NOTE: Since this article was written, updates have been made to the MagicMimeTypeIdentifier in the Aperture Framework. Read more...
My current project, vyasa, is a digital library management system. One of the features of a digital library is the ability to recognize the types of files (digital assets) that are loaded into the repository.
The process of detecting a file's type (also known as a file's MIME) is non-trivial, yet there are important benefits. For example, type-related metadata, such the length and bitrate of an audio file, or the size and DPI of an image file, lead to comprehensive asset management.
The problem and various solutions to file type detection are briefly explained in this Wikipedia article. Detailed coverage is available in the following academic papers (PDF):
- Content Based File Type Detection Algorithms, Mason McDaniel and M. Hossain Heydari, Computer Science Department, James Madison University, Harrisonburg.
- File Type Detection Technology, Douglas J. Hickok, Daine Richard Lesniak, Michael C. Rowe, Ph.D., Computer Science and Software Engineering Department, University of Wisconsin-Platteville.
- javax.activation.FileDataSource is part of the JavaBeans(TM) Activation Framework used by the JavaMail(TM) API to manage MIME data.
- Java Mime Magic, "retrieves file and stream mime types by checking magic headers," according to Réal Gagnon. (LPGL)
- ffident is a Java metadata extraction, file format identification library created by Marco Scmidt. (LPGL)
- JHOVE (JSTOR/Harvard Object Validation Environment) will identify, validate and characterize file types (LPGL).
- MagicMimeTypeIdentifier is from the Aperture Framework. It determines the MIME type of a binary resource based on magic number-based heuristics. (AFL, OSL)
- MimetypeRegistryService is part of the Nuexo project. (LGPL)
- complex.filter.detection.typeDetection is OpenOffice.org's code for detecting file types (LGPL).
- Magic DB is a file containing magic numbers for identifying file type as well as several other file metadata. The format of the file is specified by Optima SC. (see license on magicdb.org)
- Marco Schmidt lists several non-Java resources on his file formats page.
- FileType is an internal filetype detection engine for other coders who wish to have a simple to use C module.
- org.mmbase.util.magicfile determines file types based on a parsing of the UNIX magic command.
Another important aspect of file type detection is character encoding (aka codepage) detection for plain text files such as HTML or XML. I will cover this topic in a future article.
A series of text, image, audio, video and "other" files were used to test the Java libraries listed above. The details of the files I used are available here. The results of the test are summarized in the chart below:
The detection accuracy of most libraries was less than 50%. However, the Aperture Framework's MagicMimeTypeIdentifier was extremely accurate. It was able to correctly identify many proprietary formats. The code used to perform the actual testing, along with the test files themselves are available here. More detail about the results of the tests are available in PDF form here.
MagicMimeTypeIdentifier from the Aperture Framework appears to be the most reliable and accurate file type detector.