Tuesday, December 26, 2006

File type detection

NOTE: Since this article was written, updates have been made to the MagicMimeTypeIdentifier in the Aperture Framework. Read more...

Problem Description

My current project, vyasa, is a digital library management system. One of the features of a digital library is the ability to recognize the types of files (digital assets) that are loaded into the repository.

The process of detecting a file's type (also known as a file's MIME) is non-trivial, yet there are important benefits. For example, type-related metadata, such the length and bitrate of an audio file, or the size and DPI of an image file, lead to comprehensive asset management.

Possible Solutions

The problem and various solutions to file type detection are briefly explained in this Wikipedia article. Detailed coverage is available in the following academic papers (PDF):

Java Implementations

  • javax.activation.FileDataSource is part of the JavaBeans(TM) Activation Framework used by the JavaMail(TM) API to manage MIME data.
  • Java Mime Magic, "retrieves file and stream mime types by checking magic headers," according to Réal Gagnon. (LPGL)
  • ffident is a Java metadata extraction, file format identification library created by Marco Scmidt. (LPGL)
  • JHOVE (JSTOR/Harvard Object Validation Environment) will identify, validate and characterize file types (LPGL).
  • MagicMimeTypeIdentifier is from the Aperture Framework. It determines the MIME type of a binary resource based on magic number-based heuristics. (AFL, OSL)
  • MimetypeRegistryService is part of the Nuexo project. (LGPL)

Non-Java Resources

  • complex.filter.detection.typeDetection is OpenOffice.org's code for detecting file types (LGPL).
  • Magic DB is a file containing magic numbers for identifying file type as well as several other file metadata. The format of the file is specified by Optima SC. (see license on magicdb.org)
  • Marco Schmidt lists several non-Java resources on his file formats page.
  • FileType is an internal filetype detection engine for other coders who wish to have a simple to use C module.
  • org.mmbase.util.magicfile determines file types based on a parsing of the UNIX magic command.

Character Encoding

Another important aspect of file type detection is character encoding (aka codepage) detection for plain text files such as HTML or XML. I will cover this topic in a future article.

Testing

A series of text, image, audio, video and "other" files were used to test the Java libraries listed above. The details of the files I used are available here. The results of the test are summarized in the chart below:

file-type_stats.jpg

The detection accuracy of most libraries was less than 50%. However, the Aperture Framework's MagicMimeTypeIdentifier was extremely accurate. It was able to correctly identify many proprietary formats. The code used to perform the actual testing, along with the test files themselves are available here. More detail about the results of the tests are available in PDF form here.


Conclusion

MagicMimeTypeIdentifier from the Aperture Framework appears to be the most reliable and accurate file type detector.

11 comments:

Anonymous said...

Thanks Fred.

I am a developer from China and try to find a solution to detect file type (not by extension).
And I appreaciate u have a Zodiac Year, and mine is Monkey :)

Fred Eaker said...

I am happy I could help!

fuxx said...

Thank you again.
Developer from Russia ;-)

Anonymous said...

Many thanks, Fred.
I was just looking for Java MIME type detection library and this article saved me a couple of hours of due diligence.

Oliver, Cary, NC

Ryan said...

This is a very nice article. I realize it is a few years old, but I wanted to let you know that your link to the Wiki page for MIME is for the silent clown art form. I got a chuckle from that. Also, your link titled "Content Based File Type Detection Algorithms" is now a dead link. Otherwise, this was very informative.

Sami Andoni said...

Really thanks... Very helpful

jcmúzimo said...

Iam a developer from Mexico. Thank you very much my friend, was very helpful.

Mechanic (Kharkov, UA) said...

Thanks!
You gave direct links to main subj-related resources here, imho.

Gray said...

http:I've recently finished my SimpleMagic Java library which uses the magic(5) unix config files. See: http://256.com/sources/simplemagic/

Komal Agarwal said...

MagicMimeTypeIdentifier fails to identify some file mime-types if we rename the extensions of file. For example, If we change File.deb to File.png, it shows image/png as contentType. Do you know any other robust solution for identifying correct file types even after renaming them?

Komal Agarwal said...

MagicMimeTypeIdentifier fails to identify some file mime-types if we rename the extensions of file. For example, If we change File.deb to File.png, it shows image/png as contentType. Do you know any other robust solution for identifying correct file types even after renaming them?