Sunday, February 11, 2007

MagicMimeTypeIdentifier update

MagicMimeTypeIdentifier, which scored so highly on my comparison of Java file type detectors, has been updated. Here is the email I received from Christian Fluit:

I have just updated Aperture's MagicMimeTypeIdentifier based on the results of your benchmark, in order to achieve the best score possible. This led to the addition of the following MIME types:

audio/x-ms-wma (previously mistakenly labeled as audio/x-ms-wmv)
application/x-ms-wm (artificial supertype of wma and wmv, they share the same magic number and can only be distinguished when they have the proper file name extension or when you interpret the container's contents in more depth)

Also, some MIME types had their description updated, e.g. .rmi files are now also labeled as audio/midi.

Unfortunately, a 100% score is not achievable at the moment, as the magic number of TGA files also matches with that of certain versions of Quattro Pro spreadsheets. At the moment this cannot be expressed in the identifier's config file, you would almost need a rule based language to express this.

Running the test with the new version of MagicMimeTypeIdentifier (available from the CVS repository) lead to a great improvement: 95% accuracy! Most of this improvement came from recognition of the SVG and ICO files. Great job Chris!

Chris also mentions in his email:

A remark about your benchmark: I noticed that all your files had the proper file extensions. Our MIME type identifier primarily uses magic numbers and only switches to checking file extensions when magic number matching fails or when it is unable to discriminate between a family of related file formats (e.g. the MS Office formats). I wonder what the outcome of your test would be if you would remove all file name extensions :)

Renaming the files without an extension reduced the accuracy of MagicMimeTypeIdentifier to only 78%--a small decrease from 84% accuracy with file extensions. Still, this beats the heck out of other detectors.

Current project deadlines prevent me from doing more comprehensive testing, but I would like to thank Chris for his response.

No comments: