Sunday, February 11, 2007

MagicMimeTypeIdentifier update

MagicMimeTypeIdentifier, which scored so highly on my comparison of Java file type detectors, has been updated. Here is the email I received from Christian Fluit:

I have just updated Aperture's MagicMimeTypeIdentifier based on the results of your benchmark, in order to achieve the best score possible. This led to the addition of the following MIME types:

audio/x-aiff
audio/x-ms-wma (previously mistakenly labeled as audio/x-ms-wmv)
application/x-ms-wm (artificial supertype of wma and wmv, they share the same magic number and can only be distinguished when they have the proper file name extension or when you interpret the container's contents in more depth)
image/svg
image/x-icon
image/x-raw
image/x-tga
application/x-freemind

Also, some MIME types had their description updated, e.g. .rmi files are now also labeled as audio/midi.

Unfortunately, a 100% score is not achievable at the moment, as the magic number of TGA files also matches with that of certain versions of Quattro Pro spreadsheets. At the moment this cannot be expressed in the identifier's config file, you would almost need a rule based language to express this.

Running the test with the new version of MagicMimeTypeIdentifier (available from the CVS repository) lead to a great improvement: 95% accuracy! Most of this improvement came from recognition of the SVG and ICO files. Great job Chris!

Chris also mentions in his email:

A remark about your benchmark: I noticed that all your files had the proper file extensions. Our MIME type identifier primarily uses magic numbers and only switches to checking file extensions when magic number matching fails or when it is unable to discriminate between a family of related file formats (e.g. the MS Office formats). I wonder what the outcome of your test would be if you would remove all file name extensions :)

Renaming the files without an extension reduced the accuracy of MagicMimeTypeIdentifier to only 78%--a small decrease from 84% accuracy with file extensions. Still, this beats the heck out of other detectors.

Current project deadlines prevent me from doing more comprehensive testing, but I would like to thank Chris for his response.

Wednesday, January 24, 2007

Microsoft Photo Info

Microsoft Photo Info is a new software add-in for Microsoft Windows that allows photographers to add, change and delete common "metadata" properties for digital photographs from inside Windows Explorer.

Visualization Links

Swivel is a Web site for curious people to explore data. They "use farms of powerful computers and algorithms ... to transform a lonely grid of numbers and letters into hundreds - sometimes thousands - of graphs that can be explored and compared with any other public data ... have ratings and comments and publishing shortcuts for bloggers, so folks can share ideas, talk about insights and understand data together ... we transform the sometimes tedious task of reading someone else's spreadsheet into a fun experience of clicking through a Web site full of images, graphs and color."

Gapminder "is a non-profit venture for development and provision of free software that visualize human development. This is done in collaboration with universities, UN organizations, public agencies and non-governmental organizations. The main project during the coming three years is a collaboration with UN Statistic Division with the aim to visualize UN common database..."

The IBM Visual Communication Lab "develop[s] visualization algorithms that help people see and exchange information in novel ways. Our designs aim to transform visualization from a solitary activity into a collaborative one. Some application areas are online discussions, email archives, social networks, software development, and executive decision support tools. By allowing people to observe and orient themselves in complex information landscapes, our inventions enable faster, more insightful decisions."

Monday, January 22, 2007

Bauhaus-Universit├Ąt Weimar

I recently stumbled upon Bauhaus-Universit├Ąt Weimar (english), a univerisity for creative studies in Weimar, Germany. The university conducts research in Web Technology and Information Systems.

The site has information about:

  • The AItools suite which addresses text-based information retrieval tasks. It is comprised of basic and advanced algorithms, data structures, and design patterns to model complex real-world retrieval processes.

  • The AIsearch mining tool for the intelligent analysis of document collections. It offers a convenient interface for Web-based search and combines algorithms for the formation, labeling, and visualization of categories along with a smart spelling analysis.

  • The International Workshop on Text-based Information Retrieval which addresses researchers, users, and practitioners from different fields: data mining and machine learning, document and knowledge management, semantic technologies, computer linguistics, and information retrieval in general.

The TIR workshop is occurs in conjunction with DEXA which also hosts other interesting workshops on data processing, data management, data mining and retrieval, semantics, knowledge, self adaption, and autonomic computing.

Friday, January 19, 2007

File Type Metadata Discovery, Part 2: Images

File Type Metadata Discovery, Part 2: Images

In a previous article, I evaluated various libraries to determine which most accurately identified a file's type. This article represents part two in a series of articles that explore how to discover metadata about a file after its type has been detected.

Java ImageIO

The primary Java library for image handling is javax.imageio which provides a pluggable architecture for working with images stored in files and accessed across the network and a framework for the addition of format-specific plugins. Plug-ins for several common formats are included with Java Image I/O, but third parties can use this API to create their own plugins to handle special formats.

There is also a jai-imageio project on java.net which is a set of ImageReader and ImageWriter plugins for the ImageIO API, primarily built by the JAI team (see this thread at java.net).

The javax.imageio has comprehensive metadata capability included in the ImageReader.getStreamMetadata() and ImageReader.getImageMetadata() methods. These methods generate an IIOMetadataFormat object whose values are accessible through a DOM tree. The amount of image specific metadata available is staggering, and probably wouldn't be useful to someone unless they were creating a very image-centric application. However, in a digital asset management system, simple metadata such as height and width are readily available from ImageReader.getImageMetadata().

ImageMetadataDiscoverer

I have created a simple library that gathers all available image metadata into a Map set. Download ImageMetadataDiscoverer at Sourceforge or read the javadocs.

Need More?

More specific needs can be met by the wide variety of image tools available for Java. Marco Schmidt has a nice list of raster and vector libraries. DMOZ also maintains a directory of libraries.

Wednesday, January 17, 2007

The Future of Semantic Search

Steven Arnold recently stated some facts that are very closely related to my project and research interests:

...what will carry us into 2007 is a collection of technologies we think of as text mining, where software algorithms look at documents and find the names of people, places and things and attempt to relate them to one another.
...companies like Attensity Corp. and nStein Technologies ... are focused on figuring out the nuances, relationships and the important concepts in a document. Their systems generate index terms that an enterprise search system can suck in.
...new companies ... are approaching the problem both mathematically and by doing vocabulary and knowledge-based analysis. Their software decomposes sentences into subjects, verbs and adjectives and analyzes the results with the predictive algorithms.

Wikiseek

A newly launched service called Wikiseek focuses on complimenting Wikipedia by restricting search results to articles and references in the encyclopedia. Wikiseek's about page claims that this method makes it "an authoritative source of information less subject to spam and SEO schemes."

Wikiseek suggests search refinements "based on user tagging and categorization within Wikipedia." When I do a search for the word "apple," Wikiseek returns a category cloud displaying categories which which my search phrase appears most often. The "Apple hardware" category shows most prominently. Actually, I was more interested in the fruit because I had one with lunch today. Clicking on the fruit category gives me a list of those references in Wikipedia that refer to "apple" as a fruit. Pretty nifty! If I go back and choose "Apple hardware" I see the expected list of articles about current and legacy Apple products.

There are some instances of strange results, such as the search for "wiki," and comments at the TechCrunch article are quick to point out that since Wikipedia is editable, article spammers now have more incentive if their actions will effect a Wikipedia-specific search engine.

Personally, I find the category refinement feature and reference results useful although not enough to build an entire site around. It would be better utilized as a module in a large package... perhaps Wikipedia itself.