Monday, January 1, 2007

Character encoding detection

Problem

Documents that are stored as plain text, such as XML or XHTML, often have a particular character encoding, also known as a character set or codepage. This character encoding allows applications to identify how characters should be displayed. This is especially the case with CJK languages.

Detailed analysis of the problem is available in A composite approach to language/encoding detection by Shanjian Li and Katsuhiko Momoi (2001). Li and Momoi's approach has become Mozilla's Universal Charset Detector.

Implementations

  • jchardet is a Java port of Mozilla's character set detection algorithm. (MPL)

  • International Components for Unicode (ICU) is a set of C/C++ and Java libraries for Unicode support, software internationalization and globalization (i18n/g11n). It grew out of the JDK 1.1 internationalization APIs, which the ICU team contributed, and the project continues to be developed for the most advanced Unicode/i18n support. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software. (X license)
  • cpdetector is a Java framework for configurable code page-detection of documents. (MPL)
  • monq.stuff.EncodingDetector is part of the Java Finite Automata class library from the European Bioinformatics Institute. (GPL)
  • com.sun.syndication.io.XmlReader handles the character encoding of XML documents in Files, raw streams and HTTP streams by offering a wide set of constructors. Part of the ROME project for reading RSS and Atom feeds. A nice explanation of how ROME detects character encoding is available here. (Apache License)
  • com.glaforge.i18n.io.CharsetToolkit is a utility class that guesses the charset used in a byte buffer. (Unknown license)
  • MLang is a MSDN library that lists, “detection of which possible code pages and languages text data is written in,” as one of its features. (Microsoft Windows DLL)
  • Charset Detector is a stand alone executable module for automatic charset / encoding detection based on Mozilla's i18n component. It can be compiled for MS Windows using Delphi or Free Pascal or Linux using Delphi/Kylix. (LGPL)

Test Results

Click for full-size image
Click for full-size image

Raw data (PDF) | Source and test files

The HTML Character Set Tables from Columbia University all have META tags that specify the encoding. cpdetector did an excellent job in recognizing this while the other detectors failed to do so.

The XHTML files created from the Character Encoding Test Page all have XML prologs and META tags that specify the encoding. cpdetector and monq.stuff.EncodingDetector both shine here, while com.sun.syndication.io.XmlReader generated several java.io.UnsupportedEncodingExceptions.

The Japanese XML files from the W3C (“pr-” and “weekly-” prefix) were best handled by ICU and com.sun.syndication.io.XmlReader, despite the fact that XML prologs were not always available.

Finally, the Chinese TXT and XML files (“zh-” prefix), which all have XML prologs, were handled by cpdetector, monq.stuff.EncodingDetector and com.sun.syndication.io.XmlReader.

Conclusion

The strength of a character encoding detector lies in whether or not its focus is on statistical analysis or HTML META and XML prolog discovery. If you are processing HTML files that have META, use cpdetector. Otherwise, your best bet is either monq.stuff.EncodingDetector or com.sun.syndication.io.XmlReader.

What about files that have no META or prolog? That is a big question. If an XML file has neither, but contains multiple languages, such a Chinese content and English markup, a statistical analysis skewed by the more prevalent markup language may not be enough to display a document properly. This is where Unicode comes in. Unicode with multiple languages is handled by the application used to render the document. Check out Wikipedia's entry on Multilingual support.

Test files

Other Useful Resources

While doing research, I used the following phrase in Google:
("character set" OR "charset" OR "codepage" OR "character encoding")

10 comments:

George said...

I think your j-a-b.net results are invalid. Even though the document is labeled with one charset, the contents are in a different charset. ICU and Mozilla got the j-a-b.net cases correct. The other 4 usually got them wrong, especially for the multi-byte ones, like ISO-2022-JP, EUC-KR and so on.

Some of the zh-* files from w3c.org were correctly detected by all packages.

Bart said...

Do you know of a tool that will detect the text encoding of a file?

I have a file that is semi readable in UTF-8, US-ASCII, and Western European(DOS, Windows, MAC, & ISO) text encodings.

website design New York City said...

nice post

Alexander Zagniotov said...

@Bart:

I am not sure if you are familiar with programming, but you can use ASCII value of a character to identify its byte size and display it correctly (Keep in mind that the following post shows approach in PHP):

How to detect UTF-8 multi byte characters

GreatGhoul said...

Good job.I like your post.
It saved me!

Martins said...

Thanks for your post!

Although now I see, that there are several character set encoding detection tools around, I wrote one simple UTF-8 detector myself. You can find it on my blog page: http://martinskemme.wordpress.com/2009/03/20/do-not-forget-about-encoding/

araon said...

Nice post! web template

Achim Westermann said...

Thanks a lot. I should do another release of cpdetector;-)
Achim

Steven said...

I ran the test myself against ICU- The test files it uses are not real content in the named codepages, but they are code charts. I.e., they are not "text in encoding X" but "text in English with a list of the code points in encoding X". Some of them aren't actually in the named codepage, but contain numeric character references.

ICU should be used with the input filter on if you are feeding it XML/HTML - otherwise you are reading the markup. detector.enableInputFilter(true);

ICU's detection assumes that META and XML tags have already been taken into account.

So, your results may be somewhat misleading without these caveats.

Steven said...

Further testing: I ran the test case in that blog against the same detectors (latest versions) on untagged data. ONLY icu detected: euc-jp, iso-2022-jp, koi8-r, iso-2022-cn iso-2022-kr....

Only ICU and Mozilla jchardet detected: shift-jis, gb18030, big5...

I used samples from http://source.icu-project.org/repos/icu/icu/trunk/source/extra/uconv/samples/ and the utf-8 directory (some converted from files there into the target codepage).