Documents that are stored as plain text, such as XML or XHTML, often have a particular character encoding, also known as a character set or codepage. This character encoding allows applications to identify how characters should be displayed. This is especially the case with CJK languages.
Detailed analysis of the problem is available in A composite approach to language/encoding detection by Shanjian Li and Katsuhiko Momoi (2001). Li and Momoi's approach has become Mozilla's Universal Charset Detector.
- jchardet is a Java port of Mozilla's character set detection algorithm. (MPL)
- International Components for Unicode (ICU) is a set of C/C++ and Java libraries for Unicode support, software internationalization and globalization (i18n/g11n). It grew out of the JDK 1.1 internationalization APIs, which the ICU team contributed, and the project continues to be developed for the most advanced Unicode/i18n support. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software. (X license)
- cpdetector is a Java framework for configurable code page-detection of documents. (MPL)
- monq.stuff.EncodingDetector is part of the Java Finite Automata class library from the European Bioinformatics Institute. (GPL)
- com.sun.syndication.io.XmlReader handles the character encoding of XML documents in Files, raw streams and HTTP streams by offering a wide set of constructors. Part of the ROME project for reading RSS and Atom feeds. A nice explanation of how ROME detects character encoding is available here. (Apache License)
- com.glaforge.i18n.io.CharsetToolkit is a utility class that guesses the charset used in a byte buffer. (Unknown license)
- MLang is a MSDN library that lists, “detection of which possible code pages and languages text data is written in,” as one of its features. (Microsoft Windows DLL)
- Charset Detector is a stand alone executable module for automatic charset / encoding detection based on Mozilla's i18n component. It can be compiled for MS Windows using Delphi or Free Pascal or Linux using Delphi/Kylix. (LGPL)
The XHTML files created from the Character Encoding Test Page all have XML prologs and META tags that specify the encoding. cpdetector and monq.stuff.EncodingDetector both shine here, while com.sun.syndication.io.XmlReader generated several java.io.UnsupportedEncodingExceptions.
The strength of a character encoding detector lies in whether or not its focus is on statistical analysis or HTML META and XML prolog discovery. If you are processing HTML files that have META, use cpdetector. Otherwise, your best bet is either monq.stuff.EncodingDetector or com.sun.syndication.io.XmlReader.
What about files that have no META or prolog? That is a big question. If an XML file has neither, but contains multiple languages, such a Chinese content and English markup, a statistical analysis skewed by the more prevalent markup language may not be enough to display a document properly. This is where Unicode comes in. Unicode with multiple languages is handled by the application used to render the document. Check out Wikipedia's entry on Multilingual support.
- Extensible Markup Language (XML) Conformance Test Suites (10 December 2003)
- OASIS XML Conformance Subcommittee, XML 1.0 Test Suite, Second Edition (15 March 2001)
- Chinese XML Now! Test Files
- W3C HTML/XHTML Test Suites
- Windows Test page for Academic Russian
- Character Encoding Test Page
- Character Set Tables
Other Useful Resources
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) By Joel Spolsky, Wednesday, October 08, 2003
- Letter Database lists the characters and corresponding code pages for specific languages.
- W3C I18N tutorial: Character sets & encodings in XHTML, HTML and CSS
- netscape.public.mozilla.i18n has been abandoned but replaced by mozilla.dev.i18n
- XHTML Test Cases
- MIME Test - Character Encoding Test
While doing research, I used the following phrase in Google:
("character set" OR "charset" OR "codepage" OR "character encoding")