Extensible Markup Language (XML)



General

The Extensible Markup Language (XML) is a flexible text format designed for electronic publishing an as an inter-connecting protocol between different data processing systems. It also plays an important role to exchange data on the Web. XML derived from SGML (ISO 8879).
Any XML document is build by characters from the Unicode repertoire. Its intention is first of all data inter operability. It provides a framework within vocabularies and data structures can be agreed upon.
XML became popular as a technique for describing any kind of data and is used today in almost every industry and is expanding even more. Mostly every type of structured data or content can be managed by and communicated with XML.

As an excerpt from the W3C XML Recommendation, the design goals for XML are:

XML shall be straightforwardly usable over the Internet.
XML shall support a wide variety of applications.
XML shall be compatible with SGML.
It shall be easy to write programs which process XML documents.
The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
XML documents should be human-legible and reasonably clear.
The XML design should be prepared quickly.
The design of XML shall be formal and concise.
XML documents shall be easy to create.
Terseness in XML markup is of minimal importance.



UTF (Unicode Transformation Format)

UTF-8, UTF-16 and UTF-32 are encoding schemes of Unicode and specify the way the character encoding is done, UTF-8 and UTF-16 are the standard encodings for Unicode text in HTML documents, UTF-8 is the preferred and most used encoding. If an XML document does not have a encoding declaration the generally assumed encoding UTF-8. If UTF-16 or UTF-32 is used it must be declared (). But for instance in SQL, XML stores all data as UTF-16.
UTF-32, UTF-16 and UTF-8 can represent every character of the Unicode character set, they are used to communicate any language letter or sign used in the world. For instance the Thai charset is defined as "CP874".

The UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes). UTF-8 is backwards-compatible with the ASCII code (the first 128 characters of the Unicode character set). For these 128 characters it uses one single octet, corresponding to the binary value of the ASCII character.

See: ASCII Code Table

First 128 Characters of Unicode encoded in UTF-8
The first 128 characters of the ASCII table are represented by only one byte.

Character 129 to 2049 of Unicode encoded in UTF-8
The characters from 129 to 2049 include all Latin letters and Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tana. They are encoded by two bytes.

All other Characters encoded in UTF-8
The co-called Basic Multilingual Plane contains all commonly used characters. These are encoded by three bytes. The CJK characters (Chinese, Japanese, Korean) and other historic scripts are encoded by four bytes.