Encodings – My Personal Summary

Veröffentlicht von

Just read Joel Spolkys article about encoding basics. Because I tend to forget things very fast I decided the summarize the most important points for myself.

History

In the beginning, there was ASCII. ASCII defined symbols via a seven-bit schema. Since these 128 symbols were very limited and one bit was left to conclude a byte several vendors decided to use the range from 128 to 255 for their own symbols. This range is often defined as a code page. Everyone can imagine that things became a big mess when a program written for e.g. IBM had to run on another vendor machine.

Unicode

To overcome this problem in 1991 the Unicode Consortium was founded. Unicode defines a massive bunch of characters (at the time Wikipedia states that there 143,859 characters defined in Unicode 13.0). Each character is mapped to a so-called code point. A code point represents a given character as either hex- or decimal value. For example, the letter A is represented as U+0041 (=hex) or as 65 (=decimal). The important thing about code points is that these are kind of virtual. A code point only defines a numerical value for a character, it doesn’t define how this character is encoded.

Encodings

Encodings define how a code point is represented in a bitwise form. The most popular encoding is UTF-8. The character „A“ in UTF-8 encoding maps to 0x41, which in this case is a one-to-one mapping to the Unicode code point. The German letter „ß“ maps to 0xC3 0x9F, which differs from the Unicode code point U+00DF. Because we have a lot of encodings we often face the problem that a sender encoded a Unicode character with e.g. Encoding A and the receiver can’t decode the character since it only knows Encoding B. The result is often shown in the form of a question mark. To overcome this problem communication protocols have additional headers defining the encoding of the transported payload (like in HTTP).

Fonts

Assuming that sender and receiver share the same encodings there might be cases where a character can’t be displayed on the receiver’s side. One of the most common problems is that you’ve installed a font that simply can’t display the given character.

Summary

To sum it up we need to the following points fulfilled to transport a Unicode character from A to B:

  • Sender and receiver share the same encoding standard,
  • The receiver needs an active font that is able to display the decoded character.

 

Further readings:

 

Kommentar hinterlassen

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

*

code