Character Sets

ASCII Character Set

One of the first standardized character sets (see HTML ASCII Reference (w3schools.com))
Uses 7 bits to encode the English alphabet letters, digits 0 to 9, and some special symbols
Represents characters with codes from 32 to 127
- Space character has the code 32, letter ‘A’ has code 65,
- ~ has code 126 and delete key (DEL) has code 127
Codes from 0 to 31 are used for special characters, such code 7 that makes the device beep
Remaining codes 128 to 255 are used for non-standard representations

Programming languages use 1 byte to represent characters in the ASCII character set. Since only 7 bits are needed to encode 127 characters, the 8th bit allows for the addition of 127 more characters. These additions, however, are not part of the standard ASCII character set.

Unicode Character Set

A single character set that includes every imaginable character ever used in a writing system on this planet. (See Unicode – The World Standard for Text and Emoji.)

Represents characters with code points
- Format: U+XXXX or U+XXXXXX
  - U+ means Unicode
  - XXXX or XXXXXX is a hexadecimal number
- Examples:
  - English letter A is U+0041 is
  - Hello is U+0048 U+0065 U+006C U+006C U+006F

UTF-8 Encoding

Code points from 0 to 127 are stored in 1 byte
Code points above 128 are stored using 2, 3, or 4 bytes
Backward compatible with ASCII character set for characters in the ASCII character set
In HTML, “Context-Type” meta-tag in the header of a HTML file has information such as: text/plain; charset=“UTF-8”
Python 3 uses the default encoding of the platform on which Python runs
- In Python session run:
  - >>> import locale
  - >>> locale.getpreferredencoding()
To find the Unicode code point of a string, use ord()
The inverse of ord() and chr()

UTF-16 Encoding

Uses a minimum of 2 bytes (16 bits) to encode Unicode characters
NO longer backwards compatible with ASCII (which uses 1 byte of its 7-bit representation)
Numerical values for 2-byte Unicode code points range from
- 0x0000 (decimal value 0)
- 0xFFFF (decimal value 232, which is 65,536)
Code points above 65,536 are stored using 4 bytes (32 bits)

References

Joel Spolsky. 2003. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software.

Apil Tamang. 2018. Unicode, UTF-8, and ASCII encodings made easy | by Apil Tamang | Medium Medium.