Character Sets
ASCII Character Set
- One of the first standardized character sets (see HTML ASCII Reference (w3schools.com))
- Uses 7 bits to encode the English alphabet letters, digits 0 to 9, and some special symbols
- Represents characters with codes from 32 to 127
- Space character has the code 32, letter ‘A’ has code 65,
- ~ has code 126 and delete key (DEL) has code 127
- Codes from 0 to 31 are used for special characters, such code 7 that makes the device beep
- Remaining codes 128 to 255 are used for non-standard representations
Programming languages use 1 byte to represent characters in the ASCII character set. Since only 7 bits are needed to encode 127 characters, the 8th bit allows for the addition of 127 more characters. These additions, however, are not part of the standard ASCII character set.
Unicode Character Set
A single character set that includes every imaginable character ever used in a writing system on this planet. (See Unicode – The World Standard for Text and Emoji.)
- Represents characters with code points
- Format:
U+XXXXorU+XXXXXXU+means UnicodeXXXXorXXXXXXis a hexadecimal number
- Examples:
- English letter A is
U+0041is - Hello is
U+0048U+0065U+006CU+006CU+006F
- English letter A is
- Format:
UTF-8 Encoding
- Code points from 0 to 127 are stored in 1 byte
- Code points above 128 are stored using 2, 3, or 4 bytes
- Backward compatible with ASCII character set for characters in the ASCII character set
- In HTML, “Context-Type” meta-tag in the header of a HTML file has information such as: text/plain; charset=“UTF-8”
- Python 3 uses the default encoding of the platform on which Python runs
- In Python session run:
>>> import locale>>> locale.getpreferredencoding()
- In Python session run:
- To find the Unicode code point of a string, use
ord() - The inverse of
ord()andchr()
UTF-16 Encoding
- Uses a minimum of 2 bytes (16 bits) to encode Unicode characters
- NO longer backwards compatible with ASCII (which uses 1 byte of its 7-bit representation)
- Numerical values for 2-byte Unicode code points range from
- 0x0000 (decimal value 0)
- 0xFFFF (decimal value 232, which is 65,536)
- Code points above 65,536 are stored using 4 bytes (32 bits)
References
Joel Spolsky. 2003. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software.
Apil Tamang. 2018. Unicode, UTF-8, and ASCII encodings made easy | by Apil Tamang | Medium Medium.