Skip to main content

Why do we need Unicode?



In the early days, all that existed was ASCII. This was okay, as all that would ever be needed were a few control characters, punctuation, numbers and letters like the ones in this sentence. Unfortunately, today's strange world of global intercommunication and social media was not foreseen, and it is not too unusual to see English in the same document 

But for argument's sake, lets say Joe Average is a software developer. He insists that he will only ever need English, and as such only wants to use ASCII. This might be fine for Joe the user, but this is not fine for Joe the software developer. Approximately half the world uses non-Latin characters and using ASCII is arguably inconsiderate to these people, and on top of that, he is closing off his software to a large and growing economy.

Therefore, an encompassing character set including all languages is needed. Thus came Unicode. It assigns every character a unique number called a code point. One advantage of Unicode over other possible sets is that the first 256 code points are identical to ISO-8859-1, and hence also ASCII. In addition, the vast majority of commonly used characters are representable by only two bytes, in a region called the Basic Multilingual Plane (BMP). Now a character encoding is needed to access this character set, and as the question asks, I will concentrate on UTF-8 and UTF-16.

Memory considerations

So how many bytes give access to what characters in these encodings?
  • UTF-8:
    • 1 byte: Standard ASCII
    • 2 bytes: Arabic, Hebrew, most European scripts (most notably excluding Georgian)
    • 3 bytes: BMP
    • 4 bytes: All Unicode characters
  • UTF-16:
    • 2 bytes: BMP
    • 4 bytes: All Unicode characters
It's worth mentioning now that characters not in the BMP include ancient scripts, mathematical symbols, musical symbols, and rarer Chinese/Japanese/Korean (CJK) characters.
If you'll be working mostly with ASCII characters, then UTF-8 is certainly more memory efficient. However, if you're working mostly with non-European scripts, using UTF-8 could be up to 1.5 times less memory efficient than UTF-16. When dealing with large amounts of text, such as large web-pages or lengthy word documents, this could impact performance.

Encoding basics

Note: If you know how UTF-8 and UTF-16 are encoded, skip to the next section for practical applications.
  • UTF-8: For the standard ASCII (0-127) characters, the UTF-8 codes are identical. This makes UTF-8 ideal if backwards compatibility is required with existing ASCII text. Other characters require anywhere from 2-4 bytes. This is done by reserving some bits in each of these bytes to indicate that it is part of a multi-byte character. In particular, the first bit of each byte is 1 to avoid clashing with the ASCII characters.
  • UTF-16: For valid BMP characters, the UTF-16 representation is simply its code point. However, for non-BMP characters UTF-16 introduces surrogate pairs. In this case a combination of two two-byte portions map to a non-BMP character. These two-byte portions come from the BMP numeric range, but are guaranteed by the Unicode standard to be invalid as BMP characters. In addition, since UTF-16 has two bytes as its basic unit, it is affected by endianness. To compensate, a reserved byte order mark can be placed at the beginning of a data stream which indicates endianness. Thus, if you are reading UTF-16 input, and no endianness is specified, you must check for this.
As can be seen, UTF-8 and UTF-16 are nowhere near compatible with each other. So if you're doing I/O, make sure you know which encoding you are using! For further details on these encodings, please see the UTF FAQ.

Comments

Popular posts from this blog

Normalization in DBMS: 1NF, 2NF, 3NF and BCNF in Database

Normalization is a process of organizing the data in database to avoid data redundancy, insertion anomaly, update anomaly & deletion anomaly. Let’s discuss about anomalies first then we will discuss normal forms with examples. Anomalies in DBMS There are three types of anomalies that occur when the database is not normalized. These are – Insertion, update and deletion anomaly. Let’s take an example to understand this. Example : Suppose a manufacturing company stores the employee details in a table named employee that has four attributes: emp_id for storing employee’s id, emp_name for storing employee’s name, emp_address for storing employee’s address and emp_dept for storing the department details in which the employee works. At some point of time the table looks like this: emp_id emp_name emp_address emp_dept 101 Rick Delhi D001 101 Rick Delhi D002 123 Maggie Agra D890 166 Glenn Chennai D900 166 Glenn Chennai D004 The above table ...

Downloading Cassandra in Ubuntu 16.04

Installation from Debian packages For the <release series> specify the major version number, without dot, and with an appended x . The latest <release series> is 311x . For older releases, the <release series> can be one of 30x , 22x , or 21x . Add the Apache repository of Cassandra to /etc/apt/sources.list.d/cassandra.sources.list , for example for the latest 3.11 version: echo "deb http://www.apache.org/dist/cassandra/debian 311x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list Add the Apache Cassandra repository keys: curl https://www.apache.org/dist/cassandra/KEYS | sudo apt-key add - Update the repositories: sudo apt-get update If you encounter this error: GPG error: http://www.apache.org 311x InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A278B781FE4B2BDA Then add the public key A278B781FE4B2BDA as follows: sudo apt-key adv --keyser...

How to Install and Configure MongoDB on Ubuntu 16.04

What we will do in this tutorial: Install MongoDB Configure MongoDB Conclusion What we will do in this tutorial: Install MongoDB Configure MongoDB Conclusion Install MongoDB on Ubuntu 16.04 Step 1 - Importing the Public Key GPG keys of the software distributor are required by the Ubuntu package manager apt (Advanced Package Tool) to ensure package consistency and authenticity. Run this command to import MongoDB keys to your server. sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv EA312927 Step 2 - Create source list file MongoDB Create a MongoDB list file in /etc/apt/sources.list.d/  with this command: echo "deb http://repo.mongodb.org/apt/ubuntu "$(lsb_release -sc)"/mongodb-org/3.2 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.2.list Step 3 - Update the repository update the repository with the apt command: sudo apt-get update Step 4 - Install MongoDB Now you can install MongoDB by typing this com...