Skip to main content

Why do we need Unicode?



In the early days, all that existed was ASCII. This was okay, as all that would ever be needed were a few control characters, punctuation, numbers and letters like the ones in this sentence. Unfortunately, today's strange world of global intercommunication and social media was not foreseen, and it is not too unusual to see English in the same document 

But for argument's sake, lets say Joe Average is a software developer. He insists that he will only ever need English, and as such only wants to use ASCII. This might be fine for Joe the user, but this is not fine for Joe the software developer. Approximately half the world uses non-Latin characters and using ASCII is arguably inconsiderate to these people, and on top of that, he is closing off his software to a large and growing economy.

Therefore, an encompassing character set including all languages is needed. Thus came Unicode. It assigns every character a unique number called a code point. One advantage of Unicode over other possible sets is that the first 256 code points are identical to ISO-8859-1, and hence also ASCII. In addition, the vast majority of commonly used characters are representable by only two bytes, in a region called the Basic Multilingual Plane (BMP). Now a character encoding is needed to access this character set, and as the question asks, I will concentrate on UTF-8 and UTF-16.

Memory considerations

So how many bytes give access to what characters in these encodings?
  • UTF-8:
    • 1 byte: Standard ASCII
    • 2 bytes: Arabic, Hebrew, most European scripts (most notably excluding Georgian)
    • 3 bytes: BMP
    • 4 bytes: All Unicode characters
  • UTF-16:
    • 2 bytes: BMP
    • 4 bytes: All Unicode characters
It's worth mentioning now that characters not in the BMP include ancient scripts, mathematical symbols, musical symbols, and rarer Chinese/Japanese/Korean (CJK) characters.
If you'll be working mostly with ASCII characters, then UTF-8 is certainly more memory efficient. However, if you're working mostly with non-European scripts, using UTF-8 could be up to 1.5 times less memory efficient than UTF-16. When dealing with large amounts of text, such as large web-pages or lengthy word documents, this could impact performance.

Encoding basics

Note: If you know how UTF-8 and UTF-16 are encoded, skip to the next section for practical applications.
  • UTF-8: For the standard ASCII (0-127) characters, the UTF-8 codes are identical. This makes UTF-8 ideal if backwards compatibility is required with existing ASCII text. Other characters require anywhere from 2-4 bytes. This is done by reserving some bits in each of these bytes to indicate that it is part of a multi-byte character. In particular, the first bit of each byte is 1 to avoid clashing with the ASCII characters.
  • UTF-16: For valid BMP characters, the UTF-16 representation is simply its code point. However, for non-BMP characters UTF-16 introduces surrogate pairs. In this case a combination of two two-byte portions map to a non-BMP character. These two-byte portions come from the BMP numeric range, but are guaranteed by the Unicode standard to be invalid as BMP characters. In addition, since UTF-16 has two bytes as its basic unit, it is affected by endianness. To compensate, a reserved byte order mark can be placed at the beginning of a data stream which indicates endianness. Thus, if you are reading UTF-16 input, and no endianness is specified, you must check for this.
As can be seen, UTF-8 and UTF-16 are nowhere near compatible with each other. So if you're doing I/O, make sure you know which encoding you are using! For further details on these encodings, please see the UTF FAQ.

Comments

Popular posts from this blog

Normalization in DBMS: 1NF, 2NF, 3NF and BCNF in Database

Normalization is a process of organizing the data in database to avoid data redundancy, insertion anomaly, update anomaly & deletion anomaly. Let’s discuss about anomalies first then we will discuss normal forms with examples. Anomalies in DBMS There are three types of anomalies that occur when the database is not normalized. These are – Insertion, update and deletion anomaly. Let’s take an example to understand this. Example : Suppose a manufacturing company stores the employee details in a table named employee that has four attributes: emp_id for storing employee’s id, emp_name for storing employee’s name, emp_address for storing employee’s address and emp_dept for storing the department details in which the employee works. At some point of time the table looks like this: emp_id emp_name emp_address emp_dept 101 Rick Delhi D001 101 Rick Delhi D002 123 Maggie Agra D890 166 Glenn Chennai D900 166 Glenn Chennai D004 The above table ...

Downloading Cassandra in Ubuntu 16.04

Installation from Debian packages For the <release series> specify the major version number, without dot, and with an appended x . The latest <release series> is 311x . For older releases, the <release series> can be one of 30x , 22x , or 21x . Add the Apache repository of Cassandra to /etc/apt/sources.list.d/cassandra.sources.list , for example for the latest 3.11 version: echo "deb http://www.apache.org/dist/cassandra/debian 311x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list Add the Apache Cassandra repository keys: curl https://www.apache.org/dist/cassandra/KEYS | sudo apt-key add - Update the repositories: sudo apt-get update If you encounter this error: GPG error: http://www.apache.org 311x InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A278B781FE4B2BDA Then add the public key A278B781FE4B2BDA as follows: sudo apt-key adv --keyser...

Node.js - Introduction to Node.js

Introduction to Node.js The modern web application has really come a long way over the years with the introduction of many popular frameworks such as bootstrap, Angular JS, etc. All of these frameworks are based on the popular JavaScript framework. But when it came to developing server based applications there was just kind of a void, and this is where Node.js came into the picture. Node.js is also based on the JavaScript framework, but it is used for developing server-based applications. While going through the entire tutorial, we will look into Node.js in detail and how we can use it to develop server based applications. What is Node.js? Node.js is an open-source, cross-platform runtime environment used for development of server-side web applications. Node.js applications are written in JavaScript and can be run on a wide variety of operating systems. Node.js is based on an event-driven architecture and a non-blocking Input/Output API that is designed to optimize an appli...