ComputersProgramming

UTF-8 - character encoding

Unicode supports almost all existing character sets. The best encoding for a Unicode character set is the UTF-8 encoding. It provides compatibility with ASCII, resistance to data corruption, efficiency and ease of processing. But first things first.

Forms of coding

Computers operate with numbers not just as abstract mathematical objects, but as combinations of units of storage and processing of fixed-size information-bytes and 32-bit words. The encoding standard must take this into account when determining the way characters are represented by numbers.

In computer systems, integers are stored in memory cells of 8 bits (1 byte), 16 or 32 bits. Each Unicode encoding form determines which sequence of memory cells represents an integer corresponding to a particular character. The standard provides three different forms of encoding Unicode characters: 8, 16 and 32-bit blocks. Accordingly, they are called UTF-8, UTF-16 and UTF-32. The name UTF stands for Unicode conversion format. Each of the three forms of encoding is an equal means of representing Unicode characters, has advantages in various applications.

These encodings can be used to represent all the characters of the Unicode standard. Thus, they are fully compatible for solutions for different reasons using different forms of coding. Each encoding can be uniquely converted into either of the other two without loss of data.

Principle of non-imposition

Each of the Unicode encoding forms is designed taking into account the inadmissibility of partial overlapping. For example, Windows-932 generates characters from one or two bytes of code. The length of the sequence depends on the first byte, so the leading byte values in the sequence of two bytes and a single byte do not intersect. However, the values of the single byte and the closing byte of the sequence may be the same. This means, for example, that when searching for the character D (code 44), you can mistakenly find it entering the second part of the sequence of two bytes of the character "D" (code 84 44). To determine which sequence is correct, the program must take into account the previous bytes.

The situation becomes more complicated if the master and the closing byte match. This means that to reverse the ambiguity, a reverse search will be performed until the beginning of the text or an unambiguous sequence of code. This is not only inefficient, but not protected against possible errors, because one bad byte is sufficient to make the entire text unreadable.

The Unicode conversion format avoids this problem because the values of the leading, trailing and single unit of information storage do not match. Because of this, all Unicode encodings are suitable for searching and comparing, never giving an erroneous result due to the coincidence of different parts of the character code. The fact that these encoding forms comply with the principle of non-assignment distinguishes them from other multi-byte East Asian encodings.

Another aspect of non-intersection of Unicode encodings is that each character has clearly defined boundaries. This eliminates the need to scan an undetermined number of previous characters. This feature of encodings is sometimes called self-synchronization. Distortion of one unit of code will lead to distortion of only one character, and surrounding symbols remain intact. In the 8-bit conversion format, if the pointer refers to a byte beginning with 10xxxxxx (in binary encoding), one to three reverse transitions are required to search for the beginning of the character.

Consistency

The Unicode Consortium fully supports all 3 encoding forms. It is important not to oppose UTF-8 and Unicode, because all conversion formats are equally legitimate implementations of Unicode character encoding forms.

Byte-orientation

To represent the UTF-32 symbol, you need one 32-bit unit of code that matches the Unicode code. UTF-16 - from one to two 16-bit units. And UTF-8 uses up to 4 bytes.

The encoding of UTF-8 is designed for compatibility with byte-oriented systems based on ASCII. Most of the existing software and information technology practices have for a long time relied on the representation of symbols in the form of a sequence of bytes. Many protocols depend on the invariability of the ASCII encoding and either uses or avoids special control characters. An easy way to adapt Unicode to such situations is by using 8-bit encoding to represent Unicode characters equivalent to any ASCII character or control character. For this, UTF-8 encoding is intended.

Variable length

UTF-8 is a variable-length encoding consisting of 8-bit information storage units whose high-order bits indicate which part of the sequence each single byte belongs to. One range of values is reserved for the first element of the code sequence, the other for the subsequent elements. This ensures disjoint encoding.

ASCII

The UTF-8 encoding fully supports ASCII codes (0x00-0x7F). This means that Unicode characters U + 0000-U + 007F are converted to a single byte 0x00-0x7F UTF-8 and thus become indistinguishable from ASCII. Moreover, to avoid ambiguity, the values 0x00-0x7F are not used anymore in any byte of the Unicode character representation. To encode non-ideographic symbols other than ASCII, a sequence of two bytes is used. The symbols of the range U + 0800-U + FFFF are represented by three bytes, while the additional ones with codes greater than U + FFFF require four bytes.

Application area

The encoding of UTF-8 is usually preferred in the HTML protocol and similar to it.

XML became the first standard with full UTF-8 encoding support. Organizations involved in standardization, too, recommend it. The problem of support for URL addresses other than ASCII characters was resolved when the W3C consortium and the IETF engineering group agreed to encode all URLs exclusively in UTF-8.

Compatibility with ASCII facilitates the transition to new software. With UTF-8, most text editors work, including JEdit, Emacs, BBEdit, Eclipse and Notepad of the Windows operating system. No other form of Unicode coding can boast of such support from the tools.

The advantage of encoding is that it consists of a sequence of bytes. With UTF-8 strings, it's easy to work in C and other programming languages. This is the only form of encoding that does not require the marking of the order of the BOM bytes or the encoding declaration in XML.

Self-Sync

In an environment using 8-bit character processing, compared to other multi-byte encodings, UTF-8 has the following advantages:

  • The first byte of the code sequence contains information about its length. This increases the efficiency of direct search.
  • It is simplified to find the beginning of the character, since the initial byte is limited to a fixed range of values.
  • There is no intersection of byte values.

Comparison of advantages

The UTF-8 encoding is compact. But when applying for the encoding of East Asian characters (Chinese, Japanese, Korean, using Chinese characters) 3-byte sequences are used. Also UTF-8-encoding is inferior to other forms of encoding by processing speed. A binary string sorting produces the same result as a Unicode binary sort.

Character encoding scheme

The character encoding scheme consists of a character encoding form and a method of byte-by-pixel arrangement of code units. To determine the encoding scheme by the Unicode standard, the use of the initial byte order mark (BOM, Byte order mark) is provided.

When you turn on the BOM in UTF-8, the function of the label is limited only by indicating the use of the encoding form. The problem of determining the order of bytes in UTF-8 is not, since its coding unit size is one byte. The use of BOM for this encoding form is neither mandatory nor recommended. BOM can occur in texts converted from other encodings that use the byte order mark, or for the UTF-8 encoding signature. It is a sequence of 3 bytes EF 16 BB 16 BF 16 .

How to set UTF-8 encoding

In HTML, UTF-8 encoding is set using the following code:

˂head˃

˂meta http-equiv = "Content-Type" content = "text / html; charset = utf-8" ˃

In PHP, UTF-8 encoding is specified using the header () function at the very beginning of the file after setting the value of the error output level:

˂? Php

Error_reporting (-1);

Header ('Content-Type: text / html; charset = utf-8');

To connect to MySQL databases, the encoding of UTF-8 is set as follows:

˂? Php

Mysql_set_charset ('utf8');

In CSS files, UTF-8 character encoding is specified as follows:

@charset "utf-8";

When saving files of all types, UTF-8 encoding without BOM is selected, otherwise the site will not work. To do this, in the DreamWeave program, select the menu item "Modifications - Page Properties - Title / Encoding", change the encoding to UTF-8. Then you should reload the page, uncheck the box "Connect Unicode Signatures (BOM)" and apply the changes. If any text on the page or in the database has been entered by another form of coding, then it must be re-entered or re-encoded. When working with regular expressions, it is mandatory to use the u modifier.

You can also save the file in the UTF-8 encoding in Windows Notepad. After selecting the menu item "File - Save As ..." set the necessary encoding form and save the file in UTF-8 encoding.

In the Notepad ++ text editor, if the encoding is different from UTF-8, change the encoding and save it in UTF-8 encoding via the menu item "Convert to UTF-8 without BOM".

There is no alternative

In the context of globalization, when political and language boundaries are blurred, character sets that have local characteristics become less useful. Unicode is the only character set that supports all localizations. And UTF-8 is an example of the correct implementation of Unicode, which:

  • Supports a wide range of tools, including compatibility with ASCII encoding;
  • Has a resistance to data corruption;
  • Easy and effective in processing;
  • Does not depend on the platform.

With the advent of UTF-8, discussions about which form of encoding or character set is better have become meaningless.

Similar articles

 

 

 

 

Trending Now

 

 

 

 

Newest

Copyright © 2018 en.birmiss.com. Theme powered by WordPress.