Why does Unicode have big or little endian but UTF-8 doesn't?

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Forest of Spells Looping

--

Chapters
00:00 Why Does Unicode Have Big Or Little Endian But Utf-8 Doesn'T?
00:24 Accepted Answer Score 36
01:51 Answer 2 Score 27
02:50 Answer 3 Score 1
06:42 Answer 4 Score 1
08:11 Thank you

--

Full question
https://superuser.com/questions/1648800/...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#unicode #characterencoding #utf8 #endian

#avk47

ACCEPTED ANSWER

Score 36

Note: Windows uses the term "Unicode" for UCS-2 due to historical reasons – originally that was the only way to encode Unicode codepoints into bytes, so the distinction didn't matter. But in modern terminology, both examples are Unicode, but the first is specifically UCS-2 or UTF-16 and the second is UTF-8.

UCS-2 had big-endian and little-endian because it directly represented the codepoint as a 16-bit 'uint16_t' or 'short int' number, like in C and other programming languages. It's not so much an 'encoding' as a direct memory representation of the numeric values, and as an uint16_t can be either BE or LE on different machines, so is UCS-2. The later UTF-16 just inherited the same mess for compatibility.

(It probably could have been defined for a specific endianness, but I guess they felt it was out of scope or had to compromise between people representing different hardware manufacturers or something. I don't know the actual history.)

Meanwhile, UTF-8 is a variable-length encoding, which can use anywhere from 1 to 6 bytes to represent a 31-bit value. The byte representation has no relationship to the CPU architecture at all; instead there is a specific algorithm to encode a number into bytes, and vice versa. The algorithm always outputs or consumes bits in the same order no matter what CPU it is running on.

ANSWER 2

Score 27

Exactly the same reason why an array of bytes (char[] in C or byte[] in many other languages) doesn't have any associated endianness but arrays of other types larger than byte do. It's because endianness is the way you store a value that's represented by multiple bytes into memory. If you have just a single byte then you only have 1 way to store it into memory. But if an int is comprised of 4 bytes with index 1 to 4 then you can store it in many different orders like [1, 2, 3, 4], [4, 3, 2, 1], [2, 1, 4, 3], [3, 1, 2, 4]... which is little endian, big endian, mixed endian...

Unicode has many different encodings called Unicode Transformation Format with the major ones being UTF-8, UTF-16 and UTF-32. UTF-16 and UTF-32 work on a unit of 16 and 32 bits respectively, and obviously when you store 2 or 4 bytes into byte-addressed memory you must define an order of the bytes to read/write. UTF-8 OTOH work on a byte unit, hence there's no endianness in it

ANSWER 3

Score 1

Here is the official, primary source material (published in March, 2020):

"The Unicode® Standard, Version 13.0"
Chapter 2: General Structure (page 39 of the document; page 32 of the PDF)

2.6 Encoding Schemes

The discussion of Unicode encoding forms (ed. UTF-8, UTF-16, and UTF-32) in the previous section was concerned with the machine representation of Unicode code units. Each code unit is represented in a computer simply as a numeric data type; just as for other numeric types, the exact way the bits are laid out internally is irrelevant to most processing. However, interchange of textual data, particularly between computers of different architectural types, requires consideration of the exact ordering of the bits and bytes involved in numeric representation. Integral data, including character data, is serialized for open interchange into well-defined sequences of bytes. This process of byte serialization allows all applications to correctly interpret exchanged data and to accurately reconstruct numeric values (and thereby character values) from it. In the Unicode Standard, the specifications of the distinct types of byte serializations to be used with Unicode data are known as Unicode encoding schemes.

Byte Order. Modern computer architectures differ in ordering in terms of whether the most significant byte or the least significant byte of a large numeric data type comes first in internal representation. These sequences are known as “big-endian” and “little-endian” orders, respectively. For the Unicode 16- and 32-bit encoding forms (UTF-16 and UTF32), the specification of a byte serialization must take into account the big-endian or little-endian architecture of the system on which the data is represented, so that when the data is byte serialized for interchange it will be well defined.

A character encoding scheme consists of a specified character encoding form plus a specification of how the code units are serialized into bytes. The Unicode Standard also specifies the use of an initial byte order mark (BOM) to explicitly differentiate big-endian or little-endian data in some of the Unicode encoding schemes. (See the “Byte Order Mark” subsection in Section 23.8, Specials.)

When a higher-level protocol supplies mechanisms for handling the endianness of integral data types, it is not necessary to use Unicode encoding schemes or the byte order mark. In those cases Unicode text is simply a sequence of integral data types.

For UTF-8, the encoding scheme consists merely of the UTF-8 code units (= bytes) in sequence. Hence, there is no issue of big- versus little-endian byte order for data represented in UTF-8. However, for 16-bit and 32-bit encoding forms, byte serialization must break up the code units into two or four bytes, respectively, and the order of those bytes must be clearly defined. Because of this, and because of the rules for the use of the byte order mark, the three encoding forms of the Unicode Standard result in a total of seven Unicode encoding schemes, as shown in Table 2-4.

The endian order entry for UTF-8 in Table 2-4 is marked N/A because UTF-8 code units are 8 bits in size, and the usual machine issues of endian order for larger code units do not apply. The serialized order of the bytes must not depart from the order defined by the UTF-8 encoding form. Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 23.8, Specials, for more information.

Please also see the following related information:

UNICODE CHARACTER ENCODING MODEL (UTR # 17): Character Encoding Scheme (CES)
Unicode's FAQ on "UTF-8, UTF-16, UTF-32 & BOM"

ANSWER 4

Score 1

UTF-8 uses 3 bytes to present the same character [哈 54 C8], but it does not have big or little endian. Why?

The reason (or potential explanation) is that those three bytes are encoding the code-point bits different than in UTF-16:

UTF-8    11100101 10010011 10001000    E5 93 88
         1110xxxx 10xxxxxx 10xxxxxx
             0101   010011   001000    54 C8

The 16 bits of the code-point (01010100 11001000 [哈 54 C8]) are distributed across three bytes in the UTF-8 byte-stream (a first and two continuation bytes).

By the rules of the encoding, the most significant bit is always the left-most one. This allows to parse UTF-8 byte-by-byte from lowest to highest byte index.

Compare: UTF-8 (D92 UTF-8 encoding form - 3.9 Unicode Encoding Forms, Unicode 14.0.0 p. 123)

How the number of the code-point is stored within the computers memory then is not affected by that.

With UTF-16 it is not that clear, as UTF-16 may suggest to read the byte-stream word-by-word (not byte-by-byte). Henceforth the meaning of the order of bytes within a word (and therefore as well the order of the bits) may vary :

UTF-16BE    01010100 11001000    54 C8
UTF-16LE    11001000 01010100    C8 54

If you would now map words from the stream into the computers memory, you need to make a match for the architecture to get the code-point value.

See as well: Difference between Big Endian and little Endian Byte order