Unicode、絵文字、そして少しの Golang-Golang-php.cn

Unicode, Emojis, and a bit of Golang

最近、Fedora Linux インストールで OS UI とブラウザに絵文字が表示されるという問題が発生しました。この問題により、フォント構成プロジェクトについて少し調査することになりましたが、構成とフォントをテストするには、すべての Unicode バージョンから絵文字を生成する必要があり、最終的にはすべての絵文字と一部の絵文字を印刷する Golang の「スクリプト」を作成することになりました。彼らの内部に関する情報。

この旅行を通じて、私は絵文字の内部、そのバイナリ表現、そして絵文字に関して Unicode 標準によって行われたいくつかの奇妙でかわいい決定について深く掘り下げました。

しかし、最初に、少し戻って用語集を要約しましょう。

エンコーディング (または文字エンコーディング)

エンコーディングは、言語の文字とその文字のバイナリ表現の間の「マッピング」または「翻訳」として説明できます。たとえば、従来の ASCII エンコードでは、文字 a が 0x61 16 進数 (0b01100001 バイナリ) にマップされます。エンコーディングの例としては、Microsoft (Windows 125x) または ISO (ISO/IEC 8859) 8 ビットコードページがあります。

これらの固定 8 ビットコードページでは、使用される情報の最小「量」は 8 ビット (1 バイト) です。これは、256 個の異なる文字を含めることができることを意味します。多くの言語をサポートするために、256 のバイナリコードを再利用してさまざまなコードページが作成されました。したがって、これらの 3 バイト [0xD0、0xE5、0xF2] が書き込まれたテキストファイルは、ギリシャ語 ISO 8859-7 を使用すると「Πες」、西側 ISO 8859-7 を使用すると「Ðåò」として読み取られます (同じバイトですが、解釈が異なります)。コードページに基づく)。

ある時点で、多くの異なるコードページを使用することは、テクノロジーの進歩に応じて適切に拡張できなくなりました。そのため、すべての言語 (およびそれ以上) に適合し、システム間で統一できるものが必要でした。

[多くの歴史と標準を無視して、現在まで早送りします]

ユニコード規格

Unicode 標準は、デジタル化できる世界中のすべての書記体系をサポートするように設計されました。したがって、上記の例を使用すると、Unicode 標準では、ギリシャ文字「Π」のコードは 0x03A0 ですが、ラテン大文字の eth「Ð」のコードは 0x00D0 となり、衝突することはなくなりました。 Unicode Standard にはバージョンがあり、この記事の執筆時点での最新バージョンは 16.0 (仕様) です。

でも、ちょっと待ってください、この「コードポイント」とは何ですか?

Unicode コードポイント

Unicode 標準では、すべての「文字」、制御文字、絵文字、および一般に定義されたすべての項目に、「コードポイント」と呼ばれる固有のバイナリ値があります。この規格ではすべてのコードポイントが定義されており、各コードポイントには純粋なコード/バイナリ情報が含まれています。各コードポイントの 16 進形式は通常、U 接頭辞を付けて記述されます。たとえば、ギリシャ語の小文字オメガ (ω) コードポイントは U 03C9 です。

では、これらのコードポイントを実際にエンコードするのは誰でしょうか?

Unicode エンコーディング形式とエンコーディングスキーム

コードポイントをバイトにエンコードする最初の部分は、エンコードフォームです。標準によると:

エンコーディング形式は、Unicode 文字の各整数 (コードポイント) を 1 つ以上のコード単位のシーケンスとして表現する方法を指定します。

エンコーディングフォームでは、特定のエンコーディング内の Unicode コードポイントを表すために使用されるデータの最小単位を指すために「コードユニット」という用語が使用されます。

Unicode 標準では、3 つの異なるエンコーディング形式が定義されています:

UTF-32。コードポイントごとの固定長コードユニット。コードポイントあたりのサイズ: 1 つの 32 ビットコードユニット (4 バイト)。
UTF-16。コードポイントごとの可変長コード単位。コードポイントあたりのサイズ: 1 つまたは 2 つの 16 ビットコードユニット (2 ～ 4 バイト)。
UTF-8。コードポイントごとの可変長コード単位。コードポイントあたりのサイズ: 1 ～ 4 つの 8 ビットコードユニット (1 ～ 4 バイト)。

これは、使用されるエンコード形式に応じて、単一のコードポイントまたは一連のコードポイントが異なる方法でエンコードされる可能性があることを意味します。

Unicode で実際のバイナリシリアル化を処理する層はエンコーディングスキームと呼ばれ、すべての低レベルの詳細 (エンディアンなど) を処理します。 Unicode 仕様の表 2-4:


|Encoding Scheme| Endian Order                | BOM Allowed? |
| ------------- | ----------------------------| ------------ |
| UTF-8         | N/A                         | yes          |
| UTF-16        | Big-endian or little-endian | yes          |
| UTF-16BE      | Big-endian                  | no           |
| UTF-16LE      | Little-endian               | no           |
| UTF-32        | Big-endian or little-endian | yes          |
| UTF-32BE      | Big-endian                  | no           |
| UTF-32LE      | Little-endian               | no           |

ログイン後にコピー

注: ほとんどすべての最新のプログラミング言語、OS、およびファイルシステムは、ネイティブエンコーディングとして Unicode (そのエンコーディングスキームの 1 つ) を使用します。 Java と .NET は UTF-16 を使用しますが、Golang は内部文字列エンコーディングとして UTF-8 を使用します (つまり、メモリ内に文字列を作成すると、前述のエンコーディング形式で Unicode でエンコードされます)

絵文字

Unicode 標準では、絵文字 (多くの絵文字) のコードポイントも定義されており、(バージョン番号と多少の混乱はあったものの) 絵文字「標準」のバージョンは Unicode 標準と並行して進歩しています。この記事の執筆時点では、絵文字は「16.0」、Unicode 標準は「16.0」です。

例:
⛄ 雪のない雪だるま (U 26C4)
?微笑んだ目と 3 つのハートを持つ笑顔 (U 1F970)

Emoji Modifiers and Join

Unicode defines modifiers that could follow an emoji's base code point, such as variation and skin tone (we will not explore the variation part).

We have six skin tone modifiers (following the Fitzpatrick scale) called EMOJI MODIFIER FITZPATRICK TYPE-X (where x is 1 to 6), and they affect all human emojis.

Light Skin Tone (Fitzpatrick Type-1-2) (U+1F3FB)
Medium-Light Skin Tone (Fitzpatrick Type-3) (U+1F3FC)
Medium Skin Tone (Fitzpatrick Type-4) (U+1F3FD)
Medium-Dark Skin Tone (Fitzpatrick Type-5) (U+1F3FE)
Dark Skin Tone (Fitzpatrick Type-6) (U+1F3FF)

So, for example, like all human emojis, the baby emoji ? (U+1F476), when not followed by a skin modifier, appears in a neutral yellow color. In contrast, when a skin color modifier follows it, it changes accordingly.
? U+1F476
?? U+1F476 U+1F3FF
?? U+1F476 U+1F3FE
?? U+1F476 U+1F3FD
?? U+1F476 U+1F3FC
?? U+1F476 U+1F3FB

Joining emojis together

The most strange but cute decision of the Emoji/Unicode Standard is that some emojis have been defined by joining others together using the Zero Width Joiner without a standalone code point.

So, for example, when we combine:
White Flag ?️ (U+1F3F3 U+FE0F) +
Zero Width Joiner (U+200D) +
Rainbow ? (U+1F308)

It appears as Rainbow Flag ?️‍? (U+1F3F3 U+FE0F U+200D U+1F308)

Or, ?? + ? => ??‍?
Or even, ?? + ❤️ + ? + ?? => ??‍❤️‍?‍??

It's like squeezing emojis together, and then, poof ?, a new emoji appears. How cute is that?

I wanted to create a Markdown table with all emojis, and the Unicode emoji sequence tables are the source of truth for that.

https://unicode.org/Public/emoji/16.0/emoji-sequences.txt
https://unicode.org/Public/emoji/16.0/emoji-zwj-sequences.txt

So I created a Golang parser (here) that fetches and parses those sequence files, generates each emoji when a range is described in the sequence file, and prints a markdown table with some internal information for each one (like the parts in case it joined, or the base + skin tone, etc.).

You can find the markdown table here.

The last column of this table is in this format :.

Golang, Unicode and Rune


str := "⌚"
len([]rune(str)) // 1
len([]byte(str)) // 3

ログイン後にコピー

As we discussed, Golang internal string encoding is UTF-8, which means that, for example, for clock emoji ⌚ the byte length is 3 (because the UTF-8 produces 3 bytes to "write" this code point), and the code point length is 1.

Golang rune == Unicode Code Point

But in the case of joined emoji -even if it "appears" as one- we have many code points (runes) and even more bytes.


str := "??‍❤️‍?‍??"
len([]rune(str)) // 10
len([]byte(str)) // 35

ログイン後にコピー

And the reason is that:


??‍❤️‍?‍?? : ?? + ZWJ + ❤️ + ZWJ + ? + ZWJ + ??

??  : 1F469 1F3FC // ? + skin tone modifier [2 code points]
ZWJ : 200D // [1 code points] * 3
❤️  : 2764 FE0F // ❤ + VS16 for emoji-style [2 code points]
?  : 1F48B // [1 code point]
??  : 1F468 1F3FE // ? + skin tone modifier [2 code points]

ログイン後にコピー

It is worth mentioning that how we see emojis depends on our system font and which versions of emoji this font supports.

I don't know the exact internals of font rendering and how it can render the joined fonts correctly. Perhaps it will be a future post.

Til then, cheers ?

以上がUnicode、絵文字、そして少しの Golangの詳細内容です。詳細については、PHP 中国語 Web サイトの他の関連記事を参照してください。