Home Web Front-end JS Tutorial Detailed explanation of how JavaScript uses 300 lines of code to convert Chinese characters to Pinyin

Detailed explanation of how JavaScript uses 300 lines of code to convert Chinese characters to Pinyin

May 21, 2017 am 11:32 AM

This article mainly introduces the god-level programmer JavaScript300 lines of code to convert Chinese characters to pinyin. Friends in need can refer to the following

.The current situation of converting Chinese characters to Pinyin

First of all, it should be said that there is a strong demand for converting Chinese characters to Pinyin, such as sorting/filtering contacts by Pinyin letters; such as destinations (typically such as ticket purchases)
By Pinyin Initial letter classification and so on. But the solution to this requirement, but I haven’t heard of any clever implementation (especially on the browser side), probably requires a huge dictionary.
Specific to JavaScript, check github and npm. The better libraries for converting Chinese characters to pinyin include pinyin
and pinyinjs. You can see that both All come with a huge dictionary.
These dictionaries are often tens or hundreds of KB (some even several MB), and it still takes some courage to use them on the browser side. So when we encounter the need to convert Chinese characters to Pinyin, it is no wonder that our first reaction is to reject the request (or implement it on the server side).
Now, if I tell you that you can convert Chinese characters to Pinyin in 300 lines of code on the browser side, is it unbelievable?

2. Starting from the Android 4.2.2 contact code

Again emphasize this blog - using the Android source code, Easily convert Chinese characters to Pinyin.
Today I would like to share with you a solution for converting Chinese characters into Pinyin extracted from the Android system source code. With just one class and more than 560 lines of code, you can easily implement the function of converting Chinese characters into Pinyin without any other third party. rely.
Has this broken your thinking pattern: Is there any powerful algorithm that can abandon the dictionary?
After reading the blog for the first time, I was a little disappointed. There was no algorithm analysis. It just introduced the hundreds of lines of code discovered from the Android code. The second time I read the code with the idea of ​​​​porting it to JavaScript, I finally understood the principle, so I started the journey of porting.

3. Teach you step by step how to convert Chinese characters to Pinyin with 300 lines of JavaScript code

First, let’s get straight to the core: why converting Chinese characters to Pinyin requires a huge dictionary of thinking Settlement?
Because the arrangement of Chinese characters has nothing to do with pinyin, for example, in the Chinese character range \u4E00-\u9FFF, the former may be ha, and the latter may be ze. There is no way to associate the unicode of Chinese characters with pinyin, so we can only There is a huge dictionary that records the pinyin of every Chinese character (or commonly used Chinese character).
However, suppose we can sort all Chinese characters by pinyin, such as 'A','AI','AN','ANG','AO','BA',...,'ZUI',' ZUN','ZUO' sorting, then we only need to remember the first Chinese character of each Chinese character queue with the same pinyin. Then, the required dictionary will be very small (just cover all pinyin, the number of pinyin itself is not large).
Now, the difficulty is to sort the Chinese characters by pinyin. Fortunately, the ICU/localization related API provides this sorting API (if there were no convenient sorting/comparison methods, this article may not appear).

So, this is why 300 lines can be used to convert Chinese characters to Pinyin: Intl.CollatorAPI: Intl.Collator internally implements localization-related string sorting. We can basically sort all Chinese characters according to Pinyin through Intl.Collator.prototype.compare.
Boundary Chinese character table: records the sorted boundary points. Each Chinese character in this Chinese character table is the first Chinese character in a set of Chinese characters with the same pinyin after sorting (Eachunihansisthefirstonewithinsamepinyinwhencollatoriszh_CN).
Speaking of this, there may still be something that is not clearly explained, so I will directly upload a piece of code:

For interested students You can execute the script.js above node--icu-data-dir=node_modules/full-icu to see if it is basically sorted by pinyin. Chinese character table.

Here are a few points to note:

I bolded "Basic" again because the list of Chinese characters we got is not completely sorted according to Pinyin. There are occasionally some other Pinyin Chinese characters inserted in the middle. This needs to be done when making the boundary table. Extra attention.
The table obtained in the above script is the sorting of all Chinese characters. Some of them are different from the table of HanziToPinyin.java in the Android code, so the table of HanziToPinyin.java needs to be updated. (The biggest pitfall and workload in switching from Java to JavaScript: correcting the boundary table)
I believe everyone has seen the core code: constCOLLATOR=newIntl.Collator(['zh-Hans-CN']), Intl.Collator
(The locale specified here is China zh-Hans-CN) is the key to sorting Chinese characters by pinyin. It is an Internationalization API that sorts strings in locale-specific order.
Please first npmifull-icu when executing the script. This dependency will automatically install the missing Chinese support and prompt how to specify the ICU data file to execute the script.
1.ICUICU stands for InternationalComponentsforUnicode, which provides Unicode and internationalization support for applications.
ICUisamature,widelyusedsetofC/C++andJavalibrariesprovidingUnicodeandGlobalizationsupportforsoftwareapplications.ICUiswidelyportableandgivesapplicationsthesameresultsonallplatformsandbetweenC/C++andJavasoftware.
And ICU provides localized string comparison services (UnicodeCollationAlgorithm+locally specific comparison rules):
Collation:Comparestringsaccordingtotheconvention sandstandardsofaparticularlanguage,regionorcountry. ICU'scollationisbasedontheUnicodeCollationAlgorithmpluslocale-specificcomparisonrulesfromtheCommonLocaleDataRepository,acomprehensivesourceforthistypeofdata.
On modern browsers, generally ICU has built-in support for the user's local language, and we can use it directly.
But for node.js, usually, ICU only contains a subset (usually English), so we need to add support for Chinese ourselves. Generally speaking, you can install full-icu
through npminstallfull-icu to install missing Chinese support. (See node--icu-data-dir=node_modules/full-icu above).
2.IntlAPI The previous section should basically explain the knowledge related to internationalization/localization. Here we will add the use of built-in API. How to check whether the user language and Runtime support this language? Intl.Collator.supportedLocalesOf(array|string)
Returns an array containing locales that are supported (without falling back to the default locale). The parameter can be an array or a string, which is the locales you want to test (i.e. BCP47languagetag).

Construct the Collator object and sort the string

through Intl.Collator.prototype. compare, we can sort strings in the order specified by the language. In Chinese, this sorting happens to be mostly in pinyin order, 'A', 'AI', 'AN', 'ANG', 'AO', 'BA', 'BAI', 'BAN' ,'BANG','BAO','BEI','BEN','BENG','BI','BIAN','BIAO','BIE','BIN','BING','BO',' BU','CA','CAI','CAN',...
, this is the key to converting Chinese characters to Pinyin as we mentioned above.

4. Boundary table correction

Obviously, there is a problem with this boundary table and needs to be corrected.
We can see that most of the Chinese characters have been converted into qing. It can be seen that there is a problem with the Chinese character corresponding to the pinyin of qing.
Found this Chinese character, it is '\u72c5'/'狅', plus one character before and after, ['\u4eb2','\u72c5','\u828e']/["情","狅", "芎"]
.
Search, '\u72c5'/'狅' can be read as qing, but now it is read as kuang, which should be the cause of the error.
According to the initial sorting list of all Chinese characters, the first Chinese character of qing is '\u9751'/'靑'.
After the change, only 104 failed conversions.

The above is the detailed content of Detailed explanation of how JavaScript uses 300 lines of code to convert Chinese characters to Pinyin. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to implement an online speech recognition system using WebSocket and JavaScript How to implement an online speech recognition system using WebSocket and JavaScript Dec 17, 2023 pm 02:54 PM

How to use WebSocket and JavaScript to implement an online speech recognition system Introduction: With the continuous development of technology, speech recognition technology has become an important part of the field of artificial intelligence. The online speech recognition system based on WebSocket and JavaScript has the characteristics of low latency, real-time and cross-platform, and has become a widely used solution. This article will introduce how to use WebSocket and JavaScript to implement an online speech recognition system.

WebSocket and JavaScript: key technologies for implementing real-time monitoring systems WebSocket and JavaScript: key technologies for implementing real-time monitoring systems Dec 17, 2023 pm 05:30 PM

WebSocket and JavaScript: Key technologies for realizing real-time monitoring systems Introduction: With the rapid development of Internet technology, real-time monitoring systems have been widely used in various fields. One of the key technologies to achieve real-time monitoring is the combination of WebSocket and JavaScript. This article will introduce the application of WebSocket and JavaScript in real-time monitoring systems, give code examples, and explain their implementation principles in detail. 1. WebSocket technology

How to implement an online reservation system using WebSocket and JavaScript How to implement an online reservation system using WebSocket and JavaScript Dec 17, 2023 am 09:39 AM

How to use WebSocket and JavaScript to implement an online reservation system. In today's digital era, more and more businesses and services need to provide online reservation functions. It is crucial to implement an efficient and real-time online reservation system. This article will introduce how to use WebSocket and JavaScript to implement an online reservation system, and provide specific code examples. 1. What is WebSocket? WebSocket is a full-duplex method on a single TCP connection.

How to use JavaScript and WebSocket to implement a real-time online ordering system How to use JavaScript and WebSocket to implement a real-time online ordering system Dec 17, 2023 pm 12:09 PM

Introduction to how to use JavaScript and WebSocket to implement a real-time online ordering system: With the popularity of the Internet and the advancement of technology, more and more restaurants have begun to provide online ordering services. In order to implement a real-time online ordering system, we can use JavaScript and WebSocket technology. WebSocket is a full-duplex communication protocol based on the TCP protocol, which can realize real-time two-way communication between the client and the server. In the real-time online ordering system, when the user selects dishes and places an order

JavaScript and WebSocket: Building an efficient real-time weather forecasting system JavaScript and WebSocket: Building an efficient real-time weather forecasting system Dec 17, 2023 pm 05:13 PM

JavaScript and WebSocket: Building an efficient real-time weather forecast system Introduction: Today, the accuracy of weather forecasts is of great significance to daily life and decision-making. As technology develops, we can provide more accurate and reliable weather forecasts by obtaining weather data in real time. In this article, we will learn how to use JavaScript and WebSocket technology to build an efficient real-time weather forecast system. This article will demonstrate the implementation process through specific code examples. We

Simple JavaScript Tutorial: How to Get HTTP Status Code Simple JavaScript Tutorial: How to Get HTTP Status Code Jan 05, 2024 pm 06:08 PM

JavaScript tutorial: How to get HTTP status code, specific code examples are required. Preface: In web development, data interaction with the server is often involved. When communicating with the server, we often need to obtain the returned HTTP status code to determine whether the operation is successful, and perform corresponding processing based on different status codes. This article will teach you how to use JavaScript to obtain HTTP status codes and provide some practical code examples. Using XMLHttpRequest

How to use insertBefore in javascript How to use insertBefore in javascript Nov 24, 2023 am 11:56 AM

Usage: In JavaScript, the insertBefore() method is used to insert a new node in the DOM tree. This method requires two parameters: the new node to be inserted and the reference node (that is, the node where the new node will be inserted).

How to get HTTP status code in JavaScript the easy way How to get HTTP status code in JavaScript the easy way Jan 05, 2024 pm 01:37 PM

Introduction to the method of obtaining HTTP status code in JavaScript: In front-end development, we often need to deal with the interaction with the back-end interface, and HTTP status code is a very important part of it. Understanding and obtaining HTTP status codes helps us better handle the data returned by the interface. This article will introduce how to use JavaScript to obtain HTTP status codes and provide specific code examples. 1. What is HTTP status code? HTTP status code means that when the browser initiates a request to the server, the service

See all articles