ウィキペディアに最も多くのコンテンツがある国はどこですか?-Python チュートリアル-php.cn

導入

インターネットで何かを検索すると、英語のコンテンツの方がフランス語のコンテンツよりもはるかに包括的であることがよくあります。

世界中の英語話者の数をフランス語話者と比較すると (約 4 ～ 5 倍) 当然のことのように思えるかもしれませんが、私はこの仮説を検証して定量化したいと思いました。

TLDR: 平均して、ウィキペディアの英語の記事には、フランス語の記事よりも 19% 多くの情報が含まれています。

この分析のソースコードはここから入手できます: https://github.com/jverneaut/wikipedia-analysis/

プロトコル

ウィキペディアは、世界中のウェブ上で質の高いコンテンツを提供する最大の情報源の 1 つです。

この記事の執筆時点では、英語版には 6,700,000 件を超える独自の記事がありますが、フランス語版にはわずか 2,500,000 件しかありません。このコーパスを研究の基礎として使用します。

モンテカルロ法を使用して、言語ごとに Wikipedia から記事をランダムにサンプリングし、このコーパスの平均文字長を計算します。サンプル数が十分に多ければ、現実に近い結果が得られるはずです。

Wikimedia API には記事の文字長を取得するメソッドが提供されていないため、この情報は次のように取得します。

Wikimedia API を介して、記事の大規模なサンプルのバイトサイズを取得します。
モンテカルロ法を使用して、記事の小さなサンプルから 1 文字あたりのバイト数を推定します。
手順 2 で取得した文字あたりのバイト数の推定値を使用して、多数の記事の文字数を取得します。

モンテカルロ法を使用して文字あたりのバイト数を推定しているため、実際の数からの偏差を最小限に抑えるために可能な限り最大の記事数が必要です。

Wikimedia API ドキュメントでは次の制限が指定されています。

リクエストごとにランダムな記事は 500 件以下です。
1 回のリクエストあたりの記事コンテンツは 50 個以下です。

これらの制限を考慮し、精度とクエリ実行時間の妥協点として、記事のバイト長の参考として言語ごとに 100,000 件の記事をサンプリングし、各言語の文字あたりのバイト数を見積もるために 500 件の記事をサンプリングすることにしました。

制限事項

現在、Wikimedia API は、記事のコンテンツの提供を求められた場合、独自の wikitext 形式を返します。この形式はプレーンテキストではなく、HTML に近いものです。ウィキメディア上のすべての言語はこれと同じ形式を使用しているため、最終結果の方向性に影響を与えることなく、この形式に依存できると推定しました。

ただし、一部の言語は他の言語よりも冗長です。たとえば、フランス語では「Comment ça va?」と言います。（15文字）「お元気ですか？」と比較英語（12文字）。この研究ではこの現象は説明されていません。これに対処したい場合は、同じ書籍コーパスの異なる翻訳を比較して、言語の「密度」補正変数を確立することができます。私の調査では、各言語に適用できる比率を示すデータは見つかりませんでした。

しかし、17 の異なる言語の情報密度とそれらが話される速度を比較した非常に興味深い論文を見つけました。その結論は、最も「効率的な」言語は最も効率的でない言語よりもゆっくり話され、その結果、言語による情報伝達速度は一貫して約 39 ビット/秒になるということです。

興味深いです。

各言語の記事の平均バイト長を取得する

プロトコルに記載されているように、Wikipedia API を使用して、特定の言語で 500 件の記事をランダムに取得します。

def getRandomArticlesUrl(locale):
    return "https://" + locale + ".wikipedia.org/w/api.php?action=query&generator=random&grnlimit=500&grnnamespace=0&prop=info&format=json"

def getRandomArticles(locale):
    url = getRandomArticlesUrl(locale)
    response = requests.get(url)
    return json.loads(response.content)["query"]["pages"]

ログイン後にコピー

これにより、次のような応答が得られます: { "id1": { "title": "...", "length": 1234 }, "id2": { "title": "...", "length ": 5678 }, ... } これを使用して、多数の記事のサイズをバイト単位で取得できます。

このデータは再加工されて次のテーブルが得られます:

Language	Average length	...
EN	8865.33259
FR	7566.10867
RU	10923.87673
JA	9865.59485
...

一見すると、英語の記事のバイト長がフランス語の記事よりも長いように見えます。同様に、ロシア語のバイト長は他の言語よりも長くなります。

Which country has the most content on Wikipedia?

この結論でやめるべきでしょうか?完全ではありません。 Wikipedia によって報告される長さはバイト単位であるため、これらの初期結果を理解するには、文字がどのようにエンコードされるかをもう少し深く掘り下げる必要があります。

文字のエンコード方法: UTF-8 の概要

バイトとは何ですか？

あなたや私とは異なり、コンピューターにはアルファベットはおろか文字の概念もありません。そのため、すべては 0 と 1 のシーケンスとして表されます。

私たちの 10 進法では、0 から 1、次に 1 から 2、というように 10 まで進みます。

2 進法を使用するコンピューターの場合、0 から 1、次に 1 から 10、さらに 10 から 11、100 などと進みます。

わかりやすくするための比較表を次に示します:

Decimal	Binary
0	0
1	1
2	10
3	11
4	100
5	101
6	110
7	111
8	1000
9	1001
10	1010
...

バイナリの学習はこの記事の範囲をはるかに超えていますが、数値が大きくなるにつれて、バイナリ表現は 10 進表現に比べて「広く」なることがわかります。

コンピューターは数値を区別する必要があるため、数値をバイトと呼ばれる 8 単位の小さなパケットに保存します。 1 バイトは 8 ビットで構成されます (例: 01001011)。

UTF-8 による文字の保存方法

数字を保存する方法を見てきましたが、文字を保存する場合はもう少し複雑になります。

多くの西側諸国で使用されているラテン文字は 26 文字のアルファベットを使用しています。 0 から 25 までの各数字が文字に対応する参照テーブルを使用することはできないでしょうか?

Letter	Index	Binary index
a	0	00000000
b	1	00000001
c	2	00000010
...	...	...
z	25	00011001

しかし、小文字だけではなく、もっと多くの文字があります。この単純な文には、大文字、カンマ、ピリオドなども含まれています。ASCII 標準として知られる、これらすべての文字を 1 バイト内に含めるために標準化されたリストが作成されました。

コンピューティングの黎明期には、基本的な用途には ASCII で十分でした。しかし、他の文字を使用したい場合はどうすればよいでしょうか?キリル文字（33文字）はどうやって書くのでしょうか？これが、UTF-8 標準が作成された理由です。

UTF-8 は、Unicode (Universal Coded Character Set) T変換 Format - 8 ビットの略です。これは、コンピュータが 1 バイト以上の文字を格納できるようにするエンコードシステムです。

データに使用されるバイト数を示すために、このエンコードの最初のビットがこの情報を通知するために使用されます。

First UTF-8 bits	Number of bytes used
0xxxxxx	1
110xxxxx ...	2
1110xxxx ... ...	3
11110xxx ... ... ...	4

The following bits also have their purpose, but once again, this goes beyond the scope of this article. Just note that, at a minimum, a single bit can be used as a signature in cases where our character fits within the x1111111 = 127 remaining possibilities.

For English, which does not use accents, we can assume that most characters in an article will be encoded this way, and therefore the average number of bytes per character should be close to 1.

For French, which uses accents, cedillas, etc., we assume that this number will be higher.

Finally, for languages with a more extensive alphabet, such as Russian and Japanese, we can expect a higher number of bytes, which provides a starting point for explaining the results obtained earlier.

Get the average character length in bytes of articles for each language

Now that we understand what the value returned earlier by the Wikipedia API means, we want to calculate the number of bytes per character for each language in order to adjust these results.

To do this, we use a different way of accessing the Wikipedia API that allows us to obtain both the content of the articles and their byte length.

Why not use this API directly?

This API only returns 50 results per request, whereas the previous one returns 500. Therefore, in the same amount of time, we can get 10 times more results this way.

More concretely, if the API calls took 20 minutes with the first method, they would take 3 hours and 20 minutes with this approach.

def getRandomArticlesUrl(locale):
    return "https://" + locale + ".wikipedia.org/w/api.php?action=query&generator=random&grnlimit=50&grnnamespace=0&prop=revisions&rvprop=content|size&format=json"

def getRandomArticles(locale):
    url = getRandomArticlesUrl(locale)
    response = requests.get(url)
    return json.loads(response.content)["query"]["pages"]

ログイン後にコピー

Once this data is synthesized, here is an excerpt of what we get:

Language	Bytes per character	...
EN	1.006978892420735
FR	1.0243214042939228
RU	1.5362439940531318
JA	1.843857157700553
...

So our intuition was correct: countries with a larger alphabet distort the data because of the way their content is stored.

We also see that French uses more bytes on average to store its characters than English as we previously assumed.

Results

We can now correct the data by changing from a size in bytes to a size in characters which gives us the following graph:

Which country has the most content on Wikipedia?

Our hypothesis is therefore confirmed.

On average, English is the language with the most content per page on Wikipedia. It is followed by French, then Russian, Spanish, and German.

The standard deviation (shown with the black bars) is large for this dataset, which means that the content size varies greatly from the shortest to the longest article. Therefore, it is difficult to establish a general truth for all articles, but this trend still seems consistent with my personal experience of Wikipedia.

If you want all the results from this experiment, I have also created this representation, which compares each language with its percentage of additional/less content relative to the others.

Which country has the most content on Wikipedia?

Thanks to this, we therefore find our conclusion that on average, an English article on Wikipedia contains 19% more information than its equivalent in French.

The source code for this analysis is available here: https://github.com/jverneaut/wikipedia-analysis/

以上がウィキペディアに最も多くのコンテンツがある国はどこですか?の詳細内容です。詳細については、PHP 中国語 Web サイトの他の関連記事を参照してください。