python - How to count the word frequency of an English API development document (such as javadoc document)?

Question

For example, a simpler function is how to count the word frequency of an English API development document? (The document may be multiple html files, or it may be a chm file, not a simple txt text); The more complicated requirement is that because the development document involves many class names, function or method names, etc.,...

世界只因有你 · Answer

Preparing for the postgraduate entrance examination, I haven’t written code for a long time, but the general idea should be:

Cleaning and filtering: For HTML, first filter out the content. You can write your own regular rules or find some written by others
Word segmentation: First filter the words using common delimiters such as spaces, and then find the words one by one according to different language naming conventions
Filter common words: You should be able to find files of common English words on the Internet and match them
WordCount: You can simply use python to implement MapReduce filtering yourself, or you can also use Hadoop, Spark, etc.

So far, we have completed the word statistics for filtering simple words.
Regarding counting computer-related words, you need to download the data file of computer-related words online and match them directly.
If you need to give an explanation, call Youdao or Baidu Translate. APIs are sufficient, but these APIs may have upper limits, and I have not used them.

The above steps do not consider efficiency issues. If you need to consider efficiency issues, you also need to use some algorithms or directly use class libraries written by others.
As for the machine learning you mentioned, the requirements here are not currently needed and there is no need to use it.

Finally: I still want to say that the fastest way to understand a document is to read more documents. If you keep reading, you will find that the speed of reading documents will become faster and faster. However, treating this as a training project can be regarded as doing something interesting.

Revised reply to the question:
The machine learning you mentioned is currently generally supervised and unsupervised, but according to your mention:

A software with machine learning reads a large amount of programming development documents, finds out the professional terms inside, and makes the implementation of the entire function more intelligent...

If you use supervised learning, you will definitely need the support of corpus data. If you already have corpus data, why not directly use string matching to implement it?
When using unsupervised learning, I am still a beginner. According to my understanding, it seems that it can only achieve the effect of clustering. If you want to automatically identify computer terms, you still need manual annotation or data support
If you go further, you need to study NLP carefully

I think you are interested in machine learning, but I feel that this is not a good project for practicing machine learning.

淡淡烟草味 · Answer

This should not be called English word segmentation. Word segmentation should refer to dividing by sentence components. Variable names that are connected together can be identified by common naming methods, such as Camel-Case in upper and lower case, Underscores separated by underscores, etc.

You can find various Word Splitting libraries for word segmentation, and there should be many in python. Download the lexicon of computer professional nouns, extract the word and match it with the lexicon to get the meaning.

But in fact, even if it is made, it may not necessarily make it easier to read. Just looking at the words has a bit of a curve to save the country, and it is very likely that you will not be able to read it at all. The vocabulary of computer articles is not very large. Once it is familiar, it will be familiar twice. It is better to optimize the word search experience. It is recommended to use collins dual solution combined with Macmillan, MDict or Oulu dictionary to load it. Chrome can also install Saladict to look up words.