Like the title, a simpler function is how to count the word frequency of a English API development document? (The document may be multiple html files, or it may be a chm file, not a simple txt text);
The more complicated requirement is that because the development document involves many class names, function or method names, etc., the words may be connected together, and it is best to separate them when counting (this can be separated according to naming rules) ;
A more complicated requirement is, because simply counting the word frequency of a document does not have much practical significance, how to reprocess the counted words:
Eliminate some simple words that have little meaning for development, such as the, are, to, is...
Analyze the professional terms related to computers, words with specific meanings in programming, or keywords of programming languages (involving different languages corresponding to the document);
Annotate the explanation for the final analyzed words (in Chinese, you can use third-party API)...
If you develop software with the above functions, what specific technologies are needed? Welcome to provide your ideas...
Well, actually my pain point is that when reading an English document, there are too many words that I don’t understand. I often have to look up the words. The efficiency is too low. If there is a tool that can statistically analyze an The vocabulary of the document can be roughly familiar with the meaning of the vocabulary before reading the document, which improves efficiency; and naming is also helpful for development...
Modification remarks:
Separating words that are connected together is indeed not a word segmentation technology. I said it wrong before;
The original question mentioned the use of machine learning. My idea is this: a software with machine learning reads a lot of programming development documents, finds out the professional terms in them, and makes the implementation of the entire function more intelligent. ...Of course this is just my imagination, it may not be correct, don’t criticize if you don’t like it;
Finally, as for the problem of reading English documents I mentioned, everyone has a stage where they can’t understand it at first and their efficiency is low. Who doesn’t know that if you read more, your efficiency will gradually improve? Everyone knows the truth...But, this is not the focus of our discussion, I just have this idea and put it forward for everyone to discuss
In addition, if the question you asked is wrong, you can leave a message and I will modify it. Can you not comment on it?
Preparing for the postgraduate entrance examination, I haven’t written code for a long time, but the general idea should be:
Cleaning and filtering: For HTML, first filter out the content. You can write your own regular rules or find some written by others
Word segmentation: First filter the words using common delimiters such as spaces, and then find the words one by one according to different language naming conventions
Filter common words: You should be able to find files of common English words on the Internet and match them
WordCount: You can simply use python to implement MapReduce filtering yourself, or you can also use Hadoop, Spark, etc.
So far, we have completed the word statistics for filtering simple words.
Regarding counting computer-related words, you need to download the data file of computer-related words online and match them directly.
If you need to give an explanation, call Youdao or Baidu Translate. APIs are sufficient, but these APIs may have upper limits, and I have not used them.
The above steps do not consider efficiency issues. If you need to consider efficiency issues, you also need to use some algorithms or directly use class libraries written by others.
As for the machine learning you mentioned, the requirements here are not currently needed and there is no need to use it.
Finally: I still want to say that the fastest way to understand a document is to read more documents. If you keep reading, you will find that the speed of reading documents will become faster and faster. However, treating this as a training project can be regarded as doing something interesting.
Revised reply to the question:
The machine learning you mentioned is currently generally supervised and unsupervised, but according to your mention:
If you use supervised learning, you will definitely need the support of corpus data. If you already have corpus data, why not directly use string matching to implement it?
When using unsupervised learning, I am still a beginner. According to my understanding, it seems that it can only achieve the effect of clustering. If you want to automatically identify computer terms, you still need manual annotation or data support
If you go further, you need to study NLP carefully
I think you are interested in machine learning, but I feel that this is not a good project for practicing machine learning.
This should not be called English word segmentation. Word segmentation should refer to dividing by sentence components. Variable names that are connected together can be identified by common naming methods, such as Camel-Case in upper and lower case, Underscores separated by underscores, etc.
You can find various Word Splitting libraries for word segmentation, and there should be many in python. Download the lexicon of computer professional nouns, extract the word and match it with the lexicon to get the meaning.
But in fact, even if it is made, it may not necessarily make it easier to read. Just looking at the words has a bit of a curve to save the country, and it is very likely that you will not be able to read it at all. The vocabulary of computer articles is not very large. Once it is familiar, it will be familiar twice. It is better to optimize the word search experience. It is recommended to use collins dual solution combined with Macmillan, MDict or Oulu dictionary to load it. Chrome can also install Saladict to look up words.