The content this article brings to you is about the web text data cleaning process and examples (example code). It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.
Today, more than 80% of data is unstructured. Text data preprocessing is the only way before data analysis. Most of the available text data is highly unstructured and noisy in nature, requiring better insights or building better algorithms to process the data.
We know that social media data is highly unstructured. Due to its informal communication, there are errors including spelling errors, poor grammar, use of slang, irregularities such as URLs, stop words, expressions, etc. Required content.
A typical business question, assuming you are interested in this: This is the feature that makes the iPhone more popular among fans. Below you have extracted a tweet about consumer opinions related to the iPhone:
Now do text preprocessing on this tweet:
1. Remove HTML characters:
Data obtained from the Web usually contains many HTML entities such as &&&& which are embedded into the original data. Therefore, it is necessary to get rid of these entities. One way is to remove them directly by using specific regular expressions. Another approach is to use appropriate packages and modules (such as Python's HTMLPARSER), which can convert these entities into standard HTML markup. For example:
2. Decoding data:
This is the process of converting information from complex symbols into simple and understandable characters. Text data may be subject to different forms of decoding, such as "Latin", "UTF8", etc. Therefore, for better analysis, it is necessary to keep the complete data in a standard encoding format. UTF-8 encoding is widely accepted and recommended.
3. Apostrophe search: In order to avoid any word meaning ambiguity in the text, it is recommended to maintain a proper structure in the article and follow the rules of context-free grammar. When an apostrophe is used, the chance of disambiguation increases.
For example “it’s is a contraction for it is or it has”.
All apostrophes should be converted to standard dictionaries. A lookup table of all possible keywords can be used to eliminate ambiguity.
4. Removal of stop words: When data analysis needs to be data-driven at the character level, commonly occurring words (stop words) should be deleted. By creating a long list of stop words, or you can use predefined language-specific libraries.
5. Delete punctuation marks: All punctuation marks should be processed according to priority. For example: ",", ",", "?" "Important punctuation should be retained, while other punctuation needs to be deleted.
6. Delete expressions: Text data (usually speech transcriptions) may contain human expressions , such as [laughing], [crying], [audience pause]. These expressions are usually irrelevant to the speech content and therefore need to be removed. In this case, simple regular expressions may be useful.
7 , Split adjuncts: Textual data generated by people in social forums is completely informal in nature. Most tweets are accompanied by multiple adjuncts, such as RayyDay. PrimeCythOrth., etc. These entities can be represented by simple rules and Regular expressions are split into their normal forms.
8. Slang lookup: Likewise, social media includes most of the slang vocabulary. These words should be converted into standard words to make free text. Words like LUV will be Convert to love, Helo to Hello. A similar method to apostrophe lookup can be used to convert slang words into standard words. There are numerous sources of information on the Internet which provide lists of all possible slang words that can be used as lookup dictionaries for conversion .
9. Standard words: Sometimes the format of words is incorrect. For example: "I looooveee you" should be "I love you". Simple rules and regular expressions can help solve these situations.
10. Delete URLs: URLs and hyperlinks in text data should be deleted, such as comments, comments and tweets.
The above is a complete introduction to the web text data cleaning process and examples (example code) , if you want to know more about HTML video tutorial, please pay attention to the PHP Chinese website.
The above is the detailed content of Web text data cleaning process and examples (example code). For more information, please follow other related articles on the PHP Chinese website!