This tutorial shows you how to quickly determine a document's main topic by analyzing word frequency using Python. Manually counting word occurrences is tedious; this automated approach simplifies the process.
We'll use a sample text file, test.txt
(download it, but don't peek!), to illustrate. The goal is to guess the tutorial's subject based on word frequency.
Understanding Regular Expressions
This process uses regular expressions (regex). If unfamiliar, a regex is a character sequence defining a search pattern for string matching (like "find and replace"). For a deeper dive, refer to a dedicated regex tutorial.
Building the Program
Read the File: The program begins by reading the text file into a string:
document_text = open('test.txt', 'r') text_string = document_text.read().lower()
Regular Expression: A regex filters words with 3 to 15 characters:
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
Word Frequency: A dictionary tracks word frequencies:
frequency = {} for word in match_pattern: count = frequency.get(word, 0) frequency[word] = count + 1
Output: The program then prints each word and its frequency:
frequency_list = frequency.keys() for word in frequency_list: print(word, frequency[word])
Complete Program
Here's the combined Python code:
import re frequency = {} document_text = open('test.txt', 'r') text_string = document_text.read().lower() match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string) for word in match_pattern: count = frequency.get(word, 0) frequency[word] = count + 1 frequency_list = frequency.keys() for word in frequency_list: print(word, frequency[word])
Running this will output a word frequency list. The most frequent word hints at the original tutorial's topic.
Handling Larger Text Files
For larger files, sorting the frequency dictionary simplifies finding the most frequent words:
import re frequency = {} document_text = open('dracula.txt', 'r') # Example: dracula.txt text_string = document_text.read().lower() match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string) for word in match_pattern: count = frequency.get(word, 0) frequency[word] = count + 1 most_frequent = dict(sorted(frequency.items(), key=lambda elem: elem[1], reverse=True)) most_frequent_count = most_frequent.keys() for word in most_frequent_count: print(word, most_frequent[word])
This outputs a sorted list, with the most frequent words appearing first.
Excluding Common Words
To refine the analysis, exclude common words like "the," "and," etc., using a blacklist:
import re frequency = {} document_text = open('dracula.txt', 'r') text_string = document_text.read().lower() match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string) blacklisted = ['the', 'and', 'for', 'that', 'which'] for word in match_pattern: if word not in blacklisted: count = frequency.get(word, 0) frequency[word] = count + 1 most_frequent = dict(sorted(frequency.items(), key=lambda elem: elem[1], reverse=True)) most_frequent_count = most_frequent.keys() for word in most_frequent_count: print(word, most_frequent[word])
This provides a more focused analysis.
This enhanced Python script offers a robust method for analyzing text and identifying key topics based on word frequency. Remember to adapt the blacklist and word length criteria to suit your specific needs.
The above is the detailed content of Counting Word Frequency in a File Using Python. For more information, please follow other related articles on the PHP Chinese website!