Counting Word Frequency in a File Using Python
This tutorial shows you how to quickly determine a document's main topic by analyzing word frequency using Python. Manually counting word occurrences is tedious; this automated approach simplifies the process.
We'll use a sample text file, test.txt
(download it, but don't peek!), to illustrate. The goal is to guess the tutorial's subject based on word frequency.
Understanding Regular Expressions
This process uses regular expressions (regex). If unfamiliar, a regex is a character sequence defining a search pattern for string matching (like "find and replace"). For a deeper dive, refer to a dedicated regex tutorial.
Building the Program
-
Read the File: The program begins by reading the text file into a string:
document_text = open('test.txt', 'r') text_string = document_text.read().lower()
Copy after login -
Regular Expression: A regex filters words with 3 to 15 characters:
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
Copy after login -
Word Frequency: A dictionary tracks word frequencies:
frequency = {} for word in match_pattern: count = frequency.get(word, 0) frequency[word] = count + 1
Copy after login -
Output: The program then prints each word and its frequency:
frequency_list = frequency.keys() for word in frequency_list: print(word, frequency[word])
Copy after login
Complete Program
Here's the combined Python code:
import re frequency = {} document_text = open('test.txt', 'r') text_string = document_text.read().lower() match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string) for word in match_pattern: count = frequency.get(word, 0) frequency[word] = count + 1 frequency_list = frequency.keys() for word in frequency_list: print(word, frequency[word])
Running this will output a word frequency list. The most frequent word hints at the original tutorial's topic.
Handling Larger Text Files
For larger files, sorting the frequency dictionary simplifies finding the most frequent words:
import re frequency = {} document_text = open('dracula.txt', 'r') # Example: dracula.txt text_string = document_text.read().lower() match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string) for word in match_pattern: count = frequency.get(word, 0) frequency[word] = count + 1 most_frequent = dict(sorted(frequency.items(), key=lambda elem: elem[1], reverse=True)) most_frequent_count = most_frequent.keys() for word in most_frequent_count: print(word, most_frequent[word])
This outputs a sorted list, with the most frequent words appearing first.
Excluding Common Words
To refine the analysis, exclude common words like "the," "and," etc., using a blacklist:
import re frequency = {} document_text = open('dracula.txt', 'r') text_string = document_text.read().lower() match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string) blacklisted = ['the', 'and', 'for', 'that', 'which'] for word in match_pattern: if word not in blacklisted: count = frequency.get(word, 0) frequency[word] = count + 1 most_frequent = dict(sorted(frequency.items(), key=lambda elem: elem[1], reverse=True)) most_frequent_count = most_frequent.keys() for word in most_frequent_count: print(word, most_frequent[word])
This provides a more focused analysis.
This enhanced Python script offers a robust method for analyzing text and identifying key topics based on word frequency. Remember to adapt the blacklist and word length criteria to suit your specific needs.
The above is the detailed content of Counting Word Frequency in a File Using Python. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Solution to permission issues when viewing Python version in Linux terminal When you try to view Python version in Linux terminal, enter python...

When using Python's pandas library, how to copy whole columns between two DataFrames with different structures is a common problem. Suppose we have two Dats...

The article discusses popular Python libraries like NumPy, Pandas, Matplotlib, Scikit-learn, TensorFlow, Django, Flask, and Requests, detailing their uses in scientific computing, data analysis, visualization, machine learning, web development, and H

How does Uvicorn continuously listen for HTTP requests? Uvicorn is a lightweight web server based on ASGI. One of its core functions is to listen for HTTP requests and proceed...

Regular expressions are powerful tools for pattern matching and text manipulation in programming, enhancing efficiency in text processing across various applications.

In Python, how to dynamically create an object through a string and call its methods? This is a common programming requirement, especially if it needs to be configured or run...

Fastapi ...

How to teach computer novice programming basics within 10 hours? If you only have 10 hours to teach computer novice some programming knowledge, what would you choose to teach...
