Home > Backend Development > Python Tutorial > Counting Word Frequency in a File Using Python

Counting Word Frequency in a File Using Python

Jennifer Aniston
Release: 2025-03-06 11:59:11
Original
665 people have browsed it

This tutorial shows you how to quickly determine a document's main topic by analyzing word frequency using Python. Manually counting word occurrences is tedious; this automated approach simplifies the process.

We'll use a sample text file, test.txt (download it, but don't peek!), to illustrate. The goal is to guess the tutorial's subject based on word frequency.

Understanding Regular Expressions

This process uses regular expressions (regex). If unfamiliar, a regex is a character sequence defining a search pattern for string matching (like "find and replace"). For a deeper dive, refer to a dedicated regex tutorial.

Building the Program

  1. Read the File: The program begins by reading the text file into a string:

    document_text = open('test.txt', 'r')
    text_string = document_text.read().lower()
    Copy after login
  2. Regular Expression: A regex filters words with 3 to 15 characters:

    match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
    Copy after login
  3. Word Frequency: A dictionary tracks word frequencies:

    frequency = {}
    for word in match_pattern:
        count = frequency.get(word, 0)
        frequency[word] = count + 1
    Copy after login
  4. Output: The program then prints each word and its frequency:

    frequency_list = frequency.keys()
    for word in frequency_list:
        print(word, frequency[word])
    Copy after login

Complete Program

Here's the combined Python code:

import re

frequency = {}
document_text = open('test.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)

for word in match_pattern:
    count = frequency.get(word, 0)
    frequency[word] = count + 1

frequency_list = frequency.keys()
for word in frequency_list:
    print(word, frequency[word])
Copy after login

Running this will output a word frequency list. The most frequent word hints at the original tutorial's topic.

Counting Word Frequency in a File Using Python

Handling Larger Text Files

For larger files, sorting the frequency dictionary simplifies finding the most frequent words:

import re

frequency = {}
document_text = open('dracula.txt', 'r')  # Example: dracula.txt
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)

for word in match_pattern:
    count = frequency.get(word, 0)
    frequency[word] = count + 1

most_frequent = dict(sorted(frequency.items(), key=lambda elem: elem[1], reverse=True))
most_frequent_count = most_frequent.keys()

for word in most_frequent_count:
    print(word, most_frequent[word])
Copy after login

This outputs a sorted list, with the most frequent words appearing first.

Counting Word Frequency in a File Using Python

Excluding Common Words

To refine the analysis, exclude common words like "the," "and," etc., using a blacklist:

import re

frequency = {}
document_text = open('dracula.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)

blacklisted = ['the', 'and', 'for', 'that', 'which']

for word in match_pattern:
    if word not in blacklisted:
        count = frequency.get(word, 0)
        frequency[word] = count + 1

most_frequent = dict(sorted(frequency.items(), key=lambda elem: elem[1], reverse=True))
most_frequent_count = most_frequent.keys()

for word in most_frequent_count:
    print(word, most_frequent[word])
Copy after login

This provides a more focused analysis.

Counting Word Frequency in a File Using Python

This enhanced Python script offers a robust method for analyzing text and identifying key topics based on word frequency. Remember to adapt the blacklist and word length criteria to suit your specific needs.

The above is the detailed content of Counting Word Frequency in a File Using Python. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template