使用 Topc 进行主题建模:Dreyfus、AI 和 Wordclouds
使用 Python 从 PDF 中提取见解:综合指南
此脚本演示了用于处理 PDF、提取文本、标记句子以及通过可视化执行主题建模的强大工作流程,专为高效和富有洞察力的分析而定制。
库概述
- os:提供与操作系统交互的功能。
- matplotlib.pyplot:用于在 Python 中创建静态、动画和交互式可视化。
- nltk:自然语言工具包,一套用于自然语言处理的库和程序。
- pandas:数据操作和分析库。
- pdftotext:用于将 PDF 文档转换为纯文本的库。
- re:提供正则表达式匹配操作。
- seaborn:基于matplotlib的统计数据可视化库。
- nltk.tokenize.sent_tokenize:NLTK 函数将字符串标记为句子。
- top2vec:主题建模和语义搜索的库。
- wordcloud:用于从文本数据创建词云的库。
初始设置
导入模块
import os import matplotlib.pyplot as plt import nltk import pandas as pd import pdftotext import re import seaborn as sns from nltk.tokenize import sent_tokenize from top2vec import Top2Vec from wordcloud import WordCloud from cleantext import clean
接下来,确保下载 punkt tokenizer:
nltk.download('punkt')
文本规范化
def normalize_text(text): """Normalize text by removing special characters and extra spaces, and applying various other cleaning options.""" # Apply the clean function with specified parameters cleaned_text = clean( text, fix_unicode=True, # fix various unicode errors to_ascii=True, # transliterate to closest ASCII representation lower=True, # lowercase text no_line_breaks=False, # fully strip line breaks as opposed to only normalizing them no_urls=True, # replace all URLs with a special token no_emails=True, # replace all email addresses with a special token no_phone_numbers=True, # replace all phone numbers with a special token no_numbers=True, # replace all numbers with a special token no_digits=True, # replace all digits with a special token no_currency_symbols=True, # replace all currency symbols with a special token no_punct=False, # remove punctuations lang="en", # set to 'de' for German special handling ) # Further clean the text by removing any remaining special characters except word characters, whitespace, and periods/commas cleaned_text = re.sub(r"[^\w\s.,]", "", cleaned_text) # Replace multiple whitespace characters with a single space and strip leading/trailing spaces cleaned_text = re.sub(r"\s+", " ", cleaned_text).strip() return cleaned_text
PDF文本提取
def extract_text_from_pdf(pdf_path): with open(pdf_path, "rb") as f: pdf = pdftotext.PDF(f) all_text = "\n\n".join(pdf) return normalize_text(all_text)
句子标记化
def split_into_sentences(text): return sent_tokenize(text)
处理多个文件
def process_files(file_paths): authors, titles, all_sentences = [], [], [] for file_path in file_paths: file_name = os.path.basename(file_path) parts = file_name.split(" - ", 2) if len(parts) != 3 or not file_name.endswith(".pdf"): print(f"Skipping file with incorrect format: {file_name}") continue year, author, title = parts author, title = author.strip(), title.replace(".pdf", "").strip() try: text = extract_text_from_pdf(file_path) except Exception as e: print(f"Error extracting text from {file_name}: {e}") continue sentences = split_into_sentences(text) authors.append(author) titles.append(title) all_sentences.extend(sentences) print(f"Number of sentences for {file_name}: {len(sentences)}") return authors, titles, all_sentences
将数据保存到 CSV
def save_data_to_csv(authors, titles, file_paths, output_file): texts = [] for fp in file_paths: try: text = extract_text_from_pdf(fp) sentences = split_into_sentences(text) texts.append(" ".join(sentences)) except Exception as e: print(f"Error processing file {fp}: {e}") texts.append("") data = pd.DataFrame({ "Author": authors, "Title": titles, "Text": texts }) data.to_csv(output_file, index=False, quoting=1, encoding='utf-8') print(f"Data has been written to {output_file}")
加载停用词
def load_stopwords(filepath): with open(filepath, "r") as f: stopwords = f.read().splitlines() additional_stopwords = ["able", "according", "act", "actually", "after", "again", "age", "agree", "al", "all", "already", "also", "am", "among", "an", "and", "another", "any", "appropriate", "are", "argue", "as", "at", "avoid", "based", "basic", "basis", "be", "been", "begin", "best", "book", "both", "build", "but", "by", "call", "can", "cant", "case", "cases", "claim", "claims", "class", "clear", "clearly", "cope", "could", "course", "data", "de", "deal", "dec", "did", "do", "doesnt", "done", "dont", "each", "early", "ed", "either", "end", "etc", "even", "ever", "every", "far", "feel", "few", "field", "find", "first", "follow", "follows", "for", "found", "free", "fri", "fully", "get", "had", "hand", "has", "have", "he", "help", "her", "here", "him", "his", "how", "however", "httpsabout", "ibid", "if", "im", "in", "is", "it", "its", "jstor", "june", "large", "lead", "least", "less", "like", "long", "look", "man", "many", "may", "me", "money", "more", "most", "move", "moves", "my", "neither", "net", "never", "new", "no", "nor", "not", "notes", "notion", "now", "of", "on", "once", "one", "ones", "only", "open", "or", "order", "orgterms", "other", "our", "out", "own", "paper", "past", "place", "plan", "play", "point", "pp", "precisely", "press", "put", "rather", "real", "require", "right", "risk", "role", "said", "same", "says", "search", "second", "see", "seem", "seems", "seen", "sees", "set", "shall", "she", "should", "show", "shows", "since", "so", "step", "strange", "style", "such", "suggests", "talk", "tell", "tells", "term", "terms", "than", "that", "the", "their", "them", "then", "there", "therefore", "these", "they", "this", "those", "three", "thus", "to", "todes", "together", "too", "tradition", "trans", "true", "try", "trying", "turn", "turns", "two", "up", "us", "use", "used", "uses", "using", "very", "view", "vol", "was", "way", "ways", "we", "web", "well", "were", "what", "when", "whether", "which", "who", "why", "with", "within", "works", "would", "years", "york", "you", "your", "suggests", "without"] stopwords.extend(additional_stopwords) return set(stopwords)
从主题中过滤停用词
def filter_stopwords_from_topics(topic_words, stopwords): filtered_topics = [] for words in topic_words: filtered_topics.append([word for word in words if word.lower() not in stopwords]) return filtered_topics
词云生成
def generate_wordcloud(topic_words, topic_num, palette='inferno'): colors = sns.color_palette(palette, n_colors=256).as_hex() def color_func(word, font_size, position, orientation, random_state=None, **kwargs): return colors[random_state.randint(0, len(colors) - 1)] wordcloud = WordCloud(width=800, height=400, background_color='black', color_func=color_func).generate(' '.join(topic_words)) plt.figure(figsize=(10, 5)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') plt.title(f'Topic {topic_num} Word Cloud') plt.show()
主要执行
file_paths = [f"/home/roomal/Desktop/Dreyfus-Project/Dreyfus/{fname}" for fname in os.listdir("/home/roomal/Desktop/Dreyfus-Project/Dreyfus/") if fname.endswith(".pdf")] authors, titles, all_sentences = process_files(file_paths) output_file = "/home/roomal/Desktop/Dreyfus-Project/Dreyfus_Papers.csv" save_data_to_csv(authors, titles, file_paths, output_file) stopwords_filepath = "/home/roomal/Documents/Lists/stopwords.txt" stopwords = load_stopwords(stopwords_filepath) try: topic_model = Top2Vec( all_sentences, embedding_model="distiluse-base-multilingual-cased", speed="deep-learn", workers=6 ) print("Top2Vec model created successfully.") except ValueError as e: print(f"Error initializing Top2Vec: {e}") except Exception as e: print(f"Unexpected error: {e}") num_topics = topic_model.get_num_topics() topic_words, word_scores, topic_nums = topic_model.get_topics(num_topics) filtered_topic_words = filter_stopwords_from_topics(topic_words, stopwords) for i, words in enumerate(filtered_topic_words): print(f"Topic {i}: {', '.join(words)}") keywords = ["heidegger"] topic_words, word_scores, topic_scores, topic_nums = topic_model.search_topics(keywords=keywords, num_topics=num_topics) filtered _search_topic_words = filter_stopwords_from_topics(topic_words, stopwords) for i, words in enumerate(filtered_search_topic_words): generate_wordcloud(words, topic_nums[i]) for i in range(reduced_num_topics): topic_words = topic_model.topic_words_reduced[i] filtered_words = [word for word in topic_words if word.lower() not in stopwords] print(f"Reduced Topic {i}: {', '.join(filtered_words)}") generate_wordcloud(filtered_words, i)
减少主题数量
reduced_num_topics = 5 topic_mapping = topic_model.hierarchical_topic_reduction(num_topics=reduced_num_topics) # Print reduced topics and generate word clouds for i in range(reduced_num_topics): topic_words = topic_model.topic_words_reduced[i] filtered_words = [word for word in topic_words if word.lower() not in stopwords] print(f"Reduced Topic {i}: {', '.join(filtered_words)}") generate_wordcloud(filtered_words, i)
以上是使用 Topc 进行主题建模:Dreyfus、AI 和 Wordclouds的详细内容。更多信息请关注PHP中文网其他相关文章!

热AI工具

Undresser.AI Undress
人工智能驱动的应用程序,用于创建逼真的裸体照片

AI Clothes Remover
用于从照片中去除衣服的在线人工智能工具。

Undress AI Tool
免费脱衣服图片

Clothoff.io
AI脱衣机

Video Face Swap
使用我们完全免费的人工智能换脸工具轻松在任何视频中换脸!

热门文章

热工具

记事本++7.3.1
好用且免费的代码编辑器

SublimeText3汉化版
中文版,非常好用

禅工作室 13.0.1
功能强大的PHP集成开发环境

Dreamweaver CS6
视觉化网页开发工具

SublimeText3 Mac版
神级代码编辑软件(SublimeText3)

Python更易学且易用,C 则更强大但复杂。1.Python语法简洁,适合初学者,动态类型和自动内存管理使其易用,但可能导致运行时错误。2.C 提供低级控制和高级特性,适合高性能应用,但学习门槛高,需手动管理内存和类型安全。

要在有限的时间内最大化学习Python的效率,可以使用Python的datetime、time和schedule模块。1.datetime模块用于记录和规划学习时间。2.time模块帮助设置学习和休息时间。3.schedule模块自动化安排每周学习任务。

Python在开发效率上优于C ,但C 在执行性能上更高。1.Python的简洁语法和丰富库提高开发效率。2.C 的编译型特性和硬件控制提升执行性能。选择时需根据项目需求权衡开发速度与执行效率。

Python和C 各有优势,选择应基于项目需求。1)Python适合快速开发和数据处理,因其简洁语法和动态类型。2)C 适用于高性能和系统编程,因其静态类型和手动内存管理。

每天学习Python两个小时是否足够?这取决于你的目标和学习方法。1)制定清晰的学习计划,2)选择合适的学习资源和方法,3)动手实践和复习巩固,可以在这段时间内逐步掌握Python的基本知识和高级功能。

pythonlistsarepartofthestAndArdLibrary,herilearRaysarenot.listsarebuilt-In,多功能,和Rused ForStoringCollections,而EasaraySaraySaraySaraysaraySaraySaraysaraySaraysarrayModuleandleandleandlesscommonlyusedDduetolimitedFunctionalityFunctionalityFunctionality。

Python在自动化、脚本编写和任务管理中表现出色。1)自动化:通过标准库如os、shutil实现文件备份。2)脚本编写:使用psutil库监控系统资源。3)任务管理:利用schedule库调度任务。Python的易用性和丰富库支持使其在这些领域中成为首选工具。

Python在Web开发中的关键应用包括使用Django和Flask框架、API开发、数据分析与可视化、机器学习与AI、以及性能优化。1.Django和Flask框架:Django适合快速开发复杂应用,Flask适用于小型或高度自定义项目。2.API开发:使用Flask或DjangoRESTFramework构建RESTfulAPI。3.数据分析与可视化:利用Python处理数据并通过Web界面展示。4.机器学习与AI:Python用于构建智能Web应用。5.性能优化:通过异步编程、缓存和代码优
