NoisOCR is a Python library designed to simulate noise in texts generated after Optical Character Recognition (OCR). These texts may contain errors or annotations, reflecting the challenges of handling OCR in low-quality documents or manuscripts. The library offers features that facilitate the simulation of common errors in post-OCR texts and partitioning texts into sliding windows, with or without hyphenation. This can contribute to the training of neural network models for spelling correction.
GitHub Repository: NoisOCR
PyPI: NoisOCR on PyPI
You can easily install NoisOCR via pip:
pip install noisocr
This function divides a text into segments of limited size, keeping the words intact.
import noisocr text = "Lorem Ipsum is simply dummy...type specimen book." max_window_size = 50 windows = noisocr.sliding_window(text, max_window_size) # Output: # [ # 'Lorem Ipsum is simply dummy text of the printing', # ... # 'type and scrambled it to make a type specimen', # 'book.' # ]
When using hyphenation, the function attempts to fit words that exceed the character limit per window by inserting hyphens as necessary. This functionality supports multiple languages through the PyHyphen package.
import noisocr text = "Lorem Ipsum is simply dummy...type specimen book." max_window_size = 50 windows = noisocr.sliding_window_with_hyphenation(text, max_window_size, 'en_US') # Output: # [ # 'Lorem Ipsum is simply dummy text of the printing ', # 'typesetting industry. Lorem Ipsum has been the in-', # ... # 'scrambled it to make a type specimen book.' # ]
The simulate_errors function allows users to add random errors to the text, emulating issues commonly found in post-OCR texts. The typo library generates errors, such as character swaps, missing spaces, extra characters, and more.
import noisocr text = "Hello world." text_with_errors = noisocr.simulate_errors(text, interactions=1) # Output: Hello, wotrld! text_with_errors = noisocr.simulate_errors(text, 2) # Output: Hsllo,wlorld! text_with_errors = noisocr.simulate_errors(text, 5) # Output: fllo,w0rlr!
The annotation simulation feature allows the user to add custom markings to the text based on a set of annotations, including those from the BRESSAY dataset.
import noisocr text = "Hello world." text_with_annotation = noisocr.simulate_annotation(text, probability=0.5) # Output: Hello, $$--xxx--$$ text_with_annotation = noisocr.simulate_annotation(text, probability=0.5) # Output: Hello, ##--world!--## text_with_annotation = noisocr.simulate_annotation(text, 0.01) # Output: Hello world.
The core functions of the NoisOCR library are based on leveraging libraries like typo for simulating errors and hyphen for managing word hyphenation across different languages. Below is an explanation of the critical functions.
The simulate_annotation function selects a random word from the text and annotates it, following a defined set of annotations.
import random annotations = [ '##@@???@@##', '$$@@???@@$$', '@@???@@', '##--xxx--##', '$$--xxx--$$', '--xxx--', '##--text--##', '$$--text--$$', '##text##', '$$text$$', '--text--' ] def simulate_annotation(text, annotations=annotations, probability=0.01): words = text.split() if len(words) > 1: target_word = random.choice(words) else: return text if random.random() < probability: annotation = random.choice(annotations) if 'text' in annotation: annotated_text = annotation.replace('text', target_word) else: annotated_text = annotation result_text = text.replace(target_word, annotated_text, 1) return result_text else: return text
The simulate_errors function applies various errors to the text, randomly selected from the typo library.
import random import typo def simulate_errors(text, interactions=3, seed=None): methods = ["char_swap", "missing_char", "extra_char", "nearby_char", "similar_char", "skipped_space", "random_space", "repeated_char", "unichar"] if seed is not None: random.seed(seed) else: random.seed() instance = typo.StrErrer(text) method = random.choice(methods) method_to_call = getattr(instance, method) text = method_to_call().result if interactions > 0: interactions -= 1 text = simulate_errors(text, interactions, seed=seed) return text
These functions are responsible for splitting the text into sliding windows, with or without hyphenation.
from hyphen import Hyphenator def sliding_window_with_hyphenation(text, window_size=80, language='pt_BR'): hyphenator = Hyphenator(language) words = text.split() windows = [] current_window = [] remaining_word = "" for word in words: if remaining_word: word = remaining_word + word remaining_word = "" if len(" ".join(current_window)) + len(word) + 1 <= window_size: current_window.append(word) else: syllables = hyphenator.syllables(word) temp_word = "" for i, syllable in enumerate(syllables): if len(" ".join(current_window)) + len(temp_word) + len(syllable) + 1 <= window_size: temp_word += syllable else: if temp_word: current_window.append(temp_word + "-") remaining_word = "".join(syllables[i:]) + " " break else: remaining_word = word + " " break else: current_window.append(temp_word) remaining_word = "" windows.append(" ".join(current_window)) current_window = [] if remaining_word: current_window.append(remaining_word) if current_window: windows.append(" ".join(current_window)) return windows
NoisOCR provides essential tools for those working on post-OCR text correction, making it easier to simulate real-world scenarios where digitized texts are prone to errors and annotations. Whether for automated testing, text correction model development, or analysis of datasets like BRESSAY, this library is a versatile and user-friendly solution.
Check out the project on GitHub: NoisOCR and contribute to its improvement!
Das obige ist der detaillierte Inhalt vonNoisOCR: Eine Python-Bibliothek zur Simulation verrauschter Post-OCR-Texte. Für weitere Informationen folgen Sie bitte anderen verwandten Artikeln auf der PHP chinesischen Website!