When using OCR to digitize financial reports, you may encounter various approaches for detecting specific categories within those reports. For example, traditional methods like the Levenshtein algorithm can be used for string matching based on edit distance, making it effective for handling near matches, such as correcting typos or small variations in text.
However, the challenge becomes more complex when you need to detect multiple categories in a single line of a report, especially when those categories may not appear exactly as expected or could overlap semantically.
In this post, we analyze a semantic matching approach using Facebook's LASER (Language-Agnostic SEntence Representations) embeddings, showcasing how it can effectively handle this task.
The objective is to identify specific financial terms (categories) in a given text line. Let’s assume we have a fixed set of predefined categories that represent all possible terms of interest, such as:
["revenues", "operating expenses", "operating profit", "depreciation", "interest", "net profit", "tax", "profit after tax", "metric 1"]
Given an input line like:
"operating profit, net profit and profit after tax"
We aim to detect which identifiers appear in this line.
Instead of relying on exact or fuzzy text matches, we use semantic similarity. This approach leverages LASER embeddings to capture the semantic meaning of text and compares it using cosine similarity.
Before embedding, the text is preprocessed by converting it to lowercase and removing extra spaces. This ensures uniformity.
def preprocess(text): return text.lower().strip()
The LASER encoder generates normalized embeddings for both the list of identifiers and the input/OCR line.
identifier_embeddings = encoder.encode_sentences(identifiers, normalize_embeddings=True) ocr_line_embedding = encoder.encode_sentences([ocr_line], normalize_embeddings=True)[0]
Longer identifiers are prioritized by sorting them based on word count. This helps handle nested matches, where longer identifiers might subsume shorter ones (e.g., "profit after tax" subsumes "profit").
ranked_identifiers = sorted(identifiers, key=lambda x: len(x.split()), reverse=True) ranked_embeddings = encoder.encode_sentences(ranked_identifiers, normalize_embeddings=True)
Using cosine similarity, we measure how semantically similar each identifier is to the input line. Identifiers with similarity above a specified threshold are considered matches.
matches = [] threshold = 0.6 for idx, identifier_embedding in enumerate(ranked_embeddings): similarity = cosine_similarity([identifier_embedding], [ocr_line_embedding])[0][0] if similarity >= threshold: matches.append((ranked_identifiers[idx], similarity))
To handle overlapping identifiers, longer matches are prioritized, ensuring shorter matches within them are excluded.
def preprocess(text): return text.lower().strip()
When the code is executed, the output provides a list of detected matches along with their similarity scores. For the example input:
identifier_embeddings = encoder.encode_sentences(identifiers, normalize_embeddings=True) ocr_line_embedding = encoder.encode_sentences([ocr_line], normalize_embeddings=True)[0]
This method works well in structured financial reports with multiple categories on a single line, provided there aren't too many categories or much unrelated text. However, accuracy can degrade with longer, complex inputs or unstructured user-generated text, as the embeddings may struggle to focus on relevant categories. It is less reliable for noisy or unpredictable inputs.
This post demonstrates how LASER embeddings can be a useful tool for detecting multiple categories in text. Is it the best option? Maybe not, but it is certainly one of the options worth considering, especially when dealing with complex scenarios where traditional matching techniques might fall short.
ranked_identifiers = sorted(identifiers, key=lambda x: len(x.split()), reverse=True) ranked_embeddings = encoder.encode_sentences(ranked_identifiers, normalize_embeddings=True)
The above is the detailed content of Semantic Matching of Text Identifiers Using LASER Embeddings in Python. For more information, please follow other related articles on the PHP Chinese website!