Semantic Matching of Text Identifiers Using LASER Embeddings in Python-Python Tutorial-php.cn

Semantic Matching of Text Identifiers Using LASER Embeddings in Python

Linda Hamilton

Release： 2024-11-25 05:33:11

Original

689 people have browsed it

Semantic Matching of Text Identifiers Using LASER Embeddings in Python

When using OCR to digitize financial reports, you may encounter various approaches for detecting specific categories within those reports. For example, traditional methods like the Levenshtein algorithm can be used for string matching based on edit distance, making it effective for handling near matches, such as correcting typos or small variations in text.

However, the challenge becomes more complex when you need to detect multiple categories in a single line of a report, especially when those categories may not appear exactly as expected or could overlap semantically.

In this post, we analyze a semantic matching approach using Facebook's LASER (Language-Agnostic SEntence Representations) embeddings, showcasing how it can effectively handle this task.

Problem

The objective is to identify specific financial terms (categories) in a given text line. Let’s assume we have a fixed set of predefined categories that represent all possible terms of interest, such as:

["revenues", "operating expenses", "operating profit", "depreciation", "interest", "net profit", "tax", "profit after tax", "metric 1"]

Given an input line like:

"operating profit, net profit and profit after tax"

We aim to detect which identifiers appear in this line.

Semantic Matching with LASER

Instead of relying on exact or fuzzy text matches, we use semantic similarity. This approach leverages LASER embeddings to capture the semantic meaning of text and compares it using cosine similarity.

Implementation

Preprocessing the Text

Before embedding, the text is preprocessed by converting it to lowercase and removing extra spaces. This ensures uniformity.

def preprocess(text):
    return text.lower().strip()

Copy after login

Embedding Identifiers and Input Line

The LASER encoder generates normalized embeddings for both the list of identifiers and the input/OCR line.

identifier_embeddings = encoder.encode_sentences(identifiers, normalize_embeddings=True)
ocr_line_embedding = encoder.encode_sentences([ocr_line], normalize_embeddings=True)[0]

Copy after login

Ranking Identifiers by Specificity

Longer identifiers are prioritized by sorting them based on word count. This helps handle nested matches, where longer identifiers might subsume shorter ones (e.g., "profit after tax" subsumes "profit").

ranked_identifiers = sorted(identifiers, key=lambda x: len(x.split()), reverse=True)
ranked_embeddings = encoder.encode_sentences(ranked_identifiers, normalize_embeddings=True)

Copy after login

Calculating Similarity

Using cosine similarity, we measure how semantically similar each identifier is to the input line. Identifiers with similarity above a specified threshold are considered matches.

matches = []
threshold = 0.6

for idx, identifier_embedding in enumerate(ranked_embeddings):
    similarity = cosine_similarity([identifier_embedding], [ocr_line_embedding])[0][0]
    if similarity >= threshold:
        matches.append((ranked_identifiers[idx], similarity))

Copy after login

Resolving Nested Matches

To handle overlapping identifiers, longer matches are prioritized, ensuring shorter matches within them are excluded.

def preprocess(text):
    return text.lower().strip()

Copy after login

Results

When the code is executed, the output provides a list of detected matches along with their similarity scores. For the example input:

identifier_embeddings = encoder.encode_sentences(identifiers, normalize_embeddings=True)
ocr_line_embedding = encoder.encode_sentences([ocr_line], normalize_embeddings=True)[0]

Copy after login

Considerations for Longer and Complex Inputs

This method works well in structured financial reports with multiple categories on a single line, provided there aren't too many categories or much unrelated text. However, accuracy can degrade with longer, complex inputs or unstructured user-generated text, as the embeddings may struggle to focus on relevant categories. It is less reliable for noisy or unpredictable inputs.

Conclusion

This post demonstrates how LASER embeddings can be a useful tool for detecting multiple categories in text. Is it the best option? Maybe not, but it is certainly one of the options worth considering, especially when dealing with complex scenarios where traditional matching techniques might fall short.

Full code

ranked_identifiers = sorted(identifiers, key=lambda x: len(x.split()), reverse=True)
ranked_embeddings = encoder.encode_sentences(ranked_identifiers, normalize_embeddings=True)

Copy after login

The above is the detailed content of Semantic Matching of Text Identifiers Using LASER Embeddings in Python. For more information, please follow other related articles on the PHP Chinese website!