This article explores the HybridSimilarity algorithm, a sophisticated neural network designed to assess the similarity between text pairs. This hybrid model cleverly integrates lexical, phonetic, semantic, and syntactic comparisons for a comprehensive similarity score.
<code class="language-python">import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import TruncatedSVD from sentence_transformers import SentenceTransformer from Levenshtein import ratio as levenshtein_ratio from phonetics import metaphone import torch import torch.nn as nn class HybridSimilarity(nn.Module): def __init__(self): super().__init__() self.bert = SentenceTransformer('all-MiniLM-L6-v2') self.tfidf = TfidfVectorizer() self.attention = nn.MultiheadAttention(embed_dim=384, num_heads=4) self.fc = nn.Sequential( nn.Linear(1152, 256), nn.ReLU(), nn.LayerNorm(256), nn.Linear(256, 1), nn.Sigmoid() ) def _extract_features(self, text1, text2): # Feature Extraction features = {} # Lexical Analysis features['levenshtein'] = levenshtein_ratio(text1, text2) features['jaccard'] = len(set(text1.split()) & set(text2.split())) / len(set(text1.split()) | set(text2.split())) # Phonetic Analysis features['metaphone'] = 1.0 if metaphone(text1) == metaphone(text2) else 0.0 # Semantic Analysis (BERT) emb1 = self.bert.encode(text1, convert_to_tensor=True) emb2 = self.bert.encode(text2, convert_to_tensor=True) features['semantic_cosine'] = nn.CosineSimilarity()(emb1, emb2).item() # Syntactic Analysis (LSA-TFIDF) tfidf_matrix = self.tfidf.fit_transform([text1, text2]) svd = TruncatedSVD(n_components=1) lsa = svd.fit_transform(tfidf_matrix) features['lsa_cosine'] = np.dot(lsa[0], lsa[1].T)[0][0] # Attention Mechanism att_output, _ = self.attention( emb1.unsqueeze(0).unsqueeze(0), emb2.unsqueeze(0).unsqueeze(0), emb2.unsqueeze(0).unsqueeze(0) ) features['attention_score'] = att_output.mean().item() return torch.tensor(list(features.values())).unsqueeze(0) def forward(self, text1, text2): features = self._extract_features(text1, text2) return self.fc(features).item() def similarity_coefficient(text1, text2): model = HybridSimilarity() return model(text1, text2)</code>
The HybridSimilarity model relies on these key components:
The HybridSimilarity
class, extending nn.Module
, initializes:
all-MiniLM-L6-v2
).<code class="language-python">self.bert = SentenceTransformer('all-MiniLM-L6-v2') self.tfidf = TfidfVectorizer() self.attention = nn.MultiheadAttention(embed_dim=384, num_heads=4) self.fc = nn.Sequential( nn.Linear(1152, 256), nn.ReLU(), nn.LayerNorm(256), nn.Linear(256, 1), nn.Sigmoid() )</code>
The _extract_features
method computes several similarity features:
<code class="language-python">features['levenshtein'] = levenshtein_ratio(text1, text2) features['jaccard'] = len(set(text1.split()) & set(text2.split())) / len(set(text1.split()) | set(text2.split()))</code>
<code class="language-python">features['metaphone'] = 1.0 if metaphone(text1) == metaphone(text2) else 0.0</code>
<code class="language-python">emb1 = self.bert.encode(text1, convert_to_tensor=True) emb2 = self.bert.encode(text2, convert_to_tensor=True) features['semantic_cosine'] = nn.CosineSimilarity()(emb1, emb2).item()</code>
TruncatedSVD
.<code class="language-python">tfidf_matrix = self.tfidf.fit_transform([text1, text2]) svd = TruncatedSVD(n_components=1) lsa = svd.fit_transform(tfidf_matrix) features['lsa_cosine'] = np.dot(lsa[0], lsa[1].T)[0][0]</code>
<code class="language-python">att_output, _ = self.attention( emb1.unsqueeze(0).unsqueeze(0), emb2.unsqueeze(0).unsqueeze(0), emb2.unsqueeze(0).unsqueeze(0) ) features['attention_score'] = att_output.mean().item()</code>
The extracted features are combined and fed into a fully connected neural network. This network outputs a similarity score (0-1).
<code class="language-python">import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import TruncatedSVD from sentence_transformers import SentenceTransformer from Levenshtein import ratio as levenshtein_ratio from phonetics import metaphone import torch import torch.nn as nn class HybridSimilarity(nn.Module): def __init__(self): super().__init__() self.bert = SentenceTransformer('all-MiniLM-L6-v2') self.tfidf = TfidfVectorizer() self.attention = nn.MultiheadAttention(embed_dim=384, num_heads=4) self.fc = nn.Sequential( nn.Linear(1152, 256), nn.ReLU(), nn.LayerNorm(256), nn.Linear(256, 1), nn.Sigmoid() ) def _extract_features(self, text1, text2): # Feature Extraction features = {} # Lexical Analysis features['levenshtein'] = levenshtein_ratio(text1, text2) features['jaccard'] = len(set(text1.split()) & set(text2.split())) / len(set(text1.split()) | set(text2.split())) # Phonetic Analysis features['metaphone'] = 1.0 if metaphone(text1) == metaphone(text2) else 0.0 # Semantic Analysis (BERT) emb1 = self.bert.encode(text1, convert_to_tensor=True) emb2 = self.bert.encode(text2, convert_to_tensor=True) features['semantic_cosine'] = nn.CosineSimilarity()(emb1, emb2).item() # Syntactic Analysis (LSA-TFIDF) tfidf_matrix = self.tfidf.fit_transform([text1, text2]) svd = TruncatedSVD(n_components=1) lsa = svd.fit_transform(tfidf_matrix) features['lsa_cosine'] = np.dot(lsa[0], lsa[1].T)[0][0] # Attention Mechanism att_output, _ = self.attention( emb1.unsqueeze(0).unsqueeze(0), emb2.unsqueeze(0).unsqueeze(0), emb2.unsqueeze(0).unsqueeze(0) ) features['attention_score'] = att_output.mean().item() return torch.tensor(list(features.values())).unsqueeze(0) def forward(self, text1, text2): features = self._extract_features(text1, text2) return self.fc(features).item() def similarity_coefficient(text1, text2): model = HybridSimilarity() return model(text1, text2)</code>
The similarity_coefficient
function initializes the model and computes the similarity between two input texts.
<code class="language-python">self.bert = SentenceTransformer('all-MiniLM-L6-v2') self.tfidf = TfidfVectorizer() self.attention = nn.MultiheadAttention(embed_dim=384, num_heads=4) self.fc = nn.Sequential( nn.Linear(1152, 256), nn.ReLU(), nn.LayerNorm(256), nn.Linear(256, 1), nn.Sigmoid() )</code>
This returns a float between 0 and 1, representing the similarity.
The HybridSimilarity algorithm offers a robust approach to text similarity by integrating various aspects of text comparison. Its combination of lexical, phonetic, semantic, and syntactic analysis allows for a more comprehensive and nuanced understanding of text similarity, making it suitable for various applications, including duplicate detection, text clustering, and information retrieval.
The above is the detailed content of HybridSimilarity Algorithm. For more information, please follow other related articles on the PHP Chinese website!