如何使用 Apache Spark 對使用 OCR 從圖像中提取的文字進行高效的字串匹配和驗證？-Python教學-PHP中文網

如何使用 Apache Spark 對使用 OCR 從圖像中提取的文字進行高效的字串匹配和驗證？

Patricia Arquette

發布： 2024-10-29 05:25:31

原創

328 人瀏覽過

How can Apache Spark be used for efficient string matching and verification of text extracted from images using OCR?

Apache Spark 中用於提取文字驗證的高效字串匹配

光學字元辨識 (OCR) 工具在從影像中擷取文字時經常會出現錯誤。為了有效地將這些提取的文本與參考資料集進行匹配，Spark 中需要一種高效的演算法。

鑑於 OCR 提取中面臨的挑戰，例如字元替換、表情符號遺漏和空白刪除，一種綜合方法是需要。考慮到 Spark 的優勢，可以利用機器學習轉換器的組合來實現高效的解決方案。

管道方法

可以建構管道來執行以下步驟：

標記化：使用RegexTokenizer，將RegexTokenizer，將RegexToken輸入文字分割成最小長度的標記，考慮「I」和「|」等字元替換。
N-Grams：NGram 擷取 n 元語法序列以捕捉潛在的符號遺漏。
向量化：為了促進高效的相似性測量，HashingTF 或 CountVectorizer 將 n 轉換為 n -gram 轉換為數值向量。
局部敏感雜湊 (LSH)：為了近似向量之間的餘弦相似度，MinHashLSH 利用局部敏感雜湊。

範例實現

<code class="scala">import org.apache.spark.ml.feature.{RegexTokenizer, NGram, HashingTF, MinHashLSH, MinHashLSHModel}

// Input text
val query = Seq("Hello there 7l | real|y like Spark!").toDF("text")

// Reference data
val db = Seq(
  "Hello there ?! I really like Spark ❤️!", 
  "Can anyone suggest an efficient algorithm"
).toDF("text")

// Create pipeline
val pipeline = new Pipeline().setStages(Array(
  new RegexTokenizer().setPattern("").setInputCol("text").setMinTokenLength(1).setOutputCol("tokens"),
  new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams"),
  new HashingTF().setInputCol("ngrams").setOutputCol("vectors"),
  new MinHashLSH().setInputCol("vectors").setOutputCol("lsh")
))

// Fit on reference data
val model = pipeline.fit(db)

// Transform both input text and reference data
val db_hashed = model.transform(db)
val query_hashed = model.transform(query)

// Approximate similarity join
model.stages.last.asInstanceOf[MinHashLSHModel]
  .approxSimilarityJoin(db_hashed, query_hashed, 0.75).show</code>

登入後複製

這種方法有效地應對了OCR 文本提取的挑戰，並提供了一種將提取的文本與Spark中的大型資料集進行匹配的有效方法。

以上是如何使用 Apache Spark 對使用 OCR 從圖像中提取的文字進行高效的字串匹配和驗證？的詳細內容。更多資訊請關注PHP中文網其他相關文章！