How can we efficiently split a text string of concatenated words without spaces into individual words?

Barbara Streisand
Release: 2024-11-04 10:48:02
Original
926 people have browsed it

How can we efficiently split a text string of concatenated words without spaces into individual words?

Splitting Text into a Word List Without Spaces

Problem

Given a text string consisting of concatenated words without spaces:

Input: "tableapplechairtablecupboard..."
Copy after login

How can we efficiently split this text into a list of individual words?

Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]
Copy after login

Algorithm

A simple approach is to iteratively find the longest possible word within the text. However, this can lead to suboptimal results.

Frequency-Based Algorithm

Instead, we can exploit the relative frequency of words in the language to improve accuracy:

  1. Model the Word Distribution: Assume words are independently distributed and follow Zipf's law, where word probability is inversely proportional to its rank.
  2. Define Word Cost: The cost of a word is defined as the logarithm of the inverse of its likelihood.
  3. Dynamic Programming Approach:

    • Initialize a cost array where the first element is 0.
    • For each character in the text, find the word that minimizes the total cost for characters up to that point.
    • Backtrack from the end to reconstruct the minimum-cost word sequence.

Code Implementation

<code class="python">from math import log

wordcost = {}  # Dictionary of word costs using Zipf's law

maxword = max(len(word) for word in wordcost)

def infer_spaces(s):
    cost = [0]
    for i in range(1, len(s) + 1):
        candidates = enumerate(reversed(cost[max(0, i - maxword):i]))
        c, k = min((wordcost.get(s[i - k - 1:i], 9e999) + c, k + 1) for k, c in candidates)
        cost.append(c)

    out = []
    i = len(s)
    while i > 0:
        c, k = best_match(i)
        assert c == cost[i]
        out.append(s[i - k:i])
        i -= k

    return " ".join(reversed(out))</code>
Copy after login

Results

This algorithm is able to accurately segment text into a list of words, even in the absence of spaces.

Example:

Input: "tableapplechairtablecupboard..."
Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]
Copy after login

Optimizations:

  • Suffix Tree: By building a suffix tree from the word list, the candidate search can be accelerated.
  • Text Block Splitting: For large text inputs, the text can be split into blocks to minimize memory usage while maintaining accuracy.

The above is the detailed content of How can we efficiently split a text string of concatenated words without spaces into individual words?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!