Given a text string consisting of concatenated words without spaces:
Input: "tableapplechairtablecupboard..."
How can we efficiently split this text into a list of individual words?
Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]
A simple approach is to iteratively find the longest possible word within the text. However, this can lead to suboptimal results.
Instead, we can exploit the relative frequency of words in the language to improve accuracy:
Dynamic Programming Approach:
<code class="python">from math import log wordcost = {} # Dictionary of word costs using Zipf's law maxword = max(len(word) for word in wordcost) def infer_spaces(s): cost = [0] for i in range(1, len(s) + 1): candidates = enumerate(reversed(cost[max(0, i - maxword):i])) c, k = min((wordcost.get(s[i - k - 1:i], 9e999) + c, k + 1) for k, c in candidates) cost.append(c) out = [] i = len(s) while i > 0: c, k = best_match(i) assert c == cost[i] out.append(s[i - k:i]) i -= k return " ".join(reversed(out))</code>
This algorithm is able to accurately segment text into a list of words, even in the absence of spaces.
Example:
Input: "tableapplechairtablecupboard..." Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]
Optimizations:
The above is the detailed content of How can we efficiently split a text string of concatenated words without spaces into individual words?. For more information, please follow other related articles on the PHP Chinese website!