Way to find words that differ by only one consonant in a large word list
P粉757640504
P粉757640504 2023-08-15 16:26:13
0
1
597
<p>I have a list of nearly 5000 "fantasy" words written in ASCII text. Some of these words are as follows: </p> <pre class="brush:php;toolbar:false;">txintoq txiqbal txiqfun txiqwek txiqyal txiyton txonmiq txoqwul txoqxik</pre> <p>I want to design an algorithm that checks/verifies that no two words in a list differ by only one "similar consonant".So I'll define "set of similar consonants" like this (tentatively): </p> <pre class="brush:php;toolbar:false;">zs xj pb td kg</pre> <p><em>There may be 3 or more consonants in a set, but I will only show 2 now. As I learn more about which consonants sound alike in the tones of fantasy languages, I need to further refine this definition. </em></p> <p>Therefore, words like the following will be marked as "requires correction" (because they sound too similar): </p> <pre class="brush:php;toolbar:false;">txindan txintan # Only d/t is different xumaq jumaq # Only x/j is different dolpar dolbar # Only a b/p is different</pre> <p>How do I find these words that differ by only one consonant in my list of ~5,000 words in a <em>relatively efficient</em> way? </p> <p>This is a very naive solution that I currently have in mind, as follows: </p> <pre class="brush:php;toolbar:false;">import fs from 'fs' const terms = fs .readFileSync('term.csv', 'utf-8') .trim() .split(/n /) .map(line => { let [term] = line.split(',') return term }) .filter(x => x) const consonantSets = ` zs xj pb td kg` .split(/n /) .map(x => x.split('')) function computeSimilarTerms( term: string, consonantSets: Array<Array<string>>, ) { const termLetters = term?.split('') ?? [] const newTerms: Array<string> = [] for (const consonantSet of consonantSets) { for (const letter of consonantSet) { for (const letter2 of consonantSet) { if (letter === letter2) { continue } let i = 0 while (i < termLetters.length) { const termLetter = termLetters[i] if (termLetter === letter) { const newTerm = termLetters.concat() termLetters[i] = letter2 newTerms.push(newTerm.join('')) } i } } } } return newTerms } for (const term of terms) { const similarTerms = computeSimilarTerms(term, consonantSets) similarTerms.forEach(similarTerm => { if (terms.includes(similarTerm)) { console.log(term, similarTerm) } }) }</pre> <p>How can this be accomplished with relatively little brute force? And this solution is incomplete because it does not build <em>all possible similar word combinations</em>. So somewhere in the algorithm it should be able to do this. Any ideas? </p>
P粉757640504
P粉757640504

reply all(1)
P粉238433862

Choose one consonant in each group to be the "representative" of that group. Then, build a map that groups words together such that they become identical when their consonants are replaced by their representative consonants.

Important note: This method only works when the consonant groups form equivalence classes. In particular, consonant similarity must be transitive. If 'bp' is similar, 'bv' is similar, but 'pv' is not similar, then this method is invalid.

The following is the code for the example in Python; I let you write the JavaScript code.

  • f is a mapping that maps each consonant to its representative consonant;
  • d is a map that maps each representative word to a list of words with this representative.
bigwordlist = '''dolbar
dolpar
jumaq
txindan
txintan
txintoq
txiqbal
txiqfun
txiqwek
txiqyal
txinton
txonmiq
txoqwul
txoqxik
xumaq'''.splitlines()

consonant_groups = '''zs
xj
pb
td
kg'''.splitlines()

f = {}
for g in consonant_groups:
    for c in g:
        f[c] = g[0]

print(f)
# {'z': 'z', 's': 'z', 'x': 'x', 'j': 'x', 'p': 'p', 'b': 'p', 't': 't', 'd': 't', 'k': 'k', 'g': 'k'}
    
d = {}
for word in bigwordlist:
    key = ''.join(f.get(c, c) for c in word)
    d.setdefault(key, []).append(word)

print(d)
# {'tolpar': ['dolbar', 'dolpar'], 'xumaq': ['jumaq', 'xumaq'], 'txintan': ['txindan', 'txintan'], 'txintoq': ['txintoq'], 'txiqpal': ['txiqbal'], 'txiqfun': ['txiqfun'], 'txiqwek': ['txiqwek'], 'txiqyal': ['txiqyal'], 'txinton': ['txinton'], 'txonmiq': ['txonmiq'], 'txoqwul': ['txoqwul'], 'txoqxik': ['txoqxik']}

Finally, we can see which words are similar:

print([g for g in d.values() if len(g) > 1])
# [['dolbar', 'dolpar'], ['jumaq', 'xumaq'], ['txindan', 'txintan']]
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template