Way to find words that differ by only one consonant in a large word list
P粉757640504
2023-08-15 16:26:13
<p>I have a list of nearly 5000 "fantasy" words written in ASCII text. Some of these words are as follows: </p>
<pre class="brush:php;toolbar:false;">txintoq
txiqbal
txiqfun
txiqwek
txiqyal
txiyton
txonmiq
txoqwul
txoqxik</pre>
<p>I want to design an algorithm that checks/verifies that no two words in a list differ by only one "similar consonant".So I'll define "set of similar consonants" like this (tentatively): </p>
<pre class="brush:php;toolbar:false;">zs
xj
pb
td
kg</pre>
<p><em>There may be 3 or more consonants in a set, but I will only show 2 now. As I learn more about which consonants sound alike in the tones of fantasy languages, I need to further refine this definition. </em></p>
<p>Therefore, words like the following will be marked as "requires correction" (because they sound too similar): </p>
<pre class="brush:php;toolbar:false;">txindan
txintan # Only d/t is different
xumaq
jumaq # Only x/j is different
dolpar
dolbar # Only a b/p is different</pre>
<p>How do I find these words that differ by only one consonant in my list of ~5,000 words in a <em>relatively efficient</em> way? </p>
<p>This is a very naive solution that I currently have in mind, as follows: </p>
<pre class="brush:php;toolbar:false;">import fs from 'fs'
const terms = fs
.readFileSync('term.csv', 'utf-8')
.trim()
.split(/n /)
.map(line => {
let [term] = line.split(',')
return term
})
.filter(x => x)
const consonantSets = `
zs
xj
pb
td
kg`
.split(/n /)
.map(x => x.split(''))
function computeSimilarTerms(
term: string,
consonantSets: Array<Array<string>>,
) {
const termLetters = term?.split('') ?? []
const newTerms: Array<string> = []
for (const consonantSet of consonantSets) {
for (const letter of consonantSet) {
for (const letter2 of consonantSet) {
if (letter === letter2) {
continue
}
let i = 0
while (i < termLetters.length) {
const termLetter = termLetters[i]
if (termLetter === letter) {
const newTerm = termLetters.concat()
termLetters[i] = letter2
newTerms.push(newTerm.join(''))
}
i
}
}
}
}
return newTerms
}
for (const term of terms) {
const similarTerms = computeSimilarTerms(term, consonantSets)
similarTerms.forEach(similarTerm => {
if (terms.includes(similarTerm)) {
console.log(term, similarTerm)
}
})
}</pre>
<p>How can this be accomplished with relatively little brute force? And this solution is incomplete because it does not build <em>all possible similar word combinations</em>. So somewhere in the algorithm it should be able to do this. Any ideas? </p>
Choose one consonant in each group to be the "representative" of that group. Then, build a map that groups words together such that they become identical when their consonants are replaced by their representative consonants.
Important note: This method only works when the consonant groups form equivalence classes. In particular, consonant similarity must be transitive. If
'bp'
is similar,'bv'
is similar, but'pv'
is not similar, then this method is invalid.The following is the code for the example in Python; I let you write the JavaScript code.
f
is a mapping that maps each consonant to its representative consonant;d
is a map that maps each representative word to a list of words with this representative.Finally, we can see which words are similar: