java - 如何提高大量的字符串比较速度

Question

补充数据链接： candidates: https://pan.baidu.com/s/1nvGWbrVbg_db: https://pan.baidu.com/s/1sllFLAd 每个字符串长度都是23，只要前面20个字符就行，由于数据太大，我只传了五分之一，大神们可以挑战一下，有...

阿神 · Answer

I am an algorithm idiot, but I happen to have available computing resources, so I will follow the original idea (violent brainless parallel computing flow) and do the example of the question:

run.py

from multiprocessing import Pool, cpu_count


def do_align(item):
    with open("bg_db.txt") as fh, open("result.txt", "a") as rh:
        db_line = fh.readline().strip()
        while db_line:
            counts = 0
            for i in [(i, j) for (i, j) in zip(db_line, item)]:
                if i[0] != i[1]:
                    counts += 1
                if counts >= 4:
                    break
            if counts < 4:
                rh.write("{}
".format(db_line))
            db_line = fh.readline().strip()


def main():
    pool = Pool(cpu_count())
    with open("candidates.txt") as fh:
        pool.map(do_align, map(str.strip, fh))
    pool.close()
    pool.join()


if __name__ == "__main__":
    main()

Simply generate point data

import random
import string


def id_generator(size=8, chars=string.ascii_letters + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))


with open("candidates.txt", "w") as fh:
    for i in range(10000):
        fh.write("{}
".format(id_generator(20, "ATCG")))

with open("bg_db.txt", "w") as fh:
    for i in range(1000000):
        fh.write("{}
".format(id_generator(20, "ATCG")))

Well, I built it 10000行的candidates.txt和1000000行的bg_db.txt
and ran it to see the time:

$time python run.py

real    15m45.445s
user    1362m41.712s
sys     1m12.099s

The actual data of the subject is candidates.txt with tens of millions of rows and bg_db.txt with one billion rows. Let’s simply estimate the time candidates.txt和十亿行的bg_db.txt, 简单估算下时间
16*1000/(60*24) = 11.1116*1000/( 60*24) = 11.11
That is, 11 days. This is the configuration of a computing node I use

CPU Utilization:    1.0    0.0    98.9
user    sys    idle
Hardware
CPUs: 96 x 2.10 GHz
Memory (RAM): 1009.68 GB
Local Disk: Using 351.623 of 941.596 GB
Most Full Disk Partition: 53.5% used.

It really takes a long time and urgently needs optimization

巴扎黑 · Answer

Based on the idea of @武合之zon, using double the space and using 4 bits to represent four types of characters can simplify the subsequent search for inconsistent character numbers

黄舟 · Answer

Sit back and wait for the master to show up

伊谢尔伦 · Answer

The idea is as follows:
First: For example, compare 4 different data
**TGACGGGTGACACCCA (delete characters at certain 4 positions of the string), change the character length to 16, match the exact same string
Use map The class saves TGACGGGTGACACCCA as the key value, and the four differences as the values

Second: Compare 3 different data
Based on the above, compare the above values to compare identical strings with a length of 3

And so on

PHP中文网 · Answer

You can learn about parallel computing such as CUDA. The performance improvement of your large number of repeated simple operations is very obvious

怪我咯 · Answer


基本算法的复杂度是 O(M*N*c)
M =candidates数量
N = bg_db 数量

比较操作看做常数时间


优化：
    1 把20个字符分为4组 每组5个 ，每组可能的类型有 4^5=1024种 。把bg_db 先按照 第一个组 聚合 ，再按照第二个组聚合...构成一个4层的树结构
    2 建立 1024*1024 的表 表示 两个5个字符的串的差异数 内存 1MB
    2 匹配的时候，如果第一个组差异已经超过4个了，就不需要看它的子节点了，可以立刻排除大量节点

每组的长度也可以不确定，最好第一组有足够的区分性，这样可以一下子排除大量数据

黄舟 · Answer

Provide an idea to simplify the four characters into 00, 01, 10, 11.
When comparing, perform XOR first, so that if they are identical, they will become 00; if they are not identical, they will be 01, 10 or 11.
Then perform OR on each adjacent pair of the XOR result, so that 00 will become 0, and 01, 10 or 11 will become 1. Finally count the number of 1's. Since they are all bit operations, they should be very fast in theory.

But I’m terrible at C, so I can’t post the code.