I am an algorithm idiot, but I happen to have available computing resources, so I will follow the original idea (violent brainless parallel computing flow) and do the example of the question:
run.py
from multiprocessing import Pool, cpu_count
def do_align(item):
with open("bg_db.txt") as fh, open("result.txt", "a") as rh:
db_line = fh.readline().strip()
while db_line:
counts = 0
for i in [(i, j) for (i, j) in zip(db_line, item)]:
if i[0] != i[1]:
counts += 1
if counts >= 4:
break
if counts < 4:
rh.write("{}\n".format(db_line))
db_line = fh.readline().strip()
def main():
pool = Pool(cpu_count())
with open("candidates.txt") as fh:
pool.map(do_align, map(str.strip, fh))
pool.close()
pool.join()
if __name__ == "__main__":
main()
Simply generate point data
import random
import string
def id_generator(size=8, chars=string.ascii_letters + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
with open("candidates.txt", "w") as fh:
for i in range(10000):
fh.write("{}\n".format(id_generator(20, "ATCG")))
with open("bg_db.txt", "w") as fh:
for i in range(1000000):
fh.write("{}\n".format(id_generator(20, "ATCG")))
Well, I built it 10000行的candidates.txt和1000000行的bg_db.txt and ran it to see the time:
$time python run.py
real 15m45.445s
user 1362m41.712s
sys 1m12.099s
The actual data of the subject is candidates.txt with tens of millions of rows and bg_db.txt with one billion rows. Let’s simply estimate the time candidates.txt和十亿行的bg_db.txt, 简单估算下时间 16*1000/(60*24) = 11.1116*1000/( 60*24) = 11.11 That is, 11 days. This is the configuration of a computing node I use
CPU Utilization: 1.0 0.0 98.9
user sys idle
Hardware
CPUs: 96 x 2.10 GHz
Memory (RAM): 1009.68 GB
Local Disk: Using 351.623 of 941.596 GB
Most Full Disk Partition: 53.5% used.
It really takes a long time and urgently needs optimization
Based on the idea of @武合之zon, using double the space and using 4 bits to represent four types of characters can simplify the subsequent search for inconsistent character numbers
The idea is as follows: First: For example, compare 4 different data **TGACGGGTGACACCCA (delete characters at certain 4 positions of the string), change the character length to 16, match the exact same string Use map The class saves TGACGGGTGACACCCA as the key value, and the four differences as the values
Second: Compare 3 different data Based on the above, compare the above values to compare identical strings with a length of 3
Provide an idea to simplify the four characters into 00, 01, 10, 11. When comparing, perform XOR first, so that if they are identical, they will become 00; if they are not identical, they will be 01, 10 or 11. Then perform OR on each adjacent pair of the XOR result, so that 00 will become 0, and 01, 10 or 11 will become 1. Finally count the number of 1's. Since they are all bit operations, they should be very fast in theory.
I am an algorithm idiot, but I happen to have available computing resources, so I will follow the original idea (violent brainless parallel computing flow) and do the example of the question:
Simply generate point data
Well, I built it
10000
行的candidates.txt
和1000000
行的bg_db.txt
and ran it to see the time:
The actual data of the subject is
candidates.txt
with tens of millions of rows andbg_db.txt
with one billion rows. Let’s simply estimate the timecandidates.txt
和十亿行的bg_db.txt
, 简单估算下时间16*1000/(60*24) = 11.11
16*1000/( 60*24) = 11.11
That is, 11 days. This is the configuration of a computing node I use
It really takes a long time and urgently needs optimization
Based on the idea of @武合之zon, using double the space and using 4 bits to represent four types of characters can simplify the subsequent search for inconsistent character numbers
Sit back and wait for the master to show up
The idea is as follows:
First: For example, compare 4 different data
**TGACGGGTGACACCCA (delete characters at certain 4 positions of the string), change the character length to 16, match the exact same string
Use map The class saves TGACGGGTGACACCCA as the key value, and the four differences as the values
Second: Compare 3 different data
Based on the above, compare the above values to compare identical strings with a length of 3
And so on
You can learn about parallel computing such as CUDA. The performance improvement of your large number of repeated simple operations is very obvious
Provide an idea to simplify the four characters into 00, 01, 10, 11.
When comparing, perform XOR first, so that if they are identical, they will become 00; if they are not identical, they will be 01, 10 or 11.
Then perform OR on each adjacent pair of the XOR result, so that 00 will become 0, and 01, 10 or 11 will become 1. Finally count the number of 1's. Since they are all bit operations, they should be very fast in theory.
But I’m terrible at C, so I can’t post the code.