java - 如何提高大量的字符串比较速度

Question

补充数据链接： candidates: https://pan.baidu.com/s/1nvGWbrVbg_db: https://pan.baidu.com/s/1sllFLAd 每个字符串长度都是23，只要前面20个字符就行，由于数据太大，我只传了五分之一，大神们可以挑战一下，有...

PHPz · Answer

Write an idea

candidates = [
    'GGGAGCAGGCAAGGACTCTG',
    'GCTCGGGCTTGTCCACAGGA',
    '...',
    # 被你看出来啦，这些其实人类基因的片段
]

bg_db = [
    'CTGCTGACGGGTGACACCCA',
    'AGGAACTGGTGCTTGATGGC',
    '...',
    # 这个更多，有十亿左右
]

Because your data is actually very unique, it can be streamlined here.
Because all strings are 20 characters in length and consist of ATCGfour characters. Then they can be converted into integers for comparison.
The binary representation is as follows

A  ==>  00
T  ==>  01
C  ==>  10
G  ==>  11

Because the length of a string is fixed and each character can be represented by 2 bits, each string can be expressed as a 40位的整数。可以表示为32+8的形式，也可以直接使用64位整形。建议使用Clanguage.

Let’s talk about comparison.
Because we need to find all records with a difference of less than or equal to 4 for each candidate in bg_db, so we only need to do ^ bitwise XOR of two integers Operation, the binary 1 in the result does not exceed 8, and these 8 1s can only be divided into 4 groups at most, it is possible that it meets the requirements (00^11 =11,10^01=11). 每一个candidate在bg_db中与之差异小于等于4的所有记录,所以只要两个整数一做^按位异或操作，结果中二进制中1不超过8个,且这不超过8个1最多只能分为4个组的才有可能是符合要求的（00^11=11,10^01=11）。
把结果的40个比特位分作20个组，那么就是说最多只有4个组为b01 b10 b11这三个值，其余的全部为b00。
那么比较算法就很好写了。
可以对每个字节(四个组)获取其中有几个组是为三个非零值的，来简介获取整体的比较结果。
因为每个字节只有256种可能的值~~，而符合条件的值只有3^4=81Divide the 40 bits of the result into 20 groups, which means that there are only 4 groups at most b01 b10 b11 These three values, the rest are all b00.~~ Then the comparison algorithm is easy to write.
You can obtain for each byte (four groups) how many groups have three non-zero values to briefly obtain the overall comparison result.

Because each byte has only 256 possible values , and the qualified values are only 3^4=81 🎜, so the result can be stored first Get up and get it. 🎜Here is a function to get how many non-zero groups there are in the result. 🎜
/*****************下面table中值的生成******//** int i; for( i=0;i<256;++i){ int t =0; t += (i&0x01 || i&0x02)?1:0; t += (i&0x04 || i&0x08)?1:0; t += (i&0x10 || i&0x20)?1:0; t += (i&0x40 || i&0x80)?1:0; printf("%d,",t); if(i%10 ==9){putchar(' ');} } ********************************************// int table[] = { 0,1,1,1,1,2,2,2,1,2, 2,2,1,2,2,2,1,2,2,2, 2,3,3,3,2,3,3,3,2,3, 3,3,1,2,2,2,2,3,3,3, 2,3,3,3,2,3,3,3,1,2, 2,2,2,3,3,3,2,3,3,3, 2,3,3,3,1,2,2,2,2,3, 3,3,2,3,3,3,2,3,3,3, 2,3,3,3,3,4,4,4,3,4, 4,4,3,4,4,4,2,3,3,3, 3,4,4,4,3,4,4,4,3,4, 4,4,2,3,3,3,3,4,4,4, 3,4,4,4,3,4,4,4,1,2, 2,2,2,3,3,3,2,3,3,3, 2,3,3,3,2,3,3,3,3,4, 4,4,3,4,4,4,3,4,4,4, 2,3,3,3,3,4,4,4,3,4, 4,4,3,4,4,4,2,3,3,3, 3,4,4,4,3,4,4,4,3,4, 4,4,1,2,2,2,2,3,3,3, 2,3,3,3,2,3,3,3,2,3, 3,3,3,4,4,4,3,4,4,4, 3,4,4,4,2,3,3,3,3,4, 4,4,3,4,4,4,3,4,4,4, 2,3,3,3,3,4,4,4,3,4, 4,4,3,4,4,4 }; int getCount(uint64_t cmpresult) { uint8_t* pb = &cmpresult; // 这里假设是小段模式，且之前比较结果是存在低40位 return table[pb[0]]+table[pb[1]]+table[pb[2]]+table[pb[3]]+table[pb[4]]; }

阿神 · Answer

First of all, your time estimate is completely wrong. This kind of large-scale data processing needs to run tens of thousands of items and last for more than ten seconds before it can be used for multiplication to calculate the total time. If only one item is counted, this time will be almost They are all the overhead of the initialization process, not the key IO and CPU overhead

The following text

The four possibilities of ACTG are equivalent to 2 bits. It is too wasteful to use one character to represent one gene position. One character is 8 bits and can hold 4 gene positions

Even if you don’t use any algorithm and just write your 20 genes into binary form, you can save 5 times the time

In addition, after looping 20 times, the number of CPU instructions is 20*n, and n is estimated to be at least 3. But for binary, the XOR operation for comparison is directly a CPU instruction, and the number of instructions is 1

巴扎黑 · Answer

I don’t know much about the algorithm, but speaking from experience, complex algorithms take longer and are not as fast as this simple and crude one

You can consider multi-threading and clustering to process data

By the way, the Hamming distance seems to be able to calculate this

伊谢尔伦 · Answer

Also no algorithm is used, brute force solution is written in C

On my machine (CPU: Core 2 Duo E7500, RAM: 4G, OS: Fedora 19), test results

candidates    bg.db        cost
10000    1000000    318758165微秒
500      1000000    14950302微秒

If you switch to the subject's 24-core CPU, the performance will be improved by 20 times, and then add 48 machines to operate together. 5000W operations will take 15 seconds, and the time will be
10000000 * 1000000000 / 500 / 1000000 * 15 / 20 / 48 / 3600 / 24 = 3.616898 days

#include 
#include 
#include 
#include 
#include 

#define START_CC(flag)  \
    struct timeval st_##flag##_beg; \
    gettimeofday(&st_##flag##_beg, 0)

#define STOP_CC(flag)   \
    struct timeval st_##flag##_end; \
    gettimeofday(&st_##flag##_end, 0)

#define PRINT_CC(flag)  \
    double cost_##flag = 0.0L; \
    cost_##flag = (double)(st_##flag##_end.tv_sec - st_##flag##_beg.tv_sec); \
    cost_##flag = cost_##flag * 1000000L + (double)(st_##flag##_end.tv_usec - st_##flag##_beg.tv_usec);    \
    printf(#flag" cost time %.6lf microsecond.
", cost_##flag);

#define GENEORDER_CODE_LENGTH 20 + 1

typedef struct GeneOrder
{
    char code[GENEORDER_CODE_LENGTH];
}GeneOrder, *GeneOrderPtr;

typedef struct GOArray
{
    size_t          capacity;
    size_t          length;
    GeneOrderPtr    data;
}GOArray;

GOArray createGOAarray(size_t capacity)
{
    GOArray goa;

    goa.capacity    = capacity;
    goa.length      = 0;
    goa.data        = (GeneOrderPtr)malloc(capacity * sizeof(GeneOrder));

    return goa;
}

void destroyGOArray(GOArray* goa)
{
    if (goa->capacity > 0) {
        free(goa->data);
    }
}

bool readGOFile(char const *file, GOArray *goarray)
{
    FILE* fp = NULL;

    if ((fp = fopen(file, "r+")) == NULL) {
        return false;
    }

    char buff[64];

    while (fgets(buff, 64, fp) != NULL) {
        if (goarray->length < goarray->capacity) {
            memcpy(goarray->data[goarray->length].code,
                buff,
                GENEORDER_CODE_LENGTH * sizeof(char)
            );
            goarray->data[goarray->length].code[GENEORDER_CODE_LENGTH - 1] = 'gcc -Wall -Wextra -o ccs main.c -std=c99 -Os && ./ccs candidate.list bg.db';
            goarray->length ++;
        } else {
            fclose(fp);
            return true;
        }
    }

    fclose(fp);
    return true;
}

int main(int argc, char* argv[])
{
    (void)argc;

    GOArray condgo  = createGOAarray(10000);
    GOArray bggo    = createGOAarray(1000000);

    printf("loading ...
");

    START_CC(loading);
    if (!readGOFile(argv[1], &condgo) || !readGOFile(argv[2], &bggo)) {
        destroyGOArray(&condgo);
        destroyGOArray(&bggo);
        return -1;
    }
    STOP_CC(loading);
    PRINT_CC(loading);


    int count = 0;

    START_CC(compare);
    for (size_t i = 0;i < 500;i ++) {
        const GeneOrderPtr gop = condgo.data + i;
        for (size_t j = 0;j < bggo.length;j ++) {
            const GeneOrderPtr inner_gop = bggo.data + j;
            int inner_count = 0;

            for (size_t k = 0;k < 20;k ++) {
                if (gop->code[k] != inner_gop->code[k]) {
                    if (++inner_count > 4) {
                        break;
                    }
                }
            }

            if (inner_count <= 4) {
            #ifdef DBGPRINT
                printf("%d %s - %s
", i, gop->code, inner_gop->code);
            #endif
                count++;
            }
        }
    }
    STOP_CC(compare);
    PRINT_CC(compare);

    printf("result = %d
", count);

    destroyGOArray(&condgo);
    destroyGOArray(&bggo);

    return 0;
}

Compile parameters & run

#include 
#include 
#include 
#include 
#include 
#include 

#define START_CC(flag)              \
    struct timeval st_##flag##_beg; \
    gettimeofday(&st_##flag##_beg, 0)

#define STOP_CC(flag)               \
    struct timeval st_##flag##_end; \
    gettimeofday(&st_##flag##_end, 0)

#define PRINT_CC(flag)                                                                                  \
    double cost_##flag = 0.0L;                                                                          \
    cost_##flag = (double)(st_##flag##_end.tv_sec - st_##flag##_beg.tv_sec);                            \
    cost_##flag = cost_##flag * 1000000L + (double)(st_##flag##_end.tv_usec - st_##flag##_beg.tv_usec); \
    printf(#flag " cost time %.6lf microsecond.
", cost_##flag);

#define GENEORDER_CODE_LENGTH 20 + 1

typedef struct GeneOrder {
    char code[GENEORDER_CODE_LENGTH];
} GeneOrder, *GeneOrderPtr;

typedef struct GOArray {
    size_t capacity;
    size_t length;
    GeneOrderPtr data;
} GOArray;

GOArray createGOAarray(size_t capacity)
{
    GOArray goa;

    goa.capacity = capacity;
    goa.length = 0;
    goa.data = (GeneOrderPtr)malloc(capacity * sizeof(GeneOrder));

    return goa;
}

void destroyGOArray(GOArray* goa)
{
    if (goa->capacity > 0) {
        free(goa->data);
    }
}

bool readGOFile(char const* file, GOArray* goarray)
{
    FILE* fp = NULL;

    if ((fp = fopen(file, "r+")) == NULL) {
        return false;
    }

    char buff[64];

    while (fgets(buff, 64, fp) != NULL) {
        if (goarray->length < goarray->capacity) {
            memcpy(goarray->data[goarray->length].code, buff,
                GENEORDER_CODE_LENGTH * sizeof(char));
            goarray->data[goarray->length].code[GENEORDER_CODE_LENGTH - 1] = 'gcc -Wall -Wextra -o ccs main.c -std=c99 -O3 -lpthread && ./ccs candidate.list bg.db';
            goarray->length++;
        }
        else {
            fclose(fp);
            return true;
        }
    }

    fclose(fp);
    return true;
}

typedef struct ProcessST {
    GOArray* pcond;
    GOArray* pbg;
    size_t beg;
    size_t end; // [beg, end)
} ProcessST;

void* processThread(void* parg)
{
    ProcessST* ppst = (ProcessST*)parg;
    GOArray* pcond = ppst->pcond;
    GOArray* pbg = ppst->pbg;
    int count = 0;

    for (size_t i = ppst->beg; i < ppst->end; i++) {
        const GeneOrderPtr gop = pcond->data + i;
        for (size_t j = 0; j < pbg->length; j++) {
            const GeneOrderPtr inner_gop = pbg->data + j;
            int inner_count = 0;

            for (size_t k = 0; k < 20; k++) {
                if (gop->code[k] != inner_gop->code[k]) {
                    if (++inner_count > 4) {
                        break;
                    }
                }
            }

            if (inner_count <= 4) {
#ifdef DBGPRINT
                printf("%d %s - %s
", i, gop->code, inner_gop->code);
#endif
                count++;
            }
        }
    }

    return (void*)count;
}

int main(int argc, char* argv[])
{
    (void)argc;

    GOArray condgo = createGOAarray(10000);
    GOArray bggo = createGOAarray(1000000);

    printf("loading ...
");

    START_CC(loading);
    if (!readGOFile(argv[1], &condgo) || !readGOFile(argv[2], &bggo)) {
        destroyGOArray(&condgo);
        destroyGOArray(&bggo);
        return -1;
    }
    STOP_CC(loading);
    PRINT_CC(loading);

    size_t range[] = { 0, 250, 500 };
    pthread_t thr[2] = { 0 };
    ProcessST pst[2];

    START_CC(compare);

    for (size_t i = 0; i < 2; i++) {
        pst[i].pcond = &condgo;
        pst[i].pbg = &bggo;
        pst[i].beg = range[i];
        pst[i].end = range[i + 1];
        pthread_create(&thr[i], NULL, processThread, &pst[i]);
    }

    int count = 0;
    int* ret = NULL;

    for (size_t i = 0; i < 2; i++) {
        pthread_join(thr[i], (void**)&ret);
        count += (int)ret;
    }

    STOP_CC(compare);
    PRINT_CC(compare);

    printf("result = %d
", count);

    destroyGOArray(&condgo);
    destroyGOArray(&bggo);

    return 0;
}

If you change it to multi-threading, the speed will be faster. On my machine, the performance improvement is not very big, but newer CPUs have hyper-threading technology, and the speed is expected to be better. . . 2个线程简单使用500条candidates测试，速度可以提升到9040257微秒，线程增加到4 rrreee

Compile and test

rrreee

巴扎黑 · Answer

Sorry, I saw someone else replying today.
Looking at the question carefully, I found that I thought it was just a match.
So I proposed to use ac automatic machine.

But the purpose of the question is to find the difference in the sequence.
This is to find the edit distance between the two.
wiki: Edit distance
wiki: Ravenstein distance

When I used to brush OJ, I used DP (dynamic programming) to find the minimum number of edits to convert a string into another string.

for(i:0->len)
    for(j:0->len)
        str1[i]==str2[j] ? cost=0 : cost=1
        dp[i,j]=min(
            dp[i-1, j  ] + 1,     // 刪除
            dp[i  , j-1] + 1,     // 插入
            dp[i-1, j-1] + cost   // 替換
        )

For example:

str1    "abcdef"
str2    "acddff"

str2 converted to str1

Insert b Count once
Delete d Count once
Modify f Count once

For the subject’s ATCG gene sequence, do you only need to find the modified one?
However, how should it be calculated like this
ATCGATCG
TCGATCGA
?

If you just find modifications, just compare str1[i] and str2[i] directly.

for(i:0->len)
if(str1[i]!=str2[i] )++count;

Inspired by @rockford.
We can preprocess the raw data.

Strings in

candidates
GGGAGCAGGCAAGGACTCTG
A5 T2 C4 G9

Extra data after processing A5 T2 C4 G9

String in

bg_db
CTGCTGACGGGTGACACCCA
A4 T3 C7 G6

Extra data after processing A4 T3 C7 G6

A5 -> A4 is recorded as -1
T2 -> T3 is recorded as +1
C4 -> C7 is recorded as +3
G9 -> G6 is recorded as -3

Obviously A can only become TCG if modified.
Similarly, we only need to count all + or all -
to know at least how many differences they have.
Anything greater than 4 does not need to be compared.

By first comparing the pre-processed additional data, and then performing the comparison through a single comparison algorithm.
(work overtime on Saturday, write this after get off work)

天蓬老师 · Answer

Your individual tasks are determined. What you need is to send these tasks to workers. Such calculations are not performed synchronously in a single process.
In fact, it is equivalent to you having [a, b] and [c, d] to compare, your task is

[a, c]
[a, d]
[b, c]
[b, d]

If you are synchronous serial, the time you need is 4 * single time
If you have 4 CPUs or 4 machines in parallel, your time is almost single time

So calculations like genomes are basically completed using multi-core parallel tasks on large machines. Basically, the reference principles are the principles of the paper Google MapReduce

黄舟 · Answer

I’m not good at algorithms, but for a large amount of data like yours, a computer is definitely not enough to compare. For CPU-intensive data tasks like yours, I agree with what others have said about using a cluster or multi-process method to calculate, which is what we Use the map-reduce model to calculate
map is mapping. You first map each of your candidates to bg_db one by one to form a data structure like this(candidates,bg_db)
Make it into a queue and then hand it over to different servers. Each server uses multiple processes to do it. This is the only way to calculate, but the amount of data you have is too large. Find a way to allocate your tasks and calculate in parallel,

PHP中文网 · Answer

You can try to use a dictionary tree to save all strings. Then you can use the method of traversing the dictionary tree when querying.
When traversing the tree, you can maintain a sequence of current nodes. This sequence stores the number of mismatches between the currently traversed node and the corresponding node.
When traversing the next node, try to go down all the nodes in the current sequence and form a new node sequence.
The advantage is that the current bits of many strings can be compared together, which can save some time. Since there are not many choices for each position and the mismatch is not large, there should be no excessive expansion of the current node sequence. (This is a guess... I haven’t verified it too carefully...)

def match(candidate, root):
  nset = [[],[]]
  currentId = 0
  nset[currentId].append((root, 0))
  for ch in candidate:
    nextId = 1 - currentId
    for item in nset[currentId]:
      node, mismatch = item
      for nx in node.child:
        if nx.key == ch:
          nset[nextId].append((nx, mismatch))
        else:
          if mismatch:
            nset[nextId].append((nx, mismatch - 1))
    nset[currentId].clear()
    currentId = 1 - currentId
  return nset[currentId]

The above code is a rough idea. It would be much faster if written in C++.
The entire process can be done using clusters for distributed computing.

PHP中文网 · Answer

Visually the questioner does not have multiple machines for him to calculate.
I have a simple idea. Calculate the sum of the alphabetical numbers of each string (A:0, T:1, C:2, G:3). First calculate the two The difference between the ordinal sum of the string cannot exceed 12 at most. Four A's become four G's. If the difference is less than 12, it will be processed.
It's just a rough idea. The specific weight can be set separately. In theory, it can Much faster.
There is another algorithm that calculates the string edit distance (the minimum number of edits to add, delete, and modify one string to another string). I can’t remember it at the moment. You can check it yourself.

巴扎黑 · Answer

I use blast blastn-short

:)