php - Import 100,000 pieces of data. There are 1 million pieces of data in the database. How to determine duplication?

Question

Specific requirements: There are 1 million pieces of data in the database; 100,000 pieces of data need to be imported; If the data to be imported exists in the data table, the data import will be terminated; The question is how to quickly import when each piece of data needs to be compared, and the import fails at the same time Perform data rollback?

phpcn_u1582 · Answer

Assume that the target table is called target (1 million data), and the table to be imported is called source (100,000 data).
A large part of the time consumption in this example is to determine whether the records in the two tables are the same. I guess the author wants all fields to be the same (except the id field) before they are considered to be the same record.

Calculate a Hash-like value for each record, which can be stored in the newly created fields of target and source (hashValue). The calculation amount is 1.1 million.
Then use the following statement to import the data

IF EXISTS(
    SELECT TOP 1 1
      FROM source a
      INNER JOIN target b
    ON a.hashValue = b.hashValue
) BEGIN

  -- 返回重复记录
    SELECT a.*
      FROM source a
      INNER JOIN target b
    ON a.hashValue = b.hashValue

END
ELSE 
BEGIN
    INSERT INTO Target(filed1, field2, field3, field4 ... hashValue)
    SELECT
      t.filed1,
      t.field2,
      t.field3,
      t.field4, ..., t.hashValue
    FROM target t
END

我想大声告诉你 · Answer

If you consider transaction efficiency, it must not be high

Please refer to the following for personal ideas:

1. First determine whether the data to be inserted exists

select id from table where id in('k1,k2,k3');

2. Does not exist. Write sql to the database

10w data, 1 sql, insertion speed, solved within 2 seconds

You can read an article I wrote before, optimizing the insertion of 20w data at a time to be completed in 3 seconds
https://my.oschina.net/famous...

滿天的星座 · Answer

My personal idea is: the key point is to judge the weight and establish a unique index to ensure uniqueness when importing. For batch import, you can refer to @石记's method.

怪我咯 · Answer

Based on this problem, a big problem occurred during the subsequent import, and the data import was slow, so I wrote an article about this:

/a/11...

I hope someone can use it as a reference after seeing it later.