php - Import 100,000 pieces of data. There are 1 million pieces of data in the database. How to determine duplication?
PHP中文网
PHP中文网 2017-05-16 13:09:47
0
4
774

Specific needs

  1. There are 1 million pieces of data in the database;

  2. 100,000 pieces of data need to be imported;

  3. If the data to be imported exists in the data table, terminate the data import;

Questions

How to quickly import each piece of data when it needs to be compared, and roll back the data if the import fails?

PHP中文网
PHP中文网

认证高级PHP讲师

reply all(4)
phpcn_u1582

Assume that the target table is called target (1 million data), and the table to be imported is called source (100,000 data).
A large part of the time consumption in this example is to determine whether the records in the two tables are the same. I guess the author wants all fields to be the same (except the id field) before they are considered to be the same record.

Calculate a Hash-like value for each record, which can be stored in the newly created fields of target and source (hashValue). The calculation amount is 1.1 million.
Then use the following statement to import the data

IF EXISTS(
    SELECT TOP 1 1
      FROM source a
      INNER JOIN target b
    ON a.hashValue = b.hashValue
) BEGIN

  -- 返回重复记录
    SELECT a.*
      FROM source a
      INNER JOIN target b
    ON a.hashValue = b.hashValue

END
ELSE 
BEGIN
    INSERT INTO Target(filed1, field2, field3, field4 ... hashValue)
    SELECT
      t.filed1,
      t.field2,
      t.field3,
      t.field4, ..., t.hashValue
    FROM target t
END
我想大声告诉你

If you consider transaction efficiency, it must not be high

Please refer to the following for personal ideas:

1. First determine whether the data to be inserted exists

select id from table where id in('k1,k2,k3');

2. Does not exist. Write sql to the database

10w data, 1 sql, insertion speed, solved within 2 seconds

You can read an article I wrote before, optimizing the insertion of 20w data at a time to be completed in 3 seconds
https://my.oschina.net/famous...

滿天的星座

My personal idea is: the key point is to judge the weight and establish a unique index to ensure uniqueness when importing. For batch import, you can refer to @石记's method.

刘奇

Based on this problem, a big problem occurred during the subsequent import, and the data import was slow, so I wrote an article about this:

/a/11...

I hope someone can use it as a reference after seeing it later.

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template