并发 - redis队列里每天有超过一亿条数据，使用python如何高效的将其写入sql数据库?

Question

使用python拉redis队列里的数据，如何确保完整高效不阻塞的把每天上亿条的数据写入mysql/oracle数据库？

怪我咯 · Answer

The whole process is a queue + consumption queue.

Give me an idea series:

Use the monitor command to monitor redis events, or upgrade the redis version (>=2.8) to monitor redis change events.
Multiple processes/threads/coroutines handle events (e.g. put into buffer)
Buffer data is inserted into the database in batches. (mysql seems to be able to insert up to 5,000 items at a time, not sure, it may be 9999 or 4999, but there is an upper limit), relying on the SQL database transaction function, it will roll back if it fails.

Talk about the mountains series:

4. For such large data, you can consider column database, hbase and the like.

Irresponsible series:

5. Some people seem to say that there are problems with the large amount of data in mongodb, including those at home and abroad, as well as the conclusions after testing by people around them.

伊谢尔伦 · Answer

What database are you using? The performance of MySQL and SQL Server stand-alone versions is much different. The performance of SQL Server / Oracle / DB2 is at the same level. The performance of open source MySQL is not very good. Considering the cluster, postgresql is better. If you want to move such a large amount of data into the database, you must consider 1) Appropriate indexes. Can greatly improve writing performance.
2) If the database does not have a master-slave design, then consider using RAID or SSD. The performance is not even the slightest bit worse than an ordinary mechanical hard drive.
3) Using different database engines, taking MySQL as an example, there are still performance differences between InnoDB and MyISAM.
4) Don’t commit every record. Centralized commit performance will be more than 10 times better.
5) Separate reading and writing, otherwise you will find that writing data blocks the reading of data.

If the amount of data every day is so large, then SQL is not suitable. Consider Hadoop + HBase.

In fact, which database you use depends more on your data format. Some databases are more friendly to data in specific formats. For example, HDF5 is more friendly to binary and spatial data

PHPz · Answer

With such a large amount of data, it will be faster to use mongodb

伊谢尔伦 · Answer

Mysql has not been used much, but SQL Server has a batch import function that can easily pass 10,000 in one second.
I think Mysql also has the function of batch insertion. After the insertion, the post-processing must consider the sub-database and table master-slave type.

迷茫 · Answer

Insert 100 million pieces of data into mysql every day. . . Analyzing queries is a problem. . But it should just be historical data

PHPz · Answer

1.【Complete】: Use verification mechanism.

2. [Efficient]: Use batch reading, batch writing or dedicated interfaces.

3. [No blocking]: It is impossible not to block at all. The blocking time can only be kept as short as possible without causing any perception. This requires the use of two mechanisms: task segmentation and time-sharing. Two advanced processing methods, performance curve estimation based on machine learning and real-time load balancing, can even be introduced.