First of all, the export speed of mysqldump is not slow. After testing, it can reach a speed of 50M/s. It takes 3 minutes for 10G data. It can be seen that the bottleneck lies in the network and disk IO. No matter how much the export tool is, it cannot be much faster. But the import took 60 minutes, and only about 20% of the disk and network were used. The bottleneck was the writing speed of the target library (general sequential writing cannot reach the IOPS limit), so mypumpkin was born - taking into account the import speed of myloader. and mysqldump export flexibility.
Use python to construct a queue, put all the tables that need to be exported into the queue at once, start N python threads at the same time, each take out the table name from this Queue, and subprocess Call the mysqldump command of the operating system to export the data to a file named dbname.tablename.sql. Load in is similar to dump out. According to the specified library name or table name, all sql files are found from the dump_dir directory and pushed into the queue. N threads call mysql at the same time to construct a new command and simulate the < operation.
Parameter parsing has almost been refactored from the original parsing by myself to using the argparse module.
If --tables is not specified, the program will actively query the library for all table names, and then filter them into the queue.
Load in target library, the options are as rich as dump out, you can specify which db, which tables to import, and which tables to ignore.
The key point is to be compatible with the original mysqldump, because the table-related options (-B, -A, --tables, --ignore=) need to be analyzed and combined into a new execution command, consider There are many exceptions.
Limitations
Important: The exported data does not guarantee library-level consistency
It has no impact on historical unchanged tables
Whether a table can ensure consistency is determined by which options mysqldump itself uses
Different table export actions are in different mysqldump commands, and transactions cannot be guaranteed. In my case scenario, a development classmate assisted by using a binlog parsing program and replayed all changes after completion to ensure final consistency.
In addition, in many cases when we import data, we do not need complete or consistent data. It is only used for offline analysis or temporary export. The focus is to quickly get the data to development.
The unusual option identification program has tried its best to be compatible with the mysqldump command. You only need to add mypumpkin.py and specify dump-dir to complete the concurrency magic. However, in some cases, the parameters are inconvenient to parse, and the format is not supported yet:
db1 table1 table2 db2 db3
That is, the above cannot be used to determine whether db1 and table1 are library names or surfaces on the command line. When using them, you only need to remember the three groups "[-A|-B], [--tables], [--ignore-table]" , one must appear: db1 table1 table2 is changed to db1 --tables table1 table2, db2 is changed to -B db2 db3.
Passwords can only be entered explicitly for the time being
4. Instructions for use
The installation is developed based on python 2.7, other versions have not been tested. Requires MySQLdb library.
4.1 help
./mypumpkin.py --help Only mysqldump or mysql allowed after mypumpkin.py usage: mypumpkin.py {mysqldump|mysqls} [--help] This's a program that wrap mysqldump/mysql to make them dump-out/load-in concurrently. Attention: it can not keep consistent for whole database(s). optional arguments: --help show this help message and exit -B db1 [db1 ...], --databases db1 [db1 ...] Dump one or more databases -A, --all-databases Dump all databases --tables t1 [t1 ...] Specifiy tables to dump. Override --databases (-B) --ignore-table db1.table1 [db1.table1 ...] Do not dump the specified table. (format like --ignore-table=dbname.tablename). Use the directive multiple times for more than one table to ignore. --threads =N Threads to dump out [2], or load in [CPUs*2]. --dump-dir DUMP_DIR Required. Directory to dump out (create if not exist), Or Where to load in sqlfile At least one of these 3 group options given: [-A,-B] [--tables] [--ignore-table]
--dump-dir, required, the original shell standard input and output > or < is not allowed to be used. If the directory specified by dump-dir does not exist, it will try to be created automatically.
--threads=N, N specifies the number of concurrent export or import threads. The default number of threads for dump out is 2, and the default number of threads for mypumpkin load in is the number of CPUs * 2.
Note: The larger the number of threads, the better. The main measurement indicators here are network bandwidth, disk IO, and target library IOPS. It is best to use dstat to observe it.
-B, --tables, --ignore-table, use the same as mysqldump, such as:
In mysqldump, --tables will override the --databases/-B option
In mysqldump, --tables and --ignore-table cannot appear at the same time
In mysqldump, if -B is not specified, --tables or --ignore-table must follow the db name
Other options, mypumpkin will remain intact. Put it in the shell to execute. Therefore, if there are errors in other options, the check is left to the native mysqldump. If a failure occurs during the execution, the thread will exit.
4.2 example
Export:
## 导出源库所有db到visit_dumpdir2目录 (不包括information_schema和performance_schema) $ ./mypumpkin.py mysqldump -h dbhost_name -utest_user -pyourpassword -P3306 \ --single-transaction --opt -A --dump-dir visit_dumpdir2 ## 导出源库db1,db2,会从原库查询所有表名来过滤 $ ./mypumpkin.py mysqldump -h dbhost_name -utest_user -pyourpassword -P3306 \ --single-transaction --opt -B db1 db2 --dump-dir visit_dumpdir2 ## 只导出db1库的t1,t2表,如果指定表不存在则有提示 $ ./mypumpkin.py mysqldump -h dbhost_name -utest_user -pyourpassword -P3306 \ --single-transaction --opt -B db1 --tables t1 t2 --dump-dir visit_dumpdir2 ## 导出db1,db2库,但忽略 db1.t1, db2.t2, db2.t3表 ## mysqldump只支持--ignore-table=db1.t1这种,使用多个重复指令来指定多表。这里做了兼容扩展 $ ./mypumpkin.py mysqldump -h dbhost_name -utest_user -pyourpassword --single-transaction \ --opt -B db1 db2 --ignore-table=db1.t1 --ignore-table db2.t2 db2.t3 --dump-dir visit_dumpdir2 (如果-A表示全部db) ## 不带 -A/-B $ ./mypumpkin.py mysqldump -h dbhost_name -utest_user -pyourpassword -P3306 \ --single-transaction --opt db1 --ignore-table=db1.t1 --dump-dir=visit_dumpdir2 ## 其它选项不做处理 $ ./mypumpkin.py mysqldump -h dbhost_name -utest_user -pyourpassword -P3306 \ --single-transaction --set-gtid-purged=OFF --no-set-names --skip-add-locks -e -q -t -n --skip-triggers \ --max-allowed-packet=134217728 --net-buffer-length=1638400 --default-character-set=latin1 \ --insert-ignore --hex-blob --no-autocommit \ db1 --tables t1 --dump-dir visit_dumpdir2
Import:
-A, -B, --tables, --ignore-table, --threads, --dump-dir usage and function are exactly the same as above, here are some examples :
## 导入dump-dir目录下所有表 $ ./mypumpkin.py mysql -h dbhost_name -utest_user -pyourpassword --port 3307 -A \ --dump-dir=visit_dumpdir2 ## 导入db1库(所有表) $ ./mypumpkin.py mysql -h dbhost_name -utest_user -pyourpassword --port 3307 -B db1 \ --dump-dir=visit_dumpdir2 ## 只导入db.t1表 $ ./mypumpkin.py mysql -h dbhost_name -utest_user -pyourpassword --port 3307 \ --default-character-set=utf8mb4 --max-allowed-packet=134217728 --net-buffer-length=1638400 \ -B db1 --tables t1 --dump-dir=visit_dumpdir2 ## 导入db1,db2库,但忽略db1.t1表(会到dump-dir目录检查db1,db2有无对应的表存在,不在目标库检查) $ ./mypumpkin.py mysql -h dbhost_name -utest_user -pyourpassword --port 3307 \ -B db1 db2 --ignore-table=db1.t1 --dump-dir=visit_dumpdir2