Application scenarios:
Sometimes it is necessary to test records inserted into the database for testing, so it is very necessary to use these scripts.
Create table:
CREATE TABLE `tables_a` ( `id` int(10) NOT NULL DEFAULT '0', `name` char(50) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Create a function that generates random strings:
set global log_bin_trust_function_creators = 1; DROP FUNCTION IF EXISTS rand_string; DELIMITER // CREATE FUNCTION rand_string(n INT) RETURNS VARCHAR(255) BEGIN DECLARE chars_str varchar(100) DEFAULT 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'; DECLARE return_str varchar(255) DEFAULT ''; DECLARE i INT DEFAULT 0; WHILE i < n DO SET return_str = concat(return_str,substring(chars_str , FLOOR(1 + RAND()*62 ),1)); SET i = i +1; END WHILE; RETURN return_str; END // delimiter ;
Create the procedure for inserting the table, where x starts. y is the end value, z is the number of random numbers generated
delimiter // create procedure test(x int(10),y int(10),z int(10)) begin DECLARE i INT DEFAULT x; while i<y do insert into tables_a values(i,rand_string(z)); set i=i+1; end whi
mysql random data generation and insertion
There is very little citation information in the dblp database, with an average of 0.2 citations per paper. A paper using dblp as an experimental data set mentioned that citation information can be added randomly. Inspired by this, I planned to add 20 random citations to each paper, so I wrote the following SQL statement:
String sql = "insert into citation(pId1,pId2) values( (select pId from papers limit ?,1),(select pId from papers limit ?,1))";
Use preparedstatement to submit the database in batch mode.
The first parameter is the rowid information of the paper, from 0 to N (N is the total row of papers). The second parameter is 20 non-repeating random numbers generated by Java, ranging from 0-N. Then nested in a for loop, every 10,000 pieces of data are submitted to the database.
This code cleverly uses the limit feature to randomly select tuples, which is secretly satisfying. I thought that all the selections were done by the database, eliminating the need for multiple connections through jdbc, and it should be able to be completed quickly. Unexpectedly, it took as much as 22 minutes to insert only 100,000 pieces of data (10000*10). The final experiment requires inserting 4 million pieces of data, which means it will take about 14 hours.
So I started to reflect and kept writing similar programs to find the time bottleneck, and finally locked in the select limit. This operation is very time-consuming. The reason for selecting limit at the beginning is that numbers are randomly generated and the numbers need to be mapped to tuples, that is, to rowids. Since the primary key of the papers table is not an incrementing int, the default rowid does not exist. Then I thought, I could add a temp column of auto_increment to the papers table first, and then delete it after completing the citation insertion. In this way, the sql statement is changed to:
String sql = "insert into citation(pId1,pId2) values((select pId from papers where temp=?), (select pId from papers where temp=?))";
Insert 100,000 pieces of data again, which takes 38 seconds. The efficiency has been greatly improved, but I don’t know if it can be further optimized.