In the current business, as some tables are getting larger and larger, the pressure is very high when reading (the demand for writing is relatively small), so on the database side, we decided to cut some tables with particularly large amounts of data, but in the back-end code There are a lot of codes/queries that need to be joined to these tables. How do you solve this situation?
For example, we now have a SampleTable with about 100 million pieces of data. We have divided it into about 16 different tables based on logic: SampleTable 1, SampleTable2...SampleTable31,
There was a query in the previous code, which was similar to:
select * from SampleTable join test_table
Do we need to execute this query multiple times and aggregate the data as the return result?
select * from SampleTable1 join test_table
Is there a better method or library recommendation? Is there any beset practice or sample code?
If we want to split multiple tables into different database servers later, do we need to add database connections for different DBs in the back-end code?
The basic idea and sharding strategy of database Sharding
This article is more about the strategy of database sharding. Can anyone provide an actual project code sample?
Database sharding and JPA
what -to-do-instead-of-sql-joins-while-scaling-horizontally
Some answers on stackoverflow
You can consider introducing database middleware
sharding-jdbc client level
mycat-server server level
A friend recommended Spark, which supports SQl-style queries and returns results in about 0.5 seconds for 100 million pieces of data
Only for the current situation in our project: when dividing tables, it falls to a specific table according to the hash algorithm, and then when fetching, first obtain the distribution position of the data according to the algorithm, and then it is a normal selection
Join table query is not recommended
1. Database resources are relatively precious, and join table query will take up a lot of memory, resulting in a decrease in database performance
2. Data is not supported in multiple database instances, the split database situation cannot be handled, and the scalability is poor
The common approach is to divide the join table query into multiple single table queries, and then summarize the results in the application.
1. Can solve the above problem of joining table query
2. For multiple queries, the intermediate results of each query can also be processed in the program, which is a flexibility.
3. The application can also be expanded at any time, making it more flexible
If it is an offline scenario, it is recommended to use the MR (mapreduce) framework to handle it, such as hadoop, etc. Accordingly, the data needs to be written to HDFS.
http://blog.csdn.net/tianyale...
Detailed explanation of sub-database and sub-table