Question one:
Now I have more than 400,000 pieces of data, and I need to use some kind of machine learning classification algorithm to build a model for this data. The problem I encountered is that the data is too large and cannot be read at once, so Want to ask how to process data?
Question 2:
I have a question about sklearn cross-validation: If I have 10,000 training data, these 10,000 training data sets can be divided into n groups of training using the KFold method based on the cross-validation principle (train data accounts for 0.7). Now What I don’t understand is that I fit() the training set of the first group, and then performed prediction verification on the test set to get the prediction accuracy. But what is the use of getting the prediction accuracy? Will it affect the next training session? Also, will the last trained model be used in the next fit() function?
I have been studying data mining and analysis of big data recently. Regarding question 1, I have an idea for your reference: since it cannot be read at once, you can build a distributed data model, read the data in batches, and determine the address datanode ( It can be a variable name), create a namenode (a table corresponding to the name and the address), and then when obtaining the data, first confirm the address in the namenode (which variable corresponds to the data that is needed), and then access the address to obtain The data is processed. Since I'm a beginner, I just provide my personal thoughts. The answer is not unique and is for reference only. If you don't like it, don't criticize it.
400,000 is not much, a few gigabytes at most...
If the memory is really as small as 8G, then it still depends on your specific scenario. For example, simply counting tf-idf, one generator, only the last tf-idf dictionary is in memory.
Cross-validation is just to select the one with the smallest error. Behind the previous influence you mentioned is the concept of boosting.
This kind of Q&A website is best to have one question and one pit. If necessary, two separate questions can be used to connect the links to avoid double-barreled questions
(1) See How to optimize for speed, you will find that there are many ways to control the experiment, including (a) using simple algorithms as much as possible (b) profiling memory usage and speed based on real-life conditions ( c) Try to replace all nested loops with Numpy arrays (d) Use Cython Wrapper if necessary to tune a more efficient C/C++ function library. These are just basic principles and directions. In fact, it still depends on the bottleneck analysis of the problem you want to operate, whether it is speed or space. After optimizing the code, you can consider whether to use parallel computing and other methods
(2) Your question has to distinguish between mathematical and empirical requirements. I hope you have a grasp of the empirical and mathematical significance of overfitting and underfitting. The questions and answers here are quite good. Reading them will help. .