Data science and machine learning are becoming more and more popular, and the number of people in this field is growing every day. This means there are a lot of data scientists who don’t have a lot of experience building their first machine learning models, and this is where mistakes can happen.
Recently, software architect, data scientist, and Kaggle master Agnis Liukis wrote an article in which he talked about how to solve some of the most common beginner mistakes in machine learning. programs to ensure beginners understand and avoid them.
Agnis Liukis has over 15 years of experience in software architecture and development, and he is proficient in Languages like Java, JavaScript, Spring Boot, React.JS, and Python. In addition, Liukis is also interested in data science and machine learning. He has participated in Kaggle competitions many times and achieved good results, and has reached the Kaggle competition master level.
The following is the content of the article:
Normalize the data, then obtain the features and input them into the model , it is very easy to let the model make predictions. But in some cases, the results of this simple approach can be disappointing because it's missing a very important part.
Some types of models require data normalization, such as linear regression, classic neural networks, etc. This type of model uses the feature values to multiply the weights of the training values. In the case of non-normalized features, the possible range of one feature value may differ from the possible range of another feature value.
Suppose the value of one feature is in the range [0, 0.001] and the value of the other feature is in the range [100000, 200000]. For a model that makes two features equally important, the weight of the first feature will be 100 million times greater than the weight of the second feature. Huge weights may cause serious problems for the model, such as when there are some outliers. Furthermore, estimating the importance of various features becomes difficult because a large weight may mean that the feature is important, but it may also simply mean that its feature value is small.
After normalization, the values of all features are in the same range, usually [0, 1] or [-1, 1]. In this case, the weights will be in a similar range and closely correspond to the actual importance of each feature.
Overall, using data normalization where needed will produce better, more accurate predictions.
Some people may think that adding all features is a good idea, thinking that the model will automatically select and use the best features. In fact, this idea is difficult to come true.
The more features the model has, the greater the risk of overfitting. Even in completely random data, the model is able to find some features (signals), although sometimes they are weaker and sometimes they are stronger. Of course, there is no real signal in random noise. But if we have enough noisy columns, it is possible for the model to use some of them based on the detected fault signal. When this happens, the quality of model predictions will degrade because they are based in part on random noise.
There are many techniques to help us perform feature selection. But you have to remember that you need to explain every feature you have and why that feature will help your model.
Tree-based models are easy to use and powerful, which is why they are popular . However, in some cases it may be wrong to use a tree-based model.
Tree-based models cannot be extrapolated. The predicted value of these models will never be greater than the maximum value in the training data, and the output value in training will never be smaller than the minimum value. predicted value.
In some tasks, the ability to extrapolate can be very important. For example, if the model predicts stock prices, stock prices may be higher in the future than ever before. In this case, tree-based models would not be directly useful since their predictions would almost exceed the highest historical prices.
There are multiple solutions to this problem, one solution is to predict changes or differences rather than predicting value directly. Another solution is to use different types of models for such tasks. Linear regression or neural networks can perform extrapolation.
The previous article talked about the necessity of data normalization, but this is not always the case. Tree-based models do not require data. Normalized. Neural networks may also not require explicit normalization, as some networks already include a normalization layer internally, such as the Keras library's BatchNormalization operation.
In some cases, even linear regression may not require data normalization, which means that all features are already in a similar value range and have the same meaning. For example, if the model is applied to time series data and all features are historical values of the same parameter.
Causing data leakage is easier than people think, consider the following code snippet:
Example features of data leakage
Actual Both features (sum_feature and diff_feature) are incorrect. They are leaking information because after splitting into train/test sets, the part with training data will contain some information from testing. This will result in higher validation scores but worse performance when applied to real data models.
The correct approach is to first separate the training set/test set and only then apply the feature generation function. Generally, it is a good feature engineering pattern to process the training set and test set separately.
In some cases, it may be necessary to pass some information between the two - for example, we may want to use the same StandardScaler on the test set and the training set.
All in all, it’s good to learn from your mistakes, and I hope the examples of mistakes provided above will help you.
The above is the detailed content of Summary of 15 years of experience as a software architect: In the field of ML, five pitfalls that beginners encounter. For more information, please follow other related articles on the PHP Chinese website!