In data preprocessing, a critical step is to handle missing data because machine learning models will not accept NaN values as their input. There are many ways to fill in these NaN values, but we first need to understand the importance of missing values.
A very simple way is to remove all missing values from the machine learning dataset, but before doing that, check the overall percentage of NaN values that appear in the machine learning dataset. If it is less than 1%, we can remove all missing values, otherwise we need to impute the data by choosing other methods like central tendency measure, KNN Imputer, etc.
When we use numbers in features, we use mean or median. The mean is the average value we can calculate by summing all the values in a row and then dividing by their amount. The median also represents an average. The median arranges the data in order of size to form a sequence, which is the data in the middle of the sequence. When individual data in a set of data vary greatly, the median is often used to describe the central tendency of the set of data.
If there is a skewed distribution in the machine learning data set, it is often better to use the median than the mean.
Outliers are data points that are significantly different from other observations. Sometimes, these outliers can also be sensitive. Before dealing with outliers, it is recommended to examine the machine learning dataset.
For example:
What is the data leakage problem in machine learning models?
Data leaks occur when the data we use to train machine learning models contains information that the machine learning model is trying to predict. This can lead to unreliable prediction results after the model is deployed.
This problem may be caused by the data standardization or normalization method. Because most of us continue to use these methods before splitting the data into training and test sets.
In real time, I feel that turning to some complex models unnecessarily may create some interpretability issues for business-oriented people. For example, linear regression will be easier to interpret than a neural network algorithm.
Select the corresponding machine learning model mainly based on the size and complexity of the data set. If we deal with complex problems, we can use some efficient machine learning models, such as SVN, KNN, random forest, etc.
Most of the time, the data exploration phase will help us choose the corresponding machine learning model. If the data is linearly separable in the visualization, then we can use linear regression. Support vector machines and KNN will be useful if we don't know anything about the data.
There is also a problem of model interpretability. For example, linear regression is easier to explain than neural network algorithms.
Metrics are quantitative measures of model predictors and real data. If the question is in regression, the key metrics are accuracy (R2 score), MAE (mean absolute error), and RMSE (root mean square error). If it is a classification problem, the key indicators are precision, recall, F1score and confusion matrix.
The above is the detailed content of Five common questions for newbies in machine learning. For more information, please follow other related articles on the PHP Chinese website!