The operating environment of this article: Windows 7 system, Dell G3 computer.
Data preprocessing refers to the necessary processing such as review, screening, sorting, etc. before classifying or grouping the collected data.
On the one hand, data preprocessing is to improve the quality of data, and on the other hand, it is also to adapt to the software or method of data analysis. Generally speaking, the data preprocessing steps are: data cleaning, data integration, data transformation, data reduction, and each major step has some small subdivisions. Of course, these four major steps may not necessarily be performed when doing data preprocessing.
1. Data Cleaning
Data cleaning, as the name suggests, turns “black” data into “white” data and “dirty” data into To become "clean", dirty data is dirty in form and content.
Dirty in form, such as missing values and special symbols;
Dirty in content, such as outliers.
1. Missing values
Missing values include the identification of missing values and the processing of missing values.
In R, the function is.na is used to identify missing values, and the function complete.cases is used to identify whether the sample data is complete.
Commonly used methods for dealing with missing values are: deletion, replacement and interpolation.
-
Deletion method: The deletion method can be divided into deleting observation samples and variables according to different angles of deletion, deleting observation samples (line deletion method), and the na.omit function in R can delete Rows containing missing values.
This is equivalent to reducing the sample size in exchange for the completeness of the information. However, when there are large missing variables and little impact on the research objectives, you can consider deleting the statement mydata[,-p] in the variable R. To be done. mydata represents the name of the deleted data set, p is the number of columns of the deleted variable, and - represents deletion.
Replacement method: The replacement method, as the name suggests, replaces missing values. There are different replacement rules according to different variables. The variable where the missing value is located is a numeric type. Use other numbers under this variable. The missing values are replaced by the mean; when the variable is a non-numeric variable, the median or mode of other observed values under the variable is used.
-
Interpolation method: The interpolation method is divided into regression interpolation and multiple interpolation.
Regression interpolation refers to treating the interpolated variable as the dependent variable y, and other variables as independent variables, using the regression model for fitting, and using the lm regression function in R to interpolate missing values. ;
Multiple imputation refers to generating a complete set of data from a data set containing missing values. It is performed multiple times to generate a random sample of missing values. The mice package in R can perform multiple imputation.
2. Outliers
Outlier values, like missing values, include the identification and processing of outliers.
The identification of outliers is usually handled with a univariate scatter plot or a box plot. In R, dotchart is a function that draws a univariate scatter plot, and the boxplot function draws a box plot. ; In the graph, points far away from the normal range are regarded as outliers.
The processing of outliers includes deleting observations containing outliers (direct deletion, when there are few samples, direct deletion will cause insufficient sample size and change the distribution of variables), treat them as missing values ( Use the existing information to fill in missing values), average correction (use the average of the two observations before and after to correct the outlier), and do not process it. When handling outliers, you must first review the possible reasons for the occurrence of outliers, and then determine whether the outliers should be discarded.
2. Data integration
The so-called data integration is to merge multiple data sources into one data storage , of course, if the data being analyzed is originally in a data store, there is no need for data integration (all-in-one).
The implementation of data integration is to combine two data frames based on keywords and use the merge function in R. The statement is merge (dataframe1, dataframe2, by="keyword"), and the default is in ascending order. Arrangement.
The following problems may occur when performing data integration:
The same name has different meanings, the name of an attribute in data source A and the name of an attribute in data source B The same, but the entities represented are different and cannot be used as keywords;
has synonymous names, that is, the name of an attribute in the two data sources is different but the entity it represents is the same. Can be used as keywords;
Data integration often results in data redundancy. The same attribute may appear multiple times, or it may be duplication caused by inconsistent attribute names. For duplicate attributes, do the related work first. Analyze and detect, and delete it if there is any.
3. Data transformation
Data transformation is to transform it into an appropriate form to meet the needs of software or analysis theory.
1. Simple function transformation
Simple function transformation is used to transform data without normal distribution into data with normal distribution. Commonly used ones include square, Square root, logarithm, difference, etc. For example, in time series, logarithm or difference operations are often performed on data to convert non-stationary sequences into stationary sequences.
2. Standardization
Normalization is to remove the influence of the variable dimension, such as directly comparing the difference in height and weight, the difference in units and the range of values. The differences make this not directly comparable.
Minimum-maximum normalization: also called dispersion standardization, linearly transforms the data and changes its range to [0,1]
Zero-mean normalization: also called standard deviation standardization, the mean value of the processed data is equal to 0, and the standard deviation is 1
- ##Decimal scaling normalization: move the decimal places of the attribute value, and Attribute values are mapped to [-1,1]
3. Continuous attribute discretization
Convert continuous attribute variables into categorical attributes, that is Discretization of continuous attributes, especially some classification algorithms require data to be categorical attributes, such as the ID3 algorithm. Commonly used discretization methods include the following:- Equal-width method: Divide the value range of the attribute into intervals with the same width, similar to making a frequency distribution table;
- Equal frequency method: put the same records into each interval;
- One-dimensional clustering: two steps, first put the continuous The values of the attributes are clustered using a clustering algorithm, and then the clustered sets are merged into a continuous value and marked with the same label.
4. Data reduction
Data reduction refers to the understanding of the mining task and the content of the data itself Basically, find useful features of the data that depend on the discovery target to reduce the size of the data, thereby minimizing the amount of data while maintaining the original appearance of the data as much as possible. Data curation can reduce the impact of invalid and erroneous data on modeling, reduce time, and reduce the space for storing data.1. Attribute reduction
Attribute reduction is to find the smallest attribute subset and determine the probability distribution of the subset that is close to the probability distribution of the original data.- Merge attributes: merge some old attributes into a new attribute;
- Select forward step by step: start from an empty attribute set, Each time, a current optimal attribute is selected from the original attribute set and added to the current subset, until the optimal attribute cannot be selected or a constraint value is satisfied;
- Step by step selection: from one Starting from an empty attribute set, each time the current worst attribute is selected from the original attribute set and eliminated from the current subset, until the worst attribute cannot be selected or a constraint value is satisfied;
- Decision Tree induction: attributes that do not appear in this decision tree are deleted from the initial set to obtain a better attribute subset;
- Principal component analysis: use fewer variables to explain Most of the variables in the original data (convert highly correlated variables into independent or uncorrelated variables).
2. Numerical reduction
By reducing the amount of data, including parametric and non-parametric methods, with parameters such as linear regression and multiple regression , parameterless methods such as histogram, sampling, etc. For more related knowledge, please visit theFAQ column!