What are the four methods for eliminating abnormal data?
The four methods for eliminating abnormal data are: 1. "isolation forest"; 2. DBSCAN; 3. OneClassSVM; 4. "Local Outlier Factor", which calculates a numerical score to reflect a sample Abnormal degree.
The operating environment of this tutorial: Windows 7 system, Dell G3 computer.
outlier detection outlier identification method
1. isolation forest Isolated forest
1.1 Test sample example
File test.pkl
1.2 Isolated Forest demo
Isolated Forest Principle
By randomly dividing features, a random forest is established, which can be divided after a smaller number of divisions. The point is considered an abnormal point.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
1.3 If you modify it yourself, X_train can be changed to the data you need
There is no standardization here. You can standardize it first and then remove outliers based on standardization. from sklearn.preprocessing import StandardScaler
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
1.4 Core code
1.4.1 Sample sample
1 2 3 4 5 6 7 8 9 10 |
|
1.4.2 Core code implementation
clf = IsolationForest(max_samples=0.8, contamination=0.25)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
2. DBSCAN
DBSCAN(Density-Based Spatial Clustering of Applications with Noise) Principle
With each point as the center, set the neighborhood And how many points are needed in the neighborhood. If the sample point is greater than the specified requirement, the point is considered to be in the same category as the points in the neighborhood. If it is less than the specified value, if the point is located in the neighborhood of other points, it is a boundary point.
2.1 DBSCAN demo
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
|
2.2 Using custom test examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
|
Note: It can be seen that at both ends of the test sample, DBSCAN can classify the samples at the "tip" better than the isolation forest.
2.3 Core code
model = DBSCAN(eps=eps, min_samples=min_samples) #Construct classifier
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
2.4 Construct filtering Function
This function is standardized first to facilitate analysis using fixed parameters
2.4.1 Filter function
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
2.4.2 Measure classification results
(I’m too lazy to convert the markdown format, so I just took a screenshot::>_<::)
1 2 3 4 5 6 |
|
3. OneClassSVM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
|
3.2 Core code
1 2 3 4 5 6 7 8 9 10 11 |
|
Before removing abnormal points
1
2
plt.scatter(X_train_normal[:,0],X_train_normal[:,1])
plt.show()
Copy after loginCopy after login
1 2 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
|
1
2
3
4
5
6
7
8
9
from sklearn.neighbors import LocalOutlierFactor
X_train = X_train_demo.values
# 构造分类器
## 25个样本点为一组,异常值点比例为0.2
clf = LocalOutlierFactor(n_neighbors=25, contamination=0.2)
# 预测,结果为-1或者1
labels = clf.fit_predict(X_train)
# 获取正常点
X_train_normal = X_train[labels>0]
Copy after login
Before eliminating abnormal points1 2 3 4 5 6 7 8 9 |
|
1
2
plt.scatter(X_train[:,0],X_train[:,1])
plt.show()
Copy after login
1 2 |
|
1
2
plt.scatter(X_train_normal[:,0],X_train_normal[:,1])
plt.show()
Copy after loginCopy after login
1 2 |
|
FAQ column!
The above is the detailed content of What are the four methods for eliminating abnormal data?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)
