How to perform Grubbs test in Python-Python Tutorial-php.cn

The Grubbs test is a statistical hypothesis testing method used to detect outliers in a data set. Outliers are observations that are assigned to a data distribution, also known as anomalies. Data sets with outliers tend to be more susceptible to overfitting than data with a normal/Gaussian distribution. Therefore, it is necessary to address outliers before machine learning modeling. Before processing, we must detect and locate outliers in the data set. The most popular outlier detection techniques are QQPlot, interquartile range, and Grubbs statistical test. However, this article will only discuss the Grubbs test for detecting outliers. You will learn: What is a Grubbs test and how to implement it in Python.

What is an outlier?

Outliers are data observations that are numerically far apart from other data values. These values are outside the range of normally distributed data. The data set must contain 67% of the records at the first standard deviation, 95% of the data at the second standard deviation, and 99.7% of the points at the third standard deviation to achieve a normal distribution. In other words, the data points should lie between the first and third quartile range. We consider records below the first quartile and above the third quartile as outliers or outliers.

Grabbs Statistical Hypothesis Test

Like any other statistical hypothesis test, the Grubbs test can also approve or reject the null hypothesis (H0) or the alternative hypothesis (H1). The Grubbs test is a test that detects outliers in a data set.

We can perform the Grubbs test in two ways: One-sided test and Two-sided test , for univariate data sets or nearly normal samples with at least seven distribution of variables. This test is also called the extreme studentized deviation test or the maximum normalized residual test.

The Grubbs test uses the following assumptions -

Null (H0): The data set has no outliers.
Alternative (H1): The data set has only one outlier.

Grabbs Test in Python

Python can handle any programming challenge with its vast collection of libraries. These libraries provide built-in methods that can be used directly to perform any operation, statistical testing, etc. Likewise, Python has a library that contains methods for performing Grubbs tests to detect outliers. However, we will explore two ways to implement Grubbs tests in Python: built-in functions in libraries and implementing formulas from scratch.

Outlier Library and Smirnov_grubbs

Let us first install the outlier_utils library using the following command.

!pip install outlier_utils

Copy after login

Now let's make a dataset containing outliers and perform a Grubbs test.

Double-sided Grubb Test

grammar

grubbs.test(data, alpha=.05)

Copy after login

parameter

data - Numeric vector of data values.

alpha - The significance level of the test.

illustrate

In this method, the user must use the smirnov_grubbs.test() function from the outlier package and pass the necessary data as input in order to run Grubb's tests.

Example

import numpy as np
from outliers import smirnov_grubbs as grubbs
 
#define data
data = np.array([ 5, 14, 15, 15, 14, 19, 17, 16, 20, 22, 8, 21, 28, 11, 9, 29, 40])
 
#perform Grubbs' test
grubbs.test(data, alpha=.05)

Copy after login

Output

array([ 5, 14, 15, 15, 14, 19, 17, 16, 20, 22,  8, 21, 28, 11,  9, 29])

Copy after login

The above code just starts by loading the library and data, and finally uses the "test" method to perform a Grubbs test on this data. This test detects outliers on both sides (left and right), or values below the first quartile and above the third quartile. The data had only 1 outlier (40), which was removed using Grubbs' test.

One-sided Grubbs test

Synatx

grubbs.max_test(data, alpha=.05)

Copy after login

illustrate

In this method, the user must call the grubbs.min_test() function to obtain the minimum outlier value from the provided data set, or call the grubbs.max_test() function to obtain the minimum outlier value from the provided data set Get the largest outlier in the data set to obtain a one-sided Grubb's test.

Example

import numpy as np
from outliers import smirnov_grubbs as grubbs
 
#define data
data = np.array([5, 14, 15, 15, 14, 19, 17, 16, 20, 22, 8, 21, 28, 11, 9, 29, 40])

#perform Grubbs' test for minimum value is an outlier
print(grubbs.min_test(data, alpha=.05)) 

#perform Grubbs' test for minimum value is an outlier
grubbs.max_test(data, alpha=.05)

Copy after login

Output

[ 5 14 15 15 14 19 17 16 20 22  8 21 28 11  9 29 40]
array([ 5, 14, 15, 15, 14, 19, 17, 16, 20, 22,  8, 21, 28, 11,  9, 29])

Copy after login

One-sided Grubbs test detects outliers below the first quartile or above the third quartile. We can see that the min_test method removes outliers from the smallest side of the data, while the max_test method removes outliers from the top of the data.

Formula implementation

Here we will use Python to implement the following Grubbs test formula. We will use the Numpy and Scipy libraries to achieve this.

How to perform Grubbs test in Python

grammar

g_calculated = numerator/sd_x
g_critical = ((n - 1) * np.sqrt(np.square(t_value_1))) / (np.sqrt(n) * np.sqrt(n - 2 + np.square(t_value_1)))

Copy after login

algorithm

The implementation steps are as follows -

Calculate the average of the data set values.
Calculate the standard deviation of the data set values.
To implement the Grubbs test formula, calculate the numerator by subtracting each value in the data set from its mean.
Divide the numerator value by the standard deviation to get the calculated score.
Calculate critical scores for the same value.
If the critical value is greater than the calculated value, there are no outliers in the data set, otherwise there are outliers.

Example

import numpy as np
import scipy.stats as stats
## define data
x = np.array([12,13,14,19,21,23])
y = np.array([12,13,14,19,21,23,45])

## implement Grubbs test
def grubbs_test(x):
   n = len(x)
   mean_x = np.mean(x)
   sd_x = np.std(x)
   numerator = max(abs(x-mean_x))
   g_calculated = numerator/sd_x
   print("Grubbs Calculated Value:",g_calculated)
   t_value_1 = stats.t.ppf(1 - 0.05 / (2 * n), n - 2)
   g_critical = ((n - 1) * np.sqrt(np.square(t_value_1))) / (np.sqrt(n) * np.sqrt(n - 2 + np.square(t_value_1)))
   print("Grubbs Critical Value:",g_critical)
   if g_critical > g_calculated:
      print("We can see from the Grubbs test that the calculated value is less than the crucial value. Recognize the null hypothesis and draw the conclusion that there are no outliers\n")
   else:
      print("We see from the Grubbs test that the estimated value exceeds the critical value. Reject the null theory and draw the conclusion that there are outliers\n")
grubbs_test(x)
grubbs_test(y)

Copy after login

Output

Grubbs Calculated Value: 1.4274928542926593
Grubbs Critical Value: 1.887145117792422
We can see from the Grubbs test that the calculated value is less than the crucial value. Recognize the null hypothesis and draw the conclusion that there are no outliers

Grubbs Calculated Value: 2.2765147221587774
Grubbs Critical Value: 2.019968507680656
We see from the Grubbs test that the estimated value exceeds the critical value. Reject the null theory and draw the conclusion that there are outliers

Copy after login

The result of the Grubb test shows that the array x does not have any outliers, but y has 1 outlier.

in conclusion

We learned about outliers and Grubbs tests in Python in this article. Let’s wrap up this article with some key points.

Outliers are records that fall outside the interquartile range.
Outliers do not conform to the normal distribution of the data set.
We can use the Grubbs hypothesis statistical test to detect outliers.
We can execute Grubbs tests using the built-in methods provided in the outlier_utils library.
The two-sided Grubbs test detects and removes outliers on the left and right sides.
However, the one-sided Grubbs test will detect outliers on either side.

The above is the detailed content of How to perform Grubbs test in Python. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

4 weeks ago By DDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

4 weeks ago By DDD

Two Point Museum: All Exhibits And Where To Find Them

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7376

Java Tutorial

1628

CakePHP Tutorial

1355

Laravel Tutorial

1267

PHP Tutorial

1216

Related knowledge

How to Use Python to Find the Zipf Distribution of a Text File Mar 05, 2025 am 09:58 AM

This tutorial demonstrates how to use Python to process the statistical concept of Zipf's law and demonstrates the efficiency of Python's reading and sorting large text files when processing the law. You may be wondering what the term Zipf distribution means. To understand this term, we first need to define Zipf's law. Don't worry, I'll try to simplify the instructions. Zipf's Law Zipf's law simply means: in a large natural language corpus, the most frequently occurring words appear about twice as frequently as the second frequent words, three times as the third frequent words, four times as the fourth frequent words, and so on. Let's look at an example. If you look at the Brown corpus in American English, you will notice that the most frequent word is "th

How Do I Use Beautiful Soup to Parse HTML? Mar 10, 2025 pm 06:54 PM

This article explains how to use Beautiful Soup, a Python library, to parse HTML. It details common methods like find(), find_all(), select(), and get_text() for data extraction, handling of diverse HTML structures and errors, and alternatives (Sel

Mathematical Modules in Python: Statistics Mar 09, 2025 am 11:40 AM

Python's statistics module provides powerful data statistical analysis capabilities to help us quickly understand the overall characteristics of data, such as biostatistics and business analysis. Instead of looking at data points one by one, just look at statistics such as mean or variance to discover trends and features in the original data that may be ignored, and compare large datasets more easily and effectively. This tutorial will explain how to calculate the mean and measure the degree of dispersion of the dataset. Unless otherwise stated, all functions in this module support the calculation of the mean() function instead of simply summing the average. Floating point numbers can also be used. import random import statistics from fracti

How to Perform Deep Learning with TensorFlow or PyTorch? Mar 10, 2025 pm 06:52 PM

This article compares TensorFlow and PyTorch for deep learning. It details the steps involved: data preparation, model building, training, evaluation, and deployment. Key differences between the frameworks, particularly regarding computational grap

Serialization and Deserialization of Python Objects: Part 1 Mar 08, 2025 am 09:39 AM

Serialization and deserialization of Python objects are key aspects of any non-trivial program. If you save something to a Python file, you do object serialization and deserialization if you read the configuration file, or if you respond to an HTTP request. In a sense, serialization and deserialization are the most boring things in the world. Who cares about all these formats and protocols? You want to persist or stream some Python objects and retrieve them in full at a later time. This is a great way to see the world on a conceptual level. However, on a practical level, the serialization scheme, format or protocol you choose may determine the speed, security, freedom of maintenance status, and other aspects of the program

What are some popular Python libraries and their uses? Mar 21, 2025 pm 06:46 PM

The article discusses popular Python libraries like NumPy, Pandas, Matplotlib, Scikit-learn, TensorFlow, Django, Flask, and Requests, detailing their uses in scientific computing, data analysis, visualization, machine learning, web development, and H

How to Create Command-Line Interfaces (CLIs) with Python? Mar 10, 2025 pm 06:48 PM

This article guides Python developers on building command-line interfaces (CLIs). It details using libraries like typer, click, and argparse, emphasizing input/output handling, and promoting user-friendly design patterns for improved CLI usability.

Scraping Webpages in Python With Beautiful Soup: Search and DOM Modification Mar 08, 2025 am 10:36 AM

This tutorial builds upon the previous introduction to Beautiful Soup, focusing on DOM manipulation beyond simple tree navigation. We'll explore efficient search methods and techniques for modifying HTML structure. One common DOM search method is ex

See all articles