How can I use SciPy to find the best-fitting theoretical distribution for an empirical dataset and calculate probabilities exceeding a given threshold?

DDD

Release： 2024-11-24 11:36:09

Original

589 people have browsed it

How can I use SciPy to find the best-fitting theoretical distribution for an empirical dataset and calculate probabilities exceeding a given threshold?

Fitting Empirical Distributions to Theoretical Ones with Scipy

Problem Overview

Consider a dataset of integer values sampled from an unknown continuous distribution. We seek to determine the probability (p-value) of encountering values greater than any given threshold. To accurately estimate these probabilities, it is essential to fit our empirical distribution to a suitable theoretical distribution. This article explores how to perform such fitting using Scipy in Python.

Distribution Fitting

To assess the goodness of fit, we can employ a sum of squared errors (SSE) metric to compare the histograms of the empirical data and the fitted distributions. The distribution with the lowest SSE is considered the best fit.

Scipy Implementation

Scipy's statistical module provides a wide range of continuous distribution classes. We can iterate through each distribution, estimate its parameters, compute the SSE, and store the results.

Example: El Niño Dataset

Let us illustrate the process using Sea Surface Temperature (SST) data from the El Niño dataset.

Load the data and plot its histogram.
Perform distribution fitting using the SSE metric.
Identify the best-fit distribution based on the lowest SSE.
Plot the probability density function (PDF) of the best-fit distribution along with the empirical histogram.

The code below showcases this implementation:

import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt
from scipy.stats._continuous_distns import _distn_names
import warnings

# El Niño SST data
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())

# Function to fit distributions based on SSE
def best_fit_distribution(data):
    return sorted(
        [
            (getattr(st, distribution), distribution.fit(data), np.sum(np.power(data.hist(bins=50).values - distribution.pdf(data.index), 2.0))) 
            for distribution in _distn_names 
            if not distribution in ['levy_stable', 'studentized_range']
        ], 
    key=lambda x:x[2]
)

# Find best fit
best_dist = best_fit_distribution(data)[0]

# Plot distribution
fig, ax = plt.subplots(figsize=(12,8))
ax.plot(data.hist(bins=50, density=True, alpha=0.5, color='gray'))
param_names = best_dist[0].shapes + ', loc, scale' if best_dist[0].shapes else ['loc', 'scale']
param_str = ', '.join(['{}={:0.2f}'.format(k, v) for k, v in zip(param_names, best_dist[1])])
dist_str = '{}({})'.format(best_dist[0].name, param_str)

ax.plot(best_dist[0].pdf(data.index, **best_dist[1]), lw=2, label=dist_str)
ax.set_title('Fitted Distribution: ' + dist_str)
ax.set_xlabel('SST (°C)')
ax.set_ylabel('Frequency')
ax.legend()

Copy after login

The output shows the best-fit distribution as the Weibull distribution with parameters:

scale=0.64, loc=15.59

Copy after login

The above is the detailed content of How can I use SciPy to find the best-fitting theoretical distribution for an empirical dataset and calculate probabilities exceeding a given threshold?. For more information, please follow other related articles on the PHP Chinese website!