SciPy - Statistical Tests and Inference



Statistical tests and inference involve deriving conclusions about a population from sample data. These methodologies are fundamental for validating hypotheses, analyzing data trends, and making informed decisions in research, economics, engineering and many other fields. SciPys scipy.stats module offers a comprehensive set of tools to perform various statistical tests and data inferences.

Important Statistical Tests in SciPy

The scipy.stats library in Python includes a variety of functions to execute tests such as t-tests, chi-square tests and ANOVA, helping you validate assumptions and test hypotheses in different applications.

SciPy provides several statistical tests designed to assess different types of data and determine if observed differences or relationships are statistically significant. These tests play a critical role in hypothesis testing and analysis.

t-Test

A t-test is used to assess whether the means of two groups are different from one another typically applied in situations like comparing the results of two sample groups. The function scipy.stats.ttest_ind() can be used to perform a t-test on two independent samples.

The following example demonstrates how to perform a t-test on two datasets −

 from scipy.stats import ttest_ind import numpy as np # Generate sample data group1 = np.random.normal(0, 1, 100) group2 = np.random.normal(0.5, 1, 100) # Conduct the t-test stat, p_value = ttest_ind(group1, group2) print(f"t-statistic: {stat:.4f}") print(f"p-value: {p_value:.4f}") 

Here is the result of the t-test showing the t-statistic and p-value which help us to determine if the differences between the two groups are statistically significant −

 t-statistic: -3.1020 p-value: 0.0022 

Chi-Squared Test

The Chi-Squared Test is typically used to analyze categorical data, determining whether there is an association between two categorical variables. It's useful in situations like contingency tables where data is grouped into categories.

To perform Chi-Squared Test, SciPy provides the scipy.stats.chi2_contingency() function −

 from scipy.stats import chi2_contingency import numpy as np # Example data in a contingency table data = np.array([[10, 20], [20, 30]]) # Run the chi-squared test chi2_stat, p_val, dof, expected = chi2_contingency(data) print(f"Chi-squared statistic: {chi2_stat:.4f}") print(f"p-value: {p_val:.4f}") print(f"Degrees of freedom: {dof}") print(f"Expected values: \n{expected}") 

Below is the output of the Chi-squared test showing the statistic, p-value, degrees of freedom, and expected values:

 Chi-squared statistic: 0.1280 p-value: 0.7205 Degrees of freedom: 1 Expected values: [[11.25 18.75] [18.75 31.25]] 

ANOVA (Analysis of Variance)

ANOVA tests whether there are significant differences among the means of three or more groups. It's useful when comparing multiple datasets to determine if at least one of them is different from the others.

To perform a one-way ANOVA we can use the scipy.stats.f_oneway() function, following is the example which performs the Annova test −

 from scipy.stats import f_oneway import numpy as np # Example data from three groups group1 = np.random.normal(0, 1, 100) group2 = np.random.normal(1, 1, 100) group3 = np.random.normal(2, 1, 100) # Run one-way ANOVA f_stat, p_value = f_oneway(group1, group2, group3) print(f"F-statistic: {f_stat:.4f}") print(f"p-value: {p_value:.4f}") 

Heres the result of the ANOVA test showing the F-statistic and p-value, which help us assess whether the group means are statistically different:

 F-statistic: 75.5012 p-value: 0.0000 

Normality Tests

To determine if a dataset follows a normal distribution we can use normality tests like the Shapiro-Wilk Test or D'Agostino and Pearson's Test available in SciPy. The scipy.stats.shapiro() function conducts the Shapiro-Wilk test to check normality −

 from scipy.stats import shapiro import numpy as np # Example data data = np.random.normal(0, 1, 100) # Perform Shapiro-Wilk normality test stat, p_value = shapiro(data) print(f"Test statistic: {stat:.4f}") print(f"p-value: {p_value:.4f}") 

Following is the output of the Shapiro-Wilk test helps to evaluate if the sample data is consistent with a normal distribution −

 Test statistic: 0.9878 p-value: 0.4939 

Using Statistical Inference in SciPy

SciPy provides essential tools for making inferences about a population from sample data, such as −

  • p-value: This is used to determine the statistical significance of test results. A p-value below a threshold (commonly 0.05) suggests a significant result.
  • Confidence Intervals: Estimate the range in which a population parameter (such as the mean) lies based on sample data.
  • Effect Size: Quantifies the magnitude of an observed effect or difference.

Using these methods the researchers can perform thorough statistical analyses and make decisions backed by solid evidence from their data.

Advertisements