Python Statistics Module Tutorial

Statistics represents a methodology for gathering data, organizing it into tables, and interpreting the resulting information to reach conclusions. This discipline falls under applied mathematics and focuses on the processes of collecting, analyzing, and interpreting data. By employing statistical techniques, we are able to analyze and make sense of intricate data related to complex issues. In this article, we will explore how to address statistical challenges utilizing Python, and we will also examine the statistics module that Python offers.

The statistics module in Python provides an array of functions designed to compute and analyze the statistics of numerical data. This module includes a variety of functions, such as mean, median, mode, standard deviation, among others.

Within the statistics module, we frequently employ descriptive statistics, including measures like the mean, and we illustrate the data using various representation techniques, such as tables, charts, or Excel spreadsheets. The data is displayed in a manner that emphasizes significant insights to support the identification of future trends. The process of describing and summarizing data for a single variable is referred to as univariate analysis. When we identify a relationship involving two variables, this is known as bivariate analysis. In cases where the analysis encompasses more than two variables, it is termed multivariate analysis.

Methods of Descriptive Statistics

Descriptive statistics can be analyzed through two distinct approaches, which allow us to interpret the data for particular objectives.

  • Central Tendency Measurement
  • Measure of Variability
  • Measure of Central Tendency

The measure of central tendency is referred to as a single value that describes the whole data. There are three types of measures of central tendency.

  • Mean
  • Median
  • Mode

It calculates the mean of the total observations within the dataset.

The formula for the mean is:

In Python, the mean function calculates the average of the dataset provided as input arguments. If no data is supplied to the mean function, it raises an error known as StatisticsError.

Python Mean Statistics Module Example

To illustrate the functionality of the mean statistics module in Python, let's consider a practical example.

Example

Example

import statistics as st

data = [1, 2, 3, 4, 5]

print ("The mean of data is: ",end="")

print (st.mean(data))

Output:

Output

The mean of data is: 3

Explanation:

In the code presented above, the statistics module is imported under the alias st. A data list is then defined as data, and the average value of this data set is calculated and displayed by utilizing the mean function that is part of the statistics module.

Median

The median represents the central value within a dataset, effectively dividing the data into two equal halves. When the total count of values is an odd number, the median is identified as the middle value. Conversely, if the count is even, the median is calculated as the average of the two values that occupy the central positions.

To determine the Median, the initial step involves arranging the data in order, followed by identifying the Median from this sorted collection. In Python, if the median function is called without any input data, it will trigger a StatisticsError.

Python Median Statistics Module Example

Let us consider an example to illustrate the median statistics module available in Python.

Example

Example

from statistics import median

from fractions import Fraction as fr

scores1 = (10, 20, 15, 30, 25)

scores2 = (4.2, 6.3, 8.9, 12.5, 9.1)

scores3 = (fr(3, 4), fr(5, 2), fr(7, 3), fr(5, 6), fr(2, 5))

scores4 = (-8, -13, -5, -17, -25)

scores5 = (7, 3, 5, 9, -2, -7, -3, 2)

print("Median of scores1 is %s" % (median(scores1)))

print("Median of scores2 is %s" % (median(scores2)))

print("Median of scores3 is %s" % (median(scores3)))

print("Median of scores4 is %s" % (median(scores4)))

print("Median of scores5 is %s" % (median(scores5)))

Output:

Output

Median of scores1 is 20

Median of scores2 is 8.9

Median of scores3 is 5/6

Median of scores4 is -13

Median of scores5 is 2.5

Explanation:

In the code presented above, we initially generate different scores using tuples, followed by calculating the median of these scores through the use of the Median function.

Median Low:

The medianlow function calculates the median for scenarios where the total count of elements in the dataset is odd. Conversely, if the count of values is even, it returns the lower of the two central values. Below is an example of Python code showcasing the medianlow function available in the statistics module.

Example

Example

import statistics

ages = [18, 21, 23, 24, 27, 30]

print("Median of ages is %s" % (statistics.median(ages)))

print("Low Median of ages is %s" % (statistics.median_low(ages)))

Output:

Output

Median of ages is 23.5

Low Median of ages is 23

Explanation:

The code presented above calculates the median age and the total count of values within the dataset, specifically when the count is even. The Median function computes the average of the two central values, while conversely, the medianlow function returns the lower value among the two central values.

Median High:

The function medianhigh calculates the median of a dataset when the total count of values is odd. Conversely, if the dataset contains an even number of values, it returns the greater of the two middle values. Below is the Python code that implements the medianhigh function.

Example

Example

import statistics

weights = [56, 60, 60, 62, 65, 68]

print("Median of weights is %s" % (statistics.median(weights)))

print("High Median of weights is %s" % (statistics.median_high(weights)))

Output:

Output

Median of weights is 61.0

The High Median of weights is 62

Explanation:

In the code provided, we determine the Median of weights by utilizing the Median function, which yields the central value. Conversely, the median_high function gives us the greater of the two middle values.

The mode serves as a statistic for central tendency, indicating the value that appears most frequently within a dataset. In instances where all values in the dataset occur with the same frequency, it is possible that no mode exists. Furthermore, if two or more values share the same highest frequency, multiple modes can be identified. Within the Python statistics module, the mode function is utilized to determine the value with the highest frequency in the dataset. Mode can be applied to both categorical and numerical datasets.

Python Mode Statistics Module Example

Let’s consider an example to illustrate the mode statistics module available in Python.

Example

Example

from statistics import mode

from fractions import Fraction as fr

scores_a = (8, 7, 8, 9, 8, 6, 7, 8)

scores_b = (3.2, 4.1, 5.6, 3.2, 6.8, 3.2)

scores_c = (fr(3, 5), fr(7, 10), fr(3, 5), fr(2, 5))

scores_d = (-10, -5, -5, -7, -10, -10, -5)

scores_e = ("apple", "banana", "apple", "cherry", "banana", "apple")

print("Mode of scores_a is %s" % (mode(scores_a)))

print("Mode of scores_b is %s" % (mode(scores_b)))

print("Mode of scores_c is %s" % (mode(scores_c)))

print("Mode of scores_d is %s" % (mode(scores_d)))

print("Mode of scores_e is %s" % (mode(scores_e)))

Output:

Output

Mode of scores_a is 8

Mode of scores_b is 3.2

Mode of scores_c is 3/5

Mode of scores_d is -10

Mode of scores_e is apple

Explanation:

In the code presented above, we generate five scores by utilizing a tuple that encompasses numerical, floating-point, integer, and categorical data. Subsequently, we determine the mode for each score by employing the mode function available in the statistics module.

Variability Measure

Measuring the central tendency of the data is not enough to provide a description of the data. A variability measure is also needed to describe the data, also referred to as the spread of data, representing how the data is spread. There are various types of variability measures:

  • Range
  • Variance
  • Standard deviation
  • Range

The range represents the difference between the highest and lowest values within a dataset. It is directly related to the dispersion of the data, indicating that an increase in the data's dispersion results in a larger range.

The formula for range is:

Range = maximum value - minimum value

The functions max and min can be utilized to compute the highest and lowest values, respectively.

Example

Example

values = [10, 15, 22, 27, 33]

largest = max(values)

smallest = min(values)

spread = largest - smallest

print("Maximum = {}, Minimum = {} and Range = {}".format(largest, smallest, spread))

Output:

Output

Maximum = 33, Minimum = 10 and Range = 23

Explanation:

In the code presented above, a collection of values is generated, the maximum and minimum values are determined through a function, and the range of the data is computed utilizing the range formula.

Variance

Variance is determined by calculating the average of the squared differences from the mean. To compute variance, one must first calculate the deviation of each data point from the mean, square these deviations, sum them all together, and then divide this total by the number of data points in the dataset.

Here, N represents the total count of values within the dataset, while μ denotes the average value.

Example

Example

from statistics import variance

from fractions import Fraction as fr

group1 = (7, 8, 9, 11, 10, 15, 12)

group2 = (-7, -5, -3, -8, -2, -9)

group3 = (14, -2, 0, 6, -7, 3, 9, -4)

group4 = (fr(2, 5), fr(4, 5), fr(6, 5), fr(8, 5), fr(10, 5))

group5 = (2.7, 3.4, 3.8, 4.2, 2.9)

print("Variance of group1 is %s " % (variance(group1)))

print("Variance of group2 is %s " % (variance(group2)))

print("Variance of group3 is %s " % (variance(group3)))

print("Variance of group4 is %s " % (variance(group4)))

print("Variance of group5 is %s " % (variance(group5)))

Output:

Output

Variance of group1 is 7.238095238095238

Variance of group2 is 7.866666666666666

Variance of group3 is 49.410714285714285

Variance of group4 is 2/5

Variance of group5 is 0.385

Explanation:

In the preceding code snippet, we begin by importing the variance function from the statistics module. Next, we create five distinct groups of data, after which we compute the variance for the dataset.

Standard Deviation

It is determined by taking the square root of the variance.

In Python, the statistics module includes the stdev function, which calculates the standard deviation of a given dataset.

Example

Example

from statistics import stdev

from fractions import Fraction as fr

scores1 = (11, 13, 15, 12, 14, 18, 17)

scores2 = (-12, -14, -11, -17, -15, -16)

scores3 = (5, -8, 0, 10, 3, -4, 7, -6)

scores4 = (3.5, 4.1, 5.9, 4.8, 5.2)

print("The Standard Deviation of scores1 is %s" % (stdev(scores1)))

print("The Standard Deviation of scores2 is %s" % (stdev(scores2)))

print("The Standard Deviation of scores3 is %s" % (stdev(scores3)))

print("The Standard Deviation of scores4 is %s" % (stdev(scores4)))

Output:

Output

The Standard Deviation of scores1 is 2.563479777846623

The Standard Deviation of scores2 is 2.3166067138525404

The Standard Deviation of scores3 is 6.468329437674438

The Standard Deviation of scores4 is 0.9354143466934856

Explanation:

The stdev function is sourced from the statistics module. A tuple containing five data scores is established, and then the standard deviation for each of these scores is computed.

Harmonic Mean

The harmonic mean is determined by taking the reciprocal of the arithmetic mean. As an illustration, to compute the harmonic mean of three values, x, y, and z, the formula used is 3/(1/x + 1/y + 1/z). In cases where any of the values equals zero, the resulting harmonic mean will also be zero.

Example

Example

from statistics import harmonic_mean

numbers1 = [10, 20, 25, 40]

numbers2 = [2.5, 3.7, 4.1, 5.2, 6.8]

print("Harmonic mean of numbers1 is", harmonic_mean(numbers1))

print("Harmonic mean of numbers2 is", harmonic_mean(numbers2))

Output:

Output

Harmonic mean of numbers1 is 18.604651162790695

Harmonic mean of numbers2 is 3.9887064558944534

Geometric Mean

The geometric mean serves as another metric of central tendency, derived from the multiplication of values. This method contrasts with the arithmetic mean, which is determined through the summation of values.

Example

Example

from statistics import geometric_mean

data1 = [4, 9, 16, 25]

data2 = [1.2, 3.8, 2.4, 5.6, 7.9]

print("Geometric mean of data1 is", geometric_mean(data1))

print("Geometric mean of data2 is", geometric_mean(data2))

Output:

Output

Geometric mean of data1 is 10.95445115010332

Geometric mean of data2 is 3.443485356947653

Kernel Density Estimation (KDE)

KDE, or Kernel Density Estimation, applied to discrete datasets, yields a continuous probability distribution that smooths the data using a kernel function. This technique is instrumental in making inferences regarding the population based on a sample of data. The parameter 'h', known as the bandwidth, regulates the extent of smoothing. Lower values of 'h' tend to highlight local characteristics, whereas higher values result in a more uniform and smooth output.

In KDE, the kernel assesses the relative weights of sample data values. The selection of the kernel's shape is not critical, as it has minimal impact on the smoothing parameter.

Example

Example

import seaborn as sns

import numpy as np

import matplotlib.pyplot as plt

from scipy.stats import gaussian_kde

data = np.random.normal(0, 1, 1000)

kde = gaussian_kde(data)

x = np.linspace(-4, 4, 100)

plt.plot(x, kde(x))

plt.show()

Output:

Explanation

In the code provided above, the scipy library is utilized to compute the Kernel Density Estimate (KDE) for a specified dataset.

Quantiles:

This function organizes the dataset into n continuous ranges with equal likelihood. When the parameter n is defined as 4, it computes quartiles; if n is set to 10, it yields deciles, and when n is adjusted to 100, it produces percentiles. The function accepts any iterable sample data as its input.

Example

Example

import statistics

data = [15, 18, 21, 24, 29, 31, 33, 39, 40, 43, 47, 51, 53, 55, 60]

# Quartiles (4 equal groups)

quartiles = statistics.quantiles(data, n=4, method='inclusive')  # 'inclusive' matches Excel/Pandas style

print("Quartiles:", quartiles)

# Deciles (10 equal groups)

deciles = statistics.quantiles(data, n=10, method='inclusive')

print("Deciles:", deciles)

# Percentiles (100 equal groups-show a few for brevity)

percentiles = statistics.quantiles(data, n=100, method='inclusive')

print("25th Percentile:", percentiles[24])

print("50th Percentile (Median):", percentiles[49])

print("75th Percentile:", percentiles[74])

Output:

Output

Quartiles: [26.5, 39.0, 49.0]

Deciles: [19.2, 23.4, 29.4, 32.2, 39.0, 41.2, 46.2, 51.4, 54.2]

25th Percentile: 26.5

50th Percentile (Median): 39.0

75th Percentile: 49.0

Explanation:

In the code provided, the variable n represents the count of groups, for instance, 4 for quartiles, 10 for deciles, and 100 for percentiles.

Covariance

This function calculates the covariance between two provided values, a and b. It assesses the degree to which the two input values vary together.

Example

Example

import statistics

# Sample datasets

x = [4, 8, 15, 16, 23, 42]

y = [7, 6, 9, 12, 15, 20]

# Calculate the sample covariance

cov = statistics.covariance(x, y)

print("Covariance between x and y:", cov)

Output:

Output

Covariance between x and y: 69.2

Correlation

The correlation function available in Python's statistics module computes the Pearson correlation coefficient for two given input values. The correlation coefficient ranges from -1 to +1, indicating both the strength and direction of a linear association. A correlation coefficient of -1 signifies a negative correlation, while a value of +1 indicates a positive correlation. A correlation coefficient of 0 suggests that there is no correlation present.

Example

Example

import statistics

# Sample data

x = [12, 15, 17, 19, 22, 24]

y = [30, 33, 36, 40, 45, 47]

# Compute Pearson correlation coefficient

corr = statistics.correlation(x, y)

print("Correlation between x and y:", corr)

Output:

Output

Correlation between x and y: 0.9947213949442446

Explanation:

In the code provided above, we declare two input variables, x and y, and calculate the Pearson correlation coefficient for these variables by utilizing the statistics module.

Input Required

This code uses input(). Please provide values below: