How to get descriptive statistics of your Pandas DataFrame in Python
Pandas is a popular Python library for data analysis and manipulation. It provides a data structure called DataFrame, which is a two-dimensional labeled data structure with columns of potentially different types. In this tutorial, we will learn how to get descriptive statistics of a Pandas DataFrame in Python.
To get started, we need to have a Pandas DataFrame. If you don’t have one, you can create one using the following code:
import pandas as pd
data = {'name': ['John', 'Jane', 'Jim', 'Joan'],
'age': [32, 28, 45, 33],
'income': [50000, 60000, 55000, 63000]}
df = pd.DataFrame(data)
The above code creates a Pandas DataFrame df
from a dictionary data
. The keys of the dictionary data
become the column names, and the values become the values in the columns.
Now that we have a Pandas DataFrame, let’s start with some basic descriptive statistics.
1. Shape of the DataFrame
The shape of a DataFrame is the number of rows and columns it has. You can get the shape of a DataFrame using the shape
attribute:
print(df.shape)
The output of the above code will be:
(4, 3)
2. Columns of the DataFrame
You can get the columns of a DataFrame using the columns
attribute:
print(df.columns)
The output of the above code will be:
Index(['name', 'age', 'income'], dtype='object')
3. Data Types of the Columns
You can get the data types of the columns of a DataFrame using the dtypes
attribute:
print(df.dtypes)
The output of the above code will be:
name object
age int64
income int64
dtype: object
4. Summary Statistics
You can get summary statistics of the columns of a DataFrame using the describe
method:
print(df.describe())
The output of the above code will be:
age income
count 4.000000 4.000000
mean 34.250000 56000.000000
std 7.049893 4589.583480
min 28.000000 50000.000000
25% 32.000000 52500.000000
50% 32.500000 56000.000000
75% 36.000000 61500.000000
max 45.000000 63000.000000
The describe
method provides summary statistics for numerical columns of the DataFrame. By default, it returns the count, mean, standard deviation, minimum, first quartile, median, third quartile, and maximum of each numerical column.
5. Count
You can get the count of the non-missing values of a column using the count
method:
print(df['age'].count())
The output of the above code will be:
4
Another way to get descriptive statistics of a Pandas DataFrame is to use the mean()
, median()
, mode()
, min()
, max()
, sum()
, count()
, and std()
functions. These functions return the mean, median, mode, minimum value, maximum value, sum, count, and standard deviation of all the numerical columns in the DataFrame, respectively.
Here’s an example:
print(df.mean())
print(df.median())
print(df.mode().iloc[0])
print(df.min())
print(df.max())
print(df.sum())
print(df.count())
print(df.std())
Here is another example
First, let’s start by creating a sample dataframe:
import pandas as pd
import numpy as np
# creating a sample dataframe
data = {'name': ['John', 'Jane', 'Jim', 'Joan', 'Jake'],
'age': [30, 29, 31, 32, 33],
'score': [80, 75, 90, 85, 95]}
df = pd.DataFrame(data)
print(df)
Output:
name age score
0 John 30 80
1 Jane 29 75
2 Jim 31 90
3 Joan 32 85
4 Jake 33 95
With the sample dataframe, we can now start calculating descriptive statistics:
df.describe()
: This function gives a summary of the central tendency, dispersion and shape of the distribution of the DataFrame, excluding NaN
values. By default, only the numerical columns are returned, but you can include all columns by passing include='all'
as a parameter.
# descriptive statistics of numerical columns
print(df.describe())
Output:
age score
count 5.000000 5.000000
mean 31.000000 85.000000
std 1.581139 8.485281
min 29.000000 75.000000
25% 30.000000 80.000000
50% 31.000000 85.000000
75% 32.000000 90.000000
max 33.000000 95.000000
df.mean()
: This function returns the mean of all the columns.
# mean of all columns
print(df.mean())
Output:
age 31.0
score 85.0
dtype: float64
df.median()
: This function returns the median of all the columns.
# median of all columns
print(df.median())
Output:
age 31.0
score 85.0
dtype: float64
df.mode()
: This function returns the mode of all the columns. If there are multiple values with the highest frequency, all of them are returned.
# mode of all columns
print(df.mode())
Output:
name age score
0 John 30 80
1 Jane 29 75
2 Jim 31 90
3 Joan 32 85
4 Jake 33 95
df.std()
: This function returns the standard deviation of all the columns.
# standard deviation of all columns
print(df.std())
Output:
age 1.581139
score 8.485281
dtype: float64
Summary in a nutshell:
Descriptive statistics are an important part of understanding your data. Descriptive statistics give us a quick overview of the central tendency, dispersion, and shape of a data set. In this article, we will learn how to get descriptive statistics of a Pandas DataFrame in Python.
Pandas is a popular open-source library in Python for data analysis and manipulation. One of its most powerful features is the ability to perform descriptive statistics on a DataFrame.
To get descriptive statistics of a Pandas DataFrame, we use the describe()
method. This method calculates various summary statistics of the numerical data in the DataFrame. It returns a new DataFrame that includes the count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum of the data.
If you only want to get descriptive statistics of the numerical columns in the DataFrame, you can use the describe()
method with the include
parameter set to [np.number]
. If you want to get descriptive statistics of all columns, including non-numerical columns, you can set the include
parameter to 'all'
.
To get specific descriptive statistics, you can use other Pandas methods, such as mean()
, median()
, mode()
, min()
, max()
, and sum()
.
It is important to note that descriptive statistics are only useful for understanding the basic features of your data. For more advanced analysis, you may need to use other techniques, such as inferential statistics or data visualization.
In conclusion, using the describe()
method in Pandas is a convenient way to get a quick overview of the descriptive statistics of your data. This method provides a simple and efficient way to get important information about your data and is a great tool for understanding and exploring your data in Python.
Latest end-to-end Learn by Coding Projects (Jupyter Notebooks) in Python and R:
All Notebooks in One Bundle: Data Science Recipes and Examples in Python & R.
End-to-End Python Machine Learning Recipes & Examples.
End-to-End R Machine Learning Recipes & Examples.
Applied Statistics with R for Beginners and Business Professionals
Data Science and Machine Learning Projects in Python: Tabular Data Analytics
Data Science and Machine Learning Projects in R: Tabular Data Analytics
Python Machine Learning & Data Science Recipes: Learn by Coding
R Machine Learning & Data Science Recipes: Learn by Coding
Comparing Different Machine Learning Algorithms in Python for Classification (FREE)
There are 2000+ End-to-End Python & R Notebooks are available to build Professional Portfolio as a Data Scientist and/or Machine Learning Specialist. All Notebooks are only $29.95. We would like to request you to have a look at the website for FREE the end-to-end notebooks, and then decide whether you would like to purchase or not.