Business Analytics for Beginners: How to get descriptive statistics of your Pandas DataFrame in Python

How to get descriptive statistics of your Pandas DataFrame in Python

Pandas is a popular Python library for data analysis and manipulation. It provides a data structure called DataFrame, which is a two-dimensional labeled data structure with columns of potentially different types. In this tutorial, we will learn how to get descriptive statistics of a Pandas DataFrame in Python.

To get started, we need to have a Pandas DataFrame. If you don’t have one, you can create one using the following code:

import pandas as pd
data = {'name': ['John', 'Jane', 'Jim', 'Joan'],
'age': [32, 28, 45, 33],
'income': [50000, 60000, 55000, 63000]}
df = pd.DataFrame(data)

The above code creates a Pandas DataFrame df from a dictionary data. The keys of the dictionary data become the column names, and the values become the values in the columns.

Now that we have a Pandas DataFrame, let’s start with some basic descriptive statistics.

1. Shape of the DataFrame

The shape of a DataFrame is the number of rows and columns it has. You can get the shape of a DataFrame using the shape attribute:

print(df.shape)

The output of the above code will be:

2. Columns of the DataFrame

You can get the columns of a DataFrame using the columns attribute:

print(df.columns)

The output of the above code will be:

3. Data Types of the Columns

You can get the data types of the columns of a DataFrame using the dtypes attribute:

print(df.dtypes)

The output of the above code will be:

4. Summary Statistics

You can get summary statistics of the columns of a DataFrame using the describe method:

print(df.describe())

The output of the above code will be:

The describe method provides summary statistics for numerical columns of the DataFrame. By default, it returns the count, mean, standard deviation, minimum, first quartile, median, third quartile, and maximum of each numerical column.

5. Count

You can get the count of the non-missing values of a column using the count method:

print(df['age'].count())

The output of the above code will be:

Another way to get descriptive statistics of a Pandas DataFrame is to use the mean(), median(), mode(), min(), max(), sum(), count(), and std() functions. These functions return the mean, median, mode, minimum value, maximum value, sum, count, and standard deviation of all the numerical columns in the DataFrame, respectively.

Here’s an example:

print(df.mean())
print(df.median())
print(df.mode().iloc[0])
print(df.min())
print(df.max())
print(df.sum())
print(df.count())
print(df.std())

Here is another example

First, let’s start by creating a sample dataframe:

import pandas as pd
import numpy as np

# creating a sample dataframe
data = {'name': ['John', 'Jane', 'Jim', 'Joan', 'Jake'],
'age': [30, 29, 31, 32, 33],
'score': [80, 75, 90, 85, 95]}
df = pd.DataFrame(data)
print(df)

Output:

With the sample dataframe, we can now start calculating descriptive statistics:

df.describe(): This function gives a summary of the central tendency, dispersion and shape of the distribution of the DataFrame, excluding NaN values. By default, only the numerical columns are returned, but you can include all columns by passing include='all' as a parameter.

# descriptive statistics of numerical columns
print(df.describe())

Output:

df.mean(): This function returns the mean of all the columns.

# mean of all columns
print(df.mean())

Output:

df.median(): This function returns the median of all the columns.

# median of all columns
print(df.median())

Output:

df.mode(): This function returns the mode of all the columns. If there are multiple values with the highest frequency, all of them are returned.

# mode of all columns
print(df.mode())

Output:

df.std(): This function returns the standard deviation of all the columns.

# standard deviation of all columns
print(df.std())

Output:

Summary in a nutshell:

Descriptive statistics are an important part of understanding your data. Descriptive statistics give us a quick overview of the central tendency, dispersion, and shape of a data set. In this article, we will learn how to get descriptive statistics of a Pandas DataFrame in Python.

Pandas is a popular open-source library in Python for data analysis and manipulation. One of its most powerful features is the ability to perform descriptive statistics on a DataFrame.

To get descriptive statistics of a Pandas DataFrame, we use the describe() method. This method calculates various summary statistics of the numerical data in the DataFrame. It returns a new DataFrame that includes the count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum of the data.

If you only want to get descriptive statistics of the numerical columns in the DataFrame, you can use the describe() method with the include parameter set to [np.number]. If you want to get descriptive statistics of all columns, including non-numerical columns, you can set the include parameter to 'all'.

To get specific descriptive statistics, you can use other Pandas methods, such as mean(), median(), mode(), min(), max(), and sum().

It is important to note that descriptive statistics are only useful for understanding the basic features of your data. For more advanced analysis, you may need to use other techniques, such as inferential statistics or data visualization.

In conclusion, using the describe() method in Pandas is a convenient way to get a quick overview of the descriptive statistics of your data. This method provides a simple and efficient way to get important information about your data and is a great tool for understanding and exploring your data in Python.

 

Personal Career & Learning Guide for Data Analyst, Data Engineer and Data Scientist

Applied Machine Learning & Data Science Projects and Coding Recipes for Beginners

A list of FREE programming examples together with eTutorials & eBooks @ SETScholars

95% Discount on “Projects & Recipes, tutorials, ebooks”

Projects and Coding Recipes, eTutorials and eBooks: The best All-in-One resources for Data Analyst, Data Scientist, Machine Learning Engineer and Software Developer

Topics included: Classification, Clustering, Regression, Forecasting, Algorithms, Data Structures, Data Analytics & Data Science, Deep Learning, Machine Learning, Programming Languages and Software Tools & Packages.
(Discount is valid for limited time only)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.

Learn by Coding: Tutorials on Applied Machine Learning and Data Science for Beginners

Please do not waste your valuable time by watching videos, rather use end-to-end (Python and R) recipes from Professional Data Scientists to practice coding, and land the most demandable jobs in the fields of Predictive analytics & AI (Machine Learning and Data Science).

The objective is to guide the developers & analysts to “Learn how to Code” for Applied AI using end-to-end coding solutions, and unlock the world of opportunities!