Data Science Project for Beginners on Birth Rate Analysis with Python

Let’s take a look at the freely available data on births in the United States, provided by the Centers for Disease Control (CDC). This data can be found at births.csv

import pandas as pd
births = pd.read_csv("births.csv") print(births.head()) births['day'].fillna(0, inplace=True) births['day'] = births['day'].astype(int)

births[‘decade’] = 10 * (births[‘year’] // 10)

births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')
print(births.head())

We immediately see that male births outnumber female births in every decade. To see this trend a bit more clearly, we can use the built-in plotting tools in Pandas to visualize the total number of births by year:

import matplotlib.pyplot as plt 
import seaborn as sns 
sns.set() 
birth_decade = births.pivot_table('births', index='decade', columns='gender', aggfunc='sum') 
birth_decade.plot() 
plt.ylabel("Total births per year") 
plt.show()

Further data exploration:

There are a few interesting features we can pull out of this dataset using the Pandas tools. We must start by cleaning the data a bit, removing outliers caused by mistyped dates or missing values. One easy way to remove these all at once is to cut outliers, we’ll do this via a robust sigma-clipping operation:

import numpy as np
quartiles = np.percentile(births['births'], [25, 50, 75])
mean = quartiles[1]
sigma = 0.74 * (quartiles[2] - quartiles[0])

This final line is a robust estimate of the sample mean, where the 0.74 comes from the interquartile range of a Gaussian distribution. With this we can use the query() method to filter out rows with births outside these values:

births = births.query(‘(births &gt; @mean – 5 * @sigma) &amp; (births &lt; @mean + 5 * @sigma)’) 

births.index = pd.to_datetime(10000 * births.year + 100 * births.month + births.day, format='%Y%m%d') 
births['day of week'] = births.index.dayofweek

Using this we can plot births by weekday for several decades:

births_day = births.pivot_table(‘births’, index=‘day of week’,

                                columns='decade', aggfunc='mean')
births_day.index = ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']
births_day.plot()
plt.ylabel("Average Births by Day")
plt.show()

Apparently births are slightly less common on weekends than on weekdays! Note that the 1990s and 2000s are missing because the CDC data contains only the month of birth starting in 1989.

Another interesting view is to plot the mean number of births by the day of the year. Let’s first group the data by month and day separately:

births_month = births.pivot_table(‘births’, [births.index.month, births.index.day])

print(births_month.head())
births_month.index = [pd.datetime(2012, month, day)
                      for (month, day) in births_month.index]
print(births_month.head())

Focusing on the month and day only, we now have a time series reflecting the average number of births by date of the year. From this, we can use the plot method to plot the data. It reveals some interesting trends:

fig, ax = plt.subplots(figsize=(12, 4))
births_month.plot(ax=ax)
plt.show()

Python Example for Beginners

Two Machine Learning Fields

There are two sides to machine learning:

Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.

Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes

Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!

Latest end-to-end Learn by Coding Recipes in Project-Based Learning:

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Towards Advanced Analytics Specialist & Analytics Engineer

Beginner’s Data Science Project – Data Science Project for Beginners on Birth Rate Analysis with Python

Data Science Project for Beginners on Birth Rate Analysis with Python

Further data exploration:

Python Example for Beginners

Special 95% discount

2000+ Applied Machine Learning & Data Science Recipes

Portfolio Projects for Aspiring Data Scientists: Tabular Text & Image Data Analytics as well as Time Series Forecasting in Python & R

Two Machine Learning Fields

Related Posts

ETL vs. ELT: Navigating Data Integration Techniques in Data Warehousing

Unlocking the Power of Mixed Models in Statistical Analysis

Mastering Analysis of Covariance (ANCOVA): A Comprehensive Statistical Guide with Python and R Examples