Modeling and forecasting population in Bangladesh using ARIMA modelling approach in R

Modeling and forecasting population in Bangladesh using ARIMA modelling approach in R

Employing annual time series data on total population in Bangladesh from 1960 to 2019, we model and forecast total population over the next 20 years using the Box – Jenkins ARIMA technique. This article is presented with the following contents.

Contents

  • Introduction
  • Method & Modelling techniques
  • Data Source
  • R Codes and Figures
  • Model diagnostics
  • Forecasts

 

Introduction

As the 21st century began, the world’s population was estimated to be almost ~7 billion people. Projections by the United Nations place the figure at more than 9.2 billion by the year 2050 before reaching a maximum of 11 billion by 2200. Over 90% of that population will inhabit the developing world. Population problem is one of the main problems in Bangladesh at the current time. The fast growth of population during the past decades has frustrated the development efforts in Bangladesh. Bangladesh is an over populated country and the growth in resources has not been keeping pace with the growth in population.

The problem of population growth is basically not a problem of numbers but that of human welfare as it affects the provision of welfare and development. The consequences of rapidly growing population manifests heavily on species extinction, deforestation, desertification, climate change and the destruction of natural ecosystems on one hand; and unemployment, pressure on housing, transport traffic congestion, pollution and infrastructure security and stain on amenities. In Bangladesh, just like in any other part of the world, population modeling and forecasting is important for policy dialogue, especially with regards to the future threat to natural resources, persistent unemployment and worsening poverty levels. This article endeavors to model and forecast population of Bangladesh using the Box-Jenkins ARIMA technique.

 

Method and Modelling techniques

To better understand the time-series data and forecast future, data scientists apply the statistical methodology i.e. ARIMA model. It is better to have 30 to 40 observations to build an ARIMA model and forecast for short term.

Understanding Autoregressive Integrated Moving Average (ARIMA) model

An autoregressive integrated moving average (ARIMA) model is a form of regression analysis that gauges the strength of one dependent variable relative to their own lagged values that changing over time. The model’s goal is to predict future values by examining the differences between values in the series instead of actual values if the series in non-stationary.

An ARIMA model can be understood by outlining each of its components as follows:

  • AutoRegression (AR): refers to a model that shows a changing variable that regresses on its own lagged, or prior, values.
  • Integrated (I): represents the differencing of raw observations to allow for the time series to become stationary, i.e., data values are replaced by the difference between the data values and the previous values.
  • Moving Average (MA) incorporates the dependency between an observation and a residual error from a moving average model applied to lagged observations.

 

ARIMA parameters

Each component in an ARIMA model can be represented as a parameter with a standard notation. For ARIMA models, a standard notation would be ARIMA with p, d, and q, where integer values substitute for the parameters to indicate the type of ARIMA model used. The parameters can be defined as:

  • p : the number of lag observations in the model; also known as the lag order. It can be visually determined through PACF (partial auto-correlation function) plot.
  • d : the number of times that the raw observations are differenced; also known as the degree of differencing. It is used to make the time series stationary.
  • q : the size of the moving average window; also known as the order of the moving average. It can be visually determined through ACF (auto-correlation function) plot.

 

Mathematical formulation of an ARIMA model

In an autoregression model (AR), we forecast the variable of interest using a linear combination of past values of the variable. The term autoregression indicates that it is a regression of the variable against itself.

Thus, an autoregressive model of order p can be written

 

 

where   is white noise. This is like a multiple regression but with lagged values of   as predictors. We refer to this as an AR(p) model, an autoregressive model of order p.

 

 

Rather than using past values of the forecast variable in a regression, a moving average model uses past forecast errors in a regression-like model.

 

 

where   is white noise. We refer to this as an MA(q) model, a moving average model of order q.

Thus, if we combine differencing with autoregression (AR) and a moving average (MA) model, we obtain a non-seasonal ARIMA model. The full model can be written as,

equation figures … … ,

where  is the differenced series (it may have been differenced more than once). The “predictors” on the right hand side include both lagged values of   and lagged errors. We call this an ARIMA(p, d, q) model, where p = order of the autoregressive part; d = degree of the differencing involved, and q = order of the moving average part.

 

Box-Jenkins approach to ARIMA model

An appropriate ARIMA model that to be used for forecasting can be determined through Box-Jenkins approach. The Box-Jenkins approach to modelling ARIMA processes was described in a highly influential book by statisticians George Box and Gwilym Jenkins in 1970. Box-Jenkins modelling involves identifying an appropriate ARIMA model, fitting it to the data, and then using the fitted model for forecasting. One of the attractive features of the Box-Jenkins approach to forecasting is that ARIMA processes are a very rich class of possible models and it is usually possible to find a process which provides an adequate description to the data. There are sophisticated computational algorithms (statistical packages) available in R and Python programming languages that are designed to implement automatic Box-Jenkins approach. A brief description of Box-Jenkins approach is given in below.

Step 1: Data Preparation

Data preparation involves transformations and differencing. Transformations of the data (such as square roots or logarithms) can help stabilize the variance in a series where the variation changes with the level. Then the data are differenced until there are no obvious patterns such as trend or seasonality left in the data. “Differencing” means taking the difference between consecutive observations, or between observations a year apart. The differenced data are often easier to model than the original data.

Step 2: Model Selection

Model selection in the Box-Jenkins framework uses various graphs based on the transformed and differenced data to try to identify potential ARIMA processes which might provide a good fit to the data. Later developments have led to other model selection tools such as Akaike’s Information Criterion (AIC).

Step 3: Parameter Estimation

Parameter estimation means finding the values of the model coefficients which provide the best fit to the data. There are sophisticated computational algorithms designed to do this.

Step 4: Model Checking

Model checking involves testing the assumptions of the model to identify any areas where the model is inadequate. If the model is found to be inadequate, it is necessary to go back to Step 2 and try to identify a better model.

Step 5: Forecasting

Forecasting is what the whole procedure is designed to accomplish. Once the model has been selected, estimated and checked, it is usually a straight forward task to compute forecasts. This is usually done by computer software such as R and/or Python packages.

 

Data Source

This forecasting is based on 60 observation of the annual total population in Bangladesh, from 1960 – 2019. The dataset is taken from the World Bank website.

 

R Codes, Stats and Interpretation

  • The following codes show how we load the necessary R packages and then load the dataset.

 

  • After loading the dataset, we can check the head & tail of the dataset to have look at the original data.
    • Check the first 10 rows of the dataset.

    • Output

 

    • Check the last 10 rows of the dataset.

 

  • Now we plot the population of Bangladesh.

    • Output:

 

In the above figure, it indicates that the Population variable is not stationary since it is trending upwards over the period 1960 – 2019. This implies that the mean and variance of “Population” is changing overtime.  The objective is to make the dataset stationary or near stationary to make reliable forecast. So we have to difference the dataset i.e. the Population variable.

First Difference of the original dataset:

Codes:

Figure:

 

2nd Difference of the original dataset:

Codes:

Figure:

 

3rd Difference of the original dataset:

Codes:

Figure:

 

The ACF & PACF plot on the “Difference 2” dataset:

Code:

Figures:

  • ACF Plot

  • PACF Plot

 

Auto ARIMA model that best fit the data

Code:

Output:

 

 

Model Diagnostics

Codes:

Outputs & Figures:

    • Ljung-Box statistics

    • Resuduals check

    • Inverse roots of the characteristic polynomial of the fitted ARIMA model

The above diagnostics suggest the fitted ARIMA(5,2,1) model is quite stable and useful to forecast population.

 

More on model diagnostics using tactile package

Code:

Figures:

 

Actual vs. Fitted population – ARIMA(5,2,1)

Codes:

Figure:

Forecasts for next 3 decades

Codes:

Figures:

 

 

Conclusion

The ARIMA(5,2,1) model is a good candidate model to forecast the population of Bangladesh for the next 3 decades.

In the subsequent posts, we will model & forecast the population of Bangladesh using Python with FB prophet and pmdarima packages as well as using the state-of-art deep learning techniques such RNN, LSTM, CNN, TCN, FFT etc.

 

 

Python Example for Beginners

Two Machine Learning Fields

There are two sides to machine learning:

  • Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
  • Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.

Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes

Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!

Latest end-to-end Learn by Coding Recipes in Project-Based Learning:

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.  

Google –> SETScholars