Forecasting
Forecasting is being used in many applications such as
Weather forecasting : What will be forecasting
Sales forecasting : Forecasting how many products are going to be sale or sold
Demand forecasting
Economic forecasting : Forecasting income for state or country
Workforce forecasting : How many people needed for work
Types of Forecasting
Quantitative Forecasting: uses historical data to establish causal relationships and trends which can be projected into the future.
Qualitative Forecasting : uses experience and judgment to establish future behaviours.
Quantitative Forecasting Method
Time series data: Time series data is a set of observations over a sequence of times separated by a sequence of intervals.
Time series analysis : This is looking at time series data, identifying patterns, calculating statistics on those time series data that make an use in later stage.
Time Series Forecasting: Time series forecasting is basically looking at the past data to make predictions into the future. For Example: the Zomato app wants to predict the number of orders per day for the next month in order to plan the resources better. For this, Zomato team will look at tons of past data and use it in order to forecast accurately.
Air Passenger Traffic Forecasting Problem
An airline company has the data on the number of passengers that have travelled with them on a particular route for the past few years. Using this data, they want to see if they can forecast the number of passengers for the next twelve months.
Making this forecast could be quite beneficial to the company as it would help them take some crucial decisions like -
What capacity aircraft should they use?
When should they fly?
How many air hostesses and pilots do they need?
How much food should they stock in their inventory?
Terminology
Goal: A set of business objectives. For example, maximising revenue, maximising capital, etc.
Plan: A set of actions that a business takes to achieve the goal. In order to come up with a good plan, they need a forecast.
Forecast: Is the prediction of the future.
Basic steps involved in any forecasting problem
Define the problem
Collect the data
Analyze the data
Build and evaluate the forecast model
- Defining Problem
Below things revolve around the steps while defining the problem :
The Granularity Rule: The more aggregate your forecasts, the more accurate you are in your predictions simply because aggregated data has lesser variance and hence, lesser noise. As a thought experiment, suppose you work at ABC, an online entertainment streaming service, and you want to predict the number of views for a few newly launched TV show in Mumbai for the next one year. Now, would you be more accurate in your predictions if you predicted at the city-level or if you go at an area-level? Obviously, accurately predicting the views from each area might be difficult but when you sum up the number of views for each area and present your final predictions at a city-level, your predictions might be surprisingly accurate. This is because, for some areas, you might have predicted lower views than the actual whereas, for some, the number of predicted views might be higher. And when you sum all of these up, the noise and variance cancel each other out, leaving you with a good prediction. Hence, you should not make predictions at very granular levels.
The Frequency Rule: This rule tells you to keep updating your forecasts regularly to capture any new information that comes in. Let's continue with the ABC, an online entertainment streaming service, an example where the problem is to predict the number of views for a newly launched TV show in Mumbai for the next year. Now, if you keep the frequency too low, you might not be able to capture accurately the new information coming in. For example, say, your frequency for updating the forecasts is 3 months. Now, due to the COVID-19 pandemic, the residents may be locked in their homes for around 2-3 months during which the number of views will significantly increase. Now, if the frequency of your forecast is only 3 months, you will not be able to capture the increase in views which may incur significant losses and lead to mismanagement.
The Horizon Rule: When you have the horizon planned for a large number of months into the future, you are more likely to be accurate in the earlier months as compared to the later ones. Let's again go back to ABC, an online entertainment streaming service, example. Suppose that the online entertainment streaming service made a prediction for the number of views for the next 6 months in December 2019. Now, it may have been quite accurate for the first two months, but due to the unforeseen COVID-19 situation, the actual number of view in the next couple of months would have been significantly higher than predicted because of everyone staying at home. The farther ahead we go into the future, the more uncertain we are about the forecasts.
Now that you have understood the steps in defining the problem, let’s apply them to the air passenger traffic problem.
Quantity: Number of passengers
Granularity: Flights from city A to city B; i.e., flights for a particular route
Frequency: Monthly
Horizon: 1 year (12 months)
2. Collecting Data
Collecting data is very important because all forecast is dependent on the data.
There are three important characteristics that every time series data must exhibit in order for us to make a good forecast.
Relevant: The time-series data should be relevant for the set objective that we want to achieve.
Accurate: The data should be accurate in terms of capturing the timestamps and capturing the observation correctly.
Long enough: The data should be long enough to forecast. This is because it is important to identify all the patterns in the past and forecast which patterns repeat in the future.
The various types of data sources to get a time-series data. These are below:
Private enterprise data: E.g. financial information about the quarterly results of any private organisation.
Public data: E.g. government publishes the economic indicators such as GDP, consumer price index etc.
System/Sensor data: E.g. Logs generated by the servers during their 24/7 working hours.
3. Analyze the Data
We can ‘Analyze the data’ by understanding the different components of the time series.
The components associated with time series.
Level: This is the baseline of a time series. This gives the baseline to which we add the different other components.
Trend: Over a longterm, this gives an indication of whether the time series moves lower or higher. For example, in the following Sensex graph you can clearly observe that with time, the overall value is increasing i.e. this particular time series data has an increasing trend.
Seasonality: It is a pattern in a time-series data that repeats itself after a given period of time. For example, in the following graph 'Monthly sales data of company X', you can clearly observe that a fixed pattern is repeating every year. The simplest example to explain this could be, say, the sales of winter wear in India. In winter, during months like November-January, you would expect these sales to be very high whereas for the other months, the sales might be low. This shows a seasonality pattern and proves to be very useful when making forecasts.
Cyclicity: It is also a repeating pattern in data that repeats itself aperiodically. We don’t get into the more details of this component as it is out of the scope of this module.
Noise: Noise is the completely random fluctuation present in the data and we cannot use this component to forecast into the future. This is that component of the time series data that no one can explain and is completely random.
Implementation
#import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Read data:
df=pd.read_csv('https://raw.githubusercontent.com/rkmishracs/dataset/main/airline-passenger-traffic.csv', header=None)
df.columns=('Month', 'Passengers')
df.head()
df['Month']=pd.to_datetime(df['Month'], format='%Y-%m')
df=df.set_index('Month')
df.head()
Time Series Analysis
df.plot(figsize=(12,4))
plt.legend(loc='best')
plt.title('Airlines Passenger Traffic Data')
plt.show(block=False)
Handling Missing Values
Mean Imputation : Imputing the missing values with the overall mean of the data
Last observation carried forward : We impute the missing values with its previous value in the data
Linear interpolation : Draw a straight line joining the next and previous points of the missing values in the data.
Mean Imputation
df=df.assign(Passengers_Mean_Imputation=df.Passengers.fillna(df.Passengers.mean()))
df[['Passengers_Mean_Imputation']].plot(figsize=(12,4))
plt.legend(loc='best')
plt.title('Missing Value:Mean imputation')
plt.show(block=False)
Linear interpolation
df=df.assign(PLI=df.Passengers.interpolate(method='linear'))
df[['PLI']].plot()
plt.legend(loc='best')
plt.title('Missing Value: Linear Interpolation')
plt.show(block=False)