Time series is a data point sequence that occurs over a certain period of time. In layman terms, pick a variable and note the values over time. For instance, profit of a company over 8 years, the temperature of a place in every 5 minutes. This information of data is continuous that is, it is not cross-sectional. Thus, time series forecasting is the use of certain models to predict future values based on previously observed values. In other words, these models are used to predict values from historical values.
Feature Engineering is the process of transforming a table that contains date-time variables and a target variable into a table with input variables and output variables.
Since the software does not understand data in dates, for example, 4th September. It will need more information such as, is 4th September a weekend? Did it rain that day? etc. Thus, identifying such features that cause an impact on the output variable is the purpose of performing feature engineering.
Types of features while working with Time Series Data
- Date-Time features: Information such as weather, is it a weekend, month, season can be deduced from this.
- Lag features: These are values at previous points. For instance, considering last month’s sales for forecasting this month’s sales.
- Window features: These are a collection of values over a certain window of previous time steps.
Further, there are 2 types of window features, rolling window and expanding window.
The Rolling window is where the width of the window is fixed and it is slid forward along the data and expanding window is another type that considers all historical data.
Consider the following example. Here the data set used is births of females with the date of birth and no. of births that day.
From the above data, we can create new features that specifically tell us about the year, month and date of those births.
#creating features from existing dataset features['year'] = data['date'].dt.year features['month'] = data['date'].dt.month features['day'] = data['date'].dt.day
The dataset now has year, month, and date features.
Let us now create 2 lag features from the existing data set with a shift of 1 and a shift of 365.
features['lag1'] = data['births'].shift(1) features['lag2'] = data['births'].shift(365) #value at the same day last year
Rolling window and expanding window features can be created using the following code.
#mean of features over the same window size features['roll_mean'] = data['births'].rolling(window = 5).mean() #max value among features over the same window size features['roll_max'] = data['births'].rolling(window = 5).max() #getting the max record from the beginning of time features['expand_max'] = data['births'].expanding().max()
Upsampling and Downsampling with Time Series Data
At times, it may happen that the format in which the data is available might not match the format in which we wish to predict. This changing of the format or frequency is called resampling.
Consider hospital admissions, we have the number of people admitted to a hospital daily, but we need to forecast the values of the number of people admitted each month. This situation here is where resampling is needed.
There are two types of resampling, namely, upsampling and downsampling
- Upsampling: Increasing the frequency of selected data is called upsampling. To illustrate, if I have quarterly sales data and I want to convert this data into monthly sales data, the frequency of this data must be increased. This is known as upsampling. This can be done using the interpolation technique.
- Downsampling: This is the exact opposite of upsampling. Consider a case where I have yearly sales data and I want to bring it down to quarterly data. Thus the frequency of the data will have to be reduced. This is known as downsampling. This can be done using summary statistics.