In Part 1 of this series we looked at time series analysis. We learned about the different properties of a time series, autocorrelation, partial autocorrelation, stationarity, tests for stationarity, and seasonality.
In this part of the series, we will see how we can make models that take a time series and predict how the series will move in the future. Specifically, in this tutorial, we will look at autoregressive models and exponential smoothing methods. In the final part of the series, we will look at machine learning and deep learning algorithms like linear regression and LSTMs.
You can also follow along with the code in this article (and run it for free) from a Gradient Community Notebook on the ML Showcase.
Bring this project to life
We will be using the same data we used in the previous article (i.e. the weather data from Jena, Germany) for our experiments. You can download it using the following command.
Unzip the file and you'll find CSV data that you can read using Pandas. The dataset has several different weather parameters recorded. For this tutorial, we'll be using Temperature in Celsius. The data is recorded regularly over 24 hour periods at 10 minute intervals. We will be using hourly data for our predictive models.
There are different methods applied for time series forecasting, depending on the trends we discussed in the previous article. If a time series is stationary, autoregressive models can come in handy. If a series is not stationary, smoothing methods might work well. Seasonality can be handled in both autoregressive models as well as smoothing methods. We can also use classical machine learning algorithms like linear regression, random forest regression, etc., as well as deep learning architectures based on LSTMs.
If you haven't read the first article in this series, I would suggest you read through it before diving into this one.
Autoregressive Models
In multiple linear regression, we predict a value based on the value of other variables. The expression for the model assumes a linear relationship between the output variable and the predictor variables.
In autoregressive models, we assume a linear relationship between the value of a variable at time tt and the value of the same variable in the past, that is time t−1,t−2,...,t−p,...,2,1,0t−1,t−2,...,t−p,...,2,1,0.
yt=c+β1yt−1+β2yt−2+...+βpyt−p+ϵtyt=c+β1yt−1+β2yt−2+...+βpyt−p+ϵt
Here p is the lag order of the autoregressive model.
For an AR(1) model,
- When β1=0β1=0, it signifies random data
- When β1=1β1=1 and c=0c=0, it signifies a random walk
- When β1=1β1=1 and c≠0c≠0, it signifies a random walk with a drift
We usually restrict autoregressive models for stationary time series, which means that for an AR(1) model −1<β1<1−1<β1<1.
Another way of representing a time series is by considering a pure Moving Average (MA) model, where the value of our variable depends on the residual errors of the series in the past.
yt=m+ϵt+ϕ1ϵt−1+ϕ2ϵt−2+...+ϕqϵt−qyt=m+ϵt+ϕ1ϵt−1+ϕ2ϵt−2+...+ϕqϵt−q
As we learned in the previous article, if a time series is not stationary, there are multiple ways of making it stationary. The most commonly used method is differencing.
ARIMA models take into account all the three mechanisms mentioned above and represent a time series as shown below.
yt=α+β1yt−1+β2yt−2+...+βpyt−p+ϵt+ϕ1ϵt−1+ϕ2ϵt−2+...+ϕqϵt−qyt=α+β1yt−1+β2yt−2+...+βpyt−p+ϵt+ϕ1ϵt−1+ϕ2ϵt−2+...+ϕqϵt−q
(Where the series is assumed to be stationary.)
The stationarizing mechanism is implemented before the model is fitted on the series.
The order of differencing can be found by using different tests for stationarity and looking at PACF plots. You can refer to the first part of the series to understand the tests and their implementations.
The MA order is based on the ACF plots for differenced series. The order is decided depending on the order of differencing required to remove any autocorrelation in the time series.
You can implement this as follows, where pp is the lag order, qq is the MA order, and dd is the differencing order.
This will not be enough for our time series, because apart from being non-stationary, the time series we have also has seasonal trends. We will need a SARIMA model.
The equation for the SARIMA model becomes (assuming a seasonal lag of 12):
yt=γ+β1yt−1+β2yt−2+...+βpyt−p+ϵt+ϕ1ϵt−1+ϕ2ϵt−2+...+ϕqϵt−q+B1yt−12+B2yt−13+...+Bqyt−12−q+ϵt−12+Φ1ϵt−13+Φ2ϵt−14+...+Φqϵt−12−q
This is a linear equation and the coefficients can be found using regression algorithms.
Sometimes a time series can be over- or under-differenced because the ACF and PACF plots can be a little tricky to infer from. Lucky for us, there is a tool we can use to automate the hyperparameter selection of ARIMA parameters as well as the sesonality. You can install pmdarima using pip.
pmdarima
uses grid search to search through all the values of ARIMA parameters, and picks the model with the lowest AIC value. It also automatically calculates the difference value using the test you select for stationarity.
For our time series, though, the frequency is 1 cycle per 365 days (which is 1 cycle per 8544 data points). This can get a little too heavy for your computer to handle. Even after I reduced the data from hourly to daily, I found the modeling script getting killed. The only thing left for me to do was to convert the data to monthly and then run the model.
The output looks something like this.
The best model is ARIMA(1, 0, 0)(0, 1, 1)(12). From the results, you can see the different coefficient values and the p values, which are all below 0.05. This indicates that the p values are significant.
The plot of the predicted values against the original time series looks like this.

The plot with the confidence intervals looks like this.

These plots are not bad; the predicted values all fall in the confidence range and the seasonal patterns are captured well, too.
That being said, we have lost all granularity in the data while trying to make the algorithm work for us with limited compute. We need other methods.
You can learn more about Autoregressive models in this article.
Smoothing Methods
Exponential smoothing methods are often used in time series forecasting. They utilize the exponential window function to smooth a time series. There are multiple variations of smoothing methods, too.
The simplest form of exponential smoothing can be thought of this way:
s0=x0st=αxt+(1−α)st−1=st−1+α(xt−st−1)
Where x represents the original values, s represents the predicted values, and α is the smoothing factor, where:
0≤α≤1
Which means that the smoothed statistic st is the weighted average of the current real value and the smoothed value of the previous time step, with the previous time step value added as a constant.
For a greater smoothing of the curve, the value of the smoothing factor (somewhat counterintuitively) needs to be lower. Having α=1 is equivalent to the original time series. The smoothing factor can be found by using the mthod of least squares, where you minimize the following.
(st−xt+1)2
The smoothing method is called exponential smoothing because when you recursively apply the formula:
st=αxt+(1−α)st−1
You get:
st=α∑ti=0(1−α)ixt−i
Which is a geometric progression, i.e. a discrete version of an exponential function.
We will implement three variants of exponential smoothing: Simple Exponential Smoothing, Holt's Linear Smoothing, and Holt's Exponential Smoothing. We will try to find out how changing the hyperparamters of the different smoothing algorithms changes our forecasting output, and see which one works best for us.
All three models have different hyperparameters which we will test out using grid search. We will also return the RMSE values for us to compare and get the best model.
Now we run the experiments with different hyperparameters.
Let's look at how our plots turned out.



Since our data is very dense, when looked at from start to finish the plots can seem cluttered. Let's also look at how the data looks when zoomed in.



We can find the best model for all three methods and compare them, too.
We find the following results.


As it turns out, the simple exponential model has the smallest RMSE value, thus making it the best model we have.


You can find and run the code for this series of articles here.
Conclusion
In this part of the series we looked mainly at autoregressive models, explored moving average terms, lag orders, differencing, accounting for seasonality, and their implementation which includes grid search-based hyperparameter selection. We then moved onto exponential smoothing methods, looking into simple exponential smoothing, Holt's linear and exponential smoothing, grid search-based hyperparameter selection with a discrete user-defined search space, best model selection, and inference.
In the next part, we will look at how to create features, train models, and make predictions for classical machine learning algorithms like linear regression, random forest regression, and also deep learning algorithms like LSTMs.
Hope you enjoyed the read.