%load_ext pretty_jupyter

Introduction - Simulating Autoregressive (AR) Processes

This walkthrough is part of the ECLR page.

Autoregressive (AR) processes are a fundamental building block in time-series econometrics. In this walkthrough, we will learn how to simulate AR processes using Python. This exercise not only helps us understand the practical aspects of working with AR models but also deepens our theoretical understanding of their properties

General Autoregressive Process (AR)

A general autoregressive process is represented by the following model:

$$r_t = \alpha + \beta t + \sum_{i=1}^{p} \phi_i r_{t-i} + \epsilon_t$$

where $\epsilon_t $ is a white noise process satisfying:

  • $E[ε_t] = 0 $ (mean is zero),
  • ${Var}[ε_t] = \sigma^2 \ $ (variance is constant),
  • ${Cov}[ε_t, ε_{t-k}] = 0 $ (no autocorrelation) for all $ k > 0$ .

It is important to understand these specific processes.

Process Name Parameter Restrictions
White noise $ \alpha = \beta = \phi_i = 0 $ for all $i$
Deterministic Trend model $ \phi_i = 0 $ for all $i$
Random walk $ \phi_1 = 1$, $\alpha = \beta = \phi_i = 0 $ for all $ i > 1 $
Random walk with drift $ \phi_1 = 1$, $\beta = \phi_i = 0$ for all $i > 1$
AR(1) model $ \beta = \phi_i = 0$ for all $i > 1$
AR(p) model $ \beta = 0$
AR(p) model with deterministic trend $ \beta \neq 0 $

Simulating an AR(1) process

Autoregressive (AR) processes are fundamental tools for modeling time-series data. For example, daily stock price returns are often modelled as a linear function of the previous day's return. Such a relationship can be captured with an AR(1) process:

$$r_t= α + φ_1 r_{t-1} + ε_t$$

Where:

  • $ r_t $ : Return on day ($ t $)
  • $ φ $: AR(1) coefficient (persistence, determines how much influence the previous day's return has on today's return.
  • $ ε_t$: White noise (random shocks), typically drawn from the standard normal distribution $N(0,1)$, although one could also choose different distributions.

Before simulating an AR(1) process, we need to specify a number of pieces of information:

  1. Parameter Values:
    • $α$: The constant term (intercept). For now, we will set $α=0$.
    • The persistence coefficient, which determines how the previous value ($r_{t-1}$) influences the current value ($r_t$). We will set $φ=0.8$.
  2. Random Shocks ($ε_t$)
    • These are the random terms that represent "white noise" or random shocks in the system. We assume $ε_t ∼ N(0,1)$, meaning they are drawn from a standard normal distribution with mean 0 and variance 1.
  3. Number of Time Periods: -Decide how many observations ($𝑇$) you want to simulate. For example, $𝑇 = 1000$.
  4. Starting Value ($ r_0 $)
    • To simulate the first observation the first observation $r_1$, we need an initial value $r_0$. This is often set to 0 for simplicity, but this choice can affect the initial behaviour of the process.

Once we have this information, we can simulate the AR(1) process using the following code. We start by defining a function called simulate_ar1.

# import some libraries used
import pandas as pd
import numpy as np
import random
import time

# Plotting
from lets_plot import *
# Lets_plot requires a setup
LetsPlot.setup_html()

# Function to simulate AR(1) process
def simulate_ar1(phi, T, r0=0, alpha=0, seed = np.int64(time.time())):
    """
    Simulates an AR(1) process: r_t = alpha + phi * r_{t-1} + epsilon_t.

    Parameters:
    phi (float): AR(1) coefficient (persistence).
    T (int): Number of time periods to simulate.
    r0 (float): Initial value for r_0.
    alpha (float): Constant term in the AR(1) process.
    seed (int): Random seed for reproducibility.

    Returns:
    np.ndarray: Simulated AR(1) time series.
    """
    # Set the random seed for reproducibility
    np.random.seed(seed)
    
    # Step 1: Generate random shocks (white noise) from N(0, 1)
    epsilon_t = np.random.normal(0, 1, T)  
    
    # Step 2: Initialize the time series array
    r_t = np.zeros(T)  # Start with all zeros
    r_t[0] = r0  # Set the starting value r_0
    
    # Step 3: Simulate the AR(1) process iteratively
    for t in range(1, T):
        r_t[t] = alpha + phi * r_t[t-1] + epsilon_t[t]  # Apply the AR(1) formula
    
    # The function returns the simulated time series $r_t$.
    return r_t

This function requires the input of parameters (phi, T, r0=0, alpha=0, seed=42). The meaning of phi, T, r0 and alpha has been explained above. Of these r0 and alpha have a default value of 0. The seed option allows you feed in a particular random seed. By default it is set to some value based on the current time.

Now this function can be called. We set seed = 42 to ensure that we always see the same series unless we do want random variations.

# Define the parameters
alpha = 0  # Constant term
phi = 0.8  # AR(1) coefficient
T = 1000    # Number of time periods
r0 = 0     # Starting value

# Simulate the AR(1) process
r_sim = simulate_ar1(phi=phi, T=T, r0=r0, alpha=alpha, seed = 42)

With object r_sim containing the simulated series, we can display it in a line graph.

data = {
  "t": range(1,T+1),
  "r": r_sim
}

#load data into a DataFrame object:
r_df = pd.DataFrame(data)

(
  ggplot(r_df,aes(x = 't', y = 'r')) + 
    geom_line() +
    labs(title = "Simulated AR(1) return series",
         x = "Time")
)

Well done, you have just created your first simulated autoregressive series using Python!

Let's say that we want to simulate two AR(1) processes:

  • One with $φ_1=0.97$ (near unit root)
  • Another with $φ_1=1$ (a unit root process)
# Parameters
T = 1000  # Number of time periods
r0 = 0   # Starting value

# Simulate the two paths
phi_1_near = 0.97  # Near-unit root
phi_1_unit = 1.0   # Unit root

r_sim_near = simulate_ar1(phi_1_near, T, r0, seed=42)  # phi = 0.97
r_sim_unit = simulate_ar1(phi_1_unit, T, r0, seed=42)  # phi = 1.0
data = {
  "t": range(1,T+1),
  "r_near": r_sim_near,
  "r_unit": r_sim_unit
}

#load data into a DataFrame object:
r_df = pd.DataFrame(data)

r_df_long = pd.melt(r_df, 
                          id_vars = 't', 
                          value_vars = ['r_near', 'r_unit'],
                          var_name = "Process",
                          value_name = "r")

(
  ggplot(r_df_long,aes(x = 't', y = 'r', color = 'Process')) + 
    geom_line() +
    labs(title = "Simulated AR(1) return series",
         x = "Time")
)

We can observe that:

  1. For $φ_1=0.97$ (near unit root): The series frequently returns to values close to $0$, but the persistence is high. The series tends to hover for longer periods around a mean of $0$.
  2. For $φ_1=1$ (unit root): The series does not frequently return to $0$. Instead, it exhibits a random walk behaviour and drifts indefinitely without any clear tendency to revert.

Simulating an AR(2) process

The AR(2) model introduces a second lag, meaning today's value depends on the previous two values of the series, plus a random shock. Mathematically, the AR(2) process is represented as:

$$r_t= \alpha + \phi_1 r_{t-1}+ \phi_1 r_{t-2} + \epsilon_t$$

The function below (simulate_ar2) has been adjusted to simulate an AR(2) process. Compare this to simulate_ar1 to understand the differences.

def simulate_ar2(phi1, phi2, T, r0=0, alpha=0, seed=np.int64(time.time())):
    """
    Simulates an AR(2) process: r_t = phi1 * r_{t-1} + phi2 * r_{t-2} + epsilon_t.

    Parameters:
    phi1 (float): Coefficient for r_{t-1}.
    phi2 (float): Coefficient for r_{t-2}.
    T (int): Number of time periods to simulate.
    r0 (float): Return at t=0 (starting value).
    seed (int): Random seed for reproducibility.

    Returns:
    np.ndarray: Simulated AR(2) time series.
    """
    np.random.seed(seed)
    epsilon_t = np.random.normal(0, 1, T)  # Random shocks
    r_t = np.zeros(T)  # Initialize time series
    
    # Step 1: Set initial values
    r_t[0] = r0
    r_t[1] = r0

    # Step 2: Simulate AR(2) process using a loop
    for t in range(2, T):  # Start from t=2
        r_t[t] = alpha + phi1 * r_t[t-1] + phi2 * r_t[t-2] + epsilon_t[t]
    
    return r_t
# Parameters for AR(2)
phi1 = 0.6  # Coefficient for r_{t-1}
phi2 = -0.3  # Coefficient for r_{t-2}
T = 1000  # Number of time periods
r0 = 0  # Starting value at t=0

# Simulate the AR(2) process
r_sim_ar2 = simulate_ar2(phi1, phi2, T, r0)

Improving the code

The task

A common feature of programming is that your code does what you want, but you realise that you can improve it. Sometimes the need for improvement only becomes apparent in some instances.

The case for improvement

Let's simulate the process for a new parameter combination. In particular a process for which $\alpha \neq 0$.

# Parameters for AR(2)
alpha = 10
phi1 = 0.5  # Coefficient for r_{t-1}
phi2 = 0.4  # Coefficient for r_{t-2}
T = 1000  # Number of time periods
r0 = 0  # Starting value at t=0

# Simulate the AR(2) process
r_sim_ar2_v2 = simulate_ar2(phi1, phi2, T, r0, alpha = alpha)

data = {
  "t": range(1,T+1),
  "r": r_sim_ar2_v2
}

#load data into a DataFrame object:
r_df = pd.DataFrame(data)

(
  ggplot(r_df,aes(x = 't', y = 'r')) + 
    geom_line() +
    labs(title = "Simulated AR(2) series, with non-zero mean",
         x = "Time")
)

Clearly, the process looks very odd at the beginning.

The problem

The reason for the issue is that stationary AR processes converge towards a certain value, the unconditional mean. That mean relates to the process parameters as follows $E(r) = \alpha / (1 - \phi_1 - \phi_2)$ (for an AR(2) process). If we substitute the values for our process we get $E(r) = 10 / 0.1 = 100$. But the default starting value for the process is r0 = 0. So at the beginning the process spends time to transition from 0 to $E(r)$. This is typically not what you want.

The solution

The solve the issue, we would want to change the default starting value from 0 to $E(r)$.

def simulate_ar2_v2(phi1, phi2, T, r0=0, alpha=0, seed=np.int64(time.time())):
    """
    Simulates an AR(2) process: r_t = phi1 * r_{t-1} + phi2 * r_{t-2} + epsilon_t.

    Parameters:
    phi1 (float): Coefficient for r_{t-1}.
    phi2 (float): Coefficient for r_{t-2}.
    T (int): Number of time periods to simulate.
    r0 (float): Return at t=0 (starting value).
    seed (int): Random seed for reproducibility.

    Returns:
    np.ndarray: Simulated AR(2) time series.
    """
    np.random.seed(seed)
    epsilon_t = np.random.normal(0, 1, T)  # Random shocks
    r_t = np.zeros(T)  # Initialize time series
    
    if (r0 == 0):
        r0 = alpha/(1 - phi1 - phi2)
    
    # Step 1: Set initial values
    r_t[0] = r0
    r_t[1] = r0

    # Step 2: Simulate AR(2) process using a loop
    for t in range(2, T):  # Start from t=2
        r_t[t] = alpha + phi1 * r_t[t-1] + phi2 * r_t[t-2] + epsilon_t[t]
    
    return r_t

If you were to re-run the simulation code you would see that this initial run-up period has disappeared.


Using ArmaProcess from the statsmodels library

In Python, the ArmaProcess class from the statsmodels.tsa.arima_process module, provides a way to simulate the AR process and even more complex time series models directly (without manually writing loops), making it a more straightforward approach.

Let's simulate an AR(1) using the ArmaProcess command from the statsmodels library.

from statsmodels.tsa.arima_process import ArmaProcess

# AR(1) Parameters
phi = 0.8  # AR(1) coefficient
ar = np.array([1, -phi])  # AR coefficients: 1 - phi*L
ma = np.array([1])        # MA coefficients (no MA terms)

# Simulate AR(1) Process
np.random.seed(42)  # For reproducibility
T = 1000  # Number of time periods
ar1_process = ArmaProcess(ar, ma)  # Define AR(1) process
r_sim_ar1 = ar1_process.generate_sample(nsample=T)

Let's look at the time-series process.

data = {
  "t": range(1,T+1),
  "r": r_sim_ar1
}

#load data into a DataFrame object:
r_df = pd.DataFrame(data)

(
  ggplot(r_df,aes(x = 't', y = 'r')) + 
    geom_line() +
    labs(title = "Simulated AR(1) return series",
         x = "Time")
)

We can see that the ArmaProcess defines the AR coefficients through ar = np.array([1, -phi]) and could also allow for moving average process through ma = np.array([1]). To understand how this worked, let's state the process we simulated:

$$r_t = \phi_1 r_{t-1}+ \phi_1 r_{t-2} + \epsilon_t$$

This can be re-formulated to

$$\phi_0 r_t - \phi_1 r_{t-1} - \phi_2 r_{t-2} = \delta_0 \epsilon_{t} + \delta_1 \epsilon_{t-1}$$

with $\phi_0 = \delta_0=1$, $\phi_1 = 0.8$ and all other parameters equal to 0. The line ar = np.array([1, -phi]) basically enters the non-zero $\phi$ coefficients and ma = np.array([1]) the $\delta$ coefficients. If any of the $\delta_1$ or $\delta$ for higher error term lags are non-zero. In such a case we call the process an Auto-Regressive Moving Average (ARMA) process.

This function is meant to generate stationary processes. These are processes for which we do have a value for $E(r)$. While this is not the place to discuss the details of stationary processes, this is the case if the $\phi_1 + \phi_2 < 1$ in the case of a process with an AR(2) component.

Unlike the "manual" implementation, ArmaProcess automatically takes care of the starting value problem as it initializes the series around the processes' $E(r)$, so we don’t need to specify $r_0$.

Simulating integrated processes

Auto-Regressive Integrated Moving Average (ARIMA) is a general class of statistical models used for time series analysis forecasting by identifying patterns and relationships within historical data. More specifically, ARIMA stands for:

  • Auto-Regressive (AR): The model uses the lagged values of the time series.
  • Moving Average (MA): It uses lagged error terms to generate a new observation.

So far that is as the above ARMA model. The element here is

  • Integrated (I): Differencing is applied to make the time series stationary.

If we produce a series of integration order $I=1$ then the series produced would have to be differenced once to get a stationary series.

In fact, to simulate such processes we only need a little extra trick. We take the stationary series we produced and we integrate this up. What that means is that

ARMA - $r_t$ ARIMA - $y_t$
$r_1$ $ y_1 = r_1 $
$r_2$ $ y_2 = r_1 + r_2 = y_1 + r_2$
$r_3$ $ y_2 = r_1 + r_2 + r_3 = y_2 + r_3$
... ...
$r_{\tau}$ $ y_{\tau} = r_1 + r_2 + ... + r_{\tau} = y_{\tau-1} + r_{\tau}$

So, for every $y_{\tau}$ we are summing up all $r_t$ for $t = 1,...,\tau$. The command that does that in Python is np.cumsum.

y_sim_arima = np.cumsum(r_sim_ar1) 

Let's visualise the simulated process.

data = {
  "t": range(1,T+1),
  "y": y_sim_arima
}

#load data into a DataFrame object:
y_df = pd.DataFrame(data)

(
  ggplot(y_df,aes(x = 't', y = 'y')) + 
    geom_line() +
    labs(title = "Simulated ARIMA process",
         x = "Time")
)

We call such processes difference stationary processes, as differencing them will turn them into a stationary process.

In your Econometrics classes you would have learned that it can be a challenge to differentiate a difference stationary process and a trend stationary process, for instance

$$r_t = \alpha + \beta t + \phi_1 r_{t-1} + \epsilon_t$$

where $\alpha = 0.0$, $\beta = 0.1$ and $\phi_1 = 0.5$


Challenge task

The task

Adjust the code above to allow your simulated process to have a time trend.

The solution

We adjust the simulate_ar1 process to also allow for the input of the $\beta$ parameter. We default the parameter to 0.

def simulate_ar1(phi1, T, r0=0, alpha=0, beta = 0, seed=np.int64(time.time())):
    """
    Simulates an AR(2) process: r_t = phi1 * r_{t-1} + phi2 * r_{t-2} + epsilon_t.

    Parameters:
    phi1 (float): Coefficient for r_{t-1}.
    phi2 (float): Coefficient for r_{t-2}.
    T (int): Number of time periods to simulate.
    r0 (float): Return at t=0 (starting value).
    seed (int): Random seed for reproducibility.

    Returns:
    np.ndarray: Simulated AR(2) time series.
    """
    np.random.seed(seed)
    epsilon_t = np.random.normal(0, 1, T)  # Random shocks
    r_t = np.zeros(T)  # Initialize time series
    
    # ensure that process starts at unconditional 
    if (r0 == 0):
        r0 = alpha/(1 - phi1)
    
    # Step 1: Set initial values
    r_t[0] = r0
    
    # Step 2: Simulate AR(2) process using a loop
    for t in range(1, T):  # Start from t=1
        r_t[t] = alpha + beta * t + phi1 * r_t[t-1] + epsilon_t[t]
    
    return r_t

Now we can call the function to generate such a series and display it.

# Parameters for AR(1) with trend
alpha = 0
beta = 0.01

phi1 = 0.5  # Coefficient for r_{t-1}
T = 1000  # Number of time periods
r0 = 0      # will be set to uncond mean inside code

# Simulate the AR(1) process with trend
r_sim_ar1_trend = simulate_ar1(phi1, T, r0, alpha = alpha, beta = beta)

data = {
  "t": range(1,T+1),
  "r": r_sim_ar1_trend
}

#load data into a DataFrame object:
r_df = pd.DataFrame(data)

(
  ggplot(r_df,aes(x = 't', y = 'r')) + 
    geom_line() +
    labs(title = "Simulated AR(1) series, with trend",
         x = "Time")
)


If you plot this process, it certainly looks non-stationary. However, if you deduct the trend component it will be stationary. This is why such a process is called trend-stationary.

There are further extensions you may want to implement. For instance we assumed that the error terms ($\epsilon_t$) all follow a standard normal distribution. This is not necessary and you should change tde distribution or the variance.

Summary

In this walkthrough you have learned

  • how to write functions that can be used to simulate autoregressive processes with different parameter values
  • how to use the ArmaProcess function of the statsmodels library to simulate AR processes.

This walkthrough is part of the ECLR page.