Can we beat multiple regression for direct effects with multi-step regression?

The standard model of Mediation posits the following system of equations

\[ Y = \alpha + \tau W + \gamma M + \varepsilon \]

\[ M = \zeta + \kappa W + \eta \]

Here, \(\tau\) is the ‘Natural Direct Effect’ (holding the mediator at its natural values, i.e. marginalizing over its observed distribution, which is what a multiple-regression does), and \(\gamma \cdot \kappa\) is the ‘Natural Indirect Effect’ which is the effect of W mediated by M. In the linear setting, the former can be estimated directly using multiple regression, and the latter can be estimated as the product of two regression coefficients.

Finally, the coefficient \(\beta\) from the short regression

\[ Y = \varpi + \beta W + \epsilon_i \]

is the Total Effect, where Total Effect = NDE + NIE. An implication of this, is that the NDE can also be estimated as \(\beta - \gamma \cdot \kappa\). This should be equivalent to \(\tau\) from the multiple-regression and might seem like more work, but is seen in the literature.

from joblib import Parallel, delayed

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import graphviz as gr
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="ticks", context="talk")
font = {"family": "IBM Plex Sans", "weight": "normal", "size": 10}
plt.rc("font", **font)
plt.rcParams["figure.dpi"] = 200

import inspect

def simulate(**kwargs):
    values = {}
    g = gr.Digraph()
    caller_frame = inspect.currentframe().f_back
    for k, v in kwargs.items():
        parents = v.__code__.co_varnames
        inputs = {arg: values[arg] for arg in v.__code__.co_varnames if arg in values}
        # Check if any argument is not in the values dictionary
        missing_args = set(parents) - set(inputs.keys())
        for arg in missing_args:
            # Check if the argument exists in the caller's frame
            if arg in caller_frame.f_locals:
                inputs[arg] = caller_frame.f_locals[arg]
        values[k] = v(**inputs)
        for p in parents:
            if p in values and isinstance(values[p], np.ndarray):
                g.edge(p, k)

    return pd.DataFrame(values), g

def onesim(N=50,
            a=1, b=0, c=0.2, d=1,
        k=0, dat = False):
    df, g = simulate(
        W=lambda: np.round(np.random.rand(N), k),  # treatment
        M=lambda d, W: d * W + np.random.randn(N),  # mediator
        # outcome
        Y=lambda W, M, a, b, c: a + b * W + c * M + np.random.randn(N),
    )
    if dat:
        return df, g
    # multiple regression - direct effect
    b_hat = smf.ols("Y ~ W+M", data=df).fit().params.iloc[1]
    # total effect
    e_hat = smf.ols("Y ~ W", data=df).fit().params.iloc[1]
    # M -> Y path
    c_hat = smf.ols("Y ~ M", data=df).fit().params.iloc[1]
    # W -> M path
    d_hat = smf.ols("M ~ W", data=df).fit().params.iloc[1]
    b_tilde = e_hat - c_hat * d_hat
    return b_hat, b_tilde

# DAG of DGP
onesim(dat=True)[1]

def simulator(**kwargs):
    res = Parallel(n_jobs=-1)(delayed(onesim)(**kwargs) for _ in range(1_000))
    res = np.c_[res]
    b, k = kwargs["b"], kwargs["k"]
    abs_bias_direct = np.mean(np.abs(res[:, 0] - b))
    rmse_direct = np.sqrt(np.mean((res[:, 0] - b) ** 2))
    abs_bias_debias = np.mean(np.abs(res[:, 1] - b))
    rmse_debias = np.sqrt(np.mean((res[:, 1] - b) ** 2))
    return res, {"abs_bias_direct": abs_bias_direct, "rmse_direct": rmse_direct, "abs_bias_debias": abs_bias_debias, "rmse_debias": rmse_debias, "b": b, "k": k}

def sumfig(res, metrics, aux):
    f, ax = plt.subplots(1, 1, figsize=(8, 4))
    ax.hist(res[:, 0], bins=30, density=True, alpha = 0.6,
        label = "Multiple Regression")
    ax.hist(res[:, 1], bins=30, density=True, alpha = 0.6,
        label = "Debiasing")
    ax.axvline(metrics["b"], 0, 5, color = "black", linestyle = "--")
    ax.legend(loc="upper right")
    ax.text(0.05, 0.8, f"Direct: Bias {metrics['abs_bias_direct']:.4f}", transform=ax.transAxes)
    ax.text(0.05, 0.7, f"Direct: RMSE {metrics['rmse_direct']:.4f}", transform=ax.transAxes)
    ax.text(0.05, 0.6, f"Debias: Bias {metrics['abs_bias_debias']:.4f}", transform=ax.transAxes)
    ax.text(0.05, 0.5, f"Debias: RMSE {metrics['rmse_debias']:.4f}", transform=ax.transAxes)
    ax.set_title(f"Direct Effect Estimate Distribution \n True Effect = {metrics['b']} \n {aux}")
    f.tight_layout()

Christopher Adams’ Original Simulations

(“Learning Microeconometrics with R”, sec 2.5.3 “Dual Path Estimator Versus Long Regression”)

Note that he uses W for the mediator, X for the treatment, and Y for the outcome, while I use M for the mediator, W for the treatment, and Y for the outcome

sumfig(*simulator(N = 50, b=0, c=3, d = 4, k=0), "Very Strong Indirect Effect")

“Dual Path” does indeed dominate Long Regression here.

Let us now weaken the mediator’s effect on the outcome, and the treatment’s effect on the mediator, and see if the results change.

Small Samples

Change CPA’s parameters one at a time

sumfig(*simulator(b=0, c=0.1, d=4, k=0), "Weaker M -> Y Path")

Dual Path still does better

sumfig(*simulator(b=1, c=3, d=4, k=0), "Non-zero W-> Y Path")

Dual Path has considerable bias. This is consequential: Dual Path seems to only really dominate when the direct effect is zero.

sumfig(*simulator(b=0, c=3, d=0.5, k=0), "Weaker W-> M Path")

Basically the same

sumfig(*simulator(b=1, c=3, d=0.5, k=0), "Weaker W-> M and Nonzero W->Y")

sumfig(*simulator(b=1, c=0.1, d=0.1, k=0), "Weaker W->M,W->Y, NonZero W->Y")

So, did we beat OLS?