from joblib import Parallel, delayed
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import graphviz as gr
import matplotlib.pyplot as plt
import seaborn as sns
set(style="ticks", context="talk")
sns.= {"family": "IBM Plex Sans", "weight": "normal", "size": 10}
font "font", **font)
plt.rc("figure.dpi"] = 200 plt.rcParams[
Can we beat multiple regression for direct effects with multi-step regression?
The standard model of Mediation posits the following system of equations
\[ Y = \alpha + \tau W + \gamma M + \varepsilon \]
\[ M = \zeta + \kappa W + \eta \]
Here, \(\tau\) is the ‘Natural Direct Effect’ (holding the mediator at its natural
values, i.e. marginalizing over its observed distribution, which is what a multiple-regression does), and \(\gamma \cdot \kappa\) is the ‘Natural Indirect Effect’ which is the effect of W
mediated by M
. In the linear setting, the former can be estimated directly using multiple regression, and the latter can be estimated as the product of two regression coefficients.
Finally, the coefficient \(\beta\) from the short regression
\[ Y = \varpi + \beta W + \epsilon_i \]
is the Total Effect, where Total Effect = NDE + NIE. An implication of this, is that the NDE can also be estimated as \(\beta - \gamma \cdot \kappa\). This should be equivalent to \(\tau\) from the multiple-regression and might seem like more work, but is seen in the literature.
import inspect
def simulate(**kwargs):
= {}
values = gr.Digraph()
g = inspect.currentframe().f_back
caller_frame for k, v in kwargs.items():
= v.__code__.co_varnames
parents = {arg: values[arg] for arg in v.__code__.co_varnames if arg in values}
inputs # Check if any argument is not in the values dictionary
= set(parents) - set(inputs.keys())
missing_args for arg in missing_args:
# Check if the argument exists in the caller's frame
if arg in caller_frame.f_locals:
= caller_frame.f_locals[arg]
inputs[arg] = v(**inputs)
values[k] for p in parents:
if p in values and isinstance(values[p], np.ndarray):
g.edge(p, k)
return pd.DataFrame(values), g
def onesim(N=50,
=1, b=0, c=0.2, d=1,
a=0, dat = False):
k= simulate(
df, g =lambda: np.round(np.random.rand(N), k), # treatment
W=lambda d, W: d * W + np.random.randn(N), # mediator
M# outcome
=lambda W, M, a, b, c: a + b * W + c * M + np.random.randn(N),
Y
)if dat:
return df, g
# multiple regression - direct effect
= smf.ols("Y ~ W+M", data=df).fit().params.iloc[1]
b_hat # total effect
= smf.ols("Y ~ W", data=df).fit().params.iloc[1]
e_hat # M -> Y path
= smf.ols("Y ~ M", data=df).fit().params.iloc[1]
c_hat # W -> M path
= smf.ols("M ~ W", data=df).fit().params.iloc[1]
d_hat = e_hat - c_hat * d_hat
b_tilde return b_hat, b_tilde
# DAG of DGP
=True)[1] onesim(dat
def simulator(**kwargs):
= Parallel(n_jobs=-1)(delayed(onesim)(**kwargs) for _ in range(1_000))
res = np.c_[res]
res = kwargs["b"], kwargs["k"]
b, k = np.mean(np.abs(res[:, 0] - b))
abs_bias_direct = np.sqrt(np.mean((res[:, 0] - b) ** 2))
rmse_direct = np.mean(np.abs(res[:, 1] - b))
abs_bias_debias = np.sqrt(np.mean((res[:, 1] - b) ** 2))
rmse_debias return res, {"abs_bias_direct": abs_bias_direct, "rmse_direct": rmse_direct, "abs_bias_debias": abs_bias_debias, "rmse_debias": rmse_debias, "b": b, "k": k}
def sumfig(res, metrics, aux):
= plt.subplots(1, 1, figsize=(8, 4))
f, ax 0], bins=30, density=True, alpha = 0.6,
ax.hist(res[:, = "Multiple Regression")
label 1], bins=30, density=True, alpha = 0.6,
ax.hist(res[:, = "Debiasing")
label "b"], 0, 5, color = "black", linestyle = "--")
ax.axvline(metrics[="upper right")
ax.legend(loc0.05, 0.8, f"Direct: Bias {metrics['abs_bias_direct']:.4f}", transform=ax.transAxes)
ax.text(0.05, 0.7, f"Direct: RMSE {metrics['rmse_direct']:.4f}", transform=ax.transAxes)
ax.text(0.05, 0.6, f"Debias: Bias {metrics['abs_bias_debias']:.4f}", transform=ax.transAxes)
ax.text(0.05, 0.5, f"Debias: RMSE {metrics['rmse_debias']:.4f}", transform=ax.transAxes)
ax.text(f"Direct Effect Estimate Distribution \n True Effect = {metrics['b']} \n {aux}")
ax.set_title( f.tight_layout()
Christopher Adams’ Original Simulations
(“Learning Microeconometrics with R”, sec 2.5.3 “Dual Path Estimator Versus Long Regression”)
Note that he uses W
for the mediator, X
for the treatment, and Y
for the outcome, while I use M
for the mediator, W
for the treatment, and Y
for the outcome
*simulator(N = 50, b=0, c=3, d = 4, k=0), "Very Strong Indirect Effect") sumfig(
“Dual Path” does indeed dominate Long Regression here.
Let us now weaken the mediator’s effect on the outcome, and the treatment’s effect on the mediator, and see if the results change.
Small Samples
Change CPA’s parameters one at a time
*simulator(b=0, c=0.1, d=4, k=0), "Weaker M -> Y Path") sumfig(
Dual Path still does better
*simulator(b=1, c=3, d=4, k=0), "Non-zero W-> Y Path") sumfig(
Dual Path has considerable bias. This is consequential: Dual Path seems to only really dominate when the direct effect is zero.
*simulator(b=0, c=3, d=0.5, k=0), "Weaker W-> M Path") sumfig(
Basically the same
*simulator(b=1, c=3, d=0.5, k=0), "Weaker W-> M and Nonzero W->Y") sumfig(
*simulator(b=1, c=0.1, d=0.1, k=0), "Weaker W->M,W->Y, NonZero W->Y") sumfig(
So, did we beat OLS?