12  Replication

12.1 Purpose

This chapter begins the empirical replication of the Post-Soviet gravity paper in Python. The goal is not to obtain a similar-looking table. The goal is to reproduce the data sample, variable definitions, model sequence, and coefficient comparison transparently.

The working notebook for this chapter is:

notebooks/post_soviet_replication_python.qmd

12.2 Dataset

The replication uses the cleaned Post-Soviet gravity dataset:

data/ExSoviet_balanced_clean.csv

The required variables are:

flow, gdp_o, gdp_d, distw, comlang_off, contig, wto_joint, EU_joint, EAEU_joint, year, iso_o, and iso_d.

Warning

The notebook is designed to stop if the dataset is missing or if any required variable is absent. This is intentional. A replication workflow should fail clearly rather than silently estimate the wrong model.

12.3 Replication workflow

The Python workflow has four steps:

  1. inspect the dataset;
  2. prepare gravity variables;
  3. estimate the model sequence;
  4. compare Python estimates with the manuscript.

The first diagnostic block checks rows, columns, missing values, zero flows, and the positive-flow sample.

df_raw = pd.read_csv("data/ExSoviet_balanced_clean.csv")

missing_required = [column for column in REQUIRED_COLUMNS if column not in df_raw.columns]
if missing_required:
    raise ValueError("Missing required variables: " + ", ".join(missing_required))

zero_flow_share = (df_raw["flow"] == 0).mean()
positive_flow_rows = (
    (df_raw["flow"] > 0)
    & (df_raw["distw"] > 0)
    & (df_raw["gdp_o"] > 0)
    & (df_raw["gdp_d"] > 0)
).sum()

12.4 Variable preparation

The log-linear models require positive values for trade, GDP, and distance. The notebook creates logs only where the underlying value is valid.

df["log_flow"] = np.where(df["flow"] > 0, np.log(df["flow"]), np.nan)
df["log_gdp_o"] = np.where(df["gdp_o"] > 0, np.log(df["gdp_o"]), np.nan)
df["log_gdp_d"] = np.where(df["gdp_d"] > 0, np.log(df["gdp_d"]), np.nan)
df["log_distw"] = np.where(df["distw"] > 0, np.log(df["distw"]), np.nan)

df["pair_id"] = df["iso_o"] + "_" + df["iso_d"]
df["exporter_year"] = df["iso_o"] + "_" + df["year"].astype(str)
df["importer_year"] = df["iso_d"] + "_" + df["year"].astype(str)

df["log_trade_intensity"] = np.where(
    (df["flow"] > 0) & (df["gdp_o"] > 0) & (df["gdp_d"] > 0),
    np.log(df["flow"]) - np.log(df["gdp_o"]) - np.log(df["gdp_d"]),
    np.nan,
)

Do not use log(flow + 1) as a default. It changes the dependent variable and makes the log-linear estimates difficult to compare with the manuscript.

12.5 Model sequence

The notebook estimates eight models.

Model Estimator Sample Main purpose
1 Classic OLS Positive flows Baseline log-linear gravity
2 FE OLS Positive flows Exporter and importer fixed effects
3 Double-demeaned OLS Positive flows Manual fixed-effects logic
4 BVU normalized gravity Positive flows Trade-intensity normalization
5 PPML with GDP controls Valid level sample Multiplicative gravity with zeros allowed
6 GPML Positive flows Robustness estimator
7 Zero-inclusive PPML Valid level sample Main zero-inclusive model
8 Structural PPML Valid level sample Exporter-year and importer-year fixed effects

12.6 Baseline models

The classic OLS model is:

\[ \begin{aligned} \log(flow_{ijt}) &= \beta_0 + \beta_1 \log(gdp\_o_{it}) + \beta_2 \log(gdp\_d_{jt}) \\ &\quad + \beta_3 \log(distw_{ij}) + \gamma Z_{ijt} + \varepsilon_{ijt} \end{aligned} \]

where \(Z_{ijt}\) contains comlang_off, contig, wto_joint, EU_joint, and EAEU_joint.

ols_formula = (
    "log_flow ~ log_gdp_o + log_gdp_d + log_distw + "
    "comlang_off + contig + wto_joint + EU_joint + EAEU_joint"
)

ols = smf.ols(ols_formula, data=positive).fit(cov_type="HC1")

The fixed-effects OLS model controls for exporter and importer differences:

fe_formula = (
    "log_flow ~ log_distw + comlang_off + contig + "
    "wto_joint + EU_joint + EAEU_joint + C(iso_o) + C(iso_d)"
)

fe_ols = smf.ols(fe_formula, data=positive).fit(cov_type="HC1")

12.7 Double demeaning

Double demeaning manually removes origin and destination averages from each variable:

\[ \tilde{x}_{ij} = x_{ij} - \bar{x}_{i \cdot} - \bar{x}_{\cdot j} + \bar{x}. \]

def double_demean(data, variable):
    overall = data[variable].mean()
    origin_mean = data.groupby("iso_o")[variable].transform("mean")
    destination_mean = data.groupby("iso_d")[variable].transform("mean")
    return data[variable] - origin_mean - destination_mean + overall

The double-demeaned specification is useful for teaching, but final replication tables should also report the explicit fixed-effects estimator.

12.8 BVU normalized gravity

The BVU teaching specification uses:

\[ \log\left(\frac{flow_{ijt}}{gdp\_o_{it} gdp\_d_{jt}}\right). \]

bvu_formula = (
    "log_trade_intensity ~ log_distw + comlang_off + contig + "
    "wto_joint + EU_joint + EAEU_joint"
)

bvu = smf.ols(bvu_formula, data=positive).fit(cov_type="HC1")

This is a normalized gravity implementation for teaching. It should be compared with the exact Bonus Vetus transformation used by the original workflow before publication.

12.9 PPML and GPML

PPML estimates gravity in levels and can include zero trade flows:

ppml_formula = (
    "flow ~ log_gdp_o + log_gdp_d + log_distw + "
    "comlang_off + contig + wto_joint + EU_joint + EAEU_joint"
)

ppml = smf.glm(
    formula=ppml_formula,
    data=ppml_data,
    family=sm.families.Poisson(),
).fit(cov_type="HC1")

GPML is estimated on positive trade flows only:

gpml_formula = (
    "flow ~ log_distw + comlang_off + contig + "
    "wto_joint + EU_joint + EAEU_joint + C(iso_o) + C(iso_d)"
)

gpml = smf.glm(
    formula=gpml_formula,
    data=positive,
    family=sm.families.Gamma(link=sm.families.links.Log()),
).fit(cov_type="HC1")

12.10 Structural PPML

The structural PPML model uses exporter-year and importer-year fixed effects:

structural_formula = (
    "flow ~ log_distw + comlang_off + contig + "
    "wto_joint + EU_joint + EAEU_joint + "
    "C(exporter_year) + C(importer_year)"
)

structural_ppml = smf.glm(
    formula=structural_formula,
    data=structural_data,
    family=sm.families.Poisson(),
).fit(cov_type="HC1")

In this specification, log_gdp_o and log_gdp_d are absorbed by exporter-year and importer-year fixed effects. If feasible, the notebook also attempts clustered standard errors by pair_id.

12.11 Results interpretation

The main coefficients to compare are:

Variable Interpretation
log_distw Distance elasticity of bilateral trade
comlang_off Association between common official language and trade
contig Association between common border and trade
wto_joint Association between joint WTO membership and trade
EU_joint Association between joint EU status and trade
EAEU_joint Association between joint EAEU status and trade
log_gdp_o Exporter GDP elasticity where not absorbed
log_gdp_d Importer GDP elasticity where not absorbed

For dummy variables in log or log-link models, use:

\[ 100 \times \left(\exp(\beta) - 1\right). \]

Do not describe these coefficients as causal effects unless the research design supports a causal interpretation.

12.12 Comparison with manuscript

The notebook creates a manuscript comparison table for:

log_distw, comlang_off, contig, wto_joint, EU_joint, EAEU_joint, log_gdp_o, and log_gdp_d.

The manuscript-reported coefficients are stored in:

materials/manuscript-coefficients.csv

The notebook reads this file, merges manuscript coefficients with Python estimates, and calculates python_beta - manuscript_beta.

12.13 Replication Quality Assessment

Replication matters because empirical trade research depends on reproducible data construction, transparent estimation, and honest comparison with the reference study. A student should not claim an extension until the baseline replication is understood.

The coefficient comparison is performed model by model. For each reported coefficient, the notebook records the manuscript value, the Python estimate, the difference, and a match status. The thresholds are:

Status Rule Interpretation
exact abs(difference) < 0.001 The Python result matches the reported coefficient after rounding.
close abs(difference) < 0.01 The result is very near the manuscript value and likely reflects rounding or minor implementation details.
review otherwise The difference requires diagnosis before the result can be treated as replicated.

Rows with absorbed or omitted coefficients are marked as not applicable in the notebook output. Review items should be traced to:

  • sample mismatch;
  • zero-flow treatment;
  • fixed effects;
  • robust versus clustered standard errors;
  • variable definitions;
  • missing values;
  • scaling of trade or GDP.

Exact and close matches validate the main Python implementation. In the current replication, the positive-flow specifications match the manuscript after rounding, which confirms that the core data preparation, variable construction, and baseline estimators are working as intended.

The remaining review items are concentrated in the zero-inclusive models. This reflects a sample-scope difference: the teaching dataset currently contains no zero-flow observations, while the manuscript reports zero-inclusive PPML estimates from a broader sample that retains zero trade flows. This is not a Python failure. It is a reminder that replication depends on using the same data scope.

Students should never force replication by changing data silently. If estimates do not match, document the discrepancy, identify the likely source, and state whether the issue is data scope, variable construction, fixed effects, standard errors, or estimator behavior.

12.14 Replication caution

Exact replication is stricter than reproducing coefficient signs. The dataset, sample restrictions, transformations, fixed effects, and covariance estimator must match the manuscript. If the Python estimates differ, the correct response is to diagnose the discrepancy, not to force agreement.