12 Replication
12.1 Purpose
This chapter begins the empirical replication of the Post-Soviet gravity paper in Python. The goal is not to obtain a similar-looking table. The goal is to reproduce the data sample, variable definitions, model sequence, and coefficient comparison transparently.
The working notebook for this chapter is:
notebooks/post_soviet_replication_python.qmd
12.2 Dataset
The replication uses the cleaned Post-Soviet gravity dataset:
data/ExSoviet_balanced_clean.csv
The required variables are:
flow, gdp_o, gdp_d, distw, comlang_off, contig, wto_joint, EU_joint, EAEU_joint, year, iso_o, and iso_d.
The notebook is designed to stop if the dataset is missing or if any required variable is absent. This is intentional. A replication workflow should fail clearly rather than silently estimate the wrong model.
12.3 Replication workflow
The Python workflow has four steps:
- inspect the dataset;
- prepare gravity variables;
- estimate the model sequence;
- compare Python estimates with the manuscript.
The first diagnostic block checks rows, columns, missing values, zero flows, and the positive-flow sample.
df_raw = pd.read_csv("data/ExSoviet_balanced_clean.csv")
missing_required = [column for column in REQUIRED_COLUMNS if column not in df_raw.columns]
if missing_required:
raise ValueError("Missing required variables: " + ", ".join(missing_required))
zero_flow_share = (df_raw["flow"] == 0).mean()
positive_flow_rows = (
(df_raw["flow"] > 0)
& (df_raw["distw"] > 0)
& (df_raw["gdp_o"] > 0)
& (df_raw["gdp_d"] > 0)
).sum()12.4 Variable preparation
The log-linear models require positive values for trade, GDP, and distance. The notebook creates logs only where the underlying value is valid.
df["log_flow"] = np.where(df["flow"] > 0, np.log(df["flow"]), np.nan)
df["log_gdp_o"] = np.where(df["gdp_o"] > 0, np.log(df["gdp_o"]), np.nan)
df["log_gdp_d"] = np.where(df["gdp_d"] > 0, np.log(df["gdp_d"]), np.nan)
df["log_distw"] = np.where(df["distw"] > 0, np.log(df["distw"]), np.nan)
df["pair_id"] = df["iso_o"] + "_" + df["iso_d"]
df["exporter_year"] = df["iso_o"] + "_" + df["year"].astype(str)
df["importer_year"] = df["iso_d"] + "_" + df["year"].astype(str)
df["log_trade_intensity"] = np.where(
(df["flow"] > 0) & (df["gdp_o"] > 0) & (df["gdp_d"] > 0),
np.log(df["flow"]) - np.log(df["gdp_o"]) - np.log(df["gdp_d"]),
np.nan,
)Do not use log(flow + 1) as a default. It changes the dependent variable and makes the log-linear estimates difficult to compare with the manuscript.
12.5 Model sequence
The notebook estimates eight models.
| Model | Estimator | Sample | Main purpose |
|---|---|---|---|
| 1 | Classic OLS | Positive flows | Baseline log-linear gravity |
| 2 | FE OLS | Positive flows | Exporter and importer fixed effects |
| 3 | Double-demeaned OLS | Positive flows | Manual fixed-effects logic |
| 4 | BVU normalized gravity | Positive flows | Trade-intensity normalization |
| 5 | PPML with GDP controls | Valid level sample | Multiplicative gravity with zeros allowed |
| 6 | GPML | Positive flows | Robustness estimator |
| 7 | Zero-inclusive PPML | Valid level sample | Main zero-inclusive model |
| 8 | Structural PPML | Valid level sample | Exporter-year and importer-year fixed effects |
12.6 Baseline models
The classic OLS model is:
\[ \begin{aligned} \log(flow_{ijt}) &= \beta_0 + \beta_1 \log(gdp\_o_{it}) + \beta_2 \log(gdp\_d_{jt}) \\ &\quad + \beta_3 \log(distw_{ij}) + \gamma Z_{ijt} + \varepsilon_{ijt} \end{aligned} \]
where \(Z_{ijt}\) contains comlang_off, contig, wto_joint, EU_joint, and EAEU_joint.
ols_formula = (
"log_flow ~ log_gdp_o + log_gdp_d + log_distw + "
"comlang_off + contig + wto_joint + EU_joint + EAEU_joint"
)
ols = smf.ols(ols_formula, data=positive).fit(cov_type="HC1")The fixed-effects OLS model controls for exporter and importer differences:
fe_formula = (
"log_flow ~ log_distw + comlang_off + contig + "
"wto_joint + EU_joint + EAEU_joint + C(iso_o) + C(iso_d)"
)
fe_ols = smf.ols(fe_formula, data=positive).fit(cov_type="HC1")12.7 Double demeaning
Double demeaning manually removes origin and destination averages from each variable:
\[ \tilde{x}_{ij} = x_{ij} - \bar{x}_{i \cdot} - \bar{x}_{\cdot j} + \bar{x}. \]
def double_demean(data, variable):
overall = data[variable].mean()
origin_mean = data.groupby("iso_o")[variable].transform("mean")
destination_mean = data.groupby("iso_d")[variable].transform("mean")
return data[variable] - origin_mean - destination_mean + overallThe double-demeaned specification is useful for teaching, but final replication tables should also report the explicit fixed-effects estimator.
12.8 BVU normalized gravity
The BVU teaching specification uses:
\[ \log\left(\frac{flow_{ijt}}{gdp\_o_{it} gdp\_d_{jt}}\right). \]
bvu_formula = (
"log_trade_intensity ~ log_distw + comlang_off + contig + "
"wto_joint + EU_joint + EAEU_joint"
)
bvu = smf.ols(bvu_formula, data=positive).fit(cov_type="HC1")This is a normalized gravity implementation for teaching. It should be compared with the exact Bonus Vetus transformation used by the original workflow before publication.
12.9 PPML and GPML
PPML estimates gravity in levels and can include zero trade flows:
ppml_formula = (
"flow ~ log_gdp_o + log_gdp_d + log_distw + "
"comlang_off + contig + wto_joint + EU_joint + EAEU_joint"
)
ppml = smf.glm(
formula=ppml_formula,
data=ppml_data,
family=sm.families.Poisson(),
).fit(cov_type="HC1")GPML is estimated on positive trade flows only:
gpml_formula = (
"flow ~ log_distw + comlang_off + contig + "
"wto_joint + EU_joint + EAEU_joint + C(iso_o) + C(iso_d)"
)
gpml = smf.glm(
formula=gpml_formula,
data=positive,
family=sm.families.Gamma(link=sm.families.links.Log()),
).fit(cov_type="HC1")12.10 Structural PPML
The structural PPML model uses exporter-year and importer-year fixed effects:
structural_formula = (
"flow ~ log_distw + comlang_off + contig + "
"wto_joint + EU_joint + EAEU_joint + "
"C(exporter_year) + C(importer_year)"
)
structural_ppml = smf.glm(
formula=structural_formula,
data=structural_data,
family=sm.families.Poisson(),
).fit(cov_type="HC1")In this specification, log_gdp_o and log_gdp_d are absorbed by exporter-year and importer-year fixed effects. If feasible, the notebook also attempts clustered standard errors by pair_id.
12.11 Results interpretation
The main coefficients to compare are:
| Variable | Interpretation |
|---|---|
log_distw |
Distance elasticity of bilateral trade |
comlang_off |
Association between common official language and trade |
contig |
Association between common border and trade |
wto_joint |
Association between joint WTO membership and trade |
EU_joint |
Association between joint EU status and trade |
EAEU_joint |
Association between joint EAEU status and trade |
log_gdp_o |
Exporter GDP elasticity where not absorbed |
log_gdp_d |
Importer GDP elasticity where not absorbed |
For dummy variables in log or log-link models, use:
\[ 100 \times \left(\exp(\beta) - 1\right). \]
Do not describe these coefficients as causal effects unless the research design supports a causal interpretation.
12.12 Comparison with manuscript
The notebook creates a manuscript comparison table for:
log_distw, comlang_off, contig, wto_joint, EU_joint, EAEU_joint, log_gdp_o, and log_gdp_d.
The manuscript-reported coefficients are stored in:
materials/manuscript-coefficients.csv
The notebook reads this file, merges manuscript coefficients with Python estimates, and calculates python_beta - manuscript_beta.
12.13 Replication Quality Assessment
Replication matters because empirical trade research depends on reproducible data construction, transparent estimation, and honest comparison with the reference study. A student should not claim an extension until the baseline replication is understood.
The coefficient comparison is performed model by model. For each reported coefficient, the notebook records the manuscript value, the Python estimate, the difference, and a match status. The thresholds are:
| Status | Rule | Interpretation |
|---|---|---|
exact |
abs(difference) < 0.001 |
The Python result matches the reported coefficient after rounding. |
close |
abs(difference) < 0.01 |
The result is very near the manuscript value and likely reflects rounding or minor implementation details. |
review |
otherwise | The difference requires diagnosis before the result can be treated as replicated. |
Rows with absorbed or omitted coefficients are marked as not applicable in the notebook output. Review items should be traced to:
- sample mismatch;
- zero-flow treatment;
- fixed effects;
- robust versus clustered standard errors;
- variable definitions;
- missing values;
- scaling of trade or GDP.
Exact and close matches validate the main Python implementation. In the current replication, the positive-flow specifications match the manuscript after rounding, which confirms that the core data preparation, variable construction, and baseline estimators are working as intended.
The remaining review items are concentrated in the zero-inclusive models. This reflects a sample-scope difference: the teaching dataset currently contains no zero-flow observations, while the manuscript reports zero-inclusive PPML estimates from a broader sample that retains zero trade flows. This is not a Python failure. It is a reminder that replication depends on using the same data scope.
Students should never force replication by changing data silently. If estimates do not match, document the discrepancy, identify the likely source, and state whether the issue is data scope, variable construction, fixed effects, standard errors, or estimator behavior.
12.14 Replication caution
Exact replication is stricter than reproducing coefficient signs. The dataset, sample restrictions, transformations, fixed effects, and covariance estimator must match the manuscript. If the Python estimates differ, the correct response is to diagnose the discrepancy, not to force agreement.