Appendix B — R to Python
B.1 Introduction
NREC6006 uses Python as the main research language. Much of the gravity-model literature, however, uses R or Stata, and many online gravity examples rely on R packages such as fixest, gravity, dplyr, and MASS.
This appendix translates common R gravity workflows into Python for the Post-Soviet project. The examples use the course variable names:
flow, gdp_o, gdp_d, distw, comlang_off, contig, wto_joint, EU_joint, EAEU_joint, year, iso_o, and iso_d.
The goal is not to make Python imitate R line by line. The goal is to reproduce the same empirical workflow with transparent Python code.
B.2 Package equivalence table
| Task | R package/function | Python package/function | Notes |
|---|---|---|---|
| Read CSV file | readr::read_csv, read.csv |
pandas.read_csv |
Use the same file and confirm column types. |
| Create variables | dplyr::mutate |
pandas.assign, direct column assignment |
Python usually uses df["new"] = .... |
| Filter rows | dplyr::filter |
pandas.query, boolean masks |
Boolean masks are explicit and easy to audit. |
| Baseline OLS | lm |
statsmodels.formula.api.ols |
Use robust standard errors when reporting. |
| Fixed-effects OLS | fixest::feols |
statsmodels with C() or linearmodels |
statsmodels dummy expansion is transparent but can be slow. |
| Poisson GLM | glm(..., family = poisson) |
statsmodels GLM Poisson |
Used for PPML-style estimation. |
| Double demeaning | gravity::ddm |
custom double_demean() function |
Python implementation should be checked against R output. |
| Bonus Vetus Unweighted | gravity::bvu |
custom BVU transformation | Python requires explicit variable construction. |
| Bonus Vetus Weighted | gravity::bvw |
custom BVW transformation | Weighting choices must match the replication design. |
| PPML | gravity::ppml, glm, fixest::fepois |
statsmodels GLM Poisson |
PPML does not require count data. |
| GPML | gravity::gpml, glm(..., family = Gamma) |
statsmodels GLM Gamma |
Requires positive trade flows. |
| NBPML | MASS::glm.nb |
statsmodels Negative Binomial |
Useful as a robustness estimator. |
| Network analysis | igraph |
networkx |
Descriptive, not causal. |
B.3 Data loading side-by-side
The first replication check is that R and Python read the same dataset with the same rows and columns.
B.3.1 R
library(readr)
df <- read_csv("data/gravity_clean.csv")
dim(df)
names(df)
head(df)Base R is also acceptable:
df <- read.csv("data/gravity_clean.csv")B.3.2 Python
import pandas as pd
df = pd.read_csv("data/gravity_clean.csv")
df.shape
df.columns
df.head()Before comparing any results, confirm that flow, gdp_o, gdp_d, distw, comlang_off, contig, wto_joint, EU_joint, EAEU_joint, year, iso_o, and iso_d are present and have the expected types.
B.4 Variable construction side-by-side
The basic gravity workflow creates logs, a pair identifier, and a positive-flow sample for log-linear OLS.
B.4.1 R
library(dplyr)
df <- df %>%
mutate(
log_flow = ifelse(flow > 0, log(flow), NA_real_),
log_distw = log(distw),
log_gdp_o = log(gdp_o),
log_gdp_d = log(gdp_d),
pair_id = paste(iso_o, iso_d, sep = "_")
)
ols_df <- df %>%
filter(flow > 0)B.4.2 Python
import numpy as np
df["log_flow"] = np.where(df["flow"] > 0, np.log(df["flow"]), np.nan)
df["log_distw"] = np.log(df["distw"])
df["log_gdp_o"] = np.log(df["gdp_o"])
df["log_gdp_d"] = np.log(df["gdp_d"])
df["pair_id"] = df["iso_o"] + "_" + df["iso_d"]
ols_df = df.loc[df["flow"] > 0].copy()log_flow is undefined for zero trade flows. Do not replace log(0) with zero.
B.5 Baseline OLS side-by-side
The baseline log-linear model explains positive bilateral trade flows using economic size, weighted distance, bilateral controls, and institutional indicators.
B.5.1 R
ols_r <- lm(
log_flow ~ log_gdp_o + log_gdp_d + log_distw +
comlang_off + contig + wto_joint + EU_joint + EAEU_joint,
data = ols_df
)
summary(ols_r)B.5.2 Python
import statsmodels.formula.api as smf
ols_formula = (
"log_flow ~ log_gdp_o + log_gdp_d + log_distw + "
"comlang_off + contig + wto_joint + EU_joint + EAEU_joint"
)
ols_py = smf.ols(ols_formula, data=ols_df).fit(cov_type="HC1")
print(ols_py.summary())B.5.3 Interpretation
log_gdp_o, log_gdp_d, and log_distw are interpreted as elasticities in the log-linear model. Dummy variables such as wto_joint, EU_joint, and EAEU_joint should be converted using:
\[ 100 \times \left(\exp(\beta) - 1\right) \]
This gives the approximate percent difference associated with the dummy variable, conditional on the model.
B.6 Fixed effects side-by-side
Fixed effects control for unobserved exporter, importer, and year differences. In the Post-Soviet workflow, fixed effects are also a bridge to structural gravity.
B.6.1 R
library(fixest)
fe_r <- feols(
log_flow ~ log_distw + comlang_off + contig +
wto_joint + EU_joint + EAEU_joint |
iso_o + iso_d + year,
data = ols_df,
vcov = "hetero"
)
summary(fe_r)B.6.2 Python
fe_formula = (
"log_flow ~ log_distw + comlang_off + contig + "
"wto_joint + EU_joint + EAEU_joint + "
"C(iso_o) + C(iso_d) + C(year)"
)
fe_py = smf.ols(fe_formula, data=ols_df).fit(cov_type="HC1")Python dummy expansion with C() is transparent and useful for teaching. It may be slower for large datasets because it creates explicit dummy variables. For larger panels, students may need specialized fixed-effects estimators such as linearmodels or a custom high-dimensional workflow.
B.7 Double demeaning side-by-side
Double demeaning removes exporter and importer means from a variable. It helps students understand the mechanics behind two-way fixed effects.
For a variable \(x_{ij}\):
\[ \tilde{x}_{ij} = x_{ij} - \bar{x}_{i \cdot} - \bar{x}_{\cdot j} + \bar{x} \]
B.7.1 R
R package workflows may use gravity::ddm or related logic. The explicit version is:
double_demean <- function(x, exporter, importer) {
x - ave(x, exporter) - ave(x, importer) + mean(x, na.rm = TRUE)
}
ols_df$log_flow_dd <- double_demean(
ols_df$log_flow,
ols_df$iso_o,
ols_df$iso_d
)B.7.2 Python
def double_demean(data, variable, exporter="iso_o", importer="iso_d"):
overall = data[variable].mean()
exporter_mean = data.groupby(exporter)[variable].transform("mean")
importer_mean = data.groupby(importer)[variable].transform("mean")
return data[variable] - exporter_mean - importer_mean + overall
ols_df["log_flow_dd"] = double_demean(
ols_df,
variable="log_flow",
exporter="iso_o",
importer="iso_d",
)Use this to understand the transformation. For publication models, report the actual fixed-effects estimator and sample used.
B.8 BVU side-by-side
Bonus Vetus Unweighted (BVU) approximates multilateral-resistance adjustment by transforming bilateral variables with unweighted country means.
B.8.1 Conceptual equation
For a bilateral variable \(x_{ij}\):
\[ x^{BVU}_{ij} = x_{ij} - \bar{x}_{i \cdot} - \bar{x}_{\cdot j} + \bar{x} \]
B.8.2 R-style pseudocode
R gravity::bvu automates the transformation when the dataset and function arguments match the package requirements.
library(gravity)
# Pseudocode: check actual function arguments against package documentation.
# bvu_model <- bvu(
# dependent_variable = "flow",
# distance = "distw",
# additional_regressors = c("comlang_off", "contig", "wto_joint", "EU_joint", "EAEU_joint"),
# code_origin = "iso_o",
# code_destination = "iso_d",
# data = df
# )B.8.3 Python transformation
Python requires explicit construction.
def bvu_transform(data, variable, exporter="iso_o", importer="iso_d"):
overall = data[variable].mean()
exporter_mean = data.groupby(exporter)[variable].transform("mean")
importer_mean = data.groupby(importer)[variable].transform("mean")
return data[variable] - exporter_mean - importer_mean + overall
for variable in [
"log_distw",
"comlang_off",
"contig",
"wto_joint",
"EU_joint",
"EAEU_joint",
]:
ols_df[f"{variable}_bvu"] = bvu_transform(ols_df, variable)bvu_formula = (
"log_flow ~ log_gdp_o + log_gdp_d + "
"log_distw_bvu + comlang_off_bvu + contig_bvu + "
"wto_joint_bvu + EU_joint_bvu + EAEU_joint_bvu + C(year)"
)
bvu_py = smf.ols(bvu_formula, data=ols_df).fit(cov_type="HC1")B.9 BVW side-by-side
Bonus Vetus Weighted (BVW) uses weighted means rather than simple means. In the Post-Soviet project, a plausible teaching implementation can use GDP-based weights, but the final publication workflow must match the exact weighting rule used in the replication design.
B.9.1 Conceptual equation
\[ x^{BVW}_{ij} = x_{ij} - \sum_k s_k x_{ik} - \sum_m s_m x_{mj} + \sum_k \sum_m s_k s_m x_{km} \]
B.9.2 R-style pseudocode
# Pseudocode only. Confirm arguments and weights before publication use.
# bvw_model <- bvw(
# dependent_variable = "flow",
# distance = "distw",
# additional_regressors = c("comlang_off", "contig", "wto_joint", "EU_joint", "EAEU_joint"),
# code_origin = "iso_o",
# code_destination = "iso_d",
# weights = "gdp_d",
# data = df
# )B.9.3 Python teaching implementation
def weighted_mean_by_group(data, variable, group, weight):
weighted_sum = (data[variable] * data[weight]).groupby(data[group]).transform("sum")
weight_sum = data[weight].groupby(data[group]).transform("sum")
return weighted_sum / weight_sum
def bvw_transform(data, variable, exporter="iso_o", importer="iso_d", weight="gdp_d"):
exporter_wmean = weighted_mean_by_group(data, variable, exporter, weight)
importer_wmean = weighted_mean_by_group(data, variable, importer, weight)
overall_wmean = (data[variable] * data[weight]).sum() / data[weight].sum()
return data[variable] - exporter_wmean - importer_wmean + overall_wmean
for variable in [
"log_distw",
"comlang_off",
"contig",
"wto_joint",
"EU_joint",
"EAEU_joint",
]:
ols_df[f"{variable}_bvw"] = bvw_transform(ols_df, variable, weight="gdp_d")bvw_formula = (
"log_flow ~ log_gdp_o + log_gdp_d + "
"log_distw_bvw + comlang_off_bvw + contig_bvw + "
"wto_joint_bvw + EU_joint_bvw + EAEU_joint_bvw + C(year)"
)
bvw_py = smf.ols(bvw_formula, data=ols_df).fit(cov_type="HC1")The Python BVW code above is a teaching implementation. Before publication, verify the weighting formula, grouping structure, and sample against the original R workflow and the Post-Soviet replication design.
B.10 PPML side-by-side
PPML estimates the multiplicative gravity model in levels.
B.10.1 R
ppml_r <- glm(
flow ~ log_gdp_o + log_gdp_d + log_distw +
comlang_off + contig + wto_joint + EU_joint + EAEU_joint +
factor(year),
data = df,
family = quasipoisson(link = "log")
)B.10.2 Python
import statsmodels.api as sm
ppml_formula = (
"flow ~ log_gdp_o + log_gdp_d + log_distw + "
"comlang_off + contig + wto_joint + EU_joint + EAEU_joint + C(year)"
)
ppml_py = smf.glm(
formula=ppml_formula,
data=df,
family=sm.families.Poisson(),
).fit(cov_type="HC1")PPML is estimated with a Poisson family, but the dependent variable does not need to be count data. In gravity research, PPML is used for trade values because it estimates a multiplicative conditional mean.
B.11 GPML side-by-side
GPML uses a Gamma family with a log link. It requires positive trade flows.
B.11.1 R
gpml_df <- subset(df, flow > 0)
gpml_r <- glm(
flow ~ log_gdp_o + log_gdp_d + log_distw +
comlang_off + contig + wto_joint + EU_joint + EAEU_joint +
factor(year),
data = gpml_df,
family = Gamma(link = "log")
)B.11.2 Python
gpml_df = df.loc[df["flow"] > 0].copy()
gpml_py = smf.glm(
formula=ppml_formula,
data=gpml_df,
family=sm.families.Gamma(link=sm.families.links.Log()),
).fit(cov_type="HC1")Report that GPML uses positive trade flows only. Do not compare GPML and PPML without noting sample differences.
B.12 NBPML side-by-side
NBPML uses a negative-binomial variance structure with a log-link mean.
B.12.1 R
library(MASS)
nbpml_r <- glm.nb(
flow ~ log_gdp_o + log_gdp_d + log_distw +
comlang_off + contig + wto_joint + EU_joint + EAEU_joint +
factor(year),
data = df
)B.12.2 Python
nbpml_py = smf.glm(
formula=ppml_formula,
data=df,
family=sm.families.NegativeBinomial(),
).fit(cov_type="HC1")Use NBPML as a robustness estimator unless the research design gives a specific reason to prefer it.
B.13 Tetrads side-by-side
Tetrads use ratio-of-ratios logic to remove exporter and importer terms by comparing four trade flows. They are useful for understanding how fixed effects can be differenced out of a gravity relationship.
B.13.1 Formula
For exporters \(i\), \(k\) and importers \(j\), \(l\):
\[ T_{ikjl} = \frac{flow_{ij} / flow_{il}}{flow_{kj} / flow_{kl}} \]
Taking logs:
\[ \log T_{ikjl} = \log(flow_{ij}) - \log(flow_{il}) - \log(flow_{kj}) + \log(flow_{kl}) \]
Exporter and importer terms cancel when the comparison is valid and the four flows are observed.
B.13.2 R-style pseudocode
# Pseudocode:
# 1. Select a year.
# 2. Choose exporter pair i, k and importer pair j, l.
# 3. Extract flow_ij, flow_il, flow_kj, flow_kl.
# 4. Compute log(flow_ij) - log(flow_il) - log(flow_kj) + log(flow_kl).B.13.3 Python function skeleton
def tetrad_log_flow(data, exporter_i, exporter_k, importer_j, importer_l, year_value):
sample = data.loc[data["year"] == year_value].copy()
def get_flow(exporter, importer):
value = sample.loc[
(sample["iso_o"] == exporter) & (sample["iso_d"] == importer),
"flow",
]
if value.empty:
return np.nan
return value.iloc[0]
flow_ij = get_flow(exporter_i, importer_j)
flow_il = get_flow(exporter_i, importer_l)
flow_kj = get_flow(exporter_k, importer_j)
flow_kl = get_flow(exporter_k, importer_l)
flows = np.array([flow_ij, flow_il, flow_kj, flow_kl], dtype=float)
if np.any(flows <= 0) or np.any(np.isnan(flows)):
return np.nan
return np.log(flow_ij) - np.log(flow_il) - np.log(flow_kj) + np.log(flow_kl)Tetrads are sensitive to missing and zero flows. Use them carefully and document the tetrad sample.
B.14 Common translation mistakes
- Logging zero trade flows.
- Forgetting robust standard errors.
- Treating PPML as count-only.
- Interpreting dummy coefficients as \(100 \times \beta\) when coefficients are not small.
- Expecting pair fixed effects to estimate distance.
- Comparing R and Python results without matching samples.
- Ignoring missing values.
- Using different year filters across languages.
- Letting software silently drop different rows.
- Comparing standard errors from different covariance estimators.
B.15 Replication checklist
Before comparing R and Python outputs, confirm:
- same dataset;
- same sample;
- same variable definitions;
- same fixed effects;
- same standard error type;
- same zero-flow treatment;
- same dropped observations;
- same sorting and grouping rules for transformations;
- same weights for BVW;
- same country identifiers:
iso_oandiso_d; - same institutional variables:
wto_joint,EU_joint, andEAEU_joint.
B.16 Final note
Python is the main teaching language in this course. R code appears here only so students can understand existing gravity examples and replicate them carefully. The final replication package should be written in Python, with all data construction and estimation decisions documented.