Appendix B — R to Python

B.1 Introduction

NREC6006 uses Python as the main research language. Much of the gravity-model literature, however, uses R or Stata, and many online gravity examples rely on R packages such as fixest, gravity, dplyr, and MASS.

This appendix translates common R gravity workflows into Python for the Post-Soviet project. The examples use the course variable names:

flow, gdp_o, gdp_d, distw, comlang_off, contig, wto_joint, EU_joint, EAEU_joint, year, iso_o, and iso_d.

The goal is not to make Python imitate R line by line. The goal is to reproduce the same empirical workflow with transparent Python code.

B.2 Package equivalence table

Task R package/function Python package/function Notes
Read CSV file readr::read_csv, read.csv pandas.read_csv Use the same file and confirm column types.
Create variables dplyr::mutate pandas.assign, direct column assignment Python usually uses df["new"] = ....
Filter rows dplyr::filter pandas.query, boolean masks Boolean masks are explicit and easy to audit.
Baseline OLS lm statsmodels.formula.api.ols Use robust standard errors when reporting.
Fixed-effects OLS fixest::feols statsmodels with C() or linearmodels statsmodels dummy expansion is transparent but can be slow.
Poisson GLM glm(..., family = poisson) statsmodels GLM Poisson Used for PPML-style estimation.
Double demeaning gravity::ddm custom double_demean() function Python implementation should be checked against R output.
Bonus Vetus Unweighted gravity::bvu custom BVU transformation Python requires explicit variable construction.
Bonus Vetus Weighted gravity::bvw custom BVW transformation Weighting choices must match the replication design.
PPML gravity::ppml, glm, fixest::fepois statsmodels GLM Poisson PPML does not require count data.
GPML gravity::gpml, glm(..., family = Gamma) statsmodels GLM Gamma Requires positive trade flows.
NBPML MASS::glm.nb statsmodels Negative Binomial Useful as a robustness estimator.
Network analysis igraph networkx Descriptive, not causal.

B.3 Data loading side-by-side

The first replication check is that R and Python read the same dataset with the same rows and columns.

B.3.1 R

library(readr)

df <- read_csv("data/gravity_clean.csv")

dim(df)
names(df)
head(df)

Base R is also acceptable:

df <- read.csv("data/gravity_clean.csv")

B.3.2 Python

import pandas as pd

df = pd.read_csv("data/gravity_clean.csv")

df.shape
df.columns
df.head()

Before comparing any results, confirm that flow, gdp_o, gdp_d, distw, comlang_off, contig, wto_joint, EU_joint, EAEU_joint, year, iso_o, and iso_d are present and have the expected types.

B.4 Variable construction side-by-side

The basic gravity workflow creates logs, a pair identifier, and a positive-flow sample for log-linear OLS.

B.4.1 R

library(dplyr)

df <- df %>%
  mutate(
    log_flow = ifelse(flow > 0, log(flow), NA_real_),
    log_distw = log(distw),
    log_gdp_o = log(gdp_o),
    log_gdp_d = log(gdp_d),
    pair_id = paste(iso_o, iso_d, sep = "_")
  )

ols_df <- df %>%
  filter(flow > 0)

B.4.2 Python

import numpy as np

df["log_flow"] = np.where(df["flow"] > 0, np.log(df["flow"]), np.nan)
df["log_distw"] = np.log(df["distw"])
df["log_gdp_o"] = np.log(df["gdp_o"])
df["log_gdp_d"] = np.log(df["gdp_d"])
df["pair_id"] = df["iso_o"] + "_" + df["iso_d"]

ols_df = df.loc[df["flow"] > 0].copy()

log_flow is undefined for zero trade flows. Do not replace log(0) with zero.

B.5 Baseline OLS side-by-side

The baseline log-linear model explains positive bilateral trade flows using economic size, weighted distance, bilateral controls, and institutional indicators.

B.5.1 R

ols_r <- lm(
  log_flow ~ log_gdp_o + log_gdp_d + log_distw +
    comlang_off + contig + wto_joint + EU_joint + EAEU_joint,
  data = ols_df
)

summary(ols_r)

B.5.2 Python

import statsmodels.formula.api as smf

ols_formula = (
    "log_flow ~ log_gdp_o + log_gdp_d + log_distw + "
    "comlang_off + contig + wto_joint + EU_joint + EAEU_joint"
)

ols_py = smf.ols(ols_formula, data=ols_df).fit(cov_type="HC1")

print(ols_py.summary())

B.5.3 Interpretation

log_gdp_o, log_gdp_d, and log_distw are interpreted as elasticities in the log-linear model. Dummy variables such as wto_joint, EU_joint, and EAEU_joint should be converted using:

\[ 100 \times \left(\exp(\beta) - 1\right) \]

This gives the approximate percent difference associated with the dummy variable, conditional on the model.

B.6 Fixed effects side-by-side

Fixed effects control for unobserved exporter, importer, and year differences. In the Post-Soviet workflow, fixed effects are also a bridge to structural gravity.

B.6.1 R

library(fixest)

fe_r <- feols(
  log_flow ~ log_distw + comlang_off + contig +
    wto_joint + EU_joint + EAEU_joint |
    iso_o + iso_d + year,
  data = ols_df,
  vcov = "hetero"
)

summary(fe_r)

B.6.2 Python

fe_formula = (
    "log_flow ~ log_distw + comlang_off + contig + "
    "wto_joint + EU_joint + EAEU_joint + "
    "C(iso_o) + C(iso_d) + C(year)"
)

fe_py = smf.ols(fe_formula, data=ols_df).fit(cov_type="HC1")

Python dummy expansion with C() is transparent and useful for teaching. It may be slower for large datasets because it creates explicit dummy variables. For larger panels, students may need specialized fixed-effects estimators such as linearmodels or a custom high-dimensional workflow.

B.7 Double demeaning side-by-side

Double demeaning removes exporter and importer means from a variable. It helps students understand the mechanics behind two-way fixed effects.

For a variable \(x_{ij}\):

\[ \tilde{x}_{ij} = x_{ij} - \bar{x}_{i \cdot} - \bar{x}_{\cdot j} + \bar{x} \]

B.7.1 R

R package workflows may use gravity::ddm or related logic. The explicit version is:

double_demean <- function(x, exporter, importer) {
  x - ave(x, exporter) - ave(x, importer) + mean(x, na.rm = TRUE)
}

ols_df$log_flow_dd <- double_demean(
  ols_df$log_flow,
  ols_df$iso_o,
  ols_df$iso_d
)

B.7.2 Python

def double_demean(data, variable, exporter="iso_o", importer="iso_d"):
    overall = data[variable].mean()
    exporter_mean = data.groupby(exporter)[variable].transform("mean")
    importer_mean = data.groupby(importer)[variable].transform("mean")
    return data[variable] - exporter_mean - importer_mean + overall

ols_df["log_flow_dd"] = double_demean(
    ols_df,
    variable="log_flow",
    exporter="iso_o",
    importer="iso_d",
)

Use this to understand the transformation. For publication models, report the actual fixed-effects estimator and sample used.

B.8 BVU side-by-side

Bonus Vetus Unweighted (BVU) approximates multilateral-resistance adjustment by transforming bilateral variables with unweighted country means.

B.8.1 Conceptual equation

For a bilateral variable \(x_{ij}\):

\[ x^{BVU}_{ij} = x_{ij} - \bar{x}_{i \cdot} - \bar{x}_{\cdot j} + \bar{x} \]

B.8.2 R-style pseudocode

R gravity::bvu automates the transformation when the dataset and function arguments match the package requirements.

library(gravity)

# Pseudocode: check actual function arguments against package documentation.
# bvu_model <- bvu(
#   dependent_variable = "flow",
#   distance = "distw",
#   additional_regressors = c("comlang_off", "contig", "wto_joint", "EU_joint", "EAEU_joint"),
#   code_origin = "iso_o",
#   code_destination = "iso_d",
#   data = df
# )

B.8.3 Python transformation

Python requires explicit construction.

def bvu_transform(data, variable, exporter="iso_o", importer="iso_d"):
    overall = data[variable].mean()
    exporter_mean = data.groupby(exporter)[variable].transform("mean")
    importer_mean = data.groupby(importer)[variable].transform("mean")
    return data[variable] - exporter_mean - importer_mean + overall

for variable in [
    "log_distw",
    "comlang_off",
    "contig",
    "wto_joint",
    "EU_joint",
    "EAEU_joint",
]:
    ols_df[f"{variable}_bvu"] = bvu_transform(ols_df, variable)
bvu_formula = (
    "log_flow ~ log_gdp_o + log_gdp_d + "
    "log_distw_bvu + comlang_off_bvu + contig_bvu + "
    "wto_joint_bvu + EU_joint_bvu + EAEU_joint_bvu + C(year)"
)

bvu_py = smf.ols(bvu_formula, data=ols_df).fit(cov_type="HC1")

B.9 BVW side-by-side

Bonus Vetus Weighted (BVW) uses weighted means rather than simple means. In the Post-Soviet project, a plausible teaching implementation can use GDP-based weights, but the final publication workflow must match the exact weighting rule used in the replication design.

B.9.1 Conceptual equation

\[ x^{BVW}_{ij} = x_{ij} - \sum_k s_k x_{ik} - \sum_m s_m x_{mj} + \sum_k \sum_m s_k s_m x_{km} \]

B.9.2 R-style pseudocode

# Pseudocode only. Confirm arguments and weights before publication use.
# bvw_model <- bvw(
#   dependent_variable = "flow",
#   distance = "distw",
#   additional_regressors = c("comlang_off", "contig", "wto_joint", "EU_joint", "EAEU_joint"),
#   code_origin = "iso_o",
#   code_destination = "iso_d",
#   weights = "gdp_d",
#   data = df
# )

B.9.3 Python teaching implementation

def weighted_mean_by_group(data, variable, group, weight):
    weighted_sum = (data[variable] * data[weight]).groupby(data[group]).transform("sum")
    weight_sum = data[weight].groupby(data[group]).transform("sum")
    return weighted_sum / weight_sum

def bvw_transform(data, variable, exporter="iso_o", importer="iso_d", weight="gdp_d"):
    exporter_wmean = weighted_mean_by_group(data, variable, exporter, weight)
    importer_wmean = weighted_mean_by_group(data, variable, importer, weight)
    overall_wmean = (data[variable] * data[weight]).sum() / data[weight].sum()
    return data[variable] - exporter_wmean - importer_wmean + overall_wmean

for variable in [
    "log_distw",
    "comlang_off",
    "contig",
    "wto_joint",
    "EU_joint",
    "EAEU_joint",
]:
    ols_df[f"{variable}_bvw"] = bvw_transform(ols_df, variable, weight="gdp_d")
bvw_formula = (
    "log_flow ~ log_gdp_o + log_gdp_d + "
    "log_distw_bvw + comlang_off_bvw + contig_bvw + "
    "wto_joint_bvw + EU_joint_bvw + EAEU_joint_bvw + C(year)"
)

bvw_py = smf.ols(bvw_formula, data=ols_df).fit(cov_type="HC1")
WarningPublication caution

The Python BVW code above is a teaching implementation. Before publication, verify the weighting formula, grouping structure, and sample against the original R workflow and the Post-Soviet replication design.

B.10 PPML side-by-side

PPML estimates the multiplicative gravity model in levels.

B.10.1 R

ppml_r <- glm(
  flow ~ log_gdp_o + log_gdp_d + log_distw +
    comlang_off + contig + wto_joint + EU_joint + EAEU_joint +
    factor(year),
  data = df,
  family = quasipoisson(link = "log")
)

B.10.2 Python

import statsmodels.api as sm

ppml_formula = (
    "flow ~ log_gdp_o + log_gdp_d + log_distw + "
    "comlang_off + contig + wto_joint + EU_joint + EAEU_joint + C(year)"
)

ppml_py = smf.glm(
    formula=ppml_formula,
    data=df,
    family=sm.families.Poisson(),
).fit(cov_type="HC1")

PPML is estimated with a Poisson family, but the dependent variable does not need to be count data. In gravity research, PPML is used for trade values because it estimates a multiplicative conditional mean.

B.11 GPML side-by-side

GPML uses a Gamma family with a log link. It requires positive trade flows.

B.11.1 R

gpml_df <- subset(df, flow > 0)

gpml_r <- glm(
  flow ~ log_gdp_o + log_gdp_d + log_distw +
    comlang_off + contig + wto_joint + EU_joint + EAEU_joint +
    factor(year),
  data = gpml_df,
  family = Gamma(link = "log")
)

B.11.2 Python

gpml_df = df.loc[df["flow"] > 0].copy()

gpml_py = smf.glm(
    formula=ppml_formula,
    data=gpml_df,
    family=sm.families.Gamma(link=sm.families.links.Log()),
).fit(cov_type="HC1")

Report that GPML uses positive trade flows only. Do not compare GPML and PPML without noting sample differences.

B.12 NBPML side-by-side

NBPML uses a negative-binomial variance structure with a log-link mean.

B.12.1 R

library(MASS)

nbpml_r <- glm.nb(
  flow ~ log_gdp_o + log_gdp_d + log_distw +
    comlang_off + contig + wto_joint + EU_joint + EAEU_joint +
    factor(year),
  data = df
)

B.12.2 Python

nbpml_py = smf.glm(
    formula=ppml_formula,
    data=df,
    family=sm.families.NegativeBinomial(),
).fit(cov_type="HC1")

Use NBPML as a robustness estimator unless the research design gives a specific reason to prefer it.

B.13 Tetrads side-by-side

Tetrads use ratio-of-ratios logic to remove exporter and importer terms by comparing four trade flows. They are useful for understanding how fixed effects can be differenced out of a gravity relationship.

B.13.1 Formula

For exporters \(i\), \(k\) and importers \(j\), \(l\):

\[ T_{ikjl} = \frac{flow_{ij} / flow_{il}}{flow_{kj} / flow_{kl}} \]

Taking logs:

\[ \log T_{ikjl} = \log(flow_{ij}) - \log(flow_{il}) - \log(flow_{kj}) + \log(flow_{kl}) \]

Exporter and importer terms cancel when the comparison is valid and the four flows are observed.

B.13.2 R-style pseudocode

# Pseudocode:
# 1. Select a year.
# 2. Choose exporter pair i, k and importer pair j, l.
# 3. Extract flow_ij, flow_il, flow_kj, flow_kl.
# 4. Compute log(flow_ij) - log(flow_il) - log(flow_kj) + log(flow_kl).

B.13.3 Python function skeleton

def tetrad_log_flow(data, exporter_i, exporter_k, importer_j, importer_l, year_value):
    sample = data.loc[data["year"] == year_value].copy()

    def get_flow(exporter, importer):
        value = sample.loc[
            (sample["iso_o"] == exporter) & (sample["iso_d"] == importer),
            "flow",
        ]
        if value.empty:
            return np.nan
        return value.iloc[0]

    flow_ij = get_flow(exporter_i, importer_j)
    flow_il = get_flow(exporter_i, importer_l)
    flow_kj = get_flow(exporter_k, importer_j)
    flow_kl = get_flow(exporter_k, importer_l)

    flows = np.array([flow_ij, flow_il, flow_kj, flow_kl], dtype=float)
    if np.any(flows <= 0) or np.any(np.isnan(flows)):
        return np.nan

    return np.log(flow_ij) - np.log(flow_il) - np.log(flow_kj) + np.log(flow_kl)

Tetrads are sensitive to missing and zero flows. Use them carefully and document the tetrad sample.

B.14 Common translation mistakes

  • Logging zero trade flows.
  • Forgetting robust standard errors.
  • Treating PPML as count-only.
  • Interpreting dummy coefficients as \(100 \times \beta\) when coefficients are not small.
  • Expecting pair fixed effects to estimate distance.
  • Comparing R and Python results without matching samples.
  • Ignoring missing values.
  • Using different year filters across languages.
  • Letting software silently drop different rows.
  • Comparing standard errors from different covariance estimators.

B.15 Replication checklist

Before comparing R and Python outputs, confirm:

  • same dataset;
  • same sample;
  • same variable definitions;
  • same fixed effects;
  • same standard error type;
  • same zero-flow treatment;
  • same dropped observations;
  • same sorting and grouping rules for transformations;
  • same weights for BVW;
  • same country identifiers: iso_o and iso_d;
  • same institutional variables: wto_joint, EU_joint, and EAEU_joint.

B.16 Final note

Python is the main teaching language in this course. R code appears here only so students can understand existing gravity examples and replicate them carefully. The final replication package should be written in Python, with all data construction and estimation decisions documented.