5  CEPII Data

5.1 Introduction

The Post-Soviet replication project is built from gravity-style country-pair-year data. The central source is the CEPII Gravity Database, which provides harmonized bilateral trade, geographic, macroeconomic, and institutional variables used in empirical gravity research.

The course dataset is:

data/ExSoviet_balanced_clean.csv

It contains 5,253 observations, 20 columns, 15 exporters, 15 importers, and years 1992-2020. The teaching dataset contains no zero-flow observations, so all 5,253 observations are positive-flow observations.

5.2 Why CEPII matters

Gravity models require variables for both countries in a pair and for the relationship between them. CEPII is useful because it combines these elements in a single bilateral panel:

  • exporter and importer identifiers;
  • bilateral trade flows;
  • GDP and population;
  • distance and geography;
  • language and contiguity indicators;
  • institutional and policy variables.

This structure allows students to estimate whether trade between two countries is associated with economic size, trade costs, and institutional alignment.

5.3 Directed dyad-year structure

The Post-Soviet dataset is a directed dyad-year panel. A directed dyad treats exporter-importer direction as meaningful:

Armenia -> Azerbaijan in 2015
Azerbaijan -> Armenia in 2015

These are different observations because exports from country \(i\) to country \(j\) need not equal exports from country \(j\) to country \(i\).

Country-pair-year data are needed because gravity models explain bilateral flows over time. The unit of observation is:

exporter, importer, year

This structure supports pooled OLS, fixed-effects OLS, PPML, GPML, and structural PPML specifications.

5.4 Post-Soviet country sample

The project covers fifteen post-Soviet economies:

Armenia, Azerbaijan, Belarus, Estonia, Georgia, Kazakhstan, Kyrgyzstan, Latvia, Lithuania, Moldova, Russia, Tajikistan, Turkmenistan, Ukraine, and Uzbekistan.

Each country can appear as both exporter and importer. This is why the dataset has 15 exporters and 15 importers.

5.5 CEPII variables in the replication

The replication uses the following required variables:

flow, gdp_o, gdp_d, distw, comlang_off, contig, wto_joint, EU_joint, EAEU_joint, year, iso_o, and iso_d.

The variables map into three empirical roles: trade outcome, gravity controls, and institutional variables.

Variable family Example variables Research use
Exporter-importer identifiers iso_o, iso_d, year Define the directed dyad-year observation
Trade flow flow Dependent variable in OLS, PPML, and GPML
Economic mass gdp_o, gdp_d Exporter and importer economic size
Geographic cost distw Weighted distance between countries
Bilateral frictions comlang_off, contig Language and border controls
WTO status wto_joint Both countries are WTO members in year \(t\)
EU status EU_joint Both countries are EU members in year \(t\)
EAEU status EAEU_joint Both countries are EAEU members in year \(t\)

5.6 Institutional variables

The institutional variables are pair-year indicators.

wto_joint equals 1 when both countries are WTO members in year \(t\). It measures shared multilateral trade-system membership.

EU_joint equals 1 when both countries are EU members in year \(t\). In this sample, variation comes from the Baltic EU members.

EAEU_joint equals 1 when both countries are EAEU members in year \(t\). This variable becomes relevant in the post-2015 period.

These indicators do not directly measure implementation quality, tariff changes, customs procedures, or enforcement. They identify shared institutional status.

5.7 Source documentation

Every replication paper should document:

  • the database version;
  • country coverage;
  • year coverage;
  • trade-flow definition;
  • variable names;
  • sample restrictions;
  • institutional coding rules.

In this course, students should treat the source inventory as part of the research output, not as background administration.

Note

Do not open large CEPII files in spreadsheet software for cleaning. Use Python so every filtering, variable-construction, and export step is reproducible.