Session 3: Data Infrastructure & Big Data
Working with complex environmental databases at scale
This session shifts from the well-curated OWID interface to a more realistic scenario: working with Exiobase—a massive, environmentally-extended input-output database tracking emissions through global supply chains. You'll experience how data infrastructure choices affect both human and automated workflows.
Background: Exiobase
Exiobase is a multi-regional, environmentally-extended input-output (MRIO) database tracking economic activity and environmental impacts across global supply chains. Unlike OWID's aggregated national totals, Exiobase provides sectoral resolution: emissions from electricity generation, transport, manufacturing, agriculture, and hundreds of other economic activities across 49 regions.
| 163 industries | 49 regions | 1995–2022 annual coverage |
Key characteristics:
- Structure: Input-output tables linking inter-industry flows with environmental extensions
- Coverage: CO₂, CH₄, land use, water consumption, and more
- Applications: Consumption-based emissions accounting, supply chain analysis, trade embodied emissions, sectoral decarbonization pathways
Methodological note: MRIO models allocate emissions to final consumers rather than production locations. A smartphone manufactured in China but consumed in the US would have emissions attributed differently in Exiobase (consumption-based) versus OWID (territorial).
Suggested Readings
- Exiobase Community on Zenodo — Original data repository and documentation
- Exiobase Official Website — Project overview and methodology
- Exiobase-3 on Source Cooperative — Cloud-optimized GeoParquet format we'll use
- Stadler et al. (2022) — "EXIOBASE 3rx: A flexible multi-regional input-output database" in Scientific Data
- Carbon Brief: CO2 Importers & Exporters — Accessible overview of consumption vs territorial accounting
What You'll Explore
The session notebook (in your module template repository) guides you through:
- Access friction: Attempt to load Exiobase from its original Zenodo archive—experience the barriers to automated workflows
- Cloud-optimized formats: Load the same data from GeoParquet—observe the difference in agent performance
- Schema exploration: Navigate the complex structure (163 industries × 49 regions × multiple environmental pressures)
- Sectoral analysis: Identify top emission-intensive industries globally and by country
- Cross-dataset validation: Compare Exiobase totals with OWID—investigate methodological discrepancies
- Synthesis analysis: Independent investigation integrating multiple data sources
Learning Objectives
- Evaluate how data format and access patterns constrain automated workflows
- Compare schema complexity across datasets with different design goals
- Integrate multi-source data requiring methodological reconciliation
- Assess when coding agents can operate autonomously versus when domain expertise must guide analysis
Key Insight
Moving from ZIP archives to cloud-optimized Parquet removes technical friction. But understanding MRIO methodology, industry classifications, and when two authoritative sources disagree—that requires domain knowledge no format change can provide.