---
title: "Environmental footprint analysis"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Environmental footprint analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
```

## What is a footprint analysis?

A **footprint analysis** answers the question: *"How much of an
environmental pressure (land, water, emissions...) caused in country A
is ultimately driven by consumption in country B?"*

Modern supply chains are global. Spain might import soybeans from Brazil,
feed them to pigs, and export ham to France. A footprint analysis traces
these flows backwards through the entire supply chain, so that the
environmental cost of growing those soybeans is attributed to the French
consumer who ate the ham --- not to the Brazilian farmer who grew the
beans.

This is done by building a **Multi-Regional Input-Output (MRIO) model**,
a mathematical framework that captures all inter-industry and
inter-country linkages in one system of equations.

## The three-step pipeline

The whep package implements this in three steps:

1. **`build_io_model()`** --- Assemble the MRIO tables from supply-use
   data, bilateral trade, and commodity balance sheets.
2. **`compute_leontief_inverse()`** --- Compute the Leontief inverse,
   which captures both direct and indirect supply chain linkages.
3. **`compute_footprint()`** --- Multiply the Leontief inverse by
   environmental extension data to trace pressures through to final
   consumption.

The methodology follows the
[FABIO model](https://doi.org/10.1021/acs.est.9b03554) (Bruckner et
al., 2019), which builds physical MRIO tables for the global food
system.

## Concepts and terminology

Before diving into the code, here are the key concepts:

### Supply-use tables and process groups

A **supply table** records how much of each product each process
produces. A **use table** records how much of each product each process
consumes as input. Together they describe the technology of the economy.

In whep, each process belongs to one of four **process groups**, which
together cover all transformations in the food system:

| Process group | What it does | Inputs | Outputs |
|---|---|---|---|
| `crop_production` | Grow crops | Seed (from CBS) | Harvested crop + residues (straw, etc.) |
| `husbandry` | Raise livestock | Feed (crops, grass, etc.) | Live animals (heads) + non-slaughter products (milk, eggs, wool...) |
| `slaughtering` | Slaughter animals | Live animals (heads) | Meat, offals, fats, hides and skins |
| `processing` | Derive secondary products | Primary commodities (tonnes) | Processed products (oil, flour, sugar, etc.) |

Each process is uniquely identified by a `(proc_group, proc_cbs_code)`
pair.

#### Mixed units in the supply-use system

Most flows are measured in **tonnes**, but live animals are measured in
**heads** (number of animals). This matches the FABIO convention.
Different rows of the `Z` matrix can have different units because each
row's entries are only compared to that row's total output `X[i]`. The
technical coefficients $A_{ij} = Z_{ij} / X_j$ are dimensionless
ratios, so the system remains mathematically consistent.

For example, a slaughtering process for cattle:

- **Use**: 1000 heads of cattle (heads)
- **Supply**: 400 tonnes of beef + 50 tonnes of offals + 30 tonnes of
  hides (tonnes)

The cattle heads and beef tonnes live in different rows of `Z`, each
with its own unit.

### Bilateral trade

A **bilateral trade matrix** records, for each product, how many tonnes
(or heads, for live animals) flow from each exporting country to each
importing country. This is what connects domestic economies into a
global system.

### Commodity balance sheets (CBS)

A **commodity balance sheet** is an accounting identity for each product
in each country:

$$
\text{Production} + \text{Imports} = \text{Exports}
  + \text{Food} + \text{Feed} + \text{Other uses}
  + \Delta\text{Stocks}
$$

The right-hand categories (food, feed, other uses) form the
**final demand** --- the end of the supply chain where products are
consumed rather than being used as intermediate inputs.

For live animals (which FAO does not include in its CBS data), whep
constructs synthetic CBS entries from primary production data:

- **Meat animals** (cattle, pigs, poultry...): all production goes to
  `processing` (i.e. they enter the slaughtering process).
- **Draft animals** (horses, camels, asses...): all production goes to
  `other_uses`.

### The Z matrix (inter-industry flows)

`Z` is the core of the MRIO model. It is a square matrix where each row
and column represents a *(country, product)* pair. Entry $Z_{ij}$
gives the physical flow from sector $i$ to sector $j$.

For example, if row 5 is *(Spain, wheat)* and column 12 is *(Spain,
flour)*, then $Z_{5,12}$ is the amount of Spanish wheat used by
Spanish flour mills.

The model constructs `Z` using the **industry technology assumption**:

$$
Z = U_{MR} \times T
$$

where:

- $U_{MR}$ is the multi-regional use matrix. It starts from
  single-country use tables and expands them with trade shares, so that
  each input can originate from any country. For example, if Spain's pig
  farming uses 1000 tonnes of soybeans, and 60% of Spain's soybean
  supply comes from Brazil, then 600 tonnes are attributed to Brazil and
  400 to domestic production.
- $T$ is the row-normalized supply matrix (the **transformation
  matrix**). Each row sums to 1 and encodes how each process distributes
  its output across products. This converts process-level flows to
  product-level flows.

### Trade shares

For each product, trade shares tell us *where each country gets its
supply from*. If Spain consumes 1000 tonnes of soybeans, and 600 come
from domestic production while 400 come from Brazil, then the trade
share from Brazil to Spain is 0.4 and the domestic share is 0.6.

These shares are computed from the bilateral trade matrices combined
with domestic production data from the CBS. They are used to expand the
single-country use tables into a multi-regional system where each
input can originate from any country.

### The Y matrix (final demand)

`Y` is the final demand matrix. Rows are *(country, product)* pairs
(same as `Z`), and columns are *(country, demand category)* pairs.
Entry $Y_{ij}$ tells us how much of product $i$ goes to final demand
category $j$.

The demand categories are derived from the CBS columns present in
the data, typically: `food`, `other_uses`, and `stock_addition`.

Like `Z`, the final demand is expanded with trade shares so that, for
example, food consumption of wheat in France can partly originate from
French wheat and partly from imported wheat.

### The X vector (total output)

`X` is a vector where $X_i$ is the total output of sector $i$.
By construction:

$$
X = Z \cdot \mathbf{1} + Y \cdot \mathbf{1}
$$

meaning total output equals intermediate deliveries plus final demand.

### FABIO adjustments

After the initial construction of `Z` and `Y`, the model applies
several corrections following the FABIO methodology:

1. **Leftover redistribution**: When CBS reports `processing` or `seed`
   use but the supply-use table does not fully account for it, the
   unmatched quantity is redistributed to the `food` category in `Y`.
   This prevents demand from vanishing.

2. **Loss endogenization** (optional, via `endogenize_losses = TRUE`):
   If the CBS contains a `losses` column, losses are moved from `Y` to
   the diagonal of `Z`. This treats losses as self-use within each
   sector (the commodity is "consumed" by its own production process)
   rather than as final demand.

3. **Diagonal rebalancing**: When $\text{diag}(Z)_i \geq X_i$ (the
   diagonal accounts for all or more than total output, usually due
   to reporting errors), 80% of the diagonal value is moved to final
   demand and 20% is kept on the diagonal, proportionally distributed
   across the sector's demand categories.

4. **Negative output fixing**: Any sectors with $X_i < 0$ (which can
   arise from trade data inconsistencies) are corrected by adjusting
   negative entries in `Y`.

### The Leontief inverse (L)

The **Leontief inverse** $L = (I - A)^{-1}$ is the key analytical
tool. The technical coefficients matrix $A$ is defined as
$A_{ij} = Z_{ij} / X_j$, representing how much input from sector $i$
is needed per unit of output from sector $j$.

Before inversion, two corrections are applied:

- **Negative entries in $A$ are set to zero**: these arise from data
  inconsistencies and would distort the inverse.
- **Column sums of $A$ are capped at 1**: this ensures $(I - A)$ is
  invertible even when reported inputs exceed reported output.

The Leontief inverse captures the **total requirements** --- both direct
and indirect --- needed to deliver one unit of final demand. Entry
$L_{ij}$ tells us: *"To deliver one extra unit of product $j$ to final
consumers, how much total output does sector $i$ need to produce
(including all the intermediate steps)?"*

This is what makes footprint tracing possible. A consumer buying bread
in France triggers not just French bakeries, but also French mills,
French wheat farmers, Brazilian soybean exporters (if feed is
involved), and so on. The Leontief inverse encodes all of these ripple
effects in a single matrix.

### Environmental extensions

An **environmental extension** is any pressure you want to trace through
the supply chain. Common examples:

- **Land use** (hectares): How many hectares of cropland does each
  sector use?
- **Water use** (cubic meters): How much blue water is consumed?
- **Greenhouse gas emissions** (tonnes CO2-eq): How much does each
  sector emit?
- **Nitrogen surplus** (tonnes N): How much excess nitrogen enters the
  environment?

You provide these as a vector $e$ with one value per *(country,
product)* sector.

### The footprint

The footprint matrix is computed as:

$$
FP = \hat{f} \cdot L \times Y
$$

where $\hat{f}$ is a diagonal matrix of **extension intensities**
$f_i = e_i / X_i$ (pressure per unit of output). Each entry $FP_{ij}$
tells us: *"How much environmental pressure occurring in sector $i$ is
driven by final demand category $j$?"*

When `fd_labels` is provided, the footprint uses the **FABIO diagonal
approach**: for each final demand column $j$, the function computes
$MP \cdot \text{diag}(y_j) \cdot G$, where $G$ groups sectors by item.
This decomposes the footprint by both the target item being consumed
and the demand category (food, other uses, etc.).

The resulting tidy tibble decomposes this by origin country/product and
target country/product, giving you a full bilateral footprint map.

## Running the analysis

### Step 1: Build the IO model

The simplest call uses default arguments --- the function fetches all
input data internally:

```{r}
library(whep)

io <- build_io_model()
```

This downloads supply-use tables, bilateral trade, and CBS data, then
builds the IO model for all available years. To restrict to specific
years:

```{r}
io <- build_io_model(years = c(2010, 2015))
```

If you want to inspect or pre-filter the input data, you can build them
separately and pass them in:

```{r}
supply_use <- build_supply_use()
bilateral_trade <- get_bilateral_trade()
cbs <- get_wide_cbs()

io <- build_io_model(supply_use, bilateral_trade, cbs, years = 2010)
```

To endogenize losses (move them from final demand to the diagonal of
`Z`):

```{r}
io <- build_io_model(endogenize_losses = TRUE)
```

The result is a tibble with one row per year. Each row contains:

- `Z`: The inter-industry flow matrix (sparse).
- `Y`: The final demand matrix (sparse).
- `X`: The total output vector.
- `labels`: A tibble mapping row/column indices to `area_code` and
  `item_cbs_code`.
- `fd_labels`: A tibble mapping each Y column to its `area_code`
  (consuming country) and `fd_col` (demand category).

You can inspect a single year:

```{r}
# Extract matrices for one year
z_mat <- io$Z[[1]]
y_mat <- io$Y[[1]]
x_vec <- io$X[[1]]
labels <- io$labels[[1]]
fd_labels <- io$fd_labels[[1]]
selected_year <- io$year[[1]]

# Dimensions: (n_countries * n_products) x (n_countries * n_products)
dim(z_mat)

# Label mapping
labels

# Final demand column labels
fd_labels
```

### Step 2: Compute the Leontief inverse

For **small systems** (< 5000 sectors), you can pre-compute L:

```{r}
l_inv <- compute_leontief_inverse(z_mat, x_vec)
dim(l_inv)
```

This inverts the $(I - A)$ matrix. Column sums of $A$ are internally
capped at 1 to prevent singularity, and negative entries are zeroed.

#### For large systems

For systems larger than ~5000 sectors (e.g. 192 countries x 125+
products = 24 000 sectors), computing the explicit Leontief inverse
requires too much memory (e.g. 4.8 GiB for 25 000 sectors). In those
cases, pass `z_mat` directly to `compute_footprint()`, which solves
$(I - A)x = Y$ without ever materialising the full L.

### Step 3: Compute the footprint

Now provide your environmental extension vector. This must have one
value per sector, in the same order as the labels. For land use, you
can use `get_land_fp_production()`:

```{r}
land_use_data <- get_land_fp_production() |>
  dplyr::filter(year == selected_year) |>
  dplyr::select(area_code, item_cbs_code, hectares = impact_u)

# Match it to the IO model's label order
extensions <- labels |>
  dplyr::left_join(
    land_use_data,
    by = c("area_code", "item_cbs_code")
  ) |>
  dplyr::mutate(
    hectares = tidyr::replace_na(hectares, 0)
  ) |>
  dplyr::pull(hectares)
```

#### Dense path (pre-computed L, small systems)

If you pre-computed L:

```{r}
footprint <- compute_footprint(
  l_inv, x_vec, y_mat, extensions, labels,
  fd_labels = fd_labels
)
footprint
```

#### Using Z directly (large systems)

Alternatively, pass the Z matrix and `compute_footprint()` will
derive the Leontief inverse for you:

```{r}
footprint <- compute_footprint(
  z_mat = z_mat,
  x_vec = x_vec,
  y_mat = y_mat,
  extensions = extensions,
  labels = labels,
  fd_labels = fd_labels
)
footprint
```

#### With and without fd_labels

When `fd_labels` is provided (recommended), the result includes
`target_item` (the consumed product) and `target_fd` (the demand
category like `"food"` or `"other_uses"`). This uses the FABIO diagonal
decomposition approach.

When `fd_labels` is omitted, columns of Y are treated as individual
sectors. This is appropriate only for square Y matrices and does not
produce a `target_fd` column.

The result is a tidy tibble with columns:

| Column | Meaning |
|--------|---------|
| `origin_area` | Country where the pressure physically occurs |
| `origin_item` | Product causing the pressure |
| `target_area` | Country whose consumption drives the pressure |
| `target_item` | Product being consumed |
| `target_fd` | Demand category (only with `fd_labels`) |
| `value` | Footprint in extension units (e.g. hectares) |

### Analysing the results

Because the output is a tidy tibble, you can use standard dplyr
operations to answer questions:

**Which country's consumption drives the most land use?**

```{r}
footprint |>
  dplyr::summarise(
    total_land = sum(value), .by = target_area
  ) |>
  dplyr::arrange(dplyr::desc(total_land))
```

**How much of Spain's land use is driven by foreign consumption?**

```{r}
spain_code <- 203L

footprint |>
  dplyr::filter(origin_area == spain_code) |>
  dplyr::mutate(
    destination = ifelse(
      target_area == spain_code, "domestic", "exported"
    )
  ) |>
  dplyr::summarise(
    total_land = sum(value), .by = destination
  )
```

**What are the top bilateral flows?**

```{r}
footprint |>
  dplyr::summarise(
    total = sum(value),
    .by = c(origin_area, target_area)
  ) |>
  dplyr::arrange(dplyr::desc(total)) |>
  head(20)
```

**Which products drive the most land use in food consumption?**

```{r}
footprint |>
  dplyr::filter(target_fd == "food") |>
  dplyr::summarise(
    total_land = sum(value), .by = target_item
  ) |>
  add_item_cbs_name(code_column = "target_item") |>
  dplyr::arrange(dplyr::desc(total_land)) |>
  head(20)
```

### A conservation check

A useful sanity check: the total footprint should equal the total
extensions. All pressure must be attributed somewhere.

```{r}
# These two numbers should be equal (up to floating-point tolerance)
sum(footprint$value)
sum(extensions)
```

Small differences can arise from negative-output corrections or
column-sum capping in the technical coefficients matrix.

## Processing multiple years

Since `build_io_model()` returns one row per year, you can loop over
them:

```{r}
io <- build_io_model(years = 2010:2013)

land <- get_land_fp_production()

all_footprints <- purrr::pmap_dfr(
  list(
    io$Z, io$X, io$Y, io$labels,
    io$fd_labels, io$year
  ),
  function(z, x, y, lab, fdl, yr) {
    ext <- lab |>
      dplyr::left_join(
        land |>
          dplyr::filter(year == yr) |>
          dplyr::select(area_code, item_cbs_code, impact_u),
        by = c("area_code", "item_cbs_code")
      ) |>
      dplyr::mutate(impact_u = tidyr::replace_na(impact_u, 0)) |>
      dplyr::pull(impact_u)

    compute_footprint(
      z_mat = z, x_vec = x, y_mat = y,
      extensions = ext, labels = lab,
      fd_labels = fdl
    ) |>
      dplyr::mutate(year = yr)
  }
)
```

#### Parallel execution (optional)

Since each year is independent, you can parallelise using `furrr`:

```{r}
library(furrr)
future::plan(multisession, workers = 4)

all_footprints <- furrr::future_pmap_dfr(
  list(
    io$Z, io$X, io$Y, io$labels,
    io$fd_labels, io$year
  ),
  function(z, x, y, lab, fdl, yr) {
    ext <- lab |>
      dplyr::left_join(
        land |>
          dplyr::filter(year == yr) |>
          dplyr::select(area_code, item_cbs_code, impact_u),
        by = c("area_code", "item_cbs_code")
      ) |>
      dplyr::mutate(impact_u = tidyr::replace_na(impact_u, 0)) |>
      dplyr::pull(impact_u)

    compute_footprint(
      z_mat = z, x_vec = x, y_mat = y,
      extensions = ext, labels = lab,
      fd_labels = fdl
    ) |>
      dplyr::mutate(year = yr)
  }
)
```

## Summary

| Step | Function | Input | Output |
|------|----------|-------|--------|
| 1 | `build_io_model()` | Supply-use, trade, CBS (all optional --- fetched by default) | Z, Y, X, labels, fd_labels per year |
| 2 | `compute_leontief_inverse()` | Z, X | Leontief inverse L (small systems only) |
| 3 | `compute_footprint()` | L or Z, X, Y, extensions, labels, fd_labels | Tidy bilateral footprint |

## References

Bruckner, M., Wood, R., Moran, D., Kuschnig, N., Wieland, H.,
Maus, V., & Borner, J. (2019). FABIO --- The Construction of the Food
and Agriculture Biomass Input-Output Model. *Environmental Science &
Technology*, 53(19), 11302--11312.
https://doi.org/10.1021/acs.est.8b03704