--- title: "Environmental footprint analysis" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Environmental footprint analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## What is a footprint analysis? A **footprint analysis** answers the question: *"How much of an environmental pressure (land, water, emissions...) caused in country A is ultimately driven by consumption in country B?"* Modern supply chains are global. Spain might import soybeans from Brazil, feed them to pigs, and export ham to France. A footprint analysis traces these flows backwards through the entire supply chain, so that the environmental cost of growing those soybeans is attributed to the French consumer who ate the ham --- not to the Brazilian farmer who grew the beans. This is done by building a **Multi-Regional Input-Output (MRIO) model**, a mathematical framework that captures all inter-industry and inter-country linkages in one system of equations. ## The three-step pipeline The whep package implements this in three steps: 1. **`build_io_model()`** --- Assemble the MRIO tables from supply-use data, bilateral trade, and commodity balance sheets. 2. **`compute_leontief_inverse()`** --- Compute the Leontief inverse, which captures both direct and indirect supply chain linkages. 3. **`compute_footprint()`** --- Multiply the Leontief inverse by environmental extension data to trace pressures through to final consumption. The methodology follows the [FABIO model](https://doi.org/10.1021/acs.est.9b03554) (Bruckner et al., 2019), which builds physical MRIO tables for the global food system. ## Concepts and terminology Before diving into the code, here are the key concepts: ### Supply-use tables and process groups A **supply table** records how much of each product each process produces. A **use table** records how much of each product each process consumes as input. Together they describe the technology of the economy. In whep, each process belongs to one of four **process groups**, which together cover all transformations in the food system: | Process group | What it does | Inputs | Outputs | |---|---|---|---| | `crop_production` | Grow crops | Seed (from CBS) | Harvested crop + residues (straw, etc.) | | `husbandry` | Raise livestock | Feed (crops, grass, etc.) | Live animals (heads) + non-slaughter products (milk, eggs, wool...) | | `slaughtering` | Slaughter animals | Live animals (heads) | Meat, offals, fats, hides and skins | | `processing` | Derive secondary products | Primary commodities (tonnes) | Processed products (oil, flour, sugar, etc.) | Each process is uniquely identified by a `(proc_group, proc_cbs_code)` pair. #### Mixed units in the supply-use system Most flows are measured in **tonnes**, but live animals are measured in **heads** (number of animals). This matches the FABIO convention. Different rows of the `Z` matrix can have different units because each row's entries are only compared to that row's total output `X[i]`. The technical coefficients $A_{ij} = Z_{ij} / X_j$ are dimensionless ratios, so the system remains mathematically consistent. For example, a slaughtering process for cattle: - **Use**: 1000 heads of cattle (heads) - **Supply**: 400 tonnes of beef + 50 tonnes of offals + 30 tonnes of hides (tonnes) The cattle heads and beef tonnes live in different rows of `Z`, each with its own unit. ### Bilateral trade A **bilateral trade matrix** records, for each product, how many tonnes (or heads, for live animals) flow from each exporting country to each importing country. This is what connects domestic economies into a global system. ### Commodity balance sheets (CBS) A **commodity balance sheet** is an accounting identity for each product in each country: $$ \text{Production} + \text{Imports} = \text{Exports} + \text{Food} + \text{Feed} + \text{Other uses} + \Delta\text{Stocks} $$ The right-hand categories (food, feed, other uses) form the **final demand** --- the end of the supply chain where products are consumed rather than being used as intermediate inputs. For live animals (which FAO does not include in its CBS data), whep constructs synthetic CBS entries from primary production data: - **Meat animals** (cattle, pigs, poultry...): all production goes to `processing` (i.e. they enter the slaughtering process). - **Draft animals** (horses, camels, asses...): all production goes to `other_uses`. ### The Z matrix (inter-industry flows) `Z` is the core of the MRIO model. It is a square matrix where each row and column represents a *(country, product)* pair. Entry $Z_{ij}$ gives the physical flow from sector $i$ to sector $j$. For example, if row 5 is *(Spain, wheat)* and column 12 is *(Spain, flour)*, then $Z_{5,12}$ is the amount of Spanish wheat used by Spanish flour mills. The model constructs `Z` using the **industry technology assumption**: $$ Z = U_{MR} \times T $$ where: - $U_{MR}$ is the multi-regional use matrix. It starts from single-country use tables and expands them with trade shares, so that each input can originate from any country. For example, if Spain's pig farming uses 1000 tonnes of soybeans, and 60% of Spain's soybean supply comes from Brazil, then 600 tonnes are attributed to Brazil and 400 to domestic production. - $T$ is the row-normalized supply matrix (the **transformation matrix**). Each row sums to 1 and encodes how each process distributes its output across products. This converts process-level flows to product-level flows. ### Trade shares For each product, trade shares tell us *where each country gets its supply from*. If Spain consumes 1000 tonnes of soybeans, and 600 come from domestic production while 400 come from Brazil, then the trade share from Brazil to Spain is 0.4 and the domestic share is 0.6. These shares are computed from the bilateral trade matrices combined with domestic production data from the CBS. They are used to expand the single-country use tables into a multi-regional system where each input can originate from any country. ### The Y matrix (final demand) `Y` is the final demand matrix. Rows are *(country, product)* pairs (same as `Z`), and columns are *(country, demand category)* pairs. Entry $Y_{ij}$ tells us how much of product $i$ goes to final demand category $j$. The demand categories are derived from the CBS columns present in the data, typically: `food`, `other_uses`, and `stock_addition`. Like `Z`, the final demand is expanded with trade shares so that, for example, food consumption of wheat in France can partly originate from French wheat and partly from imported wheat. ### The X vector (total output) `X` is a vector where $X_i$ is the total output of sector $i$. By construction: $$ X = Z \cdot \mathbf{1} + Y \cdot \mathbf{1} $$ meaning total output equals intermediate deliveries plus final demand. ### FABIO adjustments After the initial construction of `Z` and `Y`, the model applies several corrections following the FABIO methodology: 1. **Leftover redistribution**: When CBS reports `processing` or `seed` use but the supply-use table does not fully account for it, the unmatched quantity is redistributed to the `food` category in `Y`. This prevents demand from vanishing. 2. **Loss endogenization** (optional, via `endogenize_losses = TRUE`): If the CBS contains a `losses` column, losses are moved from `Y` to the diagonal of `Z`. This treats losses as self-use within each sector (the commodity is "consumed" by its own production process) rather than as final demand. 3. **Diagonal rebalancing**: When $\text{diag}(Z)_i \geq X_i$ (the diagonal accounts for all or more than total output, usually due to reporting errors), 80% of the diagonal value is moved to final demand and 20% is kept on the diagonal, proportionally distributed across the sector's demand categories. 4. **Negative output fixing**: Any sectors with $X_i < 0$ (which can arise from trade data inconsistencies) are corrected by adjusting negative entries in `Y`. ### The Leontief inverse (L) The **Leontief inverse** $L = (I - A)^{-1}$ is the key analytical tool. The technical coefficients matrix $A$ is defined as $A_{ij} = Z_{ij} / X_j$, representing how much input from sector $i$ is needed per unit of output from sector $j$. Before inversion, two corrections are applied: - **Negative entries in $A$ are set to zero**: these arise from data inconsistencies and would distort the inverse. - **Column sums of $A$ are capped at 1**: this ensures $(I - A)$ is invertible even when reported inputs exceed reported output. The Leontief inverse captures the **total requirements** --- both direct and indirect --- needed to deliver one unit of final demand. Entry $L_{ij}$ tells us: *"To deliver one extra unit of product $j$ to final consumers, how much total output does sector $i$ need to produce (including all the intermediate steps)?"* This is what makes footprint tracing possible. A consumer buying bread in France triggers not just French bakeries, but also French mills, French wheat farmers, Brazilian soybean exporters (if feed is involved), and so on. The Leontief inverse encodes all of these ripple effects in a single matrix. ### Environmental extensions An **environmental extension** is any pressure you want to trace through the supply chain. Common examples: - **Land use** (hectares): How many hectares of cropland does each sector use? - **Water use** (cubic meters): How much blue water is consumed? - **Greenhouse gas emissions** (tonnes CO2-eq): How much does each sector emit? - **Nitrogen surplus** (tonnes N): How much excess nitrogen enters the environment? You provide these as a vector $e$ with one value per *(country, product)* sector. ### The footprint The footprint matrix is computed as: $$ FP = \hat{f} \cdot L \times Y $$ where $\hat{f}$ is a diagonal matrix of **extension intensities** $f_i = e_i / X_i$ (pressure per unit of output). Each entry $FP_{ij}$ tells us: *"How much environmental pressure occurring in sector $i$ is driven by final demand category $j$?"* When `fd_labels` is provided, the footprint uses the **FABIO diagonal approach**: for each final demand column $j$, the function computes $MP \cdot \text{diag}(y_j) \cdot G$, where $G$ groups sectors by item. This decomposes the footprint by both the target item being consumed and the demand category (food, other uses, etc.). The resulting tidy tibble decomposes this by origin country/product and target country/product, giving you a full bilateral footprint map. ## Running the analysis ### Step 1: Build the IO model The simplest call uses default arguments --- the function fetches all input data internally: ```{r} library(whep) io <- build_io_model() ``` This downloads supply-use tables, bilateral trade, and CBS data, then builds the IO model for all available years. To restrict to specific years: ```{r} io <- build_io_model(years = c(2010, 2015)) ``` If you want to inspect or pre-filter the input data, you can build them separately and pass them in: ```{r} supply_use <- build_supply_use() bilateral_trade <- get_bilateral_trade() cbs <- get_wide_cbs() io <- build_io_model(supply_use, bilateral_trade, cbs, years = 2010) ``` To endogenize losses (move them from final demand to the diagonal of `Z`): ```{r} io <- build_io_model(endogenize_losses = TRUE) ``` The result is a tibble with one row per year. Each row contains: - `Z`: The inter-industry flow matrix (sparse). - `Y`: The final demand matrix (sparse). - `X`: The total output vector. - `labels`: A tibble mapping row/column indices to `area_code` and `item_cbs_code`. - `fd_labels`: A tibble mapping each Y column to its `area_code` (consuming country) and `fd_col` (demand category). You can inspect a single year: ```{r} # Extract matrices for one year z_mat <- io$Z[[1]] y_mat <- io$Y[[1]] x_vec <- io$X[[1]] labels <- io$labels[[1]] fd_labels <- io$fd_labels[[1]] selected_year <- io$year[[1]] # Dimensions: (n_countries * n_products) x (n_countries * n_products) dim(z_mat) # Label mapping labels # Final demand column labels fd_labels ``` ### Step 2: Compute the Leontief inverse For **small systems** (< 5000 sectors), you can pre-compute L: ```{r} l_inv <- compute_leontief_inverse(z_mat, x_vec) dim(l_inv) ``` This inverts the $(I - A)$ matrix. Column sums of $A$ are internally capped at 1 to prevent singularity, and negative entries are zeroed. #### For large systems For systems larger than ~5000 sectors (e.g. 192 countries x 125+ products = 24 000 sectors), computing the explicit Leontief inverse requires too much memory (e.g. 4.8 GiB for 25 000 sectors). In those cases, pass `z_mat` directly to `compute_footprint()`, which solves $(I - A)x = Y$ without ever materialising the full L. ### Step 3: Compute the footprint Now provide your environmental extension vector. This must have one value per sector, in the same order as the labels. For land use, you can use `get_land_fp_production()`: ```{r} land_use_data <- get_land_fp_production() |> dplyr::filter(year == selected_year) |> dplyr::select(area_code, item_cbs_code, hectares = impact_u) # Match it to the IO model's label order extensions <- labels |> dplyr::left_join( land_use_data, by = c("area_code", "item_cbs_code") ) |> dplyr::mutate( hectares = tidyr::replace_na(hectares, 0) ) |> dplyr::pull(hectares) ``` #### Dense path (pre-computed L, small systems) If you pre-computed L: ```{r} footprint <- compute_footprint( l_inv, x_vec, y_mat, extensions, labels, fd_labels = fd_labels ) footprint ``` #### Using Z directly (large systems) Alternatively, pass the Z matrix and `compute_footprint()` will derive the Leontief inverse for you: ```{r} footprint <- compute_footprint( z_mat = z_mat, x_vec = x_vec, y_mat = y_mat, extensions = extensions, labels = labels, fd_labels = fd_labels ) footprint ``` #### With and without fd_labels When `fd_labels` is provided (recommended), the result includes `target_item` (the consumed product) and `target_fd` (the demand category like `"food"` or `"other_uses"`). This uses the FABIO diagonal decomposition approach. When `fd_labels` is omitted, columns of Y are treated as individual sectors. This is appropriate only for square Y matrices and does not produce a `target_fd` column. The result is a tidy tibble with columns: | Column | Meaning | |--------|---------| | `origin_area` | Country where the pressure physically occurs | | `origin_item` | Product causing the pressure | | `target_area` | Country whose consumption drives the pressure | | `target_item` | Product being consumed | | `target_fd` | Demand category (only with `fd_labels`) | | `value` | Footprint in extension units (e.g. hectares) | ### Analysing the results Because the output is a tidy tibble, you can use standard dplyr operations to answer questions: **Which country's consumption drives the most land use?** ```{r} footprint |> dplyr::summarise( total_land = sum(value), .by = target_area ) |> dplyr::arrange(dplyr::desc(total_land)) ``` **How much of Spain's land use is driven by foreign consumption?** ```{r} spain_code <- 203L footprint |> dplyr::filter(origin_area == spain_code) |> dplyr::mutate( destination = ifelse( target_area == spain_code, "domestic", "exported" ) ) |> dplyr::summarise( total_land = sum(value), .by = destination ) ``` **What are the top bilateral flows?** ```{r} footprint |> dplyr::summarise( total = sum(value), .by = c(origin_area, target_area) ) |> dplyr::arrange(dplyr::desc(total)) |> head(20) ``` **Which products drive the most land use in food consumption?** ```{r} footprint |> dplyr::filter(target_fd == "food") |> dplyr::summarise( total_land = sum(value), .by = target_item ) |> add_item_cbs_name(code_column = "target_item") |> dplyr::arrange(dplyr::desc(total_land)) |> head(20) ``` ### A conservation check A useful sanity check: the total footprint should equal the total extensions. All pressure must be attributed somewhere. ```{r} # These two numbers should be equal (up to floating-point tolerance) sum(footprint$value) sum(extensions) ``` Small differences can arise from negative-output corrections or column-sum capping in the technical coefficients matrix. ## Processing multiple years Since `build_io_model()` returns one row per year, you can loop over them: ```{r} io <- build_io_model(years = 2010:2013) land <- get_land_fp_production() all_footprints <- purrr::pmap_dfr( list( io$Z, io$X, io$Y, io$labels, io$fd_labels, io$year ), function(z, x, y, lab, fdl, yr) { ext <- lab |> dplyr::left_join( land |> dplyr::filter(year == yr) |> dplyr::select(area_code, item_cbs_code, impact_u), by = c("area_code", "item_cbs_code") ) |> dplyr::mutate(impact_u = tidyr::replace_na(impact_u, 0)) |> dplyr::pull(impact_u) compute_footprint( z_mat = z, x_vec = x, y_mat = y, extensions = ext, labels = lab, fd_labels = fdl ) |> dplyr::mutate(year = yr) } ) ``` #### Parallel execution (optional) Since each year is independent, you can parallelise using `furrr`: ```{r} library(furrr) future::plan(multisession, workers = 4) all_footprints <- furrr::future_pmap_dfr( list( io$Z, io$X, io$Y, io$labels, io$fd_labels, io$year ), function(z, x, y, lab, fdl, yr) { ext <- lab |> dplyr::left_join( land |> dplyr::filter(year == yr) |> dplyr::select(area_code, item_cbs_code, impact_u), by = c("area_code", "item_cbs_code") ) |> dplyr::mutate(impact_u = tidyr::replace_na(impact_u, 0)) |> dplyr::pull(impact_u) compute_footprint( z_mat = z, x_vec = x, y_mat = y, extensions = ext, labels = lab, fd_labels = fdl ) |> dplyr::mutate(year = yr) } ) ``` ## Summary | Step | Function | Input | Output | |------|----------|-------|--------| | 1 | `build_io_model()` | Supply-use, trade, CBS (all optional --- fetched by default) | Z, Y, X, labels, fd_labels per year | | 2 | `compute_leontief_inverse()` | Z, X | Leontief inverse L (small systems only) | | 3 | `compute_footprint()` | L or Z, X, Y, extensions, labels, fd_labels | Tidy bilateral footprint | ## References Bruckner, M., Wood, R., Moran, D., Kuschnig, N., Wieland, H., Maus, V., & Borner, J. (2019). FABIO --- The Construction of the Food and Agriculture Biomass Input-Output Model. *Environmental Science & Technology*, 53(19), 11302--11312. https://doi.org/10.1021/acs.est.8b03704