--- title: "Introduction to pixieweb" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to pixieweb} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} editor_options: chunk_output_type: console --- ```{r opts, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` `pixieweb` is an R package for *discovering*, *inspecting* and *downloading* statistical data from [PX-Web](https://www.scb.se/px-web) APIs — the platform used by Statistics Sweden (SCB), Statistics Norway (SSB), Statistics Finland, and many other national statistics agencies. This vignette provides an overview of the methods included in the `pixieweb` package and the design principles of the package API. To learn more about the specifics of functions and to see a full list of the functions included, please see the [Reference section of the package homepage](https://lchansson.github.io/pixieweb/reference/index.html) or run `??pixieweb`. For a quick introduction to the package see the vignette [Quick start guide to pixieweb](a-quickstart.html). > **Note on pxweb:** The excellent [pxweb](https://cran.r-project.org/package=pxweb) > package by rOpenGov already provides comprehensive R access to > PX-Web APIs. `pixieweb` is not a replacement — it offers an > *alternative paradigm* built around search-then-fetch discovery and > progressive disclosure. Choose the workflow that fits your needs. The design of `pixieweb` functions is inspired by the design and functionality provided by several packages in the `tidyverse` family. pixieweb uses the base R pipe (`|>`) throughout. Some vignette examples use `dplyr`, `tidyr`, and `ggplot2` for data wrangling and visualisation: ```{r install, eval = FALSE} install.packages("pixieweb") ``` ## PX-Web, a platform for national statistical databases [PX-Web](https://www.scb.se/px-web) is the statistical database platform used by national statistics agencies across the Nordic countries and beyond. Each agency runs its own instance — Statistics Sweden at [scb.se](https://www.scb.se), Statistics Norway at [ssb.no](https://www.ssb.no), Statistics Finland at [stat.fi](https://stat.fi/), and many more — but they all share the same underlying API, which comes in two versions: - **v1** — the legacy API, still widely deployed. POST-only data queries, no search endpoint, table discovery requires walking a folder hierarchy. - **v2** — the modern API, launched by SCB in 2024. GET+POST data queries, full-text search, codelists endpoint, server-side saved queries. `pixieweb` handles both versions transparently — the user-facing functions have the same signatures, and only the internal request building differs. To get started with PX-Web you might want to visit the web frontend of any participating agency, or read through the [PxWebApi2 documentation](https://github.com/PxTools/PxApiSpecs) (English). However, you can also use the `pixieweb` package to explore data without prior knowledge of the database. ```{r setup} library("pixieweb") ``` ### The data model Data in PX-Web are stored as *multi-dimensional data cubes*. Each agency publishes hundreds or thousands of **tables**, and each table defines its own set of **variables** (dimensions) along which observations are indexed. A typical population table might, for instance, be indexed by region, sex, age and year — while a foreign trade table might be indexed by partner country, commodity group and month. When downloading data, the user needs to specify which values to include along each variable — or omit the variable entirely, in which case the API returns a pre-computed aggregate (this is called *elimination*). Every table has its own rules for which variables are *mandatory* and which are *eliminable*. In summary, a PX-Web table is organised along four basic concepts: - An **API instance** — a particular agency's PX-Web database (e.g. SCB, SSB) - A **table** — a single statistical data cube published by that agency - A set of **variables** — the dimensions of the cube (region, time, sex, product, etc.); each variable has a set of valid **values** - One or more **content codes** — what is actually being measured in each cell (population count, deaths, tax revenue, GDP, ...) Tables with multiple content codes can return data in either *long* format (one row per observation, one column for the content code) or *wide* format (one column per content code) — see the section on wide output below. Additionally, many variables come with one or more **codelists** — alternative groupings of the values that can be used to aggregate data on the fly. For example, a "Region" variable with 290 municipalities might offer codelists that group them into 21 counties, or into eight NUTS-2 regions. `pixieweb` provides a function family for each of these concepts: | Level | Discover | Search | Describe | Extract | Values | |--------------|------------------------|--------------------|----------------------|-----------------------|-----------------------| | API instance | `px_api_catalogue()` | — | — | — | — | | Table | `get_tables()` | `table_search()` | `table_describe()` | `table_extract_ids()` | — | | Variable | `get_variables()` | `variable_search()`| `variable_describe()`| `variable_extract_ids()`| `variable_values()` | | Codelist | `get_codelists()` | — | `codelist_describe()`| `codelist_extract_ids()`| `codelist_values()` | | Data | `get_data()` | — | — | — | — | ### Connecting to an API All `pixieweb` workflows start by connecting to a PX-Web instance. `px_api()` accepts a short alias (`"scb"`, `"ssb"`, `"statfi"`, ...) or a full URL: ```{r api_mock, eval = FALSE} # Known aliases scb <- px_api("scb", lang = "en") ssb <- px_api("ssb", lang = "en") # Or a custom URL custom <- px_api("https://my.statbank.example/api/v2/", lang = "en") # See all known APIs px_api_catalogue() ``` ```{r api, echo = FALSE} scb <- pixieweb:::vd_scb catalogue <- pixieweb:::vd_catalogue catalogue ``` For cross-country comparison and a fuller tour of the multi-API catalogue, see `vignette("multi-api")`. ### Discovering tables: `get_tables()` Tables are the central entity. On v2 APIs, `get_tables()` sends a server-side search query; on v1 APIs, it walks the folder tree. In both cases the result is a tibble with rich metadata, one row per table: ```{r tables_mock, eval = FALSE} # Search the SCB catalogue for population-related tables tables <- get_tables(scb, query = "population") head(tables) ``` ```{r tables, echo = FALSE} tables <- pixieweb:::vd_tables tibble::tibble(head(tables)) ``` The table tibble includes subject path, time period range, time unit, and data source — all of which are searchable by `table_search()` (a client-side filter, analogous to `kpi_search()` in `rKolada`). ```{r table_search} # Narrow down to tables about municipalities tables |> table_search("municipal") |> table_describe(max_n = 2, format = "md", heading_level = 4) ``` `table_describe()` prints a human-readable summary of each table. As with all `describe` functions in the `pixieweb`/`rKolada`/`rTrafa` family, you can set `format = "md"` to produce Markdown output that can be embedded directly in an R Markdown document by setting the chunk option `results = 'asis'`. ### Exploring variables: `get_variables()` Once you have chosen a table, inspect its variables with `get_variables()`: ```{r variables_mock, eval = FALSE} vars <- get_variables(scb, "TAB638") vars |> variable_describe() ``` ```{r variables, echo = FALSE} vars <- pixieweb:::vd_variables vars |> variable_describe() ``` Each variable has a number of important properties worth knowing about: - **`elimination`** — can this variable be left out of your `get_data()` call? If `TRUE`, omitting the variable means the API returns a pre-computed total (e.g. omitting "Sex" gives the total for all sexes). If `FALSE`, the variable is **mandatory** and must be included in the query. - **`time`** — is this the time dimension of the cube? (There is always exactly one.) - **`values`** — the set of available codes and their human-readable labels. - **`codelists`** — alternative groupings of the values (see the "Codelists" section below). To inspect the available values for a specific variable, use `variable_values()`: ```{r variable_values_mock, eval = FALSE} vars |> variable_values("Region") ``` ```{r variable_values, echo = FALSE} pixieweb:::vd_region_values ``` ### Downloading data: `get_data()` `get_data()` is the workhorse function. Provide an API object, a table ID, and one argument per variable you want to include in the query: ```{r get_data_mock, eval = FALSE} pop <- get_data(scb, "TAB638", Region = c("0180", "1480", "1280"), ContentsCode = "*", Tid = px_top(5) ) head(pop) ``` ```{r get_data, echo = FALSE} pop <- pixieweb:::vd_pop tibble::tibble(head(pop)) ``` Variables you **omit** are *eliminated* (aggregated) if the API allows it. If a variable is mandatory, you must include it — `get_data()` will raise an informative error otherwise. #### Selection helpers Often you don't want to hardcode specific codes. `pixieweb` provides a family of selection helpers that translate common wishes into API-compatible arguments: | Helper | Meaning | API versions | |-------------------|----------------------------------|--------------| | `c("0180")` | Specific values | v1, v2 | | `"*"` | All values | v1, v2 | | `px_top(5)` | First N values (e.g. most recent)| v1, v2 | | `px_bottom(3)` | Last N values | v2 only | | `px_from("2020")` | From value onward | v2 only | | `px_to("2023")` | Up to value | v2 only | | `px_range(a, b)` | Inclusive range | v2 only | ### Metadata-driven queries: `prepare_query()` An alternative approach to downloading data using known codes is to use the metadata tables to construct queries. For interactive exploration, `prepare_query()` inspects the table's variable metadata and builds a query with sensible defaults, so that you only need to specify the variables you care about: ```{r prepare_query_mock, eval = FALSE} q <- prepare_query(scb, "TAB638", Region = c("0180", "1480", "1280") ) q ``` ```{r prepare_query, echo = FALSE} q <- pixieweb:::vd_prepared_query q ``` The default strategy is: - **Content code**: all values (`"*"`) - **Time variable**: latest 10 periods (`px_top(10)`) - **Eliminable variables**: omitted (API aggregates) - **Small mandatory variables** (≤ 22 values): all (`"*"`) - **Large mandatory variables**: first value (`px_top(1)`) With `maximize_selection = TRUE`, the function expands unspecified variables to include as many values as possible while staying under the API's cell limit. Once you're happy, pass the validated query to `get_data()`: ```{r get_data_from_query, eval = FALSE} pop <- get_data(scb, query = q) ``` ## Codelists Codelists provide alternative groupings of variable values. They are useful when you want data at a different aggregation level than the table's default. For example, a "Region" variable with 290 Swedish municipalities might have a codelist that groups them into 21 counties (`vs_RegionLän07`): ```{r codelists_mock, eval = FALSE} cls <- get_codelists(scb, "TAB638", "Region") cls |> codelist_describe(max_n = 3) ``` ```{r codelists, echo = FALSE} cls <- pixieweb:::vd_codelists cls |> codelist_describe(max_n = 3) ``` Apply a codelist in a query via the `.codelist` argument: ```{r codelist_use, eval = FALSE} # Fetch data aggregated to counties instead of municipalities get_data(scb, "TAB638", Region = "*", Tid = px_top(5), ContentsCode = "*", .codelist = list(Region = "vs_RegionLän07") ) ``` ## Wide output and multiple content codes When a table has multiple content codes (e.g. both Population and Deaths), the default long format has one row per content code per observation. Use `.output = "wide"` to pivot the content codes into separate columns — useful when you want to compute with several measures (e.g. death rate = Deaths / Population): ```{r wide_mock, eval = FALSE} demo <- get_data(scb, "TAB638", Region = "0180", Tid = px_top(5), ContentsCode = "*", .output = "wide" ) demo ``` ```{r wide, echo = FALSE} pixieweb:::vd_wide ``` ## Visualising results To explore results we can plot the downloaded data using `ggplot2`. Note that we convert the time column to a `Date` first — years as integer can produce awkward ggplot breaks like "2020, 2022.5, 2025", and a proper `Date` column lets `scale_x_date()` place tick marks on whole years. This is a pattern you will want to use for any time-series analysis across all three sibling packages (`rKolada`, `pixieweb`, `rTrafa`): ```{r plot, fig.width = 7, fig.height = 4} library("ggplot2") pop_plot <- pop |> # Keep only the Population content code (the table also has # "Population growth"); convert year to Date for nice axis breaks dplyr::filter(ContentsCode == "BE0101N1") |> dplyr::mutate(year = as.Date(paste0(Tid, "-01-01"))) ggplot(pop_plot, aes(year, value, colour = Region_text)) + # One line per region; linetype adds distinction in B/W print geom_line(aes(linetype = Region_text), linewidth = 1) + geom_point(size = 2) + # Years as dates, one tick per year scale_x_date(date_breaks = "1 year", date_labels = "%Y") + scale_y_continuous(labels = scales::comma) + # Colour-blind-friendly palette scale_colour_viridis_d(option = "B", end = 0.8) + labs( title = "Population, Sweden's three most populous municipalities", x = "Year", y = "Population", colour = NULL, linetype = NULL, caption = px_cite(pop) ) + theme_minimal() + theme(legend.position = "top") ``` > **More on ggplot2?** See . ## Advanced: query composition For full control over the HTTP request — useful for debugging or when you need to inspect/modify the exact query before sending it — use the low-level query composers: ```{r compose, eval = FALSE} q <- compose_data_query(scb, "TAB638", Region = c("0180"), ContentsCode = "*", Tid = px_top(3) ) # Inspect the query q$url q$body # Modify and execute raw <- execute_query(scb, q$url, q$body) ``` ## Saved queries (v2 only) PX-Web v2 supports server-side stored queries — useful for recurring reports. Save a query once, then retrieve it by ID later: ```{r saved, eval = FALSE} # Save a query id <- save_query(scb, "TAB638", Region = "0180", Tid = px_top(5), ContentsCode = "*" ) # Retrieve later get_saved_query(scb, id) ``` ## Citation Always cite your data sources. `px_cite()` generates a citation string from the metadata attached to a downloaded data frame: ```{r cite_mock, eval = FALSE} px_cite(pop) ``` ```{r cite, echo = FALSE} pixieweb:::vd_citation ``` ## Related packages `pixieweb` is part of a family of R packages for Swedish and Nordic open statistics that share the same design philosophy: - [rKolada](https://lchansson.github.io/rKolada/) — R client for the [Kolada](https://kolada.se/) database of Swedish municipal and regional Key Performance Indicators - [rTrafa](https://lchansson.github.io/rTrafa/) — R client for the [Trafa](https://api.trafa.se/) API of Swedish transport statistics See also [pxweb](https://cran.r-project.org/package=pxweb) — the original and established PX-Web client for R, by rOpenGov.