Gap Minder Sample Queries

Author

JJB + Course

The following Quarto document requires additional R packages. Please run the code chunk before continuing.

pkg_list = c("ggplot2", "DBI", "RSQLite", "readr", "gifski")
mia_pkgs = pkg_list[!(pkg_list %in% installed.packages()[,"Package"])]
if(length(mia_pkgs) > 0) install.packages(mia_pkgs)
loaded_pkgs = lapply(pkg_list, require, character.only=TRUE)

Real-Data: Gapminder

For the next section, our focus shifts to working with the Gapminder dataset created by Hans Rosling from the Gapminder Foundation. The Gapminder data set represents a longitudinal study of various global developmental indicators over multiple countries across 3 centuries.

Download Gapminder Data

Let’s use curl a shell command to download the gap-every-five-years.csv data file at https://coatless.github.io/raw-data/gap-every-five-years.csv.

curl -o gap-every-five-years.csv -LJ https://coatless.github.io/raw-data/gap-every-five-years.csv
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 70295  100 70295    0     0   708k      0 --:--:-- --:--:-- --:--:--  771k

Import Data into SQLite

Prior to working with the data in SQL, we will need to import it into the database. We achieve this by first creating a file-backed database called gapminder_db.sqlite3:

# Setup a connection into a file-backed SQLite database
gm_db = DBI::dbConnect(RSQLite::SQLite(), "gapminder_db.sqlite3")

Next, we read the data first into R and, then, using the dbWriteTable() function to write the data as the gapminder table into the database.

# Iteratively read in data to the database.
readr::read_csv_chunked(
    # Location of the data file
    file = "gap-every-five-years.csv",  
    
    # Process the data after it has been read in
    callback = function(chunk, dummy) {
        
        # Write data to the `gapminder` table iteratively
        dbWriteTable(gm_db, "gapminder", chunk, append = TRUE)
        
    },
    
    # Specify how much data should be written to a database in each chunk
    chunk_size = 1000
)

── Column specification ────────────────────────────────────────────────────────
cols(
  country = col_character(),
  year = col_double(),
  lifeExp = col_double(),
  pop = col_double(),
  gdpPercap = col_double()
)
NULL

Next, let’s look at the data inside of the gapminder table.

df = dbReadTable(gm_db, "gapminder")
head(df, 10)
       country year lifeExp      pop gdpPercap
1  Afghanistan 1952  28.801  8425333  779.4453
2  Afghanistan 1957  30.332  9240934  820.8530
3  Afghanistan 1962  31.997 10267083  853.1007
4  Afghanistan 1967  34.020 11537966  836.1971
5  Afghanistan 1972  36.088 13079460  739.9811
6  Afghanistan 1977  38.438 14880372  786.1134
7  Afghanistan 1982  39.854 12881816  978.0114
8  Afghanistan 1987  40.822 13867957  852.3959
9  Afghanistan 1992  41.674 16317921  649.3414
10 Afghanistan 1997  41.763 22227415  635.3414

Finally, we stop the database connection:

DBI::dbDisconnect(gm_db)

Load SQLite Database from Disk

Next, we’ll load the SQLite database we just created from disk.

# Setup a connection into a file-backed SQLite database
gm_db = DBI::dbConnect(RSQLite::SQLite(), "gapminder_db.sqlite3")

For each of the next SQL code chunks, we’re using the following options:

```{sql}
#| label: name-of-code-chunk
#| connection: gm_db
```

Note: We’re using gm_db as that is the name of the connection object to the database just established in the preceding code chunk.

Querying Real Data

Let’s begin by looking at a few exploratory components.

First, how many observations or records are in the Table?

SELECT COUNT(*) FROM gapminder
1 records
COUNT(*)
30672

What do the first five records of each field look like?

SELECT * FROM gapminder LIMIT 5;
5 records
country year lifeExp pop gdpPercap
Afghanistan 1952 28.801 8425333 779.4453
Afghanistan 1957 30.332 9240934 820.8530
Afghanistan 1962 31.997 10267083 853.1007
Afghanistan 1967 34.020 11537966 836.1971
Afghanistan 1972 36.088 13079460 739.9811

What countries are in the data set?

SELECT DISTINCT country FROM gapminder LIMIT 10;
Displaying records 1 - 10
country
Afghanistan
Albania
Algeria
Angola
Argentina
Australia
Austria
Bahrain
Bangladesh
Belgium
SELECT COUNT(DISTINCT country) FROM gapminder;
1 records
COUNT(DISTINCT country)
142

What does the population look if we rescale by thousands?

SELECT country, year, pop/1000.0 FROM gapminder LIMIT 5;
5 records
country year pop/1000.0
Afghanistan 1952 8425.333
Afghanistan 1957 9240.934
Afghanistan 1962 10267.083
Afghanistan 1967 11537.966
Afghanistan 1972 13079.460

We probably want a better variable than pop/1000.0. We can change the name by placing after the modification AS <new-field>:

SELECT country, year, pop/1000.0 AS rescaled_pop FROM gapminder LIMIT 5;
5 records
country year rescaled_pop
Afghanistan 1952 8425.333
Afghanistan 1957 9240.934
Afghanistan 1962 10267.083
Afghanistan 1967 11537.966
Afghanistan 1972 13079.460

How many people are represented in the data set?

SELECT SUM(pop) FROM gapminder; 
1 records
SUM(pop)
907928384418

What does the GDP per capita across the years look like?

SELECT country, year, ROUND(gdpPercap, 2) AS gdp FROM gapminder;
Displaying records 1 - 10
country year gdp
Afghanistan 1952 779.45
Afghanistan 1957 820.85
Afghanistan 1962 853.10
Afghanistan 1967 836.20
Afghanistan 1972 739.98
Afghanistan 1977 786.11
Afghanistan 1982 978.01
Afghanistan 1987 852.40
Afghanistan 1992 649.34
Afghanistan 1997 635.34

What are the features of Iceland, Germany, and Spain after 2000?

SELECT *
FROM gapminder
WHERE (year >= 2000) AND (country IN ("Iceland", "Germany", "Spain"));
Displaying records 1 - 10
country year lifeExp pop gdpPercap
Germany 2002 78.670 82350671 30035.80
Germany 2007 79.406 82400996 32170.37
Iceland 2002 80.500 288030 31163.20
Iceland 2007 81.757 301931 36180.79
Spain 2002 79.780 40152517 24835.47
Spain 2007 80.941 40448191 28821.06
Germany 2002 78.670 82350671 30035.80
Germany 2007 79.406 82400996 32170.37
Iceland 2002 80.500 288030 31163.20
Iceland 2007 81.757 301931 36180.79

Visualization

What if we wanted to visualize the data?

Well, there are built-in methods associated with the sql results… But, let’s take an interactive look at the data. To do so, we’ll use the gganimate package to rapidly make a graph.

Next, let’s query all of the information from the table.

```{sql}
#| label: retrieve-table-from-db
#| connection: gm_db
#| echo: true
#| output.var: gm_df_from_db
SELECT * FROM gapminder;
```

Note, we’re exporting the data as: gm_df_from_db

With the query done, we turn our focus to creating our plot that compares life expectancy vs. GDP per capita under all countries in Gap minder across time.

library(ggplot2)
library(gganimate)

p = ggplot(gm_df_from_db) +
    aes(gdpPercap, lifeExp, size = pop, colour = country) +
    geom_point(show.legend = FALSE, alpha = 0.7) +
    scale_color_viridis_d() +
    scale_size(range = c(2, 12)) +
    scale_x_log10() +
    theme(legend.position = 'none') +
    labs(
        title = 'Year: {frame_time}', 
        x = 'GDP per Capita', 
        y = 'Life Expectancy'
    ) +
    transition_time(year) +
    ease_aes('linear')

animate(p)