= c("ggplot2", "DBI", "RSQLite", "readr", "gifski")
pkg_list = pkg_list[!(pkg_list %in% installed.packages()[,"Package"])]
mia_pkgs if(length(mia_pkgs) > 0) install.packages(mia_pkgs)
= lapply(pkg_list, require, character.only=TRUE) loaded_pkgs
Gap Minder Sample Queries
The following Quarto document requires additional R packages. Please run the code chunk before continuing.
Real-Data: Gapminder
For the next section, our focus shifts to working with the Gapminder dataset created by Hans Rosling from the Gapminder Foundation. The Gapminder data set represents a longitudinal study of various global developmental indicators over multiple countries across 3 centuries.
Download Gapminder Data
Let’s use curl
a shell command to download the gap-every-five-years.csv
data file at https://coatless.github.io/raw-data/gap-every-five-years.csv.
curl -o gap-every-five-years.csv -LJ https://coatless.github.io/raw-data/gap-every-five-years.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 70295 100 70295 0 0 708k 0 --:--:-- --:--:-- --:--:-- 771k
Import Data into SQLite
Prior to working with the data in SQL, we will need to import it into the database. We achieve this by first creating a file-backed database called gapminder_db.sqlite3
:
# Setup a connection into a file-backed SQLite database
= DBI::dbConnect(RSQLite::SQLite(), "gapminder_db.sqlite3") gm_db
Next, we read the data first into R and, then, using the dbWriteTable()
function to write the data as the gapminder
table into the database.
# Iteratively read in data to the database.
::read_csv_chunked(
readr# Location of the data file
file = "gap-every-five-years.csv",
# Process the data after it has been read in
callback = function(chunk, dummy) {
# Write data to the `gapminder` table iteratively
dbWriteTable(gm_db, "gapminder", chunk, append = TRUE)
},
# Specify how much data should be written to a database in each chunk
chunk_size = 1000
)
── Column specification ────────────────────────────────────────────────────────
cols(
country = col_character(),
year = col_double(),
lifeExp = col_double(),
pop = col_double(),
gdpPercap = col_double()
)
NULL
Next, let’s look at the data inside of the gapminder
table.
= dbReadTable(gm_db, "gapminder")
df head(df, 10)
country year lifeExp pop gdpPercap
1 Afghanistan 1952 28.801 8425333 779.4453
2 Afghanistan 1957 30.332 9240934 820.8530
3 Afghanistan 1962 31.997 10267083 853.1007
4 Afghanistan 1967 34.020 11537966 836.1971
5 Afghanistan 1972 36.088 13079460 739.9811
6 Afghanistan 1977 38.438 14880372 786.1134
7 Afghanistan 1982 39.854 12881816 978.0114
8 Afghanistan 1987 40.822 13867957 852.3959
9 Afghanistan 1992 41.674 16317921 649.3414
10 Afghanistan 1997 41.763 22227415 635.3414
Finally, we stop the database connection:
::dbDisconnect(gm_db) DBI
Load SQLite Database from Disk
Next, we’ll load the SQLite database we just created from disk.
# Setup a connection into a file-backed SQLite database
= DBI::dbConnect(RSQLite::SQLite(), "gapminder_db.sqlite3") gm_db
For each of the next SQL code chunks, we’re using the following options:
```{sql}
#| label: name-of-code-chunk
#| connection: gm_db
```
Note: We’re using gm_db
as that is the name of the connection object to the database just established in the preceding code chunk.
Querying Real Data
Let’s begin by looking at a few exploratory components.
First, how many observations or records are in the Table?
SELECT COUNT(*) FROM gapminder
COUNT(*) |
---|
30672 |
What do the first five records of each field look like?
SELECT * FROM gapminder LIMIT 5;
country | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|
Afghanistan | 1952 | 28.801 | 8425333 | 779.4453 |
Afghanistan | 1957 | 30.332 | 9240934 | 820.8530 |
Afghanistan | 1962 | 31.997 | 10267083 | 853.1007 |
Afghanistan | 1967 | 34.020 | 11537966 | 836.1971 |
Afghanistan | 1972 | 36.088 | 13079460 | 739.9811 |
What countries are in the data set?
SELECT DISTINCT country FROM gapminder LIMIT 10;
country |
---|
Afghanistan |
Albania |
Algeria |
Angola |
Argentina |
Australia |
Austria |
Bahrain |
Bangladesh |
Belgium |
SELECT COUNT(DISTINCT country) FROM gapminder;
COUNT(DISTINCT country) |
---|
142 |
What does the population look if we rescale by thousands?
SELECT country, year, pop/1000.0 FROM gapminder LIMIT 5;
country | year | pop/1000.0 |
---|---|---|
Afghanistan | 1952 | 8425.333 |
Afghanistan | 1957 | 9240.934 |
Afghanistan | 1962 | 10267.083 |
Afghanistan | 1967 | 11537.966 |
Afghanistan | 1972 | 13079.460 |
We probably want a better variable than pop/1000.0
. We can change the name by placing after the modification AS <new-field>
:
SELECT country, year, pop/1000.0 AS rescaled_pop FROM gapminder LIMIT 5;
country | year | rescaled_pop |
---|---|---|
Afghanistan | 1952 | 8425.333 |
Afghanistan | 1957 | 9240.934 |
Afghanistan | 1962 | 10267.083 |
Afghanistan | 1967 | 11537.966 |
Afghanistan | 1972 | 13079.460 |
How many people are represented in the data set?
SELECT SUM(pop) FROM gapminder;
SUM(pop) |
---|
907928384418 |
What does the GDP per capita across the years look like?
SELECT country, year, ROUND(gdpPercap, 2) AS gdp FROM gapminder;
country | year | gdp |
---|---|---|
Afghanistan | 1952 | 779.45 |
Afghanistan | 1957 | 820.85 |
Afghanistan | 1962 | 853.10 |
Afghanistan | 1967 | 836.20 |
Afghanistan | 1972 | 739.98 |
Afghanistan | 1977 | 786.11 |
Afghanistan | 1982 | 978.01 |
Afghanistan | 1987 | 852.40 |
Afghanistan | 1992 | 649.34 |
Afghanistan | 1997 | 635.34 |
What are the features of Iceland, Germany, and Spain after 2000?
SELECT *
FROM gapminder
WHERE (year >= 2000) AND (country IN ("Iceland", "Germany", "Spain"));
country | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|
Germany | 2002 | 78.670 | 82350671 | 30035.80 |
Germany | 2007 | 79.406 | 82400996 | 32170.37 |
Iceland | 2002 | 80.500 | 288030 | 31163.20 |
Iceland | 2007 | 81.757 | 301931 | 36180.79 |
Spain | 2002 | 79.780 | 40152517 | 24835.47 |
Spain | 2007 | 80.941 | 40448191 | 28821.06 |
Germany | 2002 | 78.670 | 82350671 | 30035.80 |
Germany | 2007 | 79.406 | 82400996 | 32170.37 |
Iceland | 2002 | 80.500 | 288030 | 31163.20 |
Iceland | 2007 | 81.757 | 301931 | 36180.79 |
Visualization
What if we wanted to visualize the data?
Well, there are built-in methods associated with the sql results… But, let’s take an interactive look at the data. To do so, we’ll use the gganimate
package to rapidly make a graph.
Next, let’s query all of the information from the table.
```{sql}
#| label: retrieve-table-from-db
#| connection: gm_db
#| echo: true
#| output.var: gm_df_from_db
SELECT * FROM gapminder;
```
Note, we’re exporting the data as: gm_df_from_db
With the query done, we turn our focus to creating our plot that compares life expectancy vs. GDP per capita under all countries in Gap minder across time.
library(ggplot2)
library(gganimate)
= ggplot(gm_df_from_db) +
p aes(gdpPercap, lifeExp, size = pop, colour = country) +
geom_point(show.legend = FALSE, alpha = 0.7) +
scale_color_viridis_d() +
scale_size(range = c(2, 12)) +
scale_x_log10() +
theme(legend.position = 'none') +
labs(
title = 'Year: {frame_time}',
x = 'GDP per Capita',
y = 'Life Expectancy'
+
) transition_time(year) +
ease_aes('linear')
animate(p)