Demo SQLite Database and Queries

Author

JJB + Course

DBI, odbc, and RDBMS systems

Installing Drivers

There are two generic packages that are useful to establish a database connection with.

  • DBI: Database Interface
  • odbc: Open Database Connectivity

The prior provides a unified frontend for working with databases while the later provides connection drivers.

install.packages(c("DBI", "odbc"))

From here, we’ll want to also install DBI-compliant implementations for each database engine we are interested in using. To date, there are:

For these kinds of databases, we’ll need to connect in using the odbc package. This will require ODBC drivers to be installed on your local computer. Installation for each driver is operating system specific (e.g. different instructions for Windows, macOS, Linux…). An overview for each platform is provided at: https://solutions.posit.co/connections/db/best-practices/drivers/

# Install DBI-compliant database packages
install.packages(c(
  "RSQLite",
  "bigrquery",
  "RMySQL",
  "RMariaDB",
  "ROracle",
  "RPostgres"
))

# Install a `dplyr` to SQL translator package
install.packages("dbplyr")

SQL

Establish Database Connection

Here we establish a SQLite database.

library(DBI)
db = dbConnect(RSQLite::SQLite(), dbname = "my_db.sqlite")

Aside: Creating a SQL Code Chunk

Now, Quarto Documents allows for other languages to be used through “code engines.”

For example, we can use SQL by saying sql instead of r, c.f.

```{sql}
#| connection: db
SELECT 1
```

Yields:

SELECT 1
1 records
1
1

Note: Each sql code chunk must have a connection supplied to it based on a value set using R. Thus, throughout the code chunks below, you will see:

#| connection: db

The db command comes from the variable being established above.

Aside: SQL Comments

Just like with BASH, we can create comments that are not read by a computer. SQL comments use two minus signs (--) followed by the comment. For example, we could have a code chunk with:

```{sql}
#| label: sql-chunk-with-comment
#| connection: db

-- This is a SQL comment

SELECT "Hello World"
```

The output would look like so:

-- This is a SQL comment
SELECT "Hello World"
1 records
“Hello World”
Hello World

CREATE TABLE: Creating a Database

With this being said, we’ll use the sql code chunk option. Each code chunk must have only one command. Since we’re trying to create three different tables, we’ll create three different SQL tables: Students, Courses, and `Grades.

```{sql}
#| label: create-students-table
#| connection: db
CREATE TABLE Students (
   id INTEGER,
   firstname TEXT,
   lastname TEXT,
   age REAL,
   instate INTEGER,
   PRIMARY KEY (id)
);
```

Aside: Not specifying sql

Make sure to specify the appropriate code engine immediately after the curly bracket.

Notice: {r} is shown instead of {sql}

Failure to do so, will result in:

Error: unexpected symbol in “CREATE TABLE”

CREATE TABLE: Creating Courses and `Grades

Now, back to creating the other two tables…

CREATE TABLE Courses (
   course_id TEXT,
   acronym TEXT,
   PRIMARY KEY (course_id)
);
CREATE TABLE Grades (
   student_id INTEGER,
   course_id TEXT,
   grade TEXT,
   FOREIGN KEY (student_id) REFERENCES Students(id),
   FOREIGN KEY (course_id) REFERENCES Courses(course_id),
   PRIMARY KEY (student_id, course_id)
);

View Properties of Database

Let’s check by viewing properties of the tables

## In R 

# The names of the tables.
DBI::dbListTables(db)
[1] "Courses"  "Grades"   "Students"
# The variable names for the Grades Table.
DBI::dbListFields(db, "Grades") 
[1] "student_id" "course_id"  "grade"     

Note: These tables are currently empty. If we look at the data for each table, we’re missing them!

dbReadTable(db, "Students")
[1] id        firstname lastname  age       instate  
<0 rows> (or 0-length row.names)

INSERT: Inserting Values

Next, we’ll populate the table with values.

-- Populate Table
INSERT INTO Students VALUES
       (1, "Billy", "Joe", 23, 0),
       (2, "Theodore", "Squirrel", 25, 1),
       (3, "Keeya", "Nod", 21, 0);
INSERT INTO Courses VALUES
       ("STAT385", "SPM"),
       ("STAT432", "BSL"),
       ("HIST100", "GH");
INSERT INTO Grades VALUES
       (1, "STAT385", "A+"),
       (2, "STAT432", "A-"),
       (1, "HIST100", "A"),
       (3, "STAT385", "B+");

Aside: Blocking two inserts data

If multiple INSERT statements are used with the same data, then the insert statement will be rejected due to the primary key requirement of unique data being present. That is, re-running the code chunk populating Students table

INSERT INTO Students VALUES
       (1, "Billy", "Joe", 23, 0),
       (2, "Theodore", "Squirrel", 25, 1),
       (3, "Keeya", "Nod", 21, 0);

Yields:

UNIQUE constraint failed: Students.id

We can verify that the table was filled by looking at the contents in R.

my_local_Students_table = dbReadTable(db, "Students")

my_local_Students_table
  id firstname lastname age instate
1  1     Billy      Joe  23       0
2  2  Theodore Squirrel  25       1
3  3     Keeya      Nod  21       0

UPDATE: Modifying Values

Next, we can modify the data is stored using UPDATE

UPDATE Students 
SET firstname = "James"
WHERE id = 1;
UPDATE Grades 
SET grade = "B"
WHERE student_id > 1;

Your Turn: Updating a Value

Change the ages of anyone greater than 22 to 18 in the Students table.

UPDATE Students 
SET age = 18
WHERE age > 22;

Verify change:

SELECT * FROM Students;
3 records
id firstname lastname age instate
1 James Joe 23 0
2 Theodore Squirrel 25 1
3 Keeya Nod 21 0

DELETE: Removing Records

Here, we’re aiming to delete a row or record permanently from a Table.

::: {.callout-important}} Great care should be exercised here. :::

DELETE FROM Students 
WHERE id = 2;

SELECT: Selecting ALL values

Now, let’s retrieve ALL the values from the Student table in the database.

-- Retrieve all fields from the table
SELECT
  * 
FROM
  Students;
3 records
id firstname lastname age instate
1 James Joe 23 0
2 Theodore Squirrel 25 1
3 Keeya Nod 21 0

Aside: Case Insensitive Commands

In SQL-land, there is a preference to SPEAK IN CAPITAL LETTERS. However, SQL is largely case insensitive. Thus, the same query in lower letters will also work

select
  * 
from
  Students;
3 records
id firstname lastname age instate
1 James Joe 23 0
2 Theodore Squirrel 25 1
3 Keeya Nod 21 0

SELECT: Select two fields

From here, let’s select only two fields from the Students data base.

-- Retrieve only the firstname and lastname from the table
SELECT 
  firstname, lastname
FROM 
  Students;
3 records
firstname lastname
James Joe
Theodore Squirrel
Keeya Nod

SELECT: Select two fields

Now, let’s try our hand a computation involving the average age inside of the Students data base.

-- Compute a mean of variable   
SELECT 
  AVG(age) as MeanAge
FROM
  Students;
1 records
MeanAge
23

SELECT with R: Equivalents Operations

We can repeat the above selections using R. First, we’ll have to export the data into R

Students = DBI::dbReadTable(db, "Students")

To obtain all the data, we can type out the name of the data.frame

Students
  id firstname lastname age instate
1  1     James      Joe  23       0
2  2  Theodore Squirrel  25       1
3  3     Keeya      Nod  21       0

Retrieving variables from a data.frame.

Students[, c("firstname", "lastname")]
  firstname lastname
1     James      Joe
2  Theodore Squirrel
3     Keeya      Nod

The dplyr way would be:

library("dplyr")

Students %>% 
   select(firstname, lastname)

Create a summary statistic

MeanAge = mean(Students$age)

And with dplyr

Students %>% 
   summarise(MeanAge = mean(age))

WHERE: Subsetting the Initial Data

We request only the instate students that are stored with instate = 1 (equivalent to TRUE).

SELECT 
    *
FROM
    Students
WHERE 
    instate = 1;
1 records
id firstname lastname age instate
2 Theodore Squirrel 25 1

Alternatively, we can count how many students are out of state by using instate = 0 (equivalent to FALSE).

SELECT 
    COUNT(*) as NInstate
FROM
    Students
WHERE 
    instate = 0;
1 records
NInstate
2

GROUP BY: Aggregating by Group

Instead of performing two separate queries to identify group populations, we could instead aggregate information by membership in a group.

SELECT 
    COUNT(*) as NInstate
FROM
    Students
GROUP BY
    instate
ORDER BY 
    NInstate ASC;
2 records
NInstate
1
2

HAVING: Subsetting a Group

Perform a subset on the grouped data based on the new results

SELECT 
    COUNT(*) as NInstate
FROM
    Students
GROUP BY
    instate;
HAVING 
    NInstate > 1;
2 records
NInstate
2
1

DISTINCT: Select distinct

Finally, we can obtain distinct values by using the DISTINCT keyword neext to the field of interest.

SELECT 
    COUNT(DISTINCT student_id) as NStudents
FROM
    Grades
1 records
NStudents
3