install.packages(c("DBI", "odbc"))Demo SQLite Database and Queries
DBI, odbc, and RDBMS systems
Installing Drivers
There are two generic packages that are useful to establish a database connection with.
DBI: Database Interfaceodbc: Open Database Connectivity
The prior provides a unified frontend for working with databases while the later provides connection drivers.
From here, we’ll want to also install DBI-compliant implementations for each database engine we are interested in using. To date, there are:
For these kinds of databases, we’ll need to connect in using the odbc package. This will require ODBC drivers to be installed on your local computer. Installation for each driver is operating system specific (e.g. different instructions for Windows, macOS, Linux…). An overview for each platform is provided at: https://solutions.posit.co/connections/db/best-practices/drivers/
# Install DBI-compliant database packages
install.packages(c(
"RSQLite",
"bigrquery",
"RMySQL",
"RMariaDB",
"ROracle",
"RPostgres"
))
# Install a `dplyr` to SQL translator package
install.packages("dbplyr")SQL
Establish Database Connection
Here we establish a SQLite database.
library(DBI)
db = dbConnect(RSQLite::SQLite(), dbname = "my_db.sqlite")Aside: Creating a SQL Code Chunk
Now, Quarto Documents allows for other languages to be used through “code engines.”
For example, we can use SQL by saying sql instead of r, c.f.
```{sql}
#| connection: db
SELECT 1
```Yields:
SELECT 1| 1 |
|---|
| 1 |
Note: Each sql code chunk must have a connection supplied to it based on a value set using R. Thus, throughout the code chunks below, you will see:
#| connection: db
The db command comes from the variable being established above.
Aside: SQL Comments
Just like with BASH, we can create comments that are not read by a computer. SQL comments use two minus signs (--) followed by the comment. For example, we could have a code chunk with:
```{sql}
#| label: sql-chunk-with-comment
#| connection: db
-- This is a SQL comment
SELECT "Hello World"
```The output would look like so:
-- This is a SQL comment
SELECT "Hello World"| “Hello World” |
|---|
| Hello World |
CREATE TABLE: Creating a Database
With this being said, we’ll use the sql code chunk option. Each code chunk must have only one command. Since we’re trying to create three different tables, we’ll create three different SQL tables: Students, Courses, and `Grades.
```{sql}
#| label: create-students-table
#| connection: db
CREATE TABLE Students (
id INTEGER,
firstname TEXT,
lastname TEXT,
age REAL,
instate INTEGER,
PRIMARY KEY (id)
);
```Aside: Not specifying sql
Make sure to specify the appropriate code engine immediately after the curly bracket.
Notice: {r} is shown instead of {sql}
Failure to do so, will result in:
Error: unexpected symbol in “CREATE TABLE”
CREATE TABLE: Creating Courses and `Grades
Now, back to creating the other two tables…
CREATE TABLE Courses (
course_id TEXT,
acronym TEXT,
PRIMARY KEY (course_id)
);CREATE TABLE Grades (
student_id INTEGER,
course_id TEXT,
grade TEXT,
FOREIGN KEY (student_id) REFERENCES Students(id),
FOREIGN KEY (course_id) REFERENCES Courses(course_id),
PRIMARY KEY (student_id, course_id)
);View Properties of Database
Let’s check by viewing properties of the tables
## In R
# The names of the tables.
DBI::dbListTables(db)[1] "Courses" "Grades" "Students"
# The variable names for the Grades Table.
DBI::dbListFields(db, "Grades") [1] "student_id" "course_id" "grade"
Note: These tables are currently empty. If we look at the data for each table, we’re missing them!
dbReadTable(db, "Students")[1] id firstname lastname age instate
<0 rows> (or 0-length row.names)
INSERT: Inserting Values
Next, we’ll populate the table with values.
-- Populate Table
INSERT INTO Students VALUES
(1, "Billy", "Joe", 23, 0),
(2, "Theodore", "Squirrel", 25, 1),
(3, "Keeya", "Nod", 21, 0);INSERT INTO Courses VALUES
("STAT385", "SPM"),
("STAT432", "BSL"),
("HIST100", "GH");INSERT INTO Grades VALUES
(1, "STAT385", "A+"),
(2, "STAT432", "A-"),
(1, "HIST100", "A"),
(3, "STAT385", "B+");Aside: Blocking two inserts data
If multiple INSERT statements are used with the same data, then the insert statement will be rejected due to the primary key requirement of unique data being present. That is, re-running the code chunk populating Students table
INSERT INTO Students VALUES
(1, "Billy", "Joe", 23, 0),
(2, "Theodore", "Squirrel", 25, 1),
(3, "Keeya", "Nod", 21, 0);Yields:
UNIQUE constraint failed: Students.id
We can verify that the table was filled by looking at the contents in R.
my_local_Students_table = dbReadTable(db, "Students")
my_local_Students_table id firstname lastname age instate
1 1 Billy Joe 23 0
2 2 Theodore Squirrel 25 1
3 3 Keeya Nod 21 0
UPDATE: Modifying Values
Next, we can modify the data is stored using UPDATE
UPDATE Students
SET firstname = "James"
WHERE id = 1;UPDATE Grades
SET grade = "B"
WHERE student_id > 1;Your Turn: Updating a Value
Change the ages of anyone greater than 22 to 18 in the Students table.
UPDATE Students
SET age = 18
WHERE age > 22;Verify change:
SELECT * FROM Students;| id | firstname | lastname | age | instate |
|---|---|---|---|---|
| 1 | James | Joe | 23 | 0 |
| 2 | Theodore | Squirrel | 25 | 1 |
| 3 | Keeya | Nod | 21 | 0 |
DELETE: Removing Records
Here, we’re aiming to delete a row or record permanently from a Table.
::: {.callout-important}} Great care should be exercised here. :::
DELETE FROM Students
WHERE id = 2;SELECT: Selecting ALL values
Now, let’s retrieve ALL the values from the Student table in the database.
-- Retrieve all fields from the table
SELECT
*
FROM
Students;| id | firstname | lastname | age | instate |
|---|---|---|---|---|
| 1 | James | Joe | 23 | 0 |
| 2 | Theodore | Squirrel | 25 | 1 |
| 3 | Keeya | Nod | 21 | 0 |
Aside: Case Insensitive Commands
In SQL-land, there is a preference to SPEAK IN CAPITAL LETTERS. However, SQL is largely case insensitive. Thus, the same query in lower letters will also work
select
*
from
Students;| id | firstname | lastname | age | instate |
|---|---|---|---|---|
| 1 | James | Joe | 23 | 0 |
| 2 | Theodore | Squirrel | 25 | 1 |
| 3 | Keeya | Nod | 21 | 0 |
SELECT: Select two fields
From here, let’s select only two fields from the Students data base.
-- Retrieve only the firstname and lastname from the table
SELECT
firstname, lastname
FROM
Students;| firstname | lastname |
|---|---|
| James | Joe |
| Theodore | Squirrel |
| Keeya | Nod |
SELECT: Select two fields
Now, let’s try our hand a computation involving the average age inside of the Students data base.
-- Compute a mean of variable
SELECT
AVG(age) as MeanAge
FROM
Students;| MeanAge |
|---|
| 23 |
SELECT with R: Equivalents Operations
We can repeat the above selections using R. First, we’ll have to export the data into R
Students = DBI::dbReadTable(db, "Students")To obtain all the data, we can type out the name of the data.frame
Students id firstname lastname age instate
1 1 James Joe 23 0
2 2 Theodore Squirrel 25 1
3 3 Keeya Nod 21 0
Retrieving variables from a data.frame.
Students[, c("firstname", "lastname")] firstname lastname
1 James Joe
2 Theodore Squirrel
3 Keeya Nod
The dplyr way would be:
library("dplyr")
Students %>%
select(firstname, lastname)Create a summary statistic
MeanAge = mean(Students$age)And with dplyr…
Students %>%
summarise(MeanAge = mean(age))WHERE: Subsetting the Initial Data
We request only the instate students that are stored with instate = 1 (equivalent to TRUE).
SELECT
*
FROM
Students
WHERE
instate = 1;| id | firstname | lastname | age | instate |
|---|---|---|---|---|
| 2 | Theodore | Squirrel | 25 | 1 |
Alternatively, we can count how many students are out of state by using instate = 0 (equivalent to FALSE).
SELECT
COUNT(*) as NInstate
FROM
Students
WHERE
instate = 0;| NInstate |
|---|
| 2 |
GROUP BY: Aggregating by Group
Instead of performing two separate queries to identify group populations, we could instead aggregate information by membership in a group.
SELECT
COUNT(*) as NInstate
FROM
Students
GROUP BY
instate
ORDER BY
NInstate ASC;| NInstate |
|---|
| 1 |
| 2 |
HAVING: Subsetting a Group
Perform a subset on the grouped data based on the new results
SELECT
COUNT(*) as NInstate
FROM
Students
GROUP BY
instate;
HAVING
NInstate > 1;| NInstate |
|---|
| 2 |
| 1 |
DISTINCT: Select distinct
Finally, we can obtain distinct values by using the DISTINCT keyword neext to the field of interest.
SELECT
COUNT(DISTINCT student_id) as NStudents
FROM
Grades| NStudents |
|---|
| 3 |