install.packages(c("DBI", "odbc"))
Demo SQLite Database and Queries
DBI
, odbc
, and RDBMS systems
Installing Drivers
There are two generic packages that are useful to establish a database connection with.
DBI
: Database Interfaceodbc
: Open Database Connectivity
The prior provides a unified frontend for working with databases while the later provides connection drivers.
From here, we’ll want to also install DBI-compliant implementations for each database engine we are interested in using. To date, there are:
For these kinds of databases, we’ll need to connect in using the odbc
package. This will require ODBC drivers to be installed on your local computer. Installation for each driver is operating system specific (e.g. different instructions for Windows, macOS, Linux…). An overview for each platform is provided at: https://solutions.posit.co/connections/db/best-practices/drivers/
# Install DBI-compliant database packages
install.packages(c(
"RSQLite",
"bigrquery",
"RMySQL",
"RMariaDB",
"ROracle",
"RPostgres"
))
# Install a `dplyr` to SQL translator package
install.packages("dbplyr")
SQL
Establish Database Connection
Here we establish a SQLite database.
library(DBI)
= dbConnect(RSQLite::SQLite(), dbname = "my_db.sqlite") db
Aside: Creating a SQL Code Chunk
Now, Quarto Documents allows for other languages to be used through “code engines.”
For example, we can use SQL by saying sql
instead of r
, c.f.
```{sql}
#| connection: db
SELECT 1
```
Yields:
SELECT 1
1 |
---|
1 |
Note: Each sql
code chunk must have a connection supplied to it based on a value set using R. Thus, throughout the code chunks below, you will see:
#| connection: db
The db
command comes from the variable being established above.
Aside: SQL Comments
Just like with BASH, we can create comments that are not read by a computer. SQL comments use two minus signs (--
) followed by the comment. For example, we could have a code chunk with:
```{sql}
#| label: sql-chunk-with-comment
#| connection: db
-- This is a SQL comment
SELECT "Hello World"
```
The output would look like so:
-- This is a SQL comment
SELECT "Hello World"
“Hello World” |
---|
Hello World |
CREATE TABLE: Creating a Database
With this being said, we’ll use the sql
code chunk option. Each code chunk must have only one command. Since we’re trying to create three different tables, we’ll create three different SQL tables: Students
, Courses
, and `Grades.
```{sql}
#| label: create-students-table
#| connection: db
CREATE TABLE Students (
INTEGER,
id TEXT,
firstname TEXT,
lastname REAL,
age INTEGER,
instate PRIMARY KEY (id)
);```
Aside: Not specifying sql
Make sure to specify the appropriate code engine immediately after the curly bracket.
Notice: {r}
is shown instead of {sql}
Failure to do so, will result in:
Error: unexpected symbol in “CREATE TABLE”
CREATE TABLE: Creating Courses
and `Grades
Now, back to creating the other two tables…
CREATE TABLE Courses (
course_id TEXT,
acronym TEXT,PRIMARY KEY (course_id)
);
CREATE TABLE Grades (
INTEGER,
student_id
course_id TEXT,
grade TEXT,FOREIGN KEY (student_id) REFERENCES Students(id),
FOREIGN KEY (course_id) REFERENCES Courses(course_id),
PRIMARY KEY (student_id, course_id)
);
View Properties of Database
Let’s check by viewing properties of the tables
## In R
# The names of the tables.
::dbListTables(db) DBI
[1] "Courses" "Grades" "Students"
# The variable names for the Grades Table.
::dbListFields(db, "Grades") DBI
[1] "student_id" "course_id" "grade"
Note: These tables are currently empty. If we look at the data for each table, we’re missing them!
dbReadTable(db, "Students")
[1] id firstname lastname age instate
<0 rows> (or 0-length row.names)
INSERT: Inserting Values
Next, we’ll populate the table with values.
-- Populate Table
INSERT INTO Students VALUES
1, "Billy", "Joe", 23, 0),
(2, "Theodore", "Squirrel", 25, 1),
(3, "Keeya", "Nod", 21, 0); (
INSERT INTO Courses VALUES
"STAT385", "SPM"),
("STAT432", "BSL"),
("HIST100", "GH"); (
INSERT INTO Grades VALUES
1, "STAT385", "A+"),
(2, "STAT432", "A-"),
(1, "HIST100", "A"),
(3, "STAT385", "B+"); (
Aside: Blocking two inserts data
If multiple INSERT
statements are used with the same data, then the insert statement will be rejected due to the primary key requirement of unique data being present. That is, re-running the code chunk populating Students
table
INSERT INTO Students VALUES
1, "Billy", "Joe", 23, 0),
(2, "Theodore", "Squirrel", 25, 1),
(3, "Keeya", "Nod", 21, 0); (
Yields:
UNIQUE constraint failed: Students.id
We can verify that the table was filled by looking at the contents in R.
= dbReadTable(db, "Students")
my_local_Students_table
my_local_Students_table
id firstname lastname age instate
1 1 Billy Joe 23 0
2 2 Theodore Squirrel 25 1
3 3 Keeya Nod 21 0
UPDATE: Modifying Values
Next, we can modify the data is stored using UPDATE
UPDATE Students
SET firstname = "James"
WHERE id = 1;
UPDATE Grades
SET grade = "B"
WHERE student_id > 1;
Your Turn: Updating a Value
Change the ages of anyone greater than 22 to 18 in the Students table.
UPDATE Students
SET age = 18
WHERE age > 22;
Verify change:
SELECT * FROM Students;
id | firstname | lastname | age | instate |
---|---|---|---|---|
1 | James | Joe | 23 | 0 |
2 | Theodore | Squirrel | 25 | 1 |
3 | Keeya | Nod | 21 | 0 |
DELETE: Removing Records
Here, we’re aiming to delete a row or record permanently from a Table.
::: {.callout-important}} Great care should be exercised here. :::
DELETE FROM Students
WHERE id = 2;
SELECT: Selecting ALL values
Now, let’s retrieve ALL the values from the Student
table in the database.
-- Retrieve all fields from the table
SELECT
*
FROM
Students;
id | firstname | lastname | age | instate |
---|---|---|---|---|
1 | James | Joe | 23 | 0 |
2 | Theodore | Squirrel | 25 | 1 |
3 | Keeya | Nod | 21 | 0 |
Aside: Case Insensitive Commands
In SQL-land, there is a preference to SPEAK IN CAPITAL LETTERS. However, SQL is largely case insensitive. Thus, the same query in lower letters will also work
select
*
from
Students;
id | firstname | lastname | age | instate |
---|---|---|---|---|
1 | James | Joe | 23 | 0 |
2 | Theodore | Squirrel | 25 | 1 |
3 | Keeya | Nod | 21 | 0 |
SELECT: Select two fields
From here, let’s select only two fields from the Students
data base.
-- Retrieve only the firstname and lastname from the table
SELECT
firstname, lastnameFROM
Students;
firstname | lastname |
---|---|
James | Joe |
Theodore | Squirrel |
Keeya | Nod |
SELECT: Select two fields
Now, let’s try our hand a computation involving the average age inside of the Students
data base.
-- Compute a mean of variable
SELECT
AVG(age) as MeanAge
FROM
Students;
MeanAge |
---|
23 |
SELECT with R: Equivalents Operations
We can repeat the above selections using R. First, we’ll have to export the data into R
= DBI::dbReadTable(db, "Students") Students
To obtain all the data, we can type out the name of the data.frame
Students
id firstname lastname age instate
1 1 James Joe 23 0
2 2 Theodore Squirrel 25 1
3 3 Keeya Nod 21 0
Retrieving variables from a data.frame
.
c("firstname", "lastname")] Students[,
firstname lastname
1 James Joe
2 Theodore Squirrel
3 Keeya Nod
The dplyr
way would be:
library("dplyr")
%>%
Students select(firstname, lastname)
Create a summary statistic
= mean(Students$age) MeanAge
And with dplyr
…
%>%
Students summarise(MeanAge = mean(age))
WHERE: Subsetting the Initial Data
We request only the instate students that are stored with instate = 1
(equivalent to TRUE
).
SELECT
*
FROM
StudentsWHERE
= 1; instate
id | firstname | lastname | age | instate |
---|---|---|---|---|
2 | Theodore | Squirrel | 25 | 1 |
Alternatively, we can count how many students are out of state by using instate = 0
(equivalent to FALSE
).
SELECT
COUNT(*) as NInstate
FROM
StudentsWHERE
= 0; instate
NInstate |
---|
2 |
GROUP BY: Aggregating by Group
Instead of performing two separate queries to identify group populations, we could instead aggregate information by membership in a group.
SELECT
COUNT(*) as NInstate
FROM
StudentsGROUP BY
instateORDER BY
ASC; NInstate
NInstate |
---|
1 |
2 |
HAVING: Subsetting a Group
Perform a subset on the grouped data based on the new results
SELECT
COUNT(*) as NInstate
FROM
StudentsGROUP BY
instate;HAVING
> 1; NInstate
NInstate |
---|
2 |
1 |
DISTINCT: Select distinct
Finally, we can obtain distinct values by using the DISTINCT
keyword neext to the field of interest.
SELECT
COUNT(DISTINCT student_id) as NStudents
FROM
Grades
NStudents |
---|
3 |