SQL Fundamentals with Python - Database

Database

Data vs Information

The terms Data and Information are often used interchangeably, but they actually have distinct meanings.

Data refers to raw facts or statistics in the form of text, numbers, symbols, etc., that do not carry specific meaning or purpose. On the other hand, Information is knowledge derived from study, communication, research, or instructions. Essentially, information is obtained from analyzing and interpreting one or more pieces of data.

For example:

  • The number of likes on a social media post is a piece of data. When combined with other social media engagement statistics, like follows, comments, and shares, it becomes information that can tell us which social media platform is performing the best.

Introducing to database

A database is a systematic collection of data, managed in accordance with interrelated rules, making it easier to transform data into information.

When managing a database, some essential characteristics to consider are:

  • Speed of data retrieval
  • Storage space efficiency
  • Reduction or elimination of data redundancy
  • Accuracy in establishing relationships between data based on data type rules, data domains, and data uniqueness.

Database system

A database, or “basis data” in Indonesian, is comprised of a basis (warehouse) and data (real-world facts representing objects) like people, goods, and so forth, manifested in the form of letters, numbers, symbols, or combinations thereof.

A database is usually managed by a DBMS (Database Management System) that acts as an interface between the database and its users. Users here can be either humans or applications. DBMS can be used by users to view, change, and manage information in the database.

DBMS

DBMS (Sumber: digitaloceanspaces.com)

Some of the most popular DBMSs are:

  • MySQL: A SQL-based relational DBMS with a client-server architecture.
  • Elasticsearch: A distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. It centrally stores your data for lightning fast search, fine-tuned relevancy, and powerful analytics that scale with ease.
  • SQLite: A relational database management system embedded into applications. SQLite stores all data in a single disk file and is very popular for mobile and desktop applications that require lightweight, local data storage.
  • And others

Data is typically modeled in rows and columns within a table. This structure makes it easier to perform addition, modification, deletion, and other operations. This model is common in relational databases (RDBMSs) that use SQL (Structured Query Language).

SQL is a programming language used to manipulate, search, access, and perform other operations within a database.

But how does this differ from a Spreadsheet (like Excel, or Google Sheets)? Both are easy to use but have clear differences:

Differences Spreadsheet Database
Usage For accounting tasks Used by large companies to store large amounts of data
Who can access Designed for single-user, but can be used by multiple users in turns Can be accessed by multiple users or applications at once
Amount of data that can be stored Data storage is limited Can store large amounts of data

To illustrate the usage of a database, we will use SQL with SQLite as the relational database management system (RDBMS).

Up until now, when creating applications in Python, we’ve been storing data in memory in the form of a list of objects:

class StudentRow:
    def __init__(self, ID, username, grade):
        self.ID = ID
        self.username = username
        self.grade = grade

students = []
students.append(StudentRow(1, 'aditira', 83))
students.append(StudentRow(2, 'gina', 77))
students.append(StudentRow(3, 'ria', 64))

for student in students:
    print(student.__dict__)
{'ID': 1, 'username': 'aditira', 'grade': 83}
{'ID': 2, 'username': 'gina', 'grade': 77}
{'ID': 3, 'username': 'ria', 'grade': 64}

This approach has some challenges, such as the data not being permanently stored when the application is terminated, and the data is hard to process because it requires several lines of Python code to manipulate the data.

For example, if we want to get the average Grade from the above student list:

sum_of_grades = sum(student.grade for student in students)
average_grade = sum_of_grades / len(students)

print("Grade Average is ", average_grade)
Grade Average is  74.66666666666667

This requires about 3 lines of code to get the average Grade. Now let’s compare this with using SQL.

Assuming we already have a students table with id, username, and grade columns using SQLite. We can input data with the following SQL command:

import sqlite3

# Connect to SQLite database in memory
conn = sqlite3.connect(':memory:')

# Create a cursor object
c = conn.cursor()

# Create table
c.execute("""
    CREATE TABLE students (
        id INTEGER PRIMARY KEY,
        username TEXT,
        grade INTEGER
    )
""")
<sqlite3.Cursor at 0x27c0db505c0>
INSERT INTO public.students (username, grade) values ('aditira',83),('gina',77),('ria',64);
# Insert values
c.execute("""
    INSERT INTO students (username, grade) 
    VALUES 
        ('aditira',83),
        ('gina',77),
        ('ria',64)
""")
<sqlite3.Cursor at 0x27c0db505c0>

And we can retrieve the data with this simple SQL command:

SELECT * FROM students;
import pandas as pd

# Commit the transaction
conn.commit()

# Retrieve data and convert into DataFrame
pd.read_sql_query("SELECT * FROM students", conn)
id username grade
0 1 aditira 83
1 2 gina 77
2 3 ria 64

Quite easy, isn’t it! 😎 And to get the average value of student grades is even easier, with this simple SQL command:

SELECT AVG(grade) FROM students;
import pandas as pd

# Commit the transaction
conn.commit()

# Retrieve data and convert into DataFrame
pd.read_sql_query("SELECT AVG(grade) FROM students", conn)
AVG(grade)
0 74.666667
# Close the connection
conn.close()

So you can see that using a database, our application data is much easier to process and is stored more securely for long-term use.

DDL (Data Definition Language)

What is DDL?

Before we start using a database, we need to set things up by creating tables, schemas, views, and indexes. This process is referred to as Data Definition Language (DDL).

There are several SQL commands used to carry out these tasks, such as CREATE, ALTER, DROP, TRUNCATE.

Let’s consider the following table as an example. This table contains simple employee data:

ID name age address salary
1 Rizki 25 Jl. Kebon Jeruk 2000000
2 Andi 27 Jl. Kebon Sirih 3000000
3 Budi 30 Jl. Kebon Melati 4000000
4 Caca 32 Jl. Kebon Anggrek 5000000
5 Deni 35 Jl. Kebon Mawar 6000000

Next, we will apply DDL to this table.

SQL Data Types

To create a table, we need to determine the data types to be used. These data types will limit the kind of data that can be inserted into each column. The data types in SQL are not much different from those in Python. Here are some commonly used data types:

Data type Description
VARCHAR(size) A character data type with a flexible length, like string in Python
TEXT A character data type with a maximum length of 65535 characters
INT / INTEGER An integer data type
FLOAT A decimal number data type
BOOLEAN A boolean data type (true / false)
DATE A date data type
DATETIME A date and time data type

For the VARCHAR data type, we need to specify the size (a number from 1 - 255). This size will limit the length of the characters that can be entered into that column. For instance, VARCHAR(50) will limit the length of the characters that can be entered into that column to 50 characters. If we enter characters longer than 50, an error will occur.

The VARCHAR data type can only hold a maximum of 255 characters. If we want to store more than 255 characters, we need to use the TEXT data type, which has a maximum length of 65535 characters.

For more details, you can check here.

CREATE

The CREATE command is used to create a new object such as a database, table, schema, view, or index.

To create a new database, we can use the CREATE DATABASE command as follows:

CREATE DATABASE <db-name>;

Where <db-name> is the name of the database to be created. Don’t forget to end each SQL command with a ;.

Example:

CREATE DATABASE db_employee;

With python and SQLite let’s import the necessary module and establish a connection to the database.

import sqlite3

# Connect to the SQLite database
# If the database does not exist, it will be created
conn = sqlite3.connect('db/sqlite.db')

# Create a cursor object
cur = conn.cursor()

The command above creates a new database named db_employee.

Next, we can also use CREATE to create a new table. We can use CREATE TABLE as follows:

CREATE TABLE <table-name> (
  <column-name> <data-type> <constraints>,
  <column-name> <data-type> <constraints>,
  ...
);

Where <table-name> is the name of the table to be created, <column-name> is the name of the column, <data-type> is the data type to be used, and <constraints> are the rules to be applied to that column. To create a column with more than one rule, we can separate them with a comma ,.

Example command to create an employee table:

CREATE TABLE employee (
  id INT,
  name VARCHAR(255) NOT NULL,
  age INT NOT NULL,
  address VARCHAR(255),
  salary INT
);

Let’s create an employee table.

# Execute a SQL command: this creates a new table
cur.execute('''
    CREATE TABLE employee (
        id INTEGER,
        name TEXT NOT NULL,
        age INTEGER NOT NULL,
        address TEXT,
        salary INTEGER
    )
''')

# Commit the transaction
conn.commit()

See the table:

import pandas as pd

pd.read_sql_query("SELECT * FROM employee", conn)
id name age address salary

From the example above, we are creating a new table named employee that has 5 columns, namely id, name, age, address, and salary. For the id and age columns, we use the INT data type, which means integer, while for the name and address columns, we use the VARCHAR data type, which means character. The salary column also uses the INT data type, which means integer.

For the name and age columns, we apply the NOT NULL constraint, which means that these columns cannot be empty. If we enter empty data or empty characters "" into these columns, an error will occur.

You can try creating a new table named employee as above in an online SQL compiler or in your own pgAdmin / DBMS tool.

ALTER

The ALTER command is used to modify the structure of the database. For example, if we want to add a new column to an existing table, we can use the ALTER TABLE command as follows:

ALTER TABLE <table-name> ADD <column-name> <data-type> <constraints>;

Where <table-name> is the name of the table to be modified, <column-name> is the name of the column to be added, <data-type> is the data type to be used, and <constraints> are the rules to be applied to that column.

If we want to add a new column named email with data type VARCHAR(255) to the employee table, we can use the following command:

ALTER TABLE employee ADD email VARCHAR(255);

Try in python using SQLite:

cur.execute('ALTER TABLE employee ADD COLUMN email TEXT')
conn.commit()

See the table:

import pandas as pd

pd.read_sql_query("SELECT * FROM employee", conn)
id name age address salary email

The example above means that we are adding a new column named email with the data type VARCHAR(255) to the employee table.

We can also modify an existing column with the ALTER COLUMN command as follows:

ALTER TABLE <table-name> ALTER COLUMN <column-name> TYPE <data-type> <constraints>;

We can change salary to the FLOAT data type with the following command:

ALTER TABLE employee ALTER COLUMN salary TYPE FLOAT;

We can also rename a column with the RENAME TO command as follows:

ALTER TABLE <table-name> RENAME TO <new-table-name>;

Where <table-name> is the name of the table to be modified, and <new-table-name> is the new table name.

We can rename the employee table to employees with the following command:

ALTER TABLE employee RENAME TO employees;

Try in python using SQLite:

cur.execute('ALTER TABLE employee RENAME TO employees')
conn.commit()

See the table:

import pandas as pd

pd.read_sql_query("SELECT * FROM employees", conn)  # Change table employee to employees
id name age address salary email

Finally, we can delete a column with the DROP COLUMN command as follows:

ALTER TABLE <table-name> DROP COLUMN <column-name>;

Where <table-name> is the name of the table to be modified, and <column-name> is the name of the column to be deleted.

We can delete the email column of the employee table with the following command:

ALTER TABLE employee DROP COLUMN email;

Try in python using SQLite:

cur.execute('ALTER TABLE employees DROP COLUMN email')
conn.commit()

See the table:

import pandas as pd

pd.read_sql_query("SELECT * FROM employees", conn)
id name age address salary

After running the command above, the email column of the employee table will be deleted.

You can try using the ALTER command as above in an online SQL compiler or in your own pgAdmin / DBMS tool.

DROP

The DROP command is used to delete records or objects that exist in the database.

We can delete the employee table that was previously created using DROP TABLE as follows:

DROP TABLE <table-name>

Where <table-name> is the name of the table to be deleted.

We can delete the employee table with the following command:

DROP TABLE employee;

Try in python using SQLite:

cur.execute('DROP TABLE employees')
conn.commit()

See the table (expected error):

import pandas as pd

pd.read_sql_query("SELECT * FROM employees", conn)

We can also delete a database with the DROP DATABASE command as follows:

DROP DATABASE <database-name>

Where <database-name> is the name of the database to be deleted.

We can delete the db_employee database with the following command:

DROP DATABASE db_employees;

You can try using the DROP command as above in an online SQL compiler or in your own pgAdmin / DBMS tool.

TRUNCATE

The TRUNCATE command is used to delete all records (data) within a table.

We can delete all records in the employee table with the TRUNCATE TABLE command as follows:

TRUNCATE TABLE <table-name>

Where <table-name> is the name of the table whose records will be deleted.

We can delete all records in the employee table with the following command:

TRUNCATE TABLE employee;

Try in python using SQLite (Insert Data):

cur.execute("INSERT INTO employees VALUES (1, 'Rizki', 25, 'Jl. Kebon Jeruk', 2000000)")
cur.execute("INSERT INTO employees VALUES (2, 'Aditira', 22, 'Jl. Cibinong', 19000000)")
conn.commit()

See the table:

import pandas as pd

pd.read_sql_query("SELECT * FROM employees", conn)
id name age address salary
0 1 Rizki 25 Jl. Kebon Jeruk 2000000
1 2 Aditira 22 Jl. Cibinong 19000000

Delete all records:

cur.execute('DELETE FROM employees')
conn.commit()

See the table:

import pandas as pd

pd.read_sql_query("SELECT * FROM employees", conn)
id name age address salary

When the command above is run, all records (rows of data) in the employee table will be deleted, resulting in 0 records.

You can try using the TRUNCATE command as above in an online SQL compiler or in your own pgAdmin / DBMS tool.

# Close the connection
conn.close()

Next we’ll go into how to set up and manipulate data in tables.

Back to top