Data engineering 103: Intermediate Concepts in Data Engineering (Python & SQL)

Ashtone Onyango
6 min readSep 9, 2022

--

Pandas

Pandas is a python library, which contains functions that make it useful for analyzing, cleaning, exploring, and manipulating data when working with datasets.

Fun Fact: The nomenclature of “Pandas” is a result of a reference to both “Panel Data” and “Python Data Analysis”. It was created by McKinney in 2008.

What can you do with Pandas?

Relevant data is very important in data science. With pandas, you can analyze big data and make conclusions on statistical theories, by giving you answers about the data such as:

  • Is there any correlation between two or more columns?
  • What is the average value?
  • What is the Max/Min value?

Pandas can also clean messy datasets, and make them readable and relevant. It does this by deleting rows that are redundant, and that contain wrong values (NULL entries).

An example is given:

Importing and reading data from a CSV file format. The head() function displays the first 5 rows of the data. I have used recorded data of patients predisposed to heart disease, that captures risk factors used to predict their getting heart disease.

Importing and reading data from a JSON file format.

You can check for inconsistencies in the data, by weeding out any NULL entries in the data, as demonstrated. Fortunately, the data had no missing values. Data cleaning (cleaning messy data) has been extensively explored in a separate article (See the end of the post).

You can also provide summary statistics of the data — min, mean, max, std — for general view using the describe() function, and also display overview information about the data (like data types) using the info() function.

To explore more about Pandas, check out the official pandas documentation.

Numpy

Numpy is a python library that is used for working with arrays. It has functions that make it suitable for working in the domains of Linear Algebra, Fourier Transform, and Matrices.

Fun Fact: Numpy is written partially in Python, but most of the parts that require fast computation are written in C or C++.

What can you do with Numpy?

Why arrays? In Python, Lists can serve the purpose of Arrays, but they are slow to process. Numpy, therefore, provides an Array Object that is up to 50 times faster than traditional Python lists

  • The array object is called ndarray. Numpy provides a lot of supporting functions that make working with ndarray easy.
  • Numpy arrays are stored at one continuous place in memory, unlike lists. Therefore, processes can access and manipulate them very efficiently i.e. locality of reference.
  • Numpy is also optimized to work with the latest CPU architectures.
  • Arrays are very frequently used in data engineering and data science, where speed and resources are of high value.

An example is given:

Creating a Numpy ndarray Object. We can create a Numpy ndarray object by using the array() function. type() is a built-in Python function that tells us the type of object passed to it.

You can also create an array with 5 dimensions and verify that it has 5 dimensions. In this array the innermost dimension (5th dim) has 4 elements, the 4th dim has 1 element that is the vector, the 3rd dim has 1 element that is the matrix with the vector, the 2nd dim has 1 element that is a 3D array and 1st dim has 1 element that is a 4D array.

You can generate random arrays of integer/float type numbers. The randint() method takes a size parameter where you can specify the shape of an array.

To explore more about Numpy, check out the official numpy documentation.

SQL for Data Engineering

SQL stands for Structured Query Language. It is a standard language that enables you to store, manipulate and retrieve data in databases. Every database system including PostgreSQL, SQLite, MySQL, SQL Server, etc. has its own “flavor” of SQL it implements because they are not 100% compliant with the SQL ANSI/ISO standards.

What can you do with SQL?

  • SQL can execute queries against a database.
  • SQL can retrieve data from a database.
  • SQL can insert records into a database.
  • SQL can update records in a database.
  • SQL can delete records from a database.
  • SQL can create new databases.
  • SQL can create new tables in a database.
  • SQL can create stored procedures in a database.
  • SQL can create views in a database.
  • SQL can set permissions on tables, procedures, and views.

RDBMS

RDBMS stands for Relational Database Management System and is the basis for SQL and all modern database systems like PostgreSQL, MS SQL Server, Oracle, MySQL, MS Access, and IBM DB2. The data in RDBMS is stored in database objects called tables. A table is a collection of related data entries and it consists of columns and rows.

SQL Practice Example

In this practice example, we create a database called Logistics. Within the database, we create two tables — Drivers and Vehicles and populate them with data. We then use SQL to manipulate and query said data. We can use any database system for this task; I used PostgreSQL for this example.

Requirement: Have PostgreSQL (psql client) installed in your system (Postgres installation) to be able to practice locally and save your queries and database state.

Alternatively, you can use a (hosted workspace) where you can practice your queries. The workspace has Postgres and all required software/tools installed, to provide convenience 😉.

Using the psql client, create a database Logistics using the command: (CREATE DATABASE Logistics;). List all your available databases using the “\l” command.

NB: You can use the default database (postgres) provided when you install psql, for the subsequent tasks.

Create the tables Drivers and Vehicles using the SQL statements shown in psql client. The statements have already been created in the hosted workspace — just run the code.

Manipulating and Querying data

  1. Insert a few records into both drivers and vehicles. Include 3 records of drivers who have vehicles, belonging in the vehicles table.

2. Select all driver records; select all vehicle records; select only 3 vehicle records (using LIMIT).

3. Driver with ID 2 no longer owns any vehicles. Update the database to reflect this.

4. Driver with ID 1 now owns a new vehicle in addition to the previous one they owned. Update the database to reflect this.

Joins and Group Bys

  1. Select all vehicles owned by driver with ID 3.
  2. Select all vehicles owned by driver with name ‘Sarah’ (without knowing their ID).
  3. Show a table of the number of vehicles owned per driver.
  4. Show the number of drivers that own a Nissan model.

Structuring Data

  1. Add information about vehicle color.
  2. Update all existing vehicle records to have a vehicle color.
  3. Add contact information (email, address) to the driver’s table.

You have now explored the basic CRUD operations on relational database systems. Data cleaning has been explored more comprehensively in a separate article.

Yours, Ashtone Onyango

Twitter.

--

--

Ashtone Onyango
Ashtone Onyango

Written by Ashtone Onyango

Data Engineer || MLOps || Full-Stack Developer || Alumnus @KamiLimu | Global Winner @Huawei ICT Competition 2022

No responses yet