Data Engineering 101: Introduction to Data Engineering

Ashtone Onyango
3 min readSep 2, 2022

What is data engineering?

The amount of data collected from many sources over the past 50 years has been steadily growing day by day. Consequently, it is now very difficult to handle and store these data. We can manage this data intelligently and make them valuable for business and user requirements with the aid of emerging technologies like data engineering, artificial intelligence, and machine learning.

Large-scale data gathering, storage, and analysis systems are designed and built through the process of data engineering. It covers a wide range of topics and has use cases in almost every industry. We can gather vast amounts of data, but to make sure that the data is useable by the time it reaches data scientists and analysts, the right people and technology are needed.

Source: medium.com

This post in the series aims to familiarize the reader with data engineering and the tools they are likely to use along the way. ETL (extract, transform, and load) pipelines can be used to accomplish the goal of data engineering, which is to make the data accessible and available to data-driven processes.

Source: medium.com

Roles of data engineers, data scientists, and data analysts

Data scientists are also statisticians and data managers. They handle entire data science initiatives. They support the creation of predictive modeling processes, large-scale data archiving, and results reporting.

Data analysts, often referred to as business analysts, frequently assist individuals from throughout the organization in understanding certain query charts from data via dashboards.

Data engineers are more closely related to Database administrators and data architects. They are adaptable generalists who employ tools to analyze vast amounts of data. They usually concentrate on coding, obtaining and purging datasets, and carrying out data requests from data scientists.

Tools and technologies

Data engineers constantly use a collection of tools, technologies, and methods for transferring data from one system to another for storage and processing, data transformation, constructing data pipelines, and maintaining data infrastructure. Some of them include:

Programming Languages

· Python, Scala

Scripting and Automation

· Shell Scripting, CRON, ETL

Relational Databases

· Data Modelling

· RDBMS — PostgreSQL, MySQL

· Big Query

NoSQL Databases and Map-Reduce

· Advanced ETL

· Data Warehousing

· Data API

· Map-Reduce

Cloud Computing

· AWS, Azure, GCP

Data Processing

· Batch Processing — Apache Spark

· Stream Processing — Spark Streaming

· Basic Machine Learning

Infrastructure

· Docker, Kubernetes

Workflows

· Apache Airflow

The next part of this series delves into the use of Python for Data Engineering, with useful examples.

Yours, Ashtone Onyango

--

--

Ashtone Onyango

Data Engineer || MLOps || Full-Stack Developer || Alumnus @KamiLimu | Global Winner @Huawei ICT Competition 2022