Data Engineering 102: Introduction to Python for Data Engineering

Ashtone Onyango
5 min readSep 4, 2022

A little background

Data science, AI, and ML are used by virtually every consumer-based industry to evaluate and comprehend customer preferences and offer certain suggestions to their customers. All of the technologies mentioned above are built around data, which is sculpted into useful structured datasets that let data scientists assess and address business problems.

The basis for Data Science, Machine Learning, and AI systems and solutions is laid by data engineering. Data engineers are employed by companies that have access to vast amounts of potentially profitable real-time data. A data engineer uses complicated activities including effective and efficient data collection, extraction, transformation, modeling, and loading of data to prepare data for analytical and operational reasons. And how do they accomplish that?

In addition to database systems, there are several tools and languages used for data engineering, such as Python, R, Java, Scala, SAS, etc. Many more may develop in the future. But when it comes to data engineering, Python is the most well-known and widely used language.

Why Python though?

Python has an advantage over other languages because of its simplicity, open-source license, usability, accessibility, and adaptability. Python may be used to create a broad range of applications, from simple Web and desktop GUI-based programs to sophisticated systems for scientific, mathematical, and statistical computation. Along with a sizable, active worldwide community, Python offers a sea of libraries and packages that support a wide range of use cases in the fields of data engineering, data science, artificial intelligence (AI), machine learning (ML), and many more. Examples that are often used include SciPy, Pandas, Beautiful Soup, and NumPy.

Python is clearly used in data engineering in the following situations:

Data Ingestion:

Business data may be found in TSVs, CSVs, XLSX, JSON, HTML, XML, SQL, Microsoft, external systems, online publications, and APIs, among other formats. To access such data, Python offers a wide range of libraries and modules. Particularly Pandas, which permits reading data into data-frames and offers a wide panel of functions needed to consistently and effectively handle and convert data.

Data Acquisition:

Python is used to construct, operate, and monitor data pipelines utilizing platforms like Airflow, Dagster, and other Python-based data workflow tools. These platforms are used to schedule and orchestrate ETL processes and pipelines.

Data Manipulation:

Pandas make it easier to manipulate tiny datasets. Additionally, the Python PySpark interface is a potent tool for processing, manipulating, and aggregating large amounts of data that make extremely effective and fault-tolerant use of the parallel computing principle.

Data Modelling:

When employing frameworks like Tensorflow/Keras, Scikit-learn, or Pytorch, machine learning or deep learning jobs are conducted in Python. As a result, it creates a shared language for successful team communication.

Data Surfacing:

Using Python and frameworks like Flask and Django, it is simple to provide data for a dashboard or traditional report or to provide data as a service. Similarly, Python APIs may be used to link Tableau and Power BI.

Cross Compatibility:

Python is awesome for gluing pieces together. You can simply call external scripts in R, Java, etc from python workflow.

Cloud Support:

Python can simply develop cloud platform services and incentivize them. The use of serverless computing enables on-demand data ETL process triggering. Again, Python is just one of a select few programming languages that are supported by serverless computing services on all three of the main cloud computing platforms: AWS, Azure, and GCP.

Dive into Python basics

Procedural, Object-Oriented, and Functional programming languages are just a few of the programming paradigms that Python offers. Python’s design philosophy prioritizes code readability and makes heavy use of indentation.

Environment set-up

At this point, I assume you have a python environment installed on your computer (Mac, Linux, Windows). Most computers will have python already installed. To check if you have python installed, type the command below in your terminal/command line.

If you don’t have python installed, do visit the official python website (documentation site) for instructions. Also, do install a code editor of your choice. A few examples include Visual Studio Code, PyCharm, Anaconda, Atom, etc.

Python Quickstart

Because Python is an interpreted language, developers must create Python (.py) files in a code editor and then upload those files to the Python interpreter for execution. Below is a provided example.

Create the file helloworld.py in your text editor to serve as your first Python file. Write the following code in the file: print(“Hello, World!”). Now, save your file, open the location of your file in the command prompt or terminal, and execute the following command:

Congratulations! You have written and executed your first Python program!

Python Variables

Variables store data values. Python has no command for declaring a variable. A variable is created the moment you first assign a value to it using the assignment operator as below:

Note: Variables do not need to be declared with any particular type, and can even change type after they have been set, by reassigning them to new data values.

Casting can be used to specify the data type of a variable during declaration. An example is given:

A variable can have a short name (like x and y) or a more descriptive name (e.g. firstname, last_name). The rules for naming variables include:

· A variable name must start with a letter or the underscore character.

· A variable name cannot start with a number.

· A variable name can only contain alpha-numeric characters and underscores (A-z, 0–9, and _ ).

· Variable names are case sensitive i.e. age, Age or AGE

Variables can also be multiword, in which case they can take Camel case, Pascal case, or snake case.

Python Data Types

Variables can store data of different types, and different types can do different things. Python has built-in data types by default, that does not need to be declared. The following are the categories:

You now have a fundamental understanding of Python and why it is better than other programming languages for data engineering. Python data structures, arithmetic operations, and utilizing Python modules will be covered in the next section, Data Engineering 103.

Yours, Ashtone Onyango

--

--

Ashtone Onyango

Data Engineer || MLOps || Full-Stack Developer || Alumnus @KamiLimu | Global Winner @Huawei ICT Competition 2022