Welcome to the Data Engineering course repository. This course is being undertaken in May 2023, and I will be documenting everything I learn here.
In this course, you will delve into the concepts and practices of handling large volumes of data through extraction, transformation, and loading (ETL) processes. Additionally, you will learn to design and manage a Data Warehousing architecture using the latest technologies prevalent in today's market. You will implement data handling processes using the Python programming language, leveraging powerful libraries like Pandas. Moreover, you will work with SQL dialects and Amazon Redshift databases. By the end of this course, you will be equipped to administer, maintain, and optimize modern data infrastructures.
Upon completion of the Data Engineering course, you will have the necessary skills and knowledge to:
- Understand the concepts and challenges associated with the Big Data ecosystem.
- Design and implement solutions for efficiently managing and analyzing large datasets.
- Develop ETL processes for data extraction, transformation, and loading, enabling subsequent data analytics.
- Successfully administer a data warehouse.
To make the most of this course, it is recommended to have intermediate knowledge in the following areas:
-
Intermediate SQL skills and data analysis. Specifically, you should be familiar with the following concepts:
- Understanding primary keys and foreign keys.
- Knowledge of normalization concepts and preferably experience in normalizing a database.
- Proficiency in querying tables, aggregating results, utilizing aggregation functions, performing joins, creating new tables, and inserting, deleting, and updating records.
- Familiarity with SQL instructions such as JOINS, GROUP BY, HAVING, INSERT, UPDATE, DELETE, and CREATE.
-
Intermediate Python skills. Specifically, you should be familiar with the following concepts:
- Executing Python scripts, handling numeric variables, strings, lists/arrays, dictionaries, and writing and executing functions.
- Understanding APIs and experience in extracting data from websites using APIs, working with lists, dictionaries, and JSON.
- Introduction to the Unix terminal.
- Basic concepts of computer architecture.
- Python and SQL exercises.
- Exploring Big Data and its current challenges.
- Understanding the collaboration between Data Engineers, Data Analysts, and Data Scientists.
- Getting familiar with fundamental concepts in the field of Data Engineering.
- Reviewing a basic data architecture.
- Introduction to OLAP databases.
- Exploring MPP, Clustering, and Map Reduce techniques.
- Leveraging Amazon Redshift for data warehousing.
- Working with Apache Parquet file format.
- Utilizing Pandas Dataframes for data manipulation.
- Applying transformation techniques to dataframes (deduplication, merge, apply, etc).
- Understanding the fundamentals of database security.
- Implementing security measures in Amazon Redshift.
- Performing manual backups to S3.
- Introduction to containerization and virtual machines.
- Building Dockerfiles and creating Docker images.
- Practical exercises using Docker.
- Overview of Apache Airflow.
- Understanding the architecture of Airflow processes.
- Working with DAGs, Tasks, and Operators.
- Exploring advanced concepts: sensors, subdags, XCOMs.
- Introduction to stream processing.
- Working with PubSub messaging.
- Theoretical introduction to Apache Kafka.
- Practical exercises using AWS Kinesis.
Feel free to explore the contents of each module in the respective folders. I will be continuously updating this repository with my learning progress throughout the course.
If you have any questions, don't hesitate to create an issue in this repository. Happy learning!🧑💻👩💻