Data Pipeline to automate data transfer using Apache Airflow, AWS Glue, Crawlers, Amazon S3 to…

6 min readMar 21, 2024

Data Pipeline to automate data transfer using Apache Airflow, AWS Glue, Crawlers, Amazon S3 to Amazon Redshift.

In this project based in a real-world scenario, I acted as the Cloud Specialist to establish an order processing system to automate the data transfer from Amazon S3 bucket to Amazon Redshift using Apache Airflow, Glue Data Catalog and Crawlers.

Below are few screenshots:

Created EC2 instance and installed Python 3

Opened port 8080 for Apache Airflow

Installed Apache Airflow and able to open it from the browser

Configured VS code with EC2 instance

Removed all sample files from Apache Airflow Dag

Created Amazon S3 bucket and uploaded the first order file

Create AWS Glue and Crawlers to transfer the data manually first from S3 to AWS Glue Catalog

Run the crawler and check the data in Amazon Athena

Create the Amazon Redshift cluster and table.

Create and run the Crawler

Crawler failed

Create S3 NAT endpoint gateway to fix the error

Rerun the Crawler

Create the AWS Glue ETL job to transfer the data to Amazon Redshift

Check the count of records in the table created in Amazon Redshift

Run the ETL job manually and check the record count

Now create the Python code for Apache Airflow to run the ETL job automatically each time to transfer the new files loaded to the S3 bucket into Amazon Redshift

Upload second and third files to S3 bucket now to test Python code if Apache Airflow can run the code and populate the two files records at a time without duplicating records that are already there.

We can see the two files records are added to Redshift

Add fourth and fifth order and run the Apache Airflow job and we can see that those records are also added to Redshift

Written by Mohan Golla

No responses yet