Data Pipeline to automate data transfer using Apache Airflow, AWS Glue, Crawlers, Amazon S3 to Amazon Redshift.
In this project based in a real-world scenario, I acted as the Cloud Specialist to establish an order processing system to automate the data transfer from Amazon S3 bucket to Amazon Redshift using Apache Airflow, Glue Data Catalog and Crawlers.
Below are few screenshots:
Created EC2 instance and installed Python 3
Opened port 8080 for Apache Airflow
Installed Apache Airflow and able to open it from the browser
Configured VS code with EC2 instance
Removed all sample files from Apache Airflow Dag
Created Amazon S3 bucket and uploaded the first order file
Create AWS Glue and Crawlers to transfer the data manually first from S3 to AWS Glue Catalog
Run the crawler and check the data in Amazon Athena
Create the Amazon Redshift cluster and table.
Create and run the Crawler
Crawler failed
Create S3 NAT endpoint gateway to fix the error
Rerun the Crawler
Create the AWS Glue ETL job to transfer the data to Amazon Redshift
Check the count of records in the table created in Amazon Redshift
Run the ETL job manually and check the record count
Now create the Python code for Apache Airflow to run the ETL job automatically each time to transfer the new files loaded to the S3 bucket into Amazon Redshift
Upload second and third files to S3 bucket now to test Python code if Apache Airflow can run the code and populate the two files records at a time without duplicating records that are already there.
We can see the two files records are added to Redshift
Add fourth and fifth order and run the Apache Airflow job and we can see that those records are also added to Redshift