📊 Data Engineering: A complete serverless ETL pipeline that processes YouTube trending video statistics from multiple regions using a medallion architecture (raw → cleansed → analytics).

Raw Data (S3) → Lambda → Cleansed Layer (S3) → Glue ETL → Analytics Layer (S3) → Athena/QuickSightThis project implements a complete serverless ETL pipeline on AWS to process and analyze YouTube trending video statistics data from multiple regions. The pipeline leverages AWS managed services to build a scalable, cost-effective data analytics solution following the medallion architecture pattern.
🎯 What I built: A data lake solution that ingests raw YouTube statistics (CSV & JSON), processes them through Lambda and Glue ETL jobs, and makes the data queryable via Athena and visualizable in QuickSight — all serverless.
Raw CSV and JSON data files are uploaded to the S3 raw bucket. Data is organized using Hive-style partitioning (region=ca/, region=us/, etc.) supporting 10 regions: CA, DE, FR, GB, IN, JP, KR, MX, RU, US.
A Lambda function is triggered by S3 events when JSON files are uploaded. It:
Glue ETL Jobs (PySpark-based) handle heavy transformations:
Transformed data is stored in Parquet format, partitioned by region, queryable via Amazon Athena, and visualized using Amazon QuickSight dashboards.
git clone https://github.com/upper-stack/aws-youtube-etl-pipeline.git
cd aws-youtube-etl-pipeline# Raw bucket
s3://bigdata-on-youtube-raw-{region}-{account-id}-{env}/
# Cleansed bucket
s3://bigdata-on-youtube-cleansed-{region}-{account-id}-{env}/
# Analytics bucket
s3://bigdata-on-youtube-analytics-{region}-{account-id}-{env}/# Copy JSON reference data
aws s3 cp . s3://your-raw-bucket/youtube/raw_statistics_reference_data/ \
  --recursive --exclude "*" --include "*.json"
# Copy CSV files with regional partitioning
aws s3 cp CAvideos.csv s3://your-raw-bucket/youtube/raw_statistics/region=ca/
aws s3 cp USvideos.csv s3://your-raw-bucket/youtube/raw_statistics/region=us/lambda function.py code and AWS Data Wrangler layerFilters data at the source level, reducing data transfer and processing time.
Partitioned by region for query efficiency, enabling partition pruning in Athena.
Columnar format for compression and fast queries — ~75% storage reduction vs CSV.
Uses .coalesce(1) to reduce small file issues and improve query performance.