Data Modelling
Designed and implemented a relational data model optimized for analytics, transforming raw song and event data into a structured star schema using PostgreSQL and Cassandra scripts
Problem
The analytics team needed a centralized and query-optimized database to analyze user listening behaviors. The existing JSON event logs and song metadata were stored in separate sources and were difficult to join or aggregate efficiently
Solution
Designed a star schema consisting of fact and dimension tables for users, artists, songs, and time data. Developed Python ETL scripts to process raw JSON files, extract and transform data, and load it into PostgreSQL. The schema was designed to support fast analytical queries on song plays and user activity
Impact
Improved analytical efficiency and query performance by normalizing and structuring the raw data. Enabled data analysts to easily query song play patterns, user preferences, and artist popularity across time periods
Tech Stack Used
Key Challenges & Learnings
- Designing an efficient schema that balances normalization and query performance.
- Ensuring referential integrity and handling missing or inconsistent data from raw logs.
- Implementing reliable ETL scripts to transform semi-structured JSON data into relational tables.
- Gained practical experience in data warehouse schema design and normalization techniques
- Learned to build robust ETL pipelines using Cassandra and PostgreSQL
- Improved understanding of how data modeling decisions affect analytical query performance