Data Modelling

Designed and implemented a relational data model optimized for analytics, transforming raw song and event data into a structured star schema using PostgreSQL and Cassandra scripts

Problem

The analytics team needed a centralized and query-optimized database to analyze user listening behaviors. The existing JSON event logs and song metadata were stored in separate sources and were difficult to join or aggregate efficiently

Solution

Designed a star schema consisting of fact and dimension tables for users, artists, songs, and time data. Developed Python ETL scripts to process raw JSON files, extract and transform data, and load it into PostgreSQL. The schema was designed to support fast analytical queries on song plays and user activity

Impact

Improved analytical efficiency and query performance by normalizing and structuring the raw data. Enabled data analysts to easily query song play patterns, user preferences, and artist popularity across time periods

Tech Stack Used

PythonPostgreSQLCassandraETLData ModelingStar Schema

Key Challenges & Learnings

Designing an efficient schema that balances normalization and query performance.
Ensuring referential integrity and handling missing or inconsistent data from raw logs.
Implementing reliable ETL scripts to transform semi-structured JSON data into relational tables.
Gained practical experience in data warehouse schema design and normalization techniques
Learned to build robust ETL pipelines using Cassandra and PostgreSQL
Improved understanding of how data modeling decisions affect analytical query performance

View on GitHub