Spotify Listening History Analysis

The Data

Spotify’s extended streaming history goes back to your first play. Mine starts in 2012 — 10,133 hours, 147,594 streams, 15,748 unique artists across 14 years. 422 continuous days of listening. Here’s what the data says about who I am.

Each record has a timestamp, play duration, track URI, skip flag, shuffle flag, platform, and whether it was offline. It’s enough to reconstruct not just what I listened to but how I listened.

What I Built

A Python + DuckDB analysis pipeline that ingests all the raw JSON files, deduplicates, cleans edge cases, and outputs a set of explorable views.

Key Findings

Top artist all-time: Mac Miller — by total minutes, by years present, by genre diversity.

Skip rate as signal: Songs I’ve never skipped once in 500+ plays are a fundamentally different kind of favorite than songs I’ve played 1,000 times but skipped 40% of the time. The “never-skip” list is the real canon.

Obsession cycles: I listen in phases. There’s a clear pattern of 6–8 week immersions into a single artist, followed by a reset. I can trace every relationship, job change, and major life event to a soundtrack change.

Listening by platform: Desktop vs. mobile vs. offline reveal different moods. Offline listening is almost always music I trust. Desktop listening skews to new exploration.

Discovery vs. catalog ratio: The share of first-time listens has dropped steadily since 2021 — I’m settling into a catalog instead of exploring. Concerning.

Tech Stack

Python 3.9 — ingestion, cleaning, analysis
DuckDB 1.4.4 — in-process analytics, fast joins across 67K records
pandas — dataframe manipulation
matplotlib + seaborn — chart generation
Parquet — processed data persistence

The Hard Parts

Deduplication: Spotify double-logs plays that span offline/online sync transitions. Same track, same timestamp, different offline_timestamp. Deduplicated on (ts, track_uri, ms_played).

Track identity: The same song re-released on three different albums has three different URIs. Fell back to (artist_name, track_name) grouping for anything where URI-level precision caused fragmentation.

Minimum play threshold: 30 seconds. Below that it’s a skip regardless of what the reason_end field says.

Podcast contamination: All podcast and audiobook entries have null track fields. Filtered out for music analysis, kept separate for potential future work.

Concert Connection

The concert history analysis (concerts →) used the same Spotify data as one of three corroboration sources — scrobble timestamps verified against Google Calendar events and email receipts to reconstruct 178 live shows since 2007.