Exploring the Cold Start Problem via Spotify

CS 109a Final Project

During my junior fall semester at Harvard, I took CS 109a, an introduction to Data Science. This course focused on analyzing real, messy data in order to perform predictions using statistical and machine learning models. The main topics covered in CS 109a were data collection, data management, exploratory data analysis, prediction and statistical learning, and effective communication. The final project presented an opportunity to explore each of these concepts.

For our final project, our team of three set out to explore the cold start problem by using Spotify. One of Spotify's main services is a robust music recommendation system. Spotify leverages data science and machine learning techniques to assist in the recommendation task. The popular streaming company aggregates a number of different models to generate new playlists based on a user’s musical preferences. Spotify is able to create useful recommendations based off of extensive user feedback. However, what happens if they do not have all of this information? For example, imagine a new user trying Spotify for the first time. How can Spotify make accurate recommendations with minimal user information? This is the cold start problem in terms of Spotify.

Our project attempts to answer if it is possible to generate a likable playlist from one seed song. We develop six "cold start" models (four k-Nearest Neighbor models, two k-Means Clustering models). Each model attempts to generate a new playlist for a user given only one song as base information. We first address the motivations for the project, then we review common literature in the field. Next, we examine the data we will use. Finally we discuss and show the methods, models, and results of our project. Overall, we find that generated playlists are similar in style to a seed song suggesting that a roughly likable playlist can indeed be created with minimal user information.