Deconstructing Hollywood Box Office Success


Since the advent of the film industry, filmmakers would rely on intuition to foretell how well their films would do. But this would oftentimes lead to unexpected or even embarrassing results. We investigated how we can use historical data to understand how a movie’s core ingredients impact its financial success.


We primarily used a public IMDb dataset of 5000 movies with normalized data on budget, gross revenues, ratings, Facebook likes, durations, genres, associated keywords as well as actor & director details. This was supplemented by the Twitter sentiment of actors and their collection of awards & nominations, obtained by web-scraping.


Using supervised and unsupervised learning techniques including linear regression, K-means clustering, random forest and decision trees, we were able to understand the contribution of various factors to the monetary success of a movie. These insights could subsequently be used to produce a film with a maximized profit (and minimized artistic purpose?).


Duration of project: 9 months

Tools involved: Excel, R, Python

Personal contribution & skills: Data Analysis (Excel, R), Marketing Research

Film Analytics