The Netflix Prize

What had begun as a contest to determine the best data scientist became a demonstration of diversity bonuses.

On October 6, 2006, Reed Hastings, the CEO of Netflix, announced an open competition to predict customers’ movie ratings. On that date, Netflix released data consisting of one hundred million movie ratings of one to five stars for seventeen thousand movies from their nearly half million users—the largest data set ever made available to the public. Contest rules were as follows: any contestant who could predict consumer ratings 10 percent more accurately than Netflix’s proprietary Cinematch algorithm would be awarded a $1,000,000 prize.10 Netflix had poured substantial resources into developing Cinematch. Improving on it by 10 percent would not prove easy.

The story of the Netflix Prize differs from traditional diversity narratives in which a single talented individual, given an opportunity, creates a breakthrough because of some idiosynchratic piece of information. Instead, teams of diverse, brilliant people competed to attain a goal. The contest attracted thousands of participants with a variety of technical backgrounds and work experiences. The teams applied an algorithmic zoo of conceptual, computational, and analytical approaches. Early in the contest, the top ten teams included a team of American undergraduate math majors, a team of Austrian computer programmers, a British psychologist and his calculus-wielding daughter, two Canadian electrical engineers, and a group of data scientists from AT&T research labs.

In the end, the participants discovered that their collective differences contributed as much as or more than their individual talents. By sharing perspectives, knowledge, information, and techniques, the contestants produced a sequence of quantifiable diversity bonuses.

Winning the Netflix Prize required the inference of patterns from an enormous data set. That data set covered a diverse population of people. Some liked horror films. Others preferred romantic comedies. Some liked documentaries. The modelers would attempt to account for this heterogeneity by creating categories of movies and of people.

To understand the nature of the task, imagine a giant spreadsheet with a row for each person and a column for each movie. If each user rated every movie, that spreadsheet would contain over 8.5 billion ratings. The data consisted of a mere 100 million ratings. Though an enormous amount of data, it fills in fewer than 1.2 percent of the cells. If you opened the spreadsheet in Excel, you would see mostly blanks. Computer scientists refer to this as sparse data.

The contestants had to predict the blanks, or, to be more precise, predict the values for the blanks that consumers would fill in next. Inferring patterns from existing data, what data scientists call collaborative filtering, requires the creation of similarity measures between people and between movies. Similar people should rank the same movie similarly. And each person should rank similar movies similarly.

A team knows it has constructed effective similarity measures if the patterns identified in the existing data hold for the blanks. Characterizing similarity between people or movies involves difficult choices: Is Mel Brooks’s spoof Spaceballs closer to the Airplane! comedies or to Star Wars, the movie that Spaceballs parodied?

Early in the competition, contestants’ similarity measures of movies emphasized attributes such as genre (comedy, drama, action), box office receipts, and external rankings. Some models included the presence of specific actors (was Morgan Freeman or Will Smith in the movie?) or types of events, such as gruesome deaths, car chases, or sexual intimacy. Later models added data on the number of days between the movie’s release to video and the person’s day of rental.

One might think that including more features would lead to more accurate predictions. That need not hold. Models with too many variables can overfit the data. To guard against overfitting, computer scientists divide their data into two sets: a training set and a testing set. They fit their model to the first set, then check to see if it also works on the second set.11 In the Netflix Prize competition, the size of the data set and the costs of computation limited the number of variables that could be included in any one model. The winner would therefore not be the person or team that could think up the most features. It would be the team capable of identifying the most informative and tractable set of features.

Given a feature set, each team also needed an algorithm to make predictions. Dinosaur Planet, a team of three mathematics undergraduates that briefly led the competition in 2007, tried multiple approaches, including clustering (partitioning movies into sets based on similar characteristics), neural networks (algorithms that take features as inputs and learn patterns), and nearest-neighbor methods (algorithms that assign numerical scores to each feature for each movie and compute a distance based on vectors of features).

At the end of the first year, a team from AT&T research labs, known as BellKor, led the competition. Their best single model relied on fifty variables per movie and improved on Cinematch by 6.58 percent. That was just one of their models. By combining their fifty models in an ensemble, they could improve on Cinematch by 8.43 percent.

A year and a half into the competition, BellKor knew they could outperform the other teams, but also that they could not reach the 10 percent threshold. Rather than give up, BellKor opted to call in reinforcements. In 2008, they merged with the Austrian computer scientists, Big Chaos, a team that had developed sophisticated algorithms for combining models. BellKor had the best predictive models. Big Chaos knew better ways to combine them. By combining these repertoires, they produced a diversity bonus. However, that bonus was not sufficient to push them above the 10 percent threshold.

In 2009, the team again went looking for a new partner. This time, they added a Canadian team, Pragmatic Theory. Pragmatic Theory lacked BellKor’s ability to identify features or Big Chaos’s skills at aggregating models. Pragmatic Theory’s added value came in the form of new insights into human behavior.

They had developed novel methods for categorizing distinct users on the same account. They could separate one person into two identities: Eric alone and Eric with a date. These two Erics might rank the same movie differently. Pragmatic Theory also identified patterns in rankings based on the day of the week—some people rated movies higher on Sundays. They found that for some movies, rankings depended on whether people rated the movie immediately or after having time for reflection. As the credits roll, the hilarity of Snakes on a Plane or Anchorman results in high rankings. With time for reflection, most people no longer consider a flaming flute or a burrito in the face to be hallmarks of quality films and assign fewer stars.12

The combined team, now called BellKor’s Pragmatic Chaos, had thought up a jaw-dropping eight hundred predictive features.13 More diversity meant more ideas. Recall that the goal was not to come up with the most features. Not all the features would improve accuracy. The team had to select from among them to create powerful combinations. Eventually, the team developed a single model that improved on Cinematch by 8.4 percent. They now had a single model as good as BellKor’s entire ensemble of models. When BellKor’s Pragmatic Chaos combined that and other models, they produced even more accurate predictions.

The combined team’s composite models proved up to the task. On June 26, 2009, nearly three years after the contest began, BellKor’s Pragmatic Chaos surpassed the 10 percent threshold. Game over. BellKor’s Pragmatic Chaos won the $1,000,000 in prize money.

Although, not yet. They had to wait. To safeguard against the possibility that 10 percent would prove too easy, the organizers wrote the rules so that the contest would end thirty days after a team passed the threshold. Had the threshold been 5 percent, a level that was bested a mere six days into the contest, this decision would have been prescient. As events unfolded, this delay seemed unnecessary.

It was not. The fun had only begun. As if drawn from the script of Jurassic Park, the dinosaurs came roaring back. And they brought reinforcements. More than thirty teams, including top performers Grand Prize Team, Opera Solutions, and Vandelay Industries, joined forces with the Dinosaur Planet team to form the Ensemble. Within a few weeks, the Ensemble blended forty-eight models using a sophisticated weighting scheme and took the slightest of leads.

The ultimate winner would be decided by determining which model performed best on the testing data—the data held back by Netflix. The result was a tie. Each had improved on Cinematch by an identical 10.06 percent. The winner was determined by order of submission. By turning in their code twenty-two minutes before the Ensemble, BellKor’s Pragmatic Chaos won.

Winning the contest required knowledge of the features of movies that matter most, awareness of available information on movies, methods for representing properties of movies in languages accessible to computers, good mental models of how people rank movies, the ability to develop algorithms to predict ratings, and expertise at combining diverse models into an ensemble. What had begun as a contest to determine the best data scientist became a demonstration of diversity bonuses.




Encouraging a broad range of opinions, ideas and perspectives helps us drive creativity and innovation across the company.