Predicting movie ratings with IMDb data and R

It’s Oscars season again so why not explore how predictable (my) movie tastes are. This has literally been a million dollar problem and obviously I am not gonna solve it here, but it’s fun and slightly educational to do some number crunching, so why not. Below, I will proceed from a simple linear regression to a generalized additive model to an ordered logistic regression analysis. And I will illustrate the results with nice plots along the way. Of course, all done in R (you can get the script here).

The data for this little project comes from the IMDb website and, in particular, from my personal ratings of 442 titles recorded there. IMDb keeps the movies you have rated in a nice little table which includes information on the movie title, director, duration, year of release, genre, IMDb rating, and a few other less interesting variables. Conveniently, you can export the data directly as a csv file.

Outcome variable
The outcome variable that I want to predict is my personal movie rating. IMDb lets you score movies with one to ten stars. Half-points and other fractions are not allowed. It is a tricky variable to work with. It is obviously not a continuous one; at the same time ten ordered categories are a bit too many to treat as a regular categorical variable. Figure 1 plots the frequency distribution (black bars) and density (red area) of my ratings and the density of the IMDb scores (in blue) for the 442 observations in the data.


The mean of my ratings is a good 0.9 points lower than the IMDb scores, which are also less dispersed and have a higher peak (can you say ‘kurtosis’).

Data-generating process
Some reflection on how the data is generated can highlight its potential shortcomings. First, life is short and I try not to waste my time watching bad movies. Second, even if I get fooled to start watching a bad movie, usually I would not bother rating it on IMDb.There are occasional two- and three-star scores, but these are usually movies that were terrible and annoyed me for some reason or another (like, for example, getting a Cannes award or featuring Bill Murray). The data-generating process leads to a selection bias with two important implications. First, the effective range of variation of both the outcome and the main predictor variables is restricted, giving the models less information to work with. Second, because movies with a decent IMDb ratings which I disliked have a lower chance of being recorded in the dataset, the relationship we find in the sample will overestimate the real link between my ratings and the IMDb ones.

Take one: linear regression
Enough preliminaries, let’s get to business. An ordinary linear regression model is a common starting point for analysis and its results can serve as a baseline. Here are the estimates that lm provides for regressing my ratings on IMDb scores:

summary(lm(mine~imdb, data=d))

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.6387     0.6669  -0.958    0.339    
imdb          0.9686     0.0884  10.957   ***
Residual standard error: 1.254 on 420 degrees of freedom
Multiple R-squared: 0.2223,	Adjusted R-squared: 0.2205

The intercept indicates that on average my ratings are more than half a point lower. The positive coefficient of IMDb score is positive and very close to one which implies that one point higher (lower) IMDb rating would predict, on average, one point higher (lower) personal rating. Figure 2 plots the relationship between the two variables (for an interactive version of the scatter plot, click here):


The solid black line is the regression fit, the blue one shows a non-parametric loess smoothing which suggests some non-linearity in the relationship that we will explore later.

Although the IMDb score coefficient is highly statistically significant that should not fool us that we have gained much predictive capacity. The model fit is rather poor. The root mean squared error is 1.25 which is large given the variation in the data. But the inadequate fit is most clearly visible if we plot the actual data versus the predictions. Figure 3 below does just that. The grey bars show the prediction plus/minus two predictive standard errors. If the predictions derived from the model were good, the dots (observations) would be very close to the diagonal (indicated by the dotted line). In this case, they are not. The model does a particularly bad job in predicting very low and very high ratings.


We can also see how little information IMDb scores contain about (my) personal scores by going back to the raw data. Figure 4 plots to density of my ratings for two sets of values of IMDb scores – from 6.5 to 7.5 (blue) and from 7.5- to 8.5 (red). The means for the two sets differ somewhat, but the overlap in the density is great.


In sum, knowing the IMDb rating provides some information but on its own doesn’t get us very far in predicting what my score would be.

Take two: adding predictors
Let’s add more variables to see if things improve. Some playing around shows that among the available candidates only the year of release of the movie and dummies for a few genres and directors (selected only from those with more than four movies in the data) give any leverage.

 summary(lm(mine~imdb+d$comedy +d$romance+d$mystery+d$"Stanley Kubrick"+d$"Lars Von Trier"+d$"Darren Aronofsky"+year.c, data=d))

                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           1.074930   0.651223   1.651  .  
imdb                  0.727829   0.087238   8.343  ***
d$comedy             -0.598040   0.133533  -4.479  ***
d$romance            -0.411929   0.141274  -2.916  ** 
d$mystery             0.315991   0.185906   1.700  .  
d$"Stanley Kubrick"   1.066991   0.450826   2.367  *  
d$"Lars Von Trier"    2.117281   0.582790   3.633  ***
d$"Darren Aronofsky"  1.357664   0.584179   2.324  *  
year.c                0.016578   0.003693   4.488  ***
Residual standard error: 1.156 on 413 degrees of freedom
Multiple R-squared: 0.3508,	Adjusted R-squared: 0.3382

The fit improves somewhat. The root mean squared error of this model is 1.14. Moreover, looking again at the actual versus predicted ratings, the fit is better, especially for highly rated movies – no surprise given that the director dummies pick these up.


The last variable in the regression above is the year of release of the movie. It is coded as the difference from 2014, so the positive coefficient implies that older movies get higher ratings. The statistically significant effect, however, has no straightforward predictive interpretation. The reason is again selection bias. I have only watched movies released before the 1990s that have withstood the test of time. So even though in the sample older films have higher scores, it is highly unlikely that if I pick a random film made in the 1970s I would like it more than a random film made after 2010. In any case, Figure 6 below plots the year of release versus the residuals from the regression of my ratings on IMDb scores (for the subset of films after 1960). We can see that the relationship is likely nonlinear (and that I really dislike comedies from the 1980s).


So far both regressions assumed that the relationship between the predictors and the outcome is linear. Needless to say, there is no compelling reason why this should be the case. Maybe our predictions will improve if we allow the relationships to take any form. This calls for a generalized additive model.

Take three: generalized additive model (GAM)
In R, we can use the mgcv library to fit a  GAM. It doesn’t make sense to hypothesize non-linear effects for binary variables, so we only smooth the effects of IMDb rating and year of release. But why stop there, perhaps the non-linear effects of IMDb rating and release year are not independent, why not allow them to interact!

summary(gam(mine ~ te(imdb,year.c)+d$"comedy " +d$"romance "+d$"mystery "+d$"Stanley Kubrick"+d$"Lars Von Trier"+d$"Darren Aronofsky", data = d)) 

PParametric coefficients:
                     Estimate Std. Error t value Pr(|t|)    
(Intercept)           6.80394    0.07541  90.225   ***
d$"comedy "          -0.60742    0.13254  -4.583   ***
d$"romance "         -0.43808    0.14133  -3.100   ** 
d$"mystery "          0.32299    0.18331   1.762   .  
d$"Stanley Kubrick"   0.83139    0.45208   1.839   .  
d$"Lars Von Trier"    2.00522    0.57873   3.465   ***
d$"Darren Aronofsky"  1.26903    0.57525   2.206   *  
Approximate significance of smooth terms:
                  edf Ref.df     F p-value    
te(imdb,year.c) 10.85  13.42 11.09

Well, the root mean squared error drops to 1.11 and the jointly smoothed (with a full tensor product smooth) variables are significant, but the added predictive value is minimal in this case. Nevertheless, the plot below shows the smoothed terms are more appropriate than the linear ones, and that there is a complex interaction between the two:


Take four: models for categorical data
So far we treated personal movie ratings as if they were a continuous variable, but they are not – taking into account that they are essentially an ordered categorical variable might help. But ten categories, while possible to model, would make the analysis rather unwieldy, so we recode the personal ratings into five categories without much loss of information: 5 and less, 6,7,8,9 and more.

We can first see a nonparametric conditional destiny plot of the newly created categorical variable as a function of IMDb scores:

The plot shows the observed density for each category of the outcome variable along the range of the predictor. For example, for a film with an IMDb rating of ’6′, about 35% of the personal scores are ’5′, a further 50% are ’6′, and the remaining 15% are ’7′. Remember that the plot is based on the observed conditional frequencies only (with some smoothing), not on the projections of a model. But the small ups and downs seem pretty idiosyncratic. We can also fit an ordered logistic regression model, which would be appropriated for the categorical outcome variable we have, and plot its predicted probabilities given the model.

First, here is the output of the model:

summary(polr(as.factor(mine.c) ~ imdb+year.c,  Hess=TRUE, data = d)
        Value Std. Error t value
imdb   1.4103   0.149921   9.407
year.c 0.0283   0.006023   4.699

    Value   Std. Error t value
5|6  9.0487  1.0795     8.3822
6|7 10.6143  1.1075     9.5840
7|8 12.1539  1.1435    10.6289
8|9 14.0234  1.1876    11.8079

Residual Deviance: 1148.665 
AIC: 1160.665

The coefficients of the two predictors are significant. The plot below shows the predicted probability of the outcome variable – personal movie rating – being in each of the five categories as a function of IMDb rating and illustrates the substantive scale of the effect.


Compared to the non-parametric conditional density plot above, these model-based predictions are much smoother and have ‘disciplined’ the effect of the predictor to follow a systematic pattern.

It is interesting to ponder which of the two would be more useful for out-of-sample predictions. Despite the fact that the non-parametric one is more faithful to the current data, I think I would go for the parametric model projections. After all, is it really plausible that a random film with an IMDb rating of 5 would have lower chance a getting a 5 from me than a film with an IMDb rating of 6, as the non-parametric conditional density plot suggests? I don’t think so. Interestingly, in this case the parametric model has actually corrected for some of the selection bias and made for more plausible out-of-sample predictions.

In sum, whatever the method, it is not very fruitful to try to predict how much a person (or at least, the particular person writing this) would like a movie based on the average rating the movie gets and covariates like the genre or the director. Non-linear regressions and other modeling tricks offer only marginal predictive improvements over a simple linear regression approach, but bring plenty of insight about the data itself.

What is the way ahead? Obviously, one would want to get more relevant predictors, but, unfortunately, IMDb seems to have a policy against web-scrapping from its database, so one would either have to ask for permission or look at a different website with a more liberal policy (like Rotten Tomatoes perhaps). For me, the purpose of this exercise has been mostly in its methodological educational value, so I think I will leave it at that. Finally, don’t forget to check out the interactive scatterplot of the data used here which shows a user’s entire movie rating history at a glance.

As you would have noted, the IMDb ratings come at a greater level of precision (like 7.3) than the one available for individual users (like 7). So a user who really thinks that a film is worth 7.5 has to pick 7 or 8, but its average IMDb score could well be 7.5. If the rating categories available to the user are indeed too coarse, this would show up in the relationship with the IMDb score: movies with an average score of 7.5 would be less predictable that movies with an average score of either 7 or 8. To test this conjecture, a rerun the linear regression models on two subsets of the data: one comprising the movies with an average IMDb rating between 5.9 and 6.1, 6.9 an 7.1, etc., and a  second one comprising those with an average IMDb rating between 5.4 and 5.6, 6.4 and 6.6, etc. The fit of the regression for the first group was better than for the second (RMSE of 1.07 vs. 1.11), but, frankly, I expected a more dramatic difference. So maybe ten categories are just enough.

Swimming in a sea of code

If you are looking for code here, move on.

In the beginning, there was only the relentless blinking of the cursor. With the maddening regularity of waves splashing on the shore: blink, blink, blink, blink…Beyond the cursor, the white wasteland of the empty page: vast, featureless, and terrifying as the sea. You stare at the empty page and primordial fear engulfs you: you are never gonna venture into this wasteland, you are never gonna leave the stable, solid, familiar world of menus and shortcuts, icons and buttons.

And then you take the first cautious steps.

print ‘Hello world’
> Hello world
, the sea obliges.

> 2
> 4

You are still scared, but your curiosity is aroused. The playful responsiveness of the sea is tempting, and quickly becomes irresistible. Soon, you are jumpting around like a child, rolling upside-down and around and around:

> a=2
> b=3
> a+b

> for (x in 1:60) print (x)
1    2    3    4    5    6    7    8    9   10   11   12   13   14  15   16   17   18   19   20   21   22   23   24   25   26  27   28   29   30   31   32   33   34   35   36   37   38  39   40   41   42   43   44   45   46   47   48   49   50  51   52   53   54   55   56   57   58   59   60

The sense of freedom is exhilarating. You take a deep breath and dive:

> for (i in 1:10) ifelse (i>5, print ('ha'), print ('ho'))
[1] "ho"
[1] "ho"
[1] "ho"
[1] "ho"
[1] "ho"
[1] "ha"
[1] "ha"
[1] "ha"
[1] "ha"
[1] "ha"

Your old fear seems so silly now. Code is your friend. The sea is your friend. The white page is just a playground with endless possibilities.

Your confidence grows. You start venturing further into the deep. You write your first function. You let code scrape the web for you. You generate your first random variable. You run your first statistical models. Your code grows in length and takes you deeper and deeper into unexplored space.

Then suddenly you are lost. Panic sets in. The code stops to obey; you search for the problem but you cannot find it. Panic grows. Instinctively, you grasp for help for the icons, but there are none. You look for support by the menus but they are gone. You are all alone  in the middle of this long string of code which seems so alien right now. Clouds gather. Who tempted you in? How do you get back? What to do next? You want to turn these lists into vectors, but you can’t. You need to decompose your strings into characters but you don’t know how. Out of nowhere encoding problems appear and your entire code is defunct. You are lost….

Eventually, you give up and get back to the shore. The world of menus and icons and shortcuts is limited but safe. Your short flirt with code is over forever, you think. Sometimes you dare to dream about the freedom it gave you but then you remember the feelings of helplessness and entrapment, of being all alone in the open sea. No, getting into code was a childish mistake.

But as time goes by you learn to control your fear and approach the sea again. This time without headless enthusiasm but slowly, with humility and respect for its unfathomable depths. You never stray too far away from the shore in one go. You learn to avoid nested loops and keep your regular expressions to a minimum. You always leave signposts if you need to retrace your path.

Code will never be your friend. The sea will never be your lover. But maybe you can learn to get along just enough as to harness part of its limitless power… without losing yourself into it forever. >

R start screen

The origins of the digital universe

Just finished Turing’s Cathedral – a fine and stimulating book about the origins of the computer, the interlinked history of the first computers and nuclear bombs, the role of John von Neumann in all that, the Institute of Advanced Studies (IAS) in Princeton, and much more. It is a very thoroughly researched volume based on archival materials, interviews, etc. Actually, if I have one complaint it is that it is too scrupulous in presenting the background of all primary, secondary and tertiary characters in the story of the computer and in documenting the development of the various buildings at the IAS. For that reason I found the first part of the book a bit tedious. But the later chapters in which the author allows his own ideas about the digital universe to roam more freely are truly inspired and inspiring. It was also quite fascinating to learn that one of the first uses of the digital computer, apart from calculating nuclear fusion processes and trying to predict the weather, has been to run what would now be called agent-based modeling (by Nils Baricelli). Here is my favorite passage from the book:

‘Books are strings of code. But they have mysterious properties – like strings of DNA. Somehow the author captures a fragment of the universe, unravels it into a one-dimensional sequence, squeezes it through a keyhole, and hopes that a three-dimensional  vision emerges in the reader’s mind. The translation is never exact.’ (p.312)

Constructivism in the world of Dragons

Here is an analysis of Game of Thrones from a realist international relations perspective. Inevitably, here is the response from a constructivist angle. These are supposed to be fun so I approached them with a light heart and popcorn. But halfway through the second article I actually felt sick to my stomach. I am not exaggerating, and it wasn’t the popcorn – seeing the same ‘arguments’ between realists and constructivists rehearsed in this new setting, the same lame responses to the same lame points, the same ‘debate’ where nobody ever changes their mind, the same dreaded confluence of normative, theoretical, and empirical notions that plagues this never-ending exchange in the real (sorry, socially constructed) world, all that really gave me a physical pain. I felt entrapped – even in this fantasy world there was no escape from the Realist and the Constructivist. The Seven Kingdoms were infected by the triviality of our IR theories. The magic of their world was desecrated. Forever….

Nothing wrong with the particular analyses. But precisely because they manage to be good examples of the genres they imitate the bad taste in my mouth felt so real. So is it about interests or norms? Oh no. Is it real politik or the slow construction of a common moral order? Do leader disregard the common folk to their own peril? Oh, please stop. How do norms construct identities? Noooo moooore. Send the Dragons!!!

By the way, just one example of how George R.R. Martin can explain a difficult political idea better than an entire conference of realists and constructivists. Why do powerful people keep their promises? Is it ’cause their norms make them do it or because it is in their interests or whatever? Why do Lannisters always pay their debts even though they appear to be some the meanest self-centered characterless in the entire world of Game of Thrones?  We literally see the answer when Tyrion Lannister tries to escape from the sky cells, and the Lannister’s reputation for paying their debts is the only thing that saves him, the only thing he has left to pay Mord, but it is enough (see episode 1.6). Having a reputation for paying your debts is one of the greatest assets you can have in every world. And it is worth all the pennies you pay to preserve it even when you can actually get away with not honoring your commitments. It could not matter less if you call this interest-based or norm-based explanation: it just clicks, but it takes creativity and insight to convey the point, not impotent meta-theoretical disputes.

The failure of political science

Last week the American Senate supported with a clear bi-partisan majority a decision to stop funding for political science research from the National Science Foundation. Of all disciplines, only political science has been singled out for the cuts and the money will go for cancer research instead.

The decision is obviously wrong for so many reasons but my point is different. How could political scientists who are supposed to understand better than anyone else how politics works allow this to happen? What does it tell us about the state of the discipline that the academic experts in political analysis cannot prevent overt political action that hurts them directly and rather severely?

To me, this failure of American political scientists to protect their own turf in the political game is scandalous. It is as bad as Nobel-winning economists Robert Merton and Myron Scholes leading the hedge fund ‘Long Tern Capital Management‘ to bust and losing 4.6 billion dollars with the help of their Nobel-wining economic theories. As Myron & Scholes’ hedge fund story revels the true real-world value of (much) financial economics theories, so does the humiliation of political science by the Congress reveal the true real-world value of (much) political theories.

Think about it –  the world-leading academic specialists on collective action, interest representation and mobilization could not get themselves mobilized, organized and represented in Washington to protect their funding. The professors of the political process and legislative institutions could not find a way to work these same institutions to their own advantage. The experts on political preferences and incentives did not see the broad bi-partisan coalition against political science forming. That’s embarrassing

It is even more embarrassing because American political science is the most productive, innovative, and competitive in the world. There is no doubt that almost all of the best new ideas, methods, and theories in political science over the last 50 years have come from the US. (And a lot of these innovations have been made possible because of the funding received by the National Science Foundation). So it is not that individual American political scientists are not smart – of course they are, but for some reason as a collective body they have not been able to benefit from their own knowledge and insights. Or that knowledge and insights about US politics are deficient in important ways.The fact remains, political scientists were beaten in what should have been their own game. Hopefully some kind of lesson will emerge from all that…

P.S. No reason for public administration, sociology and other related disciplines to be smug about pol sci’s humiliation – they have been saved (for now) mostly by their own irrelevance. 

The evolution of EU legislation (graphed with ggplot2 and R)

During the last half century the European Union has adopted more than 100 000 pieces of legislation. In this presentation I look into the patterns of legislative adoption over time. I tried to create clear and engaging graphs that provide some insight into the evolution of law-making activity: not an easy task given the byzantine nature of policy making in the EU and the complex nomenclature of types of legal acts possible.

The main plot showing the number of adopted directives, regulations and decisions since 1967 is pasted below. There is much more in the presentation. The time series data is available here, as well as the R script used to generate the plots (using ggplot2). Some of the graphs are also available as interactive visualizations via ManyEyes here, here, and here (requires Java). Enjoy.

EU laws over time