Predicting movie ratings with IMDb data and R

It’s Oscars season again so why not explore how predictable (my) movie tastes are. This has literally been a million dollar problem and obviously I am not gonna solve it here, but it’s fun and slightly educational to do some number crunching, so why not. Below, I will proceed from a simple linear regression to a generalized additive model to an ordered logistic regression analysis. And I will illustrate the results with nice plots along the way. Of course, all done in R (you can get the script here).

Data
The data for this little project comes from the IMDb website and, in particular, from my personal ratings of 442 titles recorded there. IMDb keeps the movies you have rated in a nice little table which includes information on the movie title, director, duration, year of release, genre, IMDb rating, and a few other less interesting variables. Conveniently, you can export the data directly as a csv file.

Outcome variable
The outcome variable that I want to predict is my personal movie rating. IMDb lets you score movies with one to ten stars. Half-points and other fractions are not allowed. It is a tricky variable to work with. It is obviously not a continuous one; at the same time ten ordered categories are a bit too many to treat as a regular categorical variable. Figure 1 plots the frequency distribution (black bars) and density (red area) of my ratings and the density of the IMDb scores (in blue) for the 442 observations in the data.

figure1

The mean of my ratings is a good 0.9 points lower than the IMDb scores, which are also less dispersed and have a higher peak (can you say ‘kurtosis’).

Data-generating process
Some reflection on how the data is generated can highlight its potential shortcomings. First, life is short and I try not to waste my time watching bad movies. Second, even if I get fooled to start watching a bad movie, usually I would not bother rating it on IMDb.There are occasional two- and three-star scores, but these are usually movies that were terrible and annoyed me for some reason or another (like, for example, getting a Cannes award or featuring Bill Murray). The data-generating process leads to a selection bias with two important implications. First, the effective range of variation of both the outcome and the main predictor variables is restricted, giving the models less information to work with. Second, because movies with a decent IMDb ratings which I disliked have a lower chance of being recorded in the dataset, the relationship we find in the sample will overestimate the real link between my ratings and the IMDb ones.

Take one: linear regression
Enough preliminaries, let’s get to business. An ordinary linear regression model is a common starting point for analysis and its results can serve as a baseline. Here are the estimates that lm provides for regressing my ratings on IMDb scores:

summary(lm(mine~imdb, data=d))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.6387     0.6669  -0.958    0.339    
imdb          0.9686     0.0884  10.957   ***
---
Residual standard error: 1.254 on 420 degrees of freedom
Multiple R-squared: 0.2223,	Adjusted R-squared: 0.2205

The intercept indicates that on average my ratings are more than half a point lower. The positive coefficient of IMDb score is positive and very close to one which implies that one point higher (lower) IMDb rating would predict, on average, one point higher (lower) personal rating. Figure 2 plots the relationship between the two variables (for an interactive version of the scatter plot, click here):

figure2

The solid black line is the regression fit, the blue one shows a non-parametric loess smoothing which suggests some non-linearity in the relationship that we will explore later.

Although the IMDb score coefficient is highly statistically significant that should not fool us that we have gained much predictive capacity. The model fit is rather poor. The root mean squared error is 1.25 which is large given the variation in the data. But the inadequate fit is most clearly visible if we plot the actual data versus the predictions. Figure 3 below does just that. The grey bars show the prediction plus/minus two predictive standard errors. If the predictions derived from the model were good, the dots (observations) would be very close to the diagonal (indicated by the dotted line). In this case, they are not. The model does a particularly bad job in predicting very low and very high ratings.

figure3

We can also see how little information IMDb scores contain about (my) personal scores by going back to the raw data. Figure 4 plots to density of my ratings for two sets of values of IMDb scores – from 6.5 to 7.5 (blue) and from 7.5- to 8.5 (red). The means for the two sets differ somewhat, but the overlap in the density is great.

figure4

In sum, knowing the IMDb rating provides some information but on its own doesn’t get us very far in predicting what my score would be.

Take two: adding predictors
Let’s add more variables to see if things improve. Some playing around shows that among the available candidates only the year of release of the movie and dummies for a few genres and directors (selected only from those with more than four movies in the data) give any leverage.

 summary(lm(mine~imdb+d$comedy +d$romance+d$mystery+d$"Stanley Kubrick"+d$"Lars Von Trier"+d$"Darren Aronofsky"+year.c, data=d))

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           1.074930   0.651223   1.651  .  
imdb                  0.727829   0.087238   8.343  ***
d$comedy             -0.598040   0.133533  -4.479  ***
d$romance            -0.411929   0.141274  -2.916  ** 
d$mystery             0.315991   0.185906   1.700  .  
d$"Stanley Kubrick"   1.066991   0.450826   2.367  *  
d$"Lars Von Trier"    2.117281   0.582790   3.633  ***
d$"Darren Aronofsky"  1.357664   0.584179   2.324  *  
year.c                0.016578   0.003693   4.488  ***
---
Residual standard error: 1.156 on 413 degrees of freedom
Multiple R-squared: 0.3508,	Adjusted R-squared: 0.3382

The fit improves somewhat. The root mean squared error of this model is 1.14. Moreover, looking again at the actual versus predicted ratings, the fit is better, especially for highly rated movies – no surprise given that the director dummies pick these up.

figure5

The last variable in the regression above is the year of release of the movie. It is coded as the difference from 2014, so the positive coefficient implies that older movies get higher ratings. The statistically significant effect, however, has no straightforward predictive interpretation. The reason is again selection bias. I have only watched movies released before the 1990s that have withstood the test of time. So even though in the sample older films have higher scores, it is highly unlikely that if I pick a random film made in the 1970s I would like it more than a random film made after 2010. In any case, Figure 6 below plots the year of release versus the residuals from the regression of my ratings on IMDb scores (for the subset of films after 1960). We can see that the relationship is likely nonlinear (and that I really dislike comedies from the 1980s).

figure6

So far both regressions assumed that the relationship between the predictors and the outcome is linear. Needless to say, there is no compelling reason why this should be the case. Maybe our predictions will improve if we allow the relationships to take any form. This calls for a generalized additive model.

Take three: generalized additive model (GAM)
In R, we can use the mgcv library to fit a  GAM. It doesn’t make sense to hypothesize non-linear effects for binary variables, so we only smooth the effects of IMDb rating and year of release. But why stop there, perhaps the non-linear effects of IMDb rating and release year are not independent, why not allow them to interact!

library(mgcv)
summary(gam(mine ~ te(imdb,year.c)+d$"comedy " +d$"romance "+d$"mystery "+d$"Stanley Kubrick"+d$"Lars Von Trier"+d$"Darren Aronofsky", data = d)) 

PParametric coefficients:
                     Estimate Std. Error t value Pr(|t|)    
(Intercept)           6.80394    0.07541  90.225   ***
d$"comedy "          -0.60742    0.13254  -4.583   ***
d$"romance "         -0.43808    0.14133  -3.100   ** 
d$"mystery "          0.32299    0.18331   1.762   .  
d$"Stanley Kubrick"   0.83139    0.45208   1.839   .  
d$"Lars Von Trier"    2.00522    0.57873   3.465   ***
d$"Darren Aronofsky"  1.26903    0.57525   2.206   *  
---
Approximate significance of smooth terms:
                  edf Ref.df     F p-value    
te(imdb,year.c) 10.85  13.42 11.09

Well, the root mean squared error drops to 1.11 and the jointly smoothed (with a full tensor product smooth) variables are significant, but the added predictive value is minimal in this case. Nevertheless, the plot below shows the smoothed terms are more appropriate than the linear ones, and that there is a complex interaction between the two:

figure7

Take four: models for categorical data
So far we treated personal movie ratings as if they were a continuous variable, but they are not – taking into account that they are essentially an ordered categorical variable might help. But ten categories, while possible to model, would make the analysis rather unwieldy, so we recode the personal ratings into five categories without much loss of information: 5 and less, 6,7,8,9 and more.

We can first see a nonparametric conditional destiny plot of the newly created categorical variable as a function of IMDb scores:
figure8

The plot shows the observed density for each category of the outcome variable along the range of the predictor. For example, for a film with an IMDb rating of ’6′, about 35% of the personal scores are ’5′, a further 50% are ’6′, and the remaining 15% are ’7′. Remember that the plot is based on the observed conditional frequencies only (with some smoothing), not on the projections of a model. But the small ups and downs seem pretty idiosyncratic. We can also fit an ordered logistic regression model, which would be appropriated for the categorical outcome variable we have, and plot its predicted probabilities given the model.

First, here is the output of the model:

library(MASS)
summary(polr(as.factor(mine.c) ~ imdb+year.c,  Hess=TRUE, data = d)
Coefficients:
        Value Std. Error t value
imdb   1.4103   0.149921   9.407
year.c 0.0283   0.006023   4.699

Intercepts:
    Value   Std. Error t value
5|6  9.0487  1.0795     8.3822
6|7 10.6143  1.1075     9.5840
7|8 12.1539  1.1435    10.6289
8|9 14.0234  1.1876    11.8079

Residual Deviance: 1148.665 
AIC: 1160.665

The coefficients of the two predictors are significant. The plot below shows the predicted probability of the outcome variable – personal movie rating – being in each of the five categories as a function of IMDb rating and illustrates the substantive scale of the effect.

figure9

Compared to the non-parametric conditional density plot above, these model-based predictions are much smoother and have ‘disciplined’ the effect of the predictor to follow a systematic pattern.

It is interesting to ponder which of the two would be more useful for out-of-sample predictions. Despite the fact that the non-parametric one is more faithful to the current data, I think I would go for the parametric model projections. After all, is it really plausible that a random film with an IMDb rating of 5 would have lower chance a getting a 5 from me than a film with an IMDb rating of 6, as the non-parametric conditional density plot suggests? I don’t think so. Interestingly, in this case the parametric model has actually corrected for some of the selection bias and made for more plausible out-of-sample predictions.

Conclusion
In sum, whatever the method, it is not very fruitful to try to predict how much a person (or at least, the particular person writing this) would like a movie based on the average rating the movie gets and covariates like the genre or the director. Non-linear regressions and other modeling tricks offer only marginal predictive improvements over a simple linear regression approach, but bring plenty of insight about the data itself.

What is the way ahead? Obviously, one would want to get more relevant predictors, but, unfortunately, IMDb seems to have a policy against web-scrapping from its database, so one would either have to ask for permission or look at a different website with a more liberal policy (like Rotten Tomatoes perhaps). For me, the purpose of this exercise has been mostly in its methodological educational value, so I think I will leave it at that. Finally, don’t forget to check out the interactive scatterplot of the data used here which shows a user’s entire movie rating history at a glance.

Endnote
As you would have noted, the IMDb ratings come at a greater level of precision (like 7.3) than the one available for individual users (like 7). So a user who really thinks that a film is worth 7.5 has to pick 7 or 8, but its average IMDb score could well be 7.5. If the rating categories available to the user are indeed too coarse, this would show up in the relationship with the IMDb score: movies with an average score of 7.5 would be less predictable that movies with an average score of either 7 or 8. To test this conjecture, a rerun the linear regression models on two subsets of the data: one comprising the movies with an average IMDb rating between 5.9 and 6.1, 6.9 an 7.1, etc., and a  second one comprising those with an average IMDb rating between 5.4 and 5.6, 6.4 and 6.6, etc. The fit of the regression for the first group was better than for the second (RMSE of 1.07 vs. 1.11), but, frankly, I expected a more dramatic difference. So maybe ten categories are just enough.

The evolution of EU legislation (graphed with ggplot2 and R)

During the last half century the European Union has adopted more than 100 000 pieces of legislation. In this presentation I look into the patterns of legislative adoption over time. I tried to create clear and engaging graphs that provide some insight into the evolution of law-making activity: not an easy task given the byzantine nature of policy making in the EU and the complex nomenclature of types of legal acts possible.

The main plot showing the number of adopted directives, regulations and decisions since 1967 is pasted below. There is much more in the presentation. The time series data is available here, as well as the R script used to generate the plots (using ggplot2). Some of the graphs are also available as interactive visualizations via ManyEyes here, here, and here (requires Java). Enjoy.

EU laws over time

New data source for political science researchers

Political Data Yearbook Interactive is a new source for data on election results, turnout and government composition for all EU and some non-European countries. It is basically an online version of the yearbooks that ECPR printed as part of the European Journal for Political Research for many years now.

The interactive online tool has some (limited) visualization options and can export data in several formats.

Music Network Visualization

Note: probably of interest only to the intersection of the readers who are into niche music genres and those interested in network visualization.

My music interests have always been rather, hmm…, eclectic. Somehow IDM, ambient, darkwave, triphop, acid jazz, bossa nova, qawali, Mali blues and other more or less obscure genres have managed to happily co-exist in my music collection. The sheer diversity always invited the question whether there is some structure to the collection, or each genre is an island of its own. Sounds like a job for network visualization!

Now, there are plenty of music network viz applications on the web. But they don’t show my collection, and just seem unsatisfactory for various reasons. So I decided to craft my own visualization using R and igraph.

As a first step I collected for all artists in my last.fm library the artists that the site classifies as similar. So I piggyback on last.fm for the network similarity measures. I also get info on the most-often used tag for the artist and the number of plays it has on the site. The rest is pretty straightforward as can be seen from the code.

# Load the igraph and foreign packages (install if needed)
require(igraph)
require(foreign)
lastfm<-read.csv("http://www.dimiter.eu/Data_files/lastfm_network_ad.csv", header=T,  encoding="UTF-8") #Load the dataset

lastfm$include<-ifelse(lastfm$Similar %in% lastfm$Artist==T,1,0) #Index the links between artists in the library
lastfm.network<-graph.data.frame(lastfm, directed=F) #Import as a graph

last.attr<-lastfm[-which(duplicated(lastfm$Artist)),c(5,3,4) ] #Create some attributes
V(lastfm.network)[1:106]$listeners<-last.attr[,2]
V(lastfm.network)[107:length(V(lastfm.network))]$listeners<-NA
V(lastfm.network)[1:106]$tag<-last.attr[,3]
V(lastfm.network)[107:length(V(lastfm.network))]$tag<-NA #Attach the attributes to the artist from the library (only)
V(lastfm.network)$label.cex$tag<-ifelse(V(lastfm.network)$listeners>1200000, 1.4, 
                                    (ifelse(V(lastfm.network)$listeners>500000, 1.2,
                                            (ifelse(V(lastfm.network)$listeners>100000, 1.1,
                                                   (ifelse(V(lastfm.network)$listeners>50000, 1, 0.8))))))) #Scale the size of labels by the relative popularity

V(lastfm.network)$color<-"white" #Set the color of the dots
V(lastfm.network)$size<-0.1 #Set the size of the dots
V(lastfm.network)$label.color<-NA
V(lastfm.network)[1:106]$label.color<-"white" #Only the artists from the library should be in white, the rest are not needed

E(lastfm.network)[ include==0 ]$color<-"black" 
E(lastfm.network)[ include==1 ]$color<-"red" #Color edges between artists in the library red, the rest are not needed

fix(tkplot) #Add manually to the function an argument for the background color of the canvas and set it to black (bg=black)

tkplot(lastfm.network, vertex.label=V(lastfm.network)$name, layout=layout.fruchterman.reingold,
       canvas.width=1200, canvas.height=800) #Plot the graph and adjust as needed

I plot the network with the tkplot command which allows for the manual adjustments necessary because many artist names get on top of each other in the initial plot. Because the export options of tkplot are limited I just took a print screen ( I know, I know, that’s kind of cheating ;-)), added the tittle in Photoshop and, voila, it’s done!

[click to enlarge and explore]
my-music-netowrk

Knowing intimately the artists in the graph, I can certify that the network definitely makes a lot of sense. I love the small clusters (Flying Louts, Andy Stott, Extrawelt and Claro Intelecto [minimal/dub], or Anouar Brahem and Rabih Abou-Khalil [ethno jazz]) loosely connected to the rest of the network. And I love the fact that the boundary spanners are immediately obvious (e.g. Pink Martini between acid jazz and world music [what a stupid label by the way!], or Cesaria Evora between African and Caribbean music, or Portishead between brit-pop, trip-hop and darkwave, or Amon Tobin between trip-hop, electro and IDM). Even the different world music genres are close to each other but still unconnected. And somehow Banco De Gaya, the most ethno of all electronica in the library, ended up closest to the world/ethno clusters. There are a few problems, like Depeche Mode, which get to be pulled from the opposite sides of the graph, but these are very few.

Altogether, I have to admit I feel like a teenage dream of mine has finally been realized. But I realize the network is a rather personal thing (as it was meant to be) so I don’t expect many to get overly excited about it. Still, I would be glad to hear your comments or suggestions for extensions and improvements. And, if you were a good boy/girl during the year, I could also consider visualizing your last.fm network as a present for the new year!

Network visualization in R with the igraph package

In this post I showed a visualization of the organizational network of my department. Since several people asked for details how the plot has been produced, I will provide the code and some extensions below. The plot has been done entirely in R (2.14.01) with the help of the igraph package. It is a great package but I found the documentation somewhat difficult to use, so hopefully this post can be a helpful introduction to network visualization with R. Here we go:

# Load the igraph package (install if needed)

require(igraph)

# Data format. The data is in 'edges' format meaning that each row records a relationship (edge) between two people (vertices).
# Additional attributes can be included. Here is an example:
#	Supervisor	Examiner	Grade	Spec(ialization)
#	AA		BD		6	X	
#	BD		CA		8	Y
#	AA		DE		7	Y
#	...		...		...	...
# In this anonymized example, we have data on co-supervision with additional information about grades and specialization. 
# It is also possible to have the data in a matrix form (see the igraph documentation for details)

# Load the data. The data needs to be loaded as a table first: 

bsk<-read.table("http://www.dimiter.eu/Data_files/edgesdata3.txt", sep='\t', dec=',', header=T)#specify the path, separator(tab, comma, ...), decimal point symbol, etc.

# Transform the table into the required graph format:
bsk.network<-graph.data.frame(bsk, directed=F) #the 'directed' attribute specifies whether the edges are directed
# or equivelent irrespective of the position (1st vs 2nd column). For directed graphs use 'directed=T'

# Inspect the data:

V(bsk.network) #prints the list of vertices (people)
E(bsk.network) #prints the list of edges (relationships)
degree(bsk.network) #print the number of edges per vertex (relationships per people)

# First try. We can plot the graph right away but the results will usually be unsatisfactory:
plot(bsk.network)

Here is the result:

Not very informative indeed. Let’s go on:

 
#Subset the data. If we want to exclude people who are in the network only tangentially (participate in one or two relationships only)
# we can exclude the by subsetting the graph on the basis of the 'degree':

bad.vs<-V(bsk.network)[degree(bsk.network)<3] #identify those vertices part of less than three edges
bsk.network<-delete.vertices(bsk.network, bad.vs) #exclude them from the graph

# Plot the data.Some details about the graph can be specified in advance.
# For example we can separate some vertices (people) by color:

V(bsk.network)$color<-ifelse(V(bsk.network)$name=='CA', 'blue', 'red') #useful for highlighting certain people. Works by matching the name attribute of the vertex to the one specified in the 'ifelse' expression

# We can also color the connecting edges differently depending on the 'grade': 

E(bsk.network)$color<-ifelse(E(bsk.network)$grade==9, "red", "grey")

# or depending on the different specialization ('spec'):

E(bsk.network)$color<-ifelse(E(bsk.network)$spec=='X', "red", ifelse(E(bsk.network)$spec=='Y', "blue", "grey"))

# Note: the example uses nested ifelse expressions which is in general a bad idea but does the job in this case
# Additional attributes like size can be further specified in an analogous manner, either in advance or when the plot function is called:

V(bsk.network)$size<-degree(bsk.network)/10#here the size of the vertices is specified by the degree of the vertex, so that people supervising more have get proportionally bigger dots. Getting the right scale gets some playing around with the parameters of the scale function (from the 'base' package)

# Note that if the same attribute is specified beforehand and inside the function, the former will be overridden.
# And finally the plot itself:
par(mai=c(0,0,1,0)) 			#this specifies the size of the margins. the default settings leave too much free space on all sides (if no axes are printed)
plot(bsk.network,				#the graph to be plotted
layout=layout.fruchterman.reingold,	# the layout method. see the igraph documentation for details
main='Organizational network example',	#specifies the title
vertex.label.dist=0.5,			#puts the name labels slightly off the dots
vertex.frame.color='blue', 		#the color of the border of the dots 
vertex.label.color='black',		#the color of the name labels
vertex.label.font=2,			#the font of the name labels
vertex.label=V(bsk.network)$name,		#specifies the lables of the vertices. in this case the 'name' attribute is used
vertex.label.cex=1			#specifies the size of the font of the labels. can also be made to vary
)

# Save and export the plot. The plot can be copied as a metafile to the clipboard, or it can be saved as a pdf or png (and other formats).
# For example, we can save it as a png:
png(filename="org_network.png", height=800, width=600) #call the png writer
#run the plot
dev.off() #dont forget to close the device
#And that's the end for now.

Here is the result:

Still not perfect, but much more informative and aesthetically pleasing.

Additional information can be found on this guide to igraph which is in development, the examples here, and the official CRAN documentation of the package. Especially useful is this list of the plot attributes that can be tweaked. The plots can also be adjusted interactively using the tkplot function instead of plot, but the options for saving the resulting figure are limited.

Have fun with your networks!

Scatterplots vs. regression tables (Economics professors edition)

I have always considered scatterplots to be the best available device to show relationships between variables. But it must be even better to have the regression table and a full description of the results in addition, right? Not so fast:

A new paper shows that professional economists make largely correct inferences about data when looking at a scatterplot, but get confused when they are shown the details of the regressions next to the scatterplot, and totally mess it up when they are shown only the numbers without the plot! Wow! If you needed any more persuasion that graphing your data and your results are more important than those regression tables with zillions of numbers, now you have it.

P.S. The authors of this research could have done a better job themselves in communicating visually their findings…

[via Felix Salmon]

 

The illusion of predictability: How regression statistics mislead experts
Emre Soyer& Robin M. Hogarth
Abstract
Does the manner in which results are presented in empirical studies affect perceptions of the predictability of the outcomes? Noting the predominant role of linear regression analysis in empirical economics, we asked 257 academic economists to make probabilistic inferences given different presentations of the outputs of this statistical tool. Questions concerned the distribution of the dependent variable conditional on known values of the independent variable. Answers based on the presentation mode that is standard in the literature led to an illusion of predictability; outcomes were perceived to be more predictable than could be justified by the model. In particular, many respondents failed to take the error term into account. Adding graphs did not improve inferences. Paradoxically, when only graphs were provided (i.e., no regression statistics), respondents were more accurate. The implications of our study suggest, inter alia, the need to reconsider how to present empirical results and the possible provision of easy-to-use simulation tools that would enable readers of empirical papers to make accurate inferences.

New tool for discourse network analysis

EJPR has just published an article introducing a new tool for ‘discourse network analysis’. Using the tool, you can measure and visualize political discourses and the networks of actors affiliated to each discourse. One can study the actor congruence networks (based on the number of statements actors share), concept congruence networks (based on whether statements are used by an actor in the same way) and trace the evolution of both over time.

Here is a graph taken from the paper which illustrates the actor congruence networks for the issue of software patents in the EU (click to enlarge):

The discourse networks analysis tool is free and available from the website of Philip Leifeld, one of the co-authors of the article. I can’t wait to get my hands on the program and try it out for myself. The tool promises to be an interesting alternative to evolutionary factor analysis – another new method for studying policy frames and discourses that I recently discussed – with the added benefit of being able to present actors and frames in an integrated analysis.  

Here is the abstract of the EJPR article (there are more resources at this website):

In 2005, the European Parliament rejected the directive ‘on the patentability of computer-implemented inventions’, which had been drafted and supported by the European Commission, the Council and well-organised industrial interests, with an overwhelming majority. In this unusual case, a coalition of opponents of software patents prevailed over a strong industry-led coalition. In this article, an explanation is developed based on political discourse showing that two stable and distinct discourse coalitions can be identified and measured over time. The apparently weak coalition of software patent opponents shows typical properties of a hegemonic discourse coalition. It presents itself as being more coherent, employs a better-integrated set of frames and dominates key economic arguments, while the proponents of software patents are not as well-organised. This configuration of the discourse gave leeway for an alternative course of political action by the European Parliament. The notion of discourse coalitions and related structural features of the discourse are operationalised by drawing on social network analysis. More specifically, discourse network analysis is introduced as a new methodology for the study of policy debates. The approach is capable of measuring empirical discourses both statically and in a longitudinal way, and is compatible with the policy network approach.

Visualizing left-right government positions

How does the political landscape of Europe change over time? One way to approach this question is to map the socio-economic left-right positions of the governments in power. So let’s plot the changing ideological  positions of the governments using data from the Manifesto project! As you will see below, this proved to be a more challenging task than I imagined, but the preliminary results are worth sharing nonetheless.

First, we need to extract the left-right positions from the Manifesto dataset. Using the function described here, this is straightforward:

lr2000<-manifesto.position('rile', start=2000, end=2000)

This compiles the (weighted) cabinet positions for the European countries for the year 2000. Next, let’s generate a static map. We can use the new package rworldmap for this purpose. Let’s also build a custom palette that maps colors to left-right values. Since in Europe red traditionally is the color of the political left (the socialists), the palette ranges from dark red to gray to dark blue (for the right-wing governments).

library (rworldmap)
op <- palette(c('red4','red3','red2','red1','grey','blue1', 'blue2','blue3', 'blue4'))

After recoding the name of the UK, we are ready to bind our data and plot the map. You can save the map as a png file.

library(car)
lr2000$State<-recode(lr$State, "'Great Britain'='United Kingdom'")

lrmapdata <- joinCountryData2Map( lr2000,joinCode = "NAME", nameJoinColumn = "State", mapResolution='medium')

par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i")
png(file='LR2000map.png', width=640,height=480)
mapCountryData( lrmapdata, nameColumnToPlot="position",colourPalette=op, xlim=c(-9,31), ylim=c(36,68), mapTitle='2000', aspect=1.25,addLegend=T )
dev.off()

The limits on on the x- and y-axes center the map on Europe. It is a process of trial and error till you get it right, and the limits need to be co-ordinated with the aspect and the width and height of the png file so that the map looks reasonably well-proportioned. Here  is the result (click to see in full resolution):

It looks a bit chunky but not too bad. Next, we have to find a way to show developments over time. We could show several plots for different years on one page, but this is not very effective:

A much better way would be to make the maps dynamic, or, in other words, to animate them. But this is easier said than done. After searching for a few days for tools that can accomplish the job, I settled for producing individual maps for each month, importing the series into Adobe Flash, and exporting a simple animation movie. The R code to produce  the individual  maps:

lr<-manifesto.position('rile', start=1948, end=2008, period='month')
lr$State<-recode(lr$State, "'Great Britain'='United Kingdom'")
u.c<-unique(lr$Year.month)
for (i in 1:length(u.c)){
     lr.temp<-subset(lr, lr$Year.month==u.c[i])
     lrmapdata <- joinCountryData2Map( lr.temp,joinCode = "NAME", nameJoinColumn = "State", mapResolution='medium')
     plot.name<-paste('./maps/map',i,'.png', sep='') 

     par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i")
     png(file=plot.name, width=640,height=480)
     mapCountryData( lrmapdata, nameColumnToPlot="position",colourPalette=op, xlim=c(-9,31), ylim=c(36,68), mapTitle=u.c[i], aspect=1.25,addLegend=T )
     dev.off() }

And here is the result (opens outside the post):

Flash video of Left-Right positions (slow)

It kind of works, it has buttons for navigation, but it has one major flow – it is damn slow. It should be 12 frames (maps) per second, and it is 12 fps inside Flash, but once exported, the frame rate goes down (probably because my laptop’s processor is too slow). In fact, I can export a fast version, but only if I get rid of the control buttons. Here it is (right-click and press play to start):

Flash video of Left-Right positions (fast)

You can also play the animation as an AVI video (uploaded on YouTube), but somehow, through the mysteries of video-processing, a crisp slideshow of 8mb ended up as a low-res movie of 600mb.


The results resemble my initial idea, although none is perfect. Ideally, I would want a fast movie with controls and a time-slider, but my Flash programming skills (and my computer) need to be upgraded for that. Meanwhile, the Manifesto project could also update their data on which the animation is based.

Altogether, the experience of creating the visualization has been much more painful than I anticipated. First, there doesn’t seem to be an easy way to get a map of Europe (or, more precisely, of the European Union territories) for use in R. The available options are  either too low resolution, or too outdated (e.g. featuring Czechoslovakia), or require centering a world-map using ylim and xlim which is a problem because these coordinates are connected to the dimensions and the resolution of the output plot. For the US, and for individual European states, there are tons of slick and easy-to-find maps (shapefiles), but for Europe I couldn’t find anything that doesn’t feature huge tracts of land east to the Urals, which are irrelevant and remain empty with political data (which is usually available for the EU+ states only). Any pointers to good, relatively high-res maps (shapefiles) of the EU will be much appreciated.

Second, producing an animation out of the individual maps is rather difficult. Currently, Google Charts offer dynamic plots and static maps, I hope in the future they include dynamic maps as well. Especially because the googleVis package makes it possible to build Google charts from within R. I also found a new tool called StatPlanet which seems relevant and rather cool, but still relies on Adobe Flash and has no packaged Europe/EU maps. The big guns in visualization software are most probably up to the task but Tableau is prohibitively expensive and Processing is said to have a steep learning curve. Again, any help in identifying solutions that do not require proprietary software to produce animated maps would be much appreciated. I hope to be able to post an update on the project soon.

Creating Data Maps

There are several online tools for data visualization including IBM’s ManyEyes and Google’s Chart Tools. For a recent post on the other blog to which I contribute I wanted to map the distribution of a variable on a geographical map of Europe. I decided that’s a good opportunity to try a site called Target Map which promises free, high-quality, customizable data maps. The result of my efforts can be seen below:

The link to the map is here.

Altogether, I can’t say that I am too happy with the mapping utility. My main quibble is that there are no default color palettes that translate well continuous variables into color hues. By default, the program offers highly contrasting color choices for the different categories but ones that don’t suggest the ranking of categories. And I couldn’t find an easy way to customize the color palette.

Data entry is OK, although once you select Europe as the geographical scope of your data, you can’t have any values for Turkey, for example, even if you try to supply them manually. Altogether, Target Map might be useful for some very small and inconsequential projects but for serious staff one should bite the bullet and get familiar with R’s map utilities (something I have been planning to do for a while).