Swimming in a sea of code

If you are looking for code here, move on.

In the beginning, there was only the relentless blinking of the cursor. With the maddening regularity of waves splashing on the shore: blink, blink, blink, blink…Beyond the cursor, the white wasteland of the empty page: vast, featureless, and terrifying as the sea. You stare at the empty page and primordial fear engulfs you: you are never gonna venture into this wasteland, you are never gonna leave the stable, solid, familiar world of menus and shortcuts, icons and buttons.

And then you take the first cautious steps.

print ‘Hello world’
> Hello world
, the sea obliges.

1+1
> 2
2+2
> 4

You are still scared, but your curiosity is aroused. The playful responsiveness of the sea is tempting, and quickly becomes irresistible. Soon, you are jumpting around like a child, rolling upside-down and around and around:

> a=2
> b=3
> a+b
5

> for (x in 1:60) print (x)
1    2    3    4    5    6    7    8    9   10   11   12   13   14  15   16   17   18   19   20   21   22   23   24   25   26  27   28   29   30   31   32   33   34   35   36   37   38  39   40   41   42   43   44   45   46   47   48   49   50  51   52   53   54   55   56   57   58   59   60

The sense of freedom is exhilarating. You take a deep breath and dive:

> for (i in 1:10) ifelse (i>5, print ('ha'), print ('ho'))
[1] "ho"
[1] "ho"
[1] "ho"
[1] "ho"
[1] "ho"
[1] "ha"
[1] "ha"
[1] "ha"
[1] "ha"
[1] "ha"

Your old fear seems so silly now. Code is your friend. The sea is your friend. The white page is just a playground with endless possibilities.

Your confidence grows. You start venturing further into the deep. You write your first function. You let code scrape the web for you. You generate your first random variable. You run your first statistical models. Your code grows in length and takes you deeper and deeper into unexplored space.

Then suddenly you are lost. Panic sets in. The code stops to obey; you search for the problem but you cannot find it. Panic grows. Instinctively, you grasp for help for the icons, but there are none. You look for support by the menus but they are gone. You are all alone  in the middle of this long string of code which seems so alien right now. Clouds gather. Who tempted you in? How do you get back? What to do next? You want to turn these lists into vectors, but you can’t. You need to decompose your strings into characters but you don’t know how. Out of nowhere encoding problems appear and your entire code is defunct. You are lost….

Eventually, you give up and get back to the shore. The world of menus and icons and shortcuts is limited but safe. Your short flirt with code is over forever, you think. Sometimes you dare to dream about the freedom it gave you but then you remember the feelings of helplessness and entrapment, of being all alone in the open sea. No, getting into code was a childish mistake.

But as time goes by you learn to control your fear and approach the sea again. This time without headless enthusiasm but slowly, with humility and respect for its unfathomable depths. You never stray too far away from the shore in one go. You learn to avoid nested loops and keep your regular expressions to a minimum. You always leave signposts if you need to retrace your path.

Code will never be your friend. The sea will never be your lover. But maybe you can learn to get along just enough as to harness part of its limitless power… without losing yourself into it forever. >

R start screen

The evolution of EU legislation (graphed with ggplot2 and R)

During the last half century the European Union has adopted more than 100 000 pieces of legislation. In this presentation I look into the patterns of legislative adoption over time. I tried to create clear and engaging graphs that provide some insight into the evolution of law-making activity: not an easy task given the byzantine nature of policy making in the EU and the complex nomenclature of types of legal acts possible.

The main plot showing the number of adopted directives, regulations and decisions since 1967 is pasted below. There is much more in the presentation. The time series data is available here, as well as the R script used to generate the plots (using ggplot2). Some of the graphs are also available as interactive visualizations via ManyEyes here, here, and here (requires Java). Enjoy.

EU laws over time

Music Network Visualization

Note: probably of interest only to the intersection of the readers who are into niche music genres and those interested in network visualization.

My music interests have always been rather, hmm…, eclectic. Somehow IDM, ambient, darkwave, triphop, acid jazz, bossa nova, qawali, Mali blues and other more or less obscure genres have managed to happily co-exist in my music collection. The sheer diversity always invited the question whether there is some structure to the collection, or each genre is an island of its own. Sounds like a job for network visualization!

Now, there are plenty of music network viz applications on the web. But they don’t show my collection, and just seem unsatisfactory for various reasons. So I decided to craft my own visualization using R and igraph.

As a first step I collected for all artists in my last.fm library the artists that the site classifies as similar. So I piggyback on last.fm for the network similarity measures. I also get info on the most-often used tag for the artist and the number of plays it has on the site. The rest is pretty straightforward as can be seen from the code.

# Load the igraph and foreign packages (install if needed)
require(igraph)
require(foreign)
lastfm<-read.csv("http://www.dimiter.eu/Data_files/lastfm_network_ad.csv", header=T,  encoding="UTF-8") #Load the dataset

lastfm$include<-ifelse(lastfm$Similar %in% lastfm$Artist==T,1,0) #Index the links between artists in the library
lastfm.network<-graph.data.frame(lastfm, directed=F) #Import as a graph

last.attr<-lastfm[-which(duplicated(lastfm$Artist)),c(5,3,4) ] #Create some attributes
V(lastfm.network)[1:106]$listeners<-last.attr[,2]
V(lastfm.network)[107:length(V(lastfm.network))]$listeners<-NA
V(lastfm.network)[1:106]$tag<-last.attr[,3]
V(lastfm.network)[107:length(V(lastfm.network))]$tag<-NA #Attach the attributes to the artist from the library (only)
V(lastfm.network)$label.cex$tag<-ifelse(V(lastfm.network)$listeners>1200000, 1.4, 
                                    (ifelse(V(lastfm.network)$listeners>500000, 1.2,
                                            (ifelse(V(lastfm.network)$listeners>100000, 1.1,
                                                   (ifelse(V(lastfm.network)$listeners>50000, 1, 0.8))))))) #Scale the size of labels by the relative popularity

V(lastfm.network)$color<-"white" #Set the color of the dots
V(lastfm.network)$size<-0.1 #Set the size of the dots
V(lastfm.network)$label.color<-NA
V(lastfm.network)[1:106]$label.color<-"white" #Only the artists from the library should be in white, the rest are not needed

E(lastfm.network)[ include==0 ]$color<-"black" 
E(lastfm.network)[ include==1 ]$color<-"red" #Color edges between artists in the library red, the rest are not needed

fix(tkplot) #Add manually to the function an argument for the background color of the canvas and set it to black (bg=black)

tkplot(lastfm.network, vertex.label=V(lastfm.network)$name, layout=layout.fruchterman.reingold,
       canvas.width=1200, canvas.height=800) #Plot the graph and adjust as needed

I plot the network with the tkplot command which allows for the manual adjustments necessary because many artist names get on top of each other in the initial plot. Because the export options of tkplot are limited I just took a print screen ( I know, I know, that’s kind of cheating ;-) ), added the tittle in Photoshop and, voila, it’s done!

[click to enlarge and explore]
my-music-netowrk

Knowing intimately the artists in the graph, I can certify that the network definitely makes a lot of sense. I love the small clusters (Flying Louts, Andy Stott, Extrawelt and Claro Intelecto [minimal/dub], or Anouar Brahem and Rabih Abou-Khalil [ethno jazz]) loosely connected to the rest of the network. And I love the fact that the boundary spanners are immediately obvious (e.g. Pink Martini between acid jazz and world music [what a stupid label by the way!], or Cesaria Evora between African and Caribbean music, or Portishead between brit-pop, trip-hop and darkwave, or Amon Tobin between trip-hop, electro and IDM). Even the different world music genres are close to each other but still unconnected. And somehow Banco De Gaya, the most ethno of all electronica in the library, ended up closest to the world/ethno clusters. There are a few problems, like Depeche Mode, which get to be pulled from the opposite sides of the graph, but these are very few.

Altogether, I have to admit I feel like a teenage dream of mine has finally been realized. But I realize the network is a rather personal thing (as it was meant to be) so I don’t expect many to get overly excited about it. Still, I would be glad to hear your comments or suggestions for extensions and improvements. And, if you were a good boy/girl during the year, I could also consider visualizing your last.fm network as a present for the new year!

Network visualization in R with the igraph package

In this post I showed a visualization of the organizational network of my department. Since several people asked for details how the plot has been produced, I will provide the code and some extensions below. The plot has been done entirely in R (2.14.01) with the help of the igraph package. It is a great package but I found the documentation somewhat difficult to use, so hopefully this post can be a helpful introduction to network visualization with R. Here we go:

# Load the igraph package (install if needed)

require(igraph)

# Data format. The data is in 'edges' format meaning that each row records a relationship (edge) between two people (vertices).
# Additional attributes can be included. Here is an example:
#	Supervisor	Examiner	Grade	Spec(ialization)
#	AA		BD		6	X	
#	BD		CA		8	Y
#	AA		DE		7	Y
#	...		...		...	...
# In this anonymized example, we have data on co-supervision with additional information about grades and specialization. 
# It is also possible to have the data in a matrix form (see the igraph documentation for details)

# Load the data. The data needs to be loaded as a table first: 

bsk<-read.table("http://www.dimiter.eu/Data_files/edgesdata3.txt", sep='\t', dec=',', header=T)#specify the path, separator(tab, comma, ...), decimal point symbol, etc.

# Transform the table into the required graph format:
bsk.network<-graph.data.frame(bsk, directed=F) #the 'directed' attribute specifies whether the edges are directed
# or equivelent irrespective of the position (1st vs 2nd column). For directed graphs use 'directed=T'

# Inspect the data:

V(bsk.network) #prints the list of vertices (people)
E(bsk.network) #prints the list of edges (relationships)
degree(bsk.network) #print the number of edges per vertex (relationships per people)

# First try. We can plot the graph right away but the results will usually be unsatisfactory:
plot(bsk.network)

Here is the result:

Not very informative indeed. Let’s go on:

 
#Subset the data. If we want to exclude people who are in the network only tangentially (participate in one or two relationships only)
# we can exclude the by subsetting the graph on the basis of the 'degree':

bad.vs<-V(bsk.network)[degree(bsk.network)<3] #identify those vertices part of less than three edges
bsk.network<-delete.vertices(bsk.network, bad.vs) #exclude them from the graph

# Plot the data.Some details about the graph can be specified in advance.
# For example we can separate some vertices (people) by color:

V(bsk.network)$color<-ifelse(V(bsk.network)$name=='CA', 'blue', 'red') #useful for highlighting certain people. Works by matching the name attribute of the vertex to the one specified in the 'ifelse' expression

# We can also color the connecting edges differently depending on the 'grade': 

E(bsk.network)$color<-ifelse(E(bsk.network)$grade==9, "red", "grey")

# or depending on the different specialization ('spec'):

E(bsk.network)$color<-ifelse(E(bsk.network)$spec=='X', "red", ifelse(E(bsk.network)$spec=='Y', "blue", "grey"))

# Note: the example uses nested ifelse expressions which is in general a bad idea but does the job in this case
# Additional attributes like size can be further specified in an analogous manner, either in advance or when the plot function is called:

V(bsk.network)$size<-degree(bsk.network)/10#here the size of the vertices is specified by the degree of the vertex, so that people supervising more have get proportionally bigger dots. Getting the right scale gets some playing around with the parameters of the scale function (from the 'base' package)

# Note that if the same attribute is specified beforehand and inside the function, the former will be overridden.
# And finally the plot itself:
par(mai=c(0,0,1,0)) 			#this specifies the size of the margins. the default settings leave too much free space on all sides (if no axes are printed)
plot(bsk.network,				#the graph to be plotted
layout=layout.fruchterman.reingold,	# the layout method. see the igraph documentation for details
main='Organizational network example',	#specifies the title
vertex.label.dist=0.5,			#puts the name labels slightly off the dots
vertex.frame.color='blue', 		#the color of the border of the dots 
vertex.label.color='black',		#the color of the name labels
vertex.label.font=2,			#the font of the name labels
vertex.label=V(bsk.network)$name,		#specifies the lables of the vertices. in this case the 'name' attribute is used
vertex.label.cex=1			#specifies the size of the font of the labels. can also be made to vary
)

# Save and export the plot. The plot can be copied as a metafile to the clipboard, or it can be saved as a pdf or png (and other formats).
# For example, we can save it as a png:
png(filename="org_network.png", height=800, width=600) #call the png writer
#run the plot
dev.off() #dont forget to close the device
#And that's the end for now.

Here is the result:

Still not perfect, but much more informative and aesthetically pleasing.

Additional information can be found on this guide to igraph which is in development, the examples here, and the official CRAN documentation of the package. Especially useful is this list of the plot attributes that can be tweaked. The plots can also be adjusted interactively using the tkplot function instead of plot, but the options for saving the resulting figure are limited.

Have fun with your networks!

Visualizing left-right government positions

How does the political landscape of Europe change over time? One way to approach this question is to map the socio-economic left-right positions of the governments in power. So let’s plot the changing ideological  positions of the governments using data from the Manifesto project! As you will see below, this proved to be a more challenging task than I imagined, but the preliminary results are worth sharing nonetheless.

First, we need to extract the left-right positions from the Manifesto dataset. Using the function described here, this is straightforward:

lr2000<-manifesto.position('rile', start=2000, end=2000)

This compiles the (weighted) cabinet positions for the European countries for the year 2000. Next, let’s generate a static map. We can use the new package rworldmap for this purpose. Let’s also build a custom palette that maps colors to left-right values. Since in Europe red traditionally is the color of the political left (the socialists), the palette ranges from dark red to gray to dark blue (for the right-wing governments).

library (rworldmap)
op <- palette(c('red4','red3','red2','red1','grey','blue1', 'blue2','blue3', 'blue4'))

After recoding the name of the UK, we are ready to bind our data and plot the map. You can save the map as a png file.

library(car)
lr2000$State<-recode(lr$State, "'Great Britain'='United Kingdom'")

lrmapdata <- joinCountryData2Map( lr2000,joinCode = "NAME", nameJoinColumn = "State", mapResolution='medium')

par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i")
png(file='LR2000map.png', width=640,height=480)
mapCountryData( lrmapdata, nameColumnToPlot="position",colourPalette=op, xlim=c(-9,31), ylim=c(36,68), mapTitle='2000', aspect=1.25,addLegend=T )
dev.off()

The limits on on the x- and y-axes center the map on Europe. It is a process of trial and error till you get it right, and the limits need to be co-ordinated with the aspect and the width and height of the png file so that the map looks reasonably well-proportioned. Here  is the result (click to see in full resolution):

It looks a bit chunky but not too bad. Next, we have to find a way to show developments over time. We could show several plots for different years on one page, but this is not very effective:

A much better way would be to make the maps dynamic, or, in other words, to animate them. But this is easier said than done. After searching for a few days for tools that can accomplish the job, I settled for producing individual maps for each month, importing the series into Adobe Flash, and exporting a simple animation movie. The R code to produce  the individual  maps:

lr<-manifesto.position('rile', start=1948, end=2008, period='month')
lr$State<-recode(lr$State, "'Great Britain'='United Kingdom'")
u.c<-unique(lr$Year.month)
for (i in 1:length(u.c)){
     lr.temp<-subset(lr, lr$Year.month==u.c[i])
     lrmapdata <- joinCountryData2Map( lr.temp,joinCode = "NAME", nameJoinColumn = "State", mapResolution='medium')
     plot.name<-paste('./maps/map',i,'.png', sep='') 

     par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i")
     png(file=plot.name, width=640,height=480)
     mapCountryData( lrmapdata, nameColumnToPlot="position",colourPalette=op, xlim=c(-9,31), ylim=c(36,68), mapTitle=u.c[i], aspect=1.25,addLegend=T )
     dev.off() }

And here is the result (opens outside the post):

Flash video of Left-Right positions (slow)

It kind of works, it has buttons for navigation, but it has one major flow – it is damn slow. It should be 12 frames (maps) per second, and it is 12 fps inside Flash, but once exported, the frame rate goes down (probably because my laptop’s processor is too slow). In fact, I can export a fast version, but only if I get rid of the control buttons. Here it is (right-click and press play to start):

Flash video of Left-Right positions (fast)

You can also play the animation as an AVI video (uploaded on YouTube), but somehow, through the mysteries of video-processing, a crisp slideshow of 8mb ended up as a low-res movie of 600mb.


The results resemble my initial idea, although none is perfect. Ideally, I would want a fast movie with controls and a time-slider, but my Flash programming skills (and my computer) need to be upgraded for that. Meanwhile, the Manifesto project could also update their data on which the animation is based.

Altogether, the experience of creating the visualization has been much more painful than I anticipated. First, there doesn’t seem to be an easy way to get a map of Europe (or, more precisely, of the European Union territories) for use in R. The available options are  either too low resolution, or too outdated (e.g. featuring Czechoslovakia), or require centering a world-map using ylim and xlim which is a problem because these coordinates are connected to the dimensions and the resolution of the output plot. For the US, and for individual European states, there are tons of slick and easy-to-find maps (shapefiles), but for Europe I couldn’t find anything that doesn’t feature huge tracts of land east to the Urals, which are irrelevant and remain empty with political data (which is usually available for the EU+ states only). Any pointers to good, relatively high-res maps (shapefiles) of the EU will be much appreciated.

Second, producing an animation out of the individual maps is rather difficult. Currently, Google Charts offer dynamic plots and static maps, I hope in the future they include dynamic maps as well. Especially because the googleVis package makes it possible to build Google charts from within R. I also found a new tool called StatPlanet which seems relevant and rather cool, but still relies on Adobe Flash and has no packaged Europe/EU maps. The big guns in visualization software are most probably up to the task but Tableau is prohibitively expensive and Processing is said to have a steep learning curve. Again, any help in identifying solutions that do not require proprietary software to produce animated maps would be much appreciated. I hope to be able to post an update on the project soon.

Compiling government positions from the Manifesto Project data with R

The Manifesto Project (former Manifesto Research Group, Comparative Manifestos Project) has assembled a database of ‘quantitative content analyses of parties’ election programs from more than 50 countries covering all free, democratic elections since 1945′ and is freely accessible online. The data, however, is available only at the party, and not at the government (cabinet) level. In order to automate  the process of extracting government positions from the Manifesto data, I wrote a simple R function which combines the party-level Manifesto data with the data on government compositions from the ParlGov database. The function manifesto.position() produces a data frame with the country, the time period, the government position of interest, and an index (id) variable. You can get the data either at a monthly or yearly period of aggregation, specify the start and the end dates, and get the data in ‘long’ or ‘wide’ format.

Here is how it works: First, you would need R up and running (with the ‘ggplot2‘ library installed). Second, you need the original data on party positions and on government compositions, and this script to merge them. Alternatively, you can download (or source) directly the resulting merged dataset here. Third, you need to source the file containing the functions.

Here are a few examples of the function in action:

####
### 1. Load the data file from the working directory or from the URL (default)
#cabinets<-read.table ('cabinets.txt', as.is=TRUE)
cabinets<-read.table ('http://www.dimiter.eu/Data_files/cabinets/cabinets.txt', as.is=TRUE)

### 2. Load the functions from the working directory or from the URL (default)
#source('government position extraction functions.R')
source('http://www.dimiter.eu/Data_files/cabinets/government%20position%20extraction%20functions.R')

### Use of manifesto.position(x, weighted=TRUE, long=TRUE, period='year', start=1945, end=2010)
### Inputs:
###         x [the name of the Manifesto item]
###         weighted  [weighted mean of the government position or a simple unweighted mean]
###         period    [year (default)  or month - time period for which the position is extracted]
###         long      [long (default)  or wide version of the output data]
###         start     [starting year for the extraction; 1945 is default]
###         end       [end year of the extraction; 2010 is default]
### Output: A data frame with four columns - State, Year (Year.month), position [the actual position], id [Year.State(Year.month.State)]
### For details see the sourced file above

### Examples
##  1. Extract the left/right positions
lr<-manifesto.position('rile')
summary(lr)

## 2. Exatract the unweighted International peace position from 1980 until 1999
intp<-manifesto.position('intpeace', weighted=F, start=1980, end=1999)
hist(intp$position)

## 3. Exatract the weighted Welfare position from 1980 until 1999 in a wide, rather than long shape - states are rows and years are colunms
welfare<-manifesto.position('welfare', long=F, start=1980, end=1999)
welfareT<-t(welfare) ##this would make the countries columns and the years rows.

## 4. Left/right on a monthly basis from 1980 till 1990
lrm<-manifesto.position('rile', period='month', start=1980, end=1990)

I hope you find the function useful. Feel free to e-mail any suggestions, remarks, reports on bugs, etc. If you use the function and the data, don’t forget to acknowledge the work of the people who collected the Manifestos and who compiled the ParlGov database.