I’m a member of the American Statistical Association’s “Statistics in Sport” section (http://www.amstat.org/sections/sis/) and I’m also British by birth, so Andy Murray’s success at Wimbledon this year was interesting to me for two reasons. I took a look at some of the data on Murray (collected by IBM’s SlamTracker initiative — http://2013.usopen.org/en_US/slamtracker/ ) with a view to doing a little visual analysis, so now I have another reason to be interested …
I found some data on his performance over a few years leading up to Wimbledon 2013 and wanted to look at trends. Now usually I prefer to create several linked visualizations and look at them together, but for this data I found that several of the stats I was interested in worked nicely when plotted in the same system. Here’s what I came up with:
It is hard to find anyone in visualization today with much time for pie charts. In fact it seems de rigueur to disdain them. And yet we see an awful lot of them. Now, I’m not going to claim that they are a good, general purpose chart, but I do always like to think of times when a chart will actually work well.
When Pie Charts Work At All
One well-known requirement for a pie to have a chance of working is that the data represent a fraction of a whole. That’s the big selling point of pie charts — each data row should represent a fraction of the overall data. So pies work best for percentages and fractions, and second-best for counts, populations, weights — things for which there is a natural feeling that summing them all up and saying “that represents 100%” is good.
On the side of evil is when the numbers must not be summed — if the data represent means (for different sized groups) or degrees Fahrenheit, then a pie representation is flat-out wrong. It’s not a bad rule to say:
Only Use a Pie if it makes sense to think of the data values as summing to 100%
The second rule I’d suggest is based on the inability for people accurately to judge angles. Pies do not work well for that, so if you need accurately to judge numbers, do not use a pie. Pies work well for “A is about twice as big as B” or “ C is definitely smaller in the second pie”. They are not good for “C is very slightly lower than D” or “B is just under 33%”. Stating it positively:
Use a Pie if the goal is to make broad comparisons, not detailed ones.
Finally, I’d offer a third suggestion, rather than a rule. It’s based on the observation that a bar chart (a natural competitor to a pie chart) is very often improved by ordering — high to low values, for example. Pies can often look radically different when categories are re-ordered, and although it is sometimes suggested that you do this ordering for pies, I think that a pie for categories that can be re-ordered would almost certainly look better in another form. Instead I would suggest the following:
Use a Pie when the categories have a natural order
When Pie Charts Work Well
Stephen Few (Save the pies for Dessert: http://www.perceptualedge.com/articles/08-21-07.pdf) quotes a study showing that when pies have been shown to be actively superior to bar charts — it is when it makes sense to want to compare sums of categories (e.g. the sum of the first two against the sum of the second two); the reason being that in a pie, you can compare angles for multiple segments easily, whereas in a bar chart that is not easy. ￼
I’m a big fan of using languages for visualization rather than canned chart types. I’ve been working with the Grammar of Graphics approach for a number of years within SPSS and now IBM, and my book “Visualizing Time” is composed 95% of Grammar-based visualizations. It’s pretty safe to say it’s my preferred approach.
Protovis (the forerunner of D3, to a great extent) was built on Grammar approach; Bostock and Heer’s 2009 article (on Heer’s site at http://hci.stanford.edu/jheer/files/2009-Protovis-InfoVis.pdf) gives a very good statement of the benefits of the Grammar-based approach as opposed to the “Chart Type” approach:
The main drawback of [the chart type] approach is that it requires a small, closed system. If the desired chart type is not supported, or the desired visual parameter is not exposed in the interface, no recourse is available to the user and either the visualization design must be compromised or another tool adopted. Given the high cost of switching tools, and the iterative nature of visualization design, frequent compromise is likely.
Wikipedia Recent Changes Map shows a good example is a good, clean, simple implementation that addresses the question:
“How is Wikipedia being Edited right now?”
Some of the features of this visualization that work:
- Filtered data — the potential data size is huge, and grows as we wait, so the display only shows the most recent events, both on the map and the list below it
- Multiple linked views — data is shown geographically on the world map, and as a list of events below. This is preferable than trying to have one combined view as each view supports a different set of tasks, and combining them would complicate those tasks (WHERE are the changes coming from? WHAT is being changed?)
- Not using graphics — the report on what has changed is a simple scrolling text view; since the dat is textual, and it is ordered, a simple list of text makes sense.
- Different fade-out rates — Using the color for the country to show the most recent changes, and then fading that out in synch with the text description, focuses attention on changes very well. Leaving the dots behind for the changes allows us to keep a longer-term trend in mind.
As a map geek, I might prefer a different projection for the whole earth map; maybe WinkelTripel?
I took the data from my last post, aggregated up some fields and made a Chord Diagram for it, using RAVE. I was lazy and didn’t do a stellar job on rolling up years, so the year indicated is actually the center of a 4-year span — so 2007 is actually [2005.5, 2009.5] which is a little odd.
No big insights here — podcasts are all recent; alternative music is mostly recent too (Eels and Killers are artists with a large number of songs in my library). Interesting that I didn’t buy a lot of music form around 1999 …
I thought there were more packages that could do chord visualizations, but was only able to find some D3 examples.
The track information stored in iTunes is pretty interesting from a visualization point of view, as it contains dates, durations, categories, groupings — all the sorts of things that make for complex, interesting data to look at.The only issue is … it’s in iTunes, and I’d like to get a CSV version of it so I can use it in a bunch of tools.
So, here is the result; a couple of Python scripts that use standard libraries to read the XML file exported by iTunes and convert it to CSV. It’s not general or robust code, just some script that worked for me and should be pretty easy to modify for you. I’m not a Pythonista, mostly doing Java, so apologies for non-idiomatic usage. Feel free to correct or suggest in the comments as this is also a learning exercise for me.
For the Grammar of Graphics language-based approach to visualization, and therefore in the RAVE visualization system, maps are simply another element that can be used within the grammatical formulation.
Although most people consider a map a very different entity from a bar chart, all that really differs between a bar chart and a map of areas like the one included here is that instead of representing a row of data by a bar, we use a polygon (or set of polygons) on a map. Otherwise their properties ought to be the same — we can apply color, patterns, labels, transparency. We can set a summary statistic when there are multiple values for each polygon to reflect min, max, mean, median, range, or any of the regular sets of items. We can flip, transpose and panel the charts. Essentially, from the grammatical point of view, if you can do it to a bar chart, you can do it to a map. The only limitation is that whereas the sizes of the bars can be set or determined by data, the map polygons cannot, so setting sizes on the map polygons has no effect.
Orthogonality is also important — so we can say we want a point element instead of a polygon, as in the above where we’ve added a second element to a RAVE US Map conveying different data as well as being a good place to put labels
In English, we use many different words to describe the same basic objects. In one survey, researchers Dieth and Orton explored which words were used for the place where a farmer might keep his cow, depending on where the speaker resided in England. The results include words like byre, shippon, mistall, cow-stable, cow-house, cow-shed, neat-house or beast-house. We see the same situation in visualization, where a two-dimensional chart with data displayed as a collection of points, using one variable for the horizontal axis and one for the vertical, is variously called ascatterplot, a scatter diagram, a scatter graph, a 2D dotplot or even a star field.
There have been a number of attempts to form taxonomies, or categorizations, of visualizations. Most software packages for creating graphics, such as Microsoft Excel focus on the type of graphical element used to display the data and then sub-classify from that. This has one immediate problem in that plots with multiple elements are hard to classify (should we classify a chart with a bars and points as a bar chart, with point additions, or instead classify it as a point char, with bars added?). Other authors have started with the dimensionality of the data (one-dimensional, two-dimensional, etc.) and used that as a basic classification criterion, but that has similar problems.
Visualizations are too numerous, too diverse and too exciting to fit well into a taxonomy that divides and subdivides. In contrast to the evolution of animals and plants, which did occur essentially in a tree-like manner, with branches splitting and sub-splitting, information visualization techniques have been invented more by a compositional approach. We take a polar coordinate system, combine it with bars, and achieve a Rose diagram. We put a network in 3D. We addtexture, shape and size mappings to all the above. We split it into panels. This is why a traditional taxonomy of information visualization is doomed to be unsatisfying. It is based on a false analogy with biology and denies the basic process by which visualizations have been created: composition.