Beyond visualization: The future of multivariate data analysis
Emerging data analysis paradigms are presenting the opportunity to represent high-dimensional data in intuitive and accessible forms. I will provide an overview of the use of data music, data video, and data food in the context of modern data analysis, and I will discuss how to integrate these methods into conventional analytics software.
Data are just abstract numbers, with no true representation in the real world. Data analysis is about getting the numbers into our brains.
We cannot perceive the data directly, so we must convert them into forms that we can perceive.
We often make graphs
and data tables,
and then we look at the graphs with our eyes. But perhaps we could instead make kebabs, and eating the kebabs would tell us something about the data.
My point here is that data analysis need not be limited to conventional visualization.
Data music analysis products
All of these bespoke data music implementations are great, but they're impractical for most analysis. Here are some data music tools that integrate with conventional data analysis software.
One approach is to think of music as data tables. So let's start by talking about data tables.
Here's a data table.
In a tidy data table, rows correspond to things. (You might also call them "observations" or "records".) For example, each row in this table is a country.
Columns correspond to attributes about each thing. You might also call them "variables", "fields", "properties", or "dimensions".
We can use data tables to represent musical scores.
In this figure we see music represented in two forms: as a data table and as conventional sheet music.
Each row in the table corresponds to a beat in the music. That is, the two regions with red boxes around them correspond to the same data. Different notes come from different columns. G2 is G2, G3 is G3, B5 is B5, and the lyrics aren't in the sheet music.
Music is often divided into measures. In sheet music, this is typically represented with bars. That's also how I do it in my spreadsheet.
And that sheet music you saw was actually generated inside the spreadsheet.
I just showed you how we can think of music as a representation of the data in a spreadsheet.
But spreadsheets are also code, so let's also think about musical functions as spreadsheet functions.
For example, sheetmusic provides a
third_interval function for making
a third interval in a particular key. Here are some other functions.
chord_progression(<the progression>, <base note>)
For example, if I type "=ionian_scale('C4')", I'll get an Ionian scale starting at middle C.
data-driven rhythms (ddr + tuneR)
library(ddr) chicks <- arpeggidata(sqrt(ChickWeight$weight), blip, scale="Emajor", bpm=200, count=1/32) play(chicks)
Features of different data music tools
To summarize, here are the tools.
"Freq" is whether a tool can convert frequencies, like "440 hz", to sounds.
"Notes" is whether a tool can convert frequencies, like "middle A", to sounds.
"Chords" is whether a tool can generate
All of these tools can play sounds, but some can also export to other software in other ways.
|Tones in Tune||Gnumeric||Yes||No||No||?|
|Sheetmusic||Excel||No||Yes||Yes||MIDI file, sheet music|
Note: You can also implement rendering with the underlying platform.
|Generic Tool||What it does|
|MIDI synthesizer||Convert a MIDI file/events to sound or a wave file.|
|Music engraving||Convert a text file to MIDI or PDF.|
|Math||Calculate scales, chords, rhythms, &c.|
|Wave/PCM writer||Convert a numeric vector to a sound file.|
As I mentioned earlier, we need to convert abstract data into something that we can perceive.
Usually, this would be graphs or tables, but there is no reason why we can't plot our data as kebabs.
library(geomdoner) mpg$truck <- mpg$class levels(mpg$truck) <- c(TRUE,FALSE,FALSE,FALSE,TRUE,FALSE,TRUE) mpg$y2008 <- mpg$year == 2008 # Alternative is 1999 mpg$id <- row.names(mpg) set.seed(693) ggplot(mpg[sample.int(nrow(mpg), 8),]) + aes(label = paste0('Make #', id, ' (', manufacturer, ' ', model, ')'), border = drv, knoblauch = truck, scharf = grepl('auto', trans), zwiebeln = y2008, tomaten = TRUE, salat = TRUE, x = hwy, y = cty) + xlab('Highway miles per gallon') + ylab('City miles per gallon') + ggtitle('Milage of eight automobile makes.\n(Each döner is a make.)') + geom_text() + geom_doner()
We can use the geomdoner package to plot our data as kebabs. This ggplot code produces a text graph
and a bunch of orders for döner kebabs.
Make #142 (nissan altima): döner box * ohne knoblauch * ohne kräuter * ohne scharf * ohne zwiebeln * mit tomaten * mit salat Make #13 (audi a4 quattro): döner * ohne knoblauch * ohne kräuter * ohne scharf * ohne zwiebeln * mit tomaten * mit salat
Then we can order the kebabs and put them on top of the graph, which is what you see here.
The x-axis is the highway milage, y-axis is city milage,
These two were spicy, which meant automatic transmission, worse milage
Data gastronomification implementations
Unfortunately, aside from geom doner, there are few end-user tools for data gastronomification analysis. But bespoke...
Applications of data music and data food
People who can't see
People keep saying I should look at this, but I haven't.
Pretending that data are valuable
FMS Symphony, csv soundsystem
The top line is interest rate on US treasury bonds.... Note that the interest rate on US treasury bonds doesn't really change that often; this must be the market rate or something, and I don't really know what it is. But it doesn't matter, because nobody cares about interpreting the data; data is hot right now, and people want to be part of the trend.
Journalism-Driven Data, Thomas Levine
This occurred to me when I saw this video by the White House advertising the State of the Union.
Multivariate data analysis
But the main potential I see in data music videos is in presenting high-dimensional data.
In order that our visualizations can reveal unexpected patterns, it is important that we present many dimensions at once. Edward Tufte says this a lot.
As I said earlier, we historically represent our abstract data as visuals and then looked at them in our eyes. This used to work, but this approach is reaching its limits in the age of big data.
As you can see, today we have big data. Data visualization does not provide enough sensory bandwidth to represent our high variety of data that is so common today.
The issue comes down to the difference between a high volume of data and a high variety of data. Here I plot the iris data. If we have a lot of irises in our dataset, we just need to add more points. This might be intense computationally, but it's rather straightforward.
But what if we need to add more dimensions? We can vary the sizes of the points, add facets, and so on, but we can only go so far.
At some point, we wind up using a model to simplify our data and fit them into a data visualization. We reduce the data to something more manageable, but we lose the opportunity to analyze the raw variables. If we want to represent more variables in their original form, we need more bandwidth.
I have been trying to use our non-visual senses to increase our sensory bandwidth.
Music videos is a way of adding the sense of sound. But why stop there?
We should look for ways of using more of our senses to increase our sensory bandwidth.
A theory of data analysis
Even if you stick with visualization, this is the key to good data analysis.
Plotting data is about converting from abstract data to concrete metaphors. We have to find the meaningful representations that allow us to use our intuitions to understand the data.