Beyond visualization: The future of multivariate data analysis

Emerging data analysis paradigms are presenting the opportunity to represent high-dimensional data in intuitive and accessible forms. I will provide an overview of the use of data music, data video, and data food in the context of modern data analysis, and I will discuss how to integrate these methods into conventional analytics software.

Data are just abstract numbers, with no true representation in the real world. Data analysis is about getting the numbers into our brains.

We cannot perceive the data directly, so we must convert them into forms that we can perceive.

We often make graphs

and data tables,

and then we look at the graphs with our eyes. But perhaps we could instead make kebabs, and eating the kebabs would tell us something about the data.

My point here is that data analysis need not be limited to conventional visualization.

Data music

Data music analysis products

All of these bespoke data music implementations are great, but they're impractical for most analysis. Here are some data music tools that integrate with conventional data analysis software.

Sheetmusic

One approach is to think of music as data tables. So let's start by talking about data tables.

Here's a data table.

In a tidy data table, rows correspond to things. (You might also call them "observations" or "records".) For example, each row in this table is a country.

Columns correspond to attributes about each thing. You might also call them "variables", "fields", "properties", or "dimensions".

We can use data tables to represent musical scores.

spreadsheet versus ordinary sheet music

In this figure we see music represented in two forms: as a data table and as conventional sheet music.

spreadsheet versus ordinary sheet music slide, with row/beat highlighted

Each row in the table corresponds to a beat in the music. That is, the two regions with red boxes around them correspond to the same data. Different notes come from different columns. G2 is G2, G3 is G3, B5 is B5, and the lyrics aren't in the sheet music.

spreadsheet versus ordinary sheet music slide, with borders for bars

Music is often divided into measures. In sheet music, this is typically represented with bars. That's also how I do it in my spreadsheet.

And that sheet music you saw was actually generated inside the spreadsheet.

I just showed you how we can think of music as a representation of the data in a spreadsheet.

synthesizer functions

But spreadsheets are also code, so let's also think about musical functions as spreadsheet functions.

For example, sheetmusic provides a third_interval function for making a third interval in a particular key. Here are some other functions.

  • ionian_scale(<note>)
  • chord_progression(<the progression>, <base note>)
  • dominant7_chord(<note>)
  • dominant7_arpeggio(<note>)

For example, if I type "=ionian_scale('C4')", I'll get an Ionian scale starting at middle C.

TonesInTune

data-driven rhythms (ddr + tuneR)

library(ddr)
chicks <- arpeggidata(sqrt(ChickWeight$weight),
                                blip,
                                scale="Emajor",
                                bpm=200,
                                count=1/32)
play(chicks)

Features of different data music tools

To summarize, here are the tools.

"Freq" is whether a tool can convert frequencies, like "440 hz", to sounds.

"Notes" is whether a tool can convert frequencies, like "middle A", to sounds.

"Chords" is whether a tool can generate

All of these tools can play sounds, but some can also export to other software in other ways.

Tool Platform Freq Notes Chords, &c. Export
Tones in Tune Gnumeric Yes No No ?
Sheetmusic Excel No Yes Yes MIDI file, sheet music
tuneR R Yes No No Wave file
ddr R No Yes Yes R vector
ddpy pandas No Yes No MIDI file

Note: You can also implement rendering with the underlying platform.

Generic Tool What it does
MIDI synthesizer Convert a MIDI file/events to sound or a wave file.
Music engraving Convert a text file to MIDI or PDF.
Math Calculate scales, chords, rhythms, &c.
Wave/PCM writer Convert a numeric vector to a sound file.

Data food

As I mentioned earlier, we need to convert abstract data into something that we can perceive.

Usually, this would be graphs or tables, but there is no reason why we can't plot our data as kebabs.

library(geomdoner)

mpg$truck <- mpg$class
levels(mpg$truck) <- c(TRUE,FALSE,FALSE,FALSE,TRUE,FALSE,TRUE)

mpg$y2008 <- mpg$year == 2008 # Alternative is 1999
mpg$id <- row.names(mpg)

set.seed(693)
ggplot(mpg[sample.int(nrow(mpg), 8),]) +
  aes(label = paste0('Make #', id, ' (', manufacturer, ' ', model, ')'),
      border = drv,
      knoblauch = truck,
      scharf = grepl('auto', trans),
      zwiebeln = y2008,
      tomaten = TRUE, salat = TRUE,
      x = hwy, y = cty) +
  xlab('Highway miles per gallon') +
  ylab('City miles per gallon') +
  ggtitle('Milage of eight automobile makes.\n(Each döner is a make.)') +
  geom_text() + geom_doner()

We can use the geomdoner package to plot our data as kebabs. This ggplot code produces a text graph

and a bunch of orders for döner kebabs.

Make #142 (nissan altima): döner box

* ohne knoblauch
* ohne kräuter
* ohne scharf
* ohne zwiebeln
* mit tomaten
* mit salat

Make #13 (audi a4 quattro): döner

* ohne knoblauch
* ohne kräuter
* ohne scharf
* ohne zwiebeln
* mit tomaten
* mit salat

Then we can order the kebabs and put them on top of the graph, which is what you see here.

The x-axis is the highway milage, y-axis is city milage,

These two were spicy, which meant automatic transmission, worse milage

Data gastronomification implementations

Unfortunately, aside from geom doner, there are few end-user tools for data gastronomification analysis. But bespoke...

Hot Karot Open Sauce technology

Open Sauce

Census spices

Applications of data music and data food

People who can't see

People keep saying I should look at this, but I haven't.

Pretending that data are valuable

FMS Symphony, csv soundsystem

The top line is interest rate on US treasury bonds.... Note that the interest rate on US treasury bonds doesn't really change that often; this must be the market rate or something, and I don't really know what it is. But it doesn't matter, because nobody cares about interpreting the data; data is hot right now, and people want to be part of the trend.

Journalism-Driven Data, Thomas Levine

This occurred to me when I saw this video by the White House advertising the State of the Union.

Multivariate data analysis

But the main potential I see in data music videos is in presenting high-dimensional data.

In order that our visualizations can reveal unexpected patterns, it is important that we present many dimensions at once. Edward Tufte says this a lot.

As I said earlier, we historically represent our abstract data as visuals and then looked at them in our eyes. This used to work, but this approach is reaching its limits in the age of big data.

As you can see, today we have big data. Data visualization does not provide enough sensory bandwidth to represent our high variety of data that is so common today.

The issue comes down to the difference between a high volume of data and a high variety of data. Here I plot the iris data. If we have a lot of irises in our dataset, we just need to add more points. This might be intense computationally, but it's rather straightforward.

But what if we need to add more dimensions? We can vary the sizes of the points, add facets, and so on, but we can only go so far.

At some point, we wind up using a model to simplify our data and fit them into a data visualization. We reduce the data to something more manageable, but we lose the opportunity to analyze the raw variables. If we want to represent more variables in their original form, we need more bandwidth.

I have been trying to use our non-visual senses to increase our sensory bandwidth.

Music videos is a way of adding the sense of sound. But why stop there?

We should look for ways of using more of our senses to increase our sensory bandwidth.

A theory of data analysis

Even if you stick with visualization, this is the key to good data analysis.

Plotting data is about converting from abstract data to concrete metaphors. We have to find the meaningful representations that allow us to use our intuitions to understand the data.