R for people who already know something about programming

The introductory R texts and courses that I see normally read a bit differently from introductions to other languages.

Introduction to [some language] for people who already know programming

Sometimes, such materials for other languages are intended for people who are already familiar with programming languages. They are quite concise, and they introduce the language by using jargon about programming languages and by relating the language to other programming languages.

Introduction to programming through [some language]

Other times, these materials are intended as an introduction to programming through a particular programming language. They start out with elementary language features, including the flow of the program, simple data structures, and the relationship between the program and the computer hardware.

Introduction to R for people who already know statistics

The introduction to R typically ignores that it is a programming language and instead focuses on statistics. Some introductions teach people who already know some statistics how to use R to run the statistics that they know.

Introduction to statistics through R

Other introductions come alongside an introduction to statistics, so they are introductions to statistics through R. This works well for people who know something about statistics and nothing about programming, but it can be especially confusing for people who already know something about programming.

Introduction to R for people who already know programming

So I'm going to try teaching you about R like you might expect a programming language to be introduced, by discussing some of its properties with programming language terminology and by relating it to other languages.

How to think about R

There are a lot of reasons why people make new languages, and understanding these reasons can help you understand the languages. Before we start talking about the specifics of the language, we need to understand why R is so bizarre and why learning materials related to R might make absolutely no sense to you.

Keep these three explosive statements about R in mind.

  1. The thing about R is that it was designed by statisticians.
  2. Everything in R is a complete hack.
  3. People who use R only know how to use R.

I don't really know whether the above statements are true, but R makes a lot of sense when you assume that they are.

Your preferred programming languages may not share these three characteristics, and that may put you off. The culture of R clashes with your programming culture, so you should not simply assume that R works the way your preferred languages do. You must embrace these aspects of R if you to use R well.

R was designed by statisticians

I don't know whether R really was designed by statisticians, (I also don't quite know what I mean by "statisticians".) but it makes sense that we would wind up with this language if it was designed by statisticians.

R's terminology is totally bizarre. Here are some examples.

  • apply instead of "map"
  • array and list don't do what you think

The higher-order function that you would usually call map is instead called apply.

There are array and list data structures in R, but they don't do what you think they do.

Everything in R is a complete hack

Again, I don't really know if everything in R is a complete hack, but I don't see particularly strong evidence against that statement.

A common statement in R is to assign a value to the result of a function. As an example of this, let's look at the rownames function.

> rownames(Formaldehyde)
[1] "1" "2" "3" "4" "5" "6"
> rownames(Formaldehyde)[3] <- 8 # R uses arrows for assignment.
> rownames(Formaldehyde)
[1] "1" "2" "8" "4" "5" "6"

Here's how it works.

# These two commands are the same.
rownames(Formaldehyde) <- letters[1:6]
Formaldehyde <- 'rownames<-'(Formaldehyde, letters[1:6])

This is slightly bizarre, but the thing that makes me consider it a complete hack is the way this is implemented. I don't really understand how it works, but it involves defining two functions, one called 'rownames' and another called 'rownames<-'.

R has connections to everything because people who use R only know how to use R.

Shiny is the best example of this; people who use R only know how to use R, so when they want to write websites, they need their own domain-specific templating language written in R. (That said, there are good reasons to use Shiny too.)

Here is the ui.R file from the Telephones by region Shiny demo application.

library(shiny)

# Rely on the 'WorldPhones' dataset in the datasets
# package (which generally comes preloaded).
library(datasets)

# Define the overall UI
shinyUI(

  # Use a fluid Bootstrap layout
  fluidPage(

    # Give the page a title
    titlePanel("Telephones by region"),

    # Generate a row with a sidebar
    sidebarLayout(

      # Define the sidebar with one input
      sidebarPanel(
        selectInput("region", "Region:", 
                    choices=colnames(WorldPhones)),
        hr(),
        helpText("Data from AT&T (1961) The World's Telephones.")
      ),

      # Create a spot for the barplot
      mainPanel(
        plotOutput("phonePlot")  
      )

    )
  )
)

In other languages, people would probably write templates that look mostly like HTML. But people who use R only know how to use R, so they need a new domain-specific language written in R.

Basics of the language

R is an interpreted language. It is largely written in C and Fortran. It is a Lisp, but you wouldn't guess that based on the C-style syntax.

  • Interpreted
  • Largely written in C and Fortran
  • A Lisp

Conventions, best practices, and notable syntax

Let's start with some syntax.

Arrows

You can use equal signs for assignment, but arrows are more fun.

a <- 4
a <- b <- 5
a <- b <- 6 -> c -> d

<<- is for global assignment.

a.function <- function() {global.variable <<- 'peanut butter'}
another.function <- function() {local.variable <<- 'jelly'}

Dots

Sometimes, we want to put two words in the name of a function. Since we don't like having spaces in function names, we might (in other languages) use underscores or camel case.

camelCase <- function() {}
underscores_instead_of_spaces <- function() {}

In R, we use dots instead of spaces.

dots.instead.of.spaces <- function() {}

Dots are just ordinary characters; they do not separate tokens in any way.

Creating numeric vectors

The normal way of creating vectors is to concatenate vector elements.

c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

The creation of numeric vectors is quite common, so there is a seq command (like in many other languages).

seq(1, 10)

And the creation of vectors of ascending or descending integers is so common that there's a special operator for that

1:10
10:1
10:1 == rev(1:10)

Braces

We can group a bunch of expressions with a brace.

2 == {x <- 3; 8; 5 - x}

The braces that typically come in a function definition aren't specific to function definitions.

f <- function() 'The result'
f() == 'The result'

This is, in fact, the same as JavaScript. Unlike in Javascript, you'll actually need to know about this once in a while.

Indexing

First of all, indexing starts at 1.

c('a','b')[1] == 'a'

Consider a cube-shaped array with three cells per side.

contents <- c('one', 'two', 'three',
              'four', 'five', 'six',
              'seven', 'eight', 'nine',
              'ten', 'eleven', 'twelve',
              'thirteen', 'fourteen', 'fifteen',
              'sixteen', 'seventeen', 'eighteen',
              'nineteen', 'twenty', 'twenty-one',
              'twenty-two', 'twenty-three', 'twenty-four',
              'twenty-five', 'twenty-six', 'twenty-seven')
a <- array(contents, dim = c(3,3,3))

Look at all the ways we can index it.

a[1,1,1]
a[1,1,2]
a[,1,2]
a[,,2]
a[1,1,2]
a[-1,1,2]
a[,,-2]
a[,,-(2:3)]
a[round(runif(27))]
a[as.logical(round(runif(27)))]
a[c(FALSE,TRUE,FALSE,FALSE,FALSE)]
a[1:27 %% 5 == 2]
dimnames(a) <- list(c('small','medium','big'),
                    c('low','middle','high'),
                    c('a','b','c'))
a['small','high','c']

Vectorization

R is really slow, but the parts written in C and Fortran are fast, so it is good practice to "vectorize" your code. For example, adding a scalar to a vector

vec <- 1:100
vec.2 <- vec + 1

is much faster than iterating through vec and adding 1 to each element.

vec <- 1:100
vec.2 <- vector(length = length(vec))
for (i in (1:length(vec))) {
  vec.2[i] <- vec[i]
}

Types

R is dynamically typed, but there's a particular weird thing that you need to know about. Lots of R involves matrix calculations, and R magically does the right thing depending on what you pass. For example, let's start with this vector.

a <- 1:10

and see what happens when we add different things to it.

a + 3
a + (10:1)
a + 1:2
a + 1:8

So you should think of the shape of a thing as a component of its type.

Primitive data structures are weird

R's most primitive data structures (or at least the most primitive ones that I know about and would ever use) are very high-level.

A lot of times we work with numbers and text.

> str(8)
num 8
> str('A string')
chr "A string"

(str is the command for inspecting the structure of something. In other languages, you might instead call this sort of command type.)

Vectors

Sometimes we want to have several numbers inside of a container. We might use a vector for this.

a.vector <- c(8, 7, 6)

Vectors are kind of like the things you might call "arrays", but there's a lot more to them. For example, you can assign string keys for each of the values.

names(a.vector) <- c('Apple', 'Orange', 'Watermelon')

Now, we can query this vector with both numeric and character indices.

> a.vector
 Apple     Orange Watermelon 
     8          7          6
> a.vector[1]
Apple 
    8 
> a.vector['Apple']
Apple 
    8

So a vector is kind of like a mapping type as well. In fact, most R structures are the same in this respect; they usually support both numerical and character indexing.

Factors

The factor is a special version of the numeric vector.

jelly.bean.vector <- c('blue', 'pink', 'yellow', 'pink', 'yellow', 'pink')
jelly.bean.factor <- factor(jelly.bean.vector)

"Factor" is a statisticsy word; other languages might call this an enumeration. Underneath, the factor is a bunch of integers with "levels" that map integers to character values.

as.numeric(jelly.bean.factor)
# [1] 1 2 3 2 3 2
levels(jelly.bean.factor)
# [1] "blue"   "pink"   "yellow"

We can change the level values.

levels(jelly.bean.factor)[1] <- 'violet'
jelly.bean.factor
# [1] violet  pink   yellow pink   yellow pink  
# Levels: violet pink yellow

The levels have whatever order you gave them (alphabetical if you didn't say it explicitly), so you can do this

sort(jelly.bean.factor)
# [1] violet pink   pink   pink   yellow yellow
# Levels: violet pink yellow

Data frames

The data frame is like a list of vectors. (Remember, "list" means something strange in R.)

xy <- data.frame(x = 1:9, y = 1:3)

We can select the vectors with $

xy$x

or with normal indexing

xy[,1]
xy[,'x']

The presence of that comma matters. If we include the comma, we select a vector, and if we exclude it, we select a data frame with one column.

xy[1]
xy['x']

The part before the comma is for selecting rows.

xy[4:8,1:2]

And we can also select rows with characters.

row.names(xy)[3] <- 'third row'
xy['third row',]

Strange things

Don't think about pointers

In many languages, you should think of variables as pointers to some data.

// JavaScript
var container = {"bacon slices": 8, "eggs": 11}
var another_variable = container
another_variable["tomatoes"] = 4
console.log(container)
// { 'bacon slices': 8, eggs: 11, tomatoes: 4 }

In R, you can think of nearly every variable as a separate copy of data.

container <- list(bacon.slices = 8, eggs = 11)
another.variable <- container
another.variable$tomatoes <- 4
container
# $bacon.slices
# [1] 8
#
# $eggs
# [1] 11

In both versions, we assigned some data to a variable called container and then assigned container to another variable. Then we manipulated another variable. In the Javascript version, another variable was a pointer to container, so manipulating another variable manipulated container as well, because they are actually the same thing. In the R version, another variable was a copy of container, so manipulating another variable did not manipulate container.

Counter-intuitive execution order

In other dynamic languages, the arguments of a function are evaluated before the function is run with those arguments. This is what happens when we take the absolute value of 3 - 5 in R.

abs(3 - 5)

Some functions work differently. Let's consider eval. In JavaScript, you might write this.

12 == eval('var a = 4; a + 8')

In R, we don't provide a string of the expression-to-be-evaluated; instead, we just provide the expression.

12 == eval({a <- 4; a + 8})
'a <- 4; a + 8' == eval('a <- 4; a + 8')

(The braces are just there to group the expressions; they are not specific to eval.)

How does it do this!? I'm pretty sure that it involves meta-programming; ordinarily, the contents of eval would be evaluated first, just as in Javascript, but this is deferred through meta-programming.

Since this behavior is implemented through meta-programming, we could say that it is not really part of the language. Except that eval definitely is part of the language. Moreover, bizarre execution orders like this show up in many places, even in libraries, so it is good to be aware of this.

Missing values

Many languages have a concept of nothing, and R has this too; it's called NULL. But this is inappropriate for expressing missing data.

Suppose we are surveying the hats that people wear. Among other things, we record the longest diameter of the hat that each person is wearing.

  1. What do we do if the person isn't wearing a hat?
  2. What do we do if the fire alarm goes off and we can't finish taking the hat measurements?

In R, the answer to the first question is NULL, and the answer to the second is NA.

The distinction is relevant in many calculations. For example, the sum of nothing and eight is nothing.

sum(NULL, 8)
# 8

(You might prefer that adding NULL to something produce an error, but please set that preference aside for now.)

If we add a missing value to eight, we don't know what the result is.

sum(NA, 8)
# NA

NULL indicates that we know that the correct value is nothing. NA, on the other hand, indicates that we do not know what the correct value is. When we add nothing (NULL) to eight, we know that the result will be eight. When we add a missing value (NA) to eight, we do not know what the result is, so the result is another missing value.

Cool language features

Formula objects

R has a nice syntax for specifying model formulae.

f <- Sepal.Length ~ Petal.Length
lm(f, data = iris)

This means that we are fitting a linear regression model (lm) where the dependent variable is Sepal.Length and the independent variable is Petal.Length.

We pass a categorical variable rather than a numeric variable, and the model will interpret it reasonably.

f <- Sepal.Length ~ Species
lm(f, data = iris)

Species is a factor, so the above lm call creates the appropriate dummy variables.

We can specify multiple variables, of course. Here we have two main effects, Petal.Length and Petal.Width, and one interaction term.

f <- Sepal.Length ~ Petal.Width + Petal.Length + Petal.Width:Petal.Length
lm(f, data = iris)

The above call can be abbreviated thusly.

lm(Sepal.Length ~ Petal.Width * Petal.Length, data = iris)

That is, * produces the main effects and all the interaction terms.

An end-user program

You can think of R as an end-user program, akin to Excel. The R interpreter has a bunch of things that you might not expect of an interpreter.

  • help
  • example
  • data

You can write a help and an example for anything that is exported from a package or that comes with base R. For example, we can ask for help about the normal random number generator

help(rnorm)

and run some example random number generations.

example(rnorm)

Instead of using random numbers, we could use one of the datasets. If you package datasets in the R way and then load the package, they show up when you run data.

plot and summary are some other methods that people typically define for their classes.

About plotting

Here are three types of graphics that people make in R

  • Base R graphics
  • Lattice graphics (from the lattice package)
  • ggplot graphics (from ggplot2 or ggplot packages)

You probably should use ggplot2, but I learned base R graphics before I learned ggplot2, and base R graphics are part of R proper, so I'm going to tell you about base R graphics.

Base R graphics make sense if you consider that they were designed for pen plotters.

A screenshot from twitter

I'm not going to talk about how to use the plotting capabilities, but you can look at this talk on Music Videos in R if you would like to learn more about making plots with R.

Object orientation

There are three ways to make classes in R, and I don't remember all of them. You probably won't use them very often, and I have to look them up when I want to use them.

You usually get the properties of an object with $, or sometimes with @, but there are some properties that you can only access with a separate function ("method" in object-oriented terminology) or with the attr function. I think these differences are related to the different ways of making classes, but I'd need to look it up again to say for sure.

Environments

Environments are a way to get separate scopes; you can evaluate code in different places, with some-but-not-much interference.

env <- new.env() eval({twelve <- 3 + 9}, envir=env) get('twelve', envir = env)

The interference that I have discovered is that libraries get loaded across all environments.

What it means when people say they "know R"

When someone says he or she knows R, it doesn't mean that he or she knows the things I just told you; it usually means that he or she knows how to open a data file, fit whatever model he or she uses in his or her work, and make a plot. For example, he or she might know how to do this.

ChickWeight <- read.csv('ChickWeight.csv')
plot(ChickWeight)
fit <- lm(weight ~ Time + Diet, data = ChickWeight)
print(summary(fit))
plot(fit)

The script will be a bit different for different people; they'll know how to import the particular data format that they work with, how to make the plots that their peers like to see, how to fit the models that their field likes to talk about, and how to do some sort of model diagnosis.

Other resources