This is part 1 of the R Ladies Netherlands bookclub.
We’re reading Advanced R by Hadley Wickham.
7/4/2020
This is part 1 of the R Ladies Netherlands bookclub.
We’re reading Advanced R by Hadley Wickham.
Your organizers are:
Paloma Rojas (R Ladies Rotterdam)
Janine Khuc (R Ladies Amsterdam)
Margaux Sleckman (R Ladies Amsterdam)
Martine Jansen (R Ladies Den Bosch)
Laurel Brehm (R Ladies Nijmegen)
Eirini Zormpa (R Ladies Nijmegen)
Sara Iacozza (R Ladies Nijmegen)
We will work through all of Advanced R together!
You can find all materials on Github: https://github.com/rladiesnl/book_club
We are going to rotate through every 2 weeks, with an event hosted by each of the NL RLadies chapters involved.
We will be recruiting a presenter for each chapter, so we want YOU to sign up.
You can do this here: https://tinyurl.com/SignUpAdvR
Names and values: https://adv-r.hadley.nz/names-values.html
The way that R assigns names to objects and values isn’t quite what you think!
This chapter explores name/value assignment and what this means for memory use and the speed of your code.
We’ll walk through the chapter and then solve some exercises together!
You need lobstr– install as necessary: run: install.packages(‘lobstr’)
library(lobstr)
-1. Given the following data frame, how do I create a new column called “3” that contains the sum of 1 and 2? You may only use $, not [[. What makes 1, 2, and 3 challenging as variable names?
df <- data.frame(runif(3), runif(3)) names(df) <- c(1, 2)
-2. In the following code, how much memory does y occupy?
x <- runif(1e6) y <- list(x, x, x)
-3. On which line does a get copied in the following example?
a <- c(1, 5, 3, 2) b <- a b[[1]] <- 10
The code below doesn’t do what you might naively think:
x <- c(1, 2, 3)
A vector is created– and then bound to a name.
So as Wickham says “it’s actually the name that has a value”
This means that the assign arrow makes a new binding between name and value– not a copy of the object!
y <- x
You can query the actual object address– x and y are mapped to the same part of my computer memory– they’re the same ID.
(If you run this code on your own computer, the address will be different, but x and y will match)
obj_addr(x)
## [1] "0x7f8ee2dd3dd8"
obj_addr(y)
## [1] "0x7f8ee2dd3dd8"
Because of this binding between names and values, R has a hard time with certain types of names– they should be alphanumeric and start with a letter.
They should also not duplicate function names– like ‘if’ or ‘null’.
You can get around these if you need to (because someone else needs you to use a specific variable name, because you want to pass to panel plots, etc…)
Just use back ticks– NOT quotes (that will do something funny.)
`_abc` <- 1 `_abc`
## [1] 1
Why does R run an assignment arrow operation like this so fast?
x <- c(1, 2, 3) y <- x y[[3]] <- 4
A copy of the object wasn’t made until it needed to be– otherwise both are bindings to the same value.
You can use tracemem() to see when an object is actually copied. (untracemem() to turn off..)
x <- c(1, 2, 3) tracemem(x)
## [1] "<0x7f8ee407bb68>"
y <- x y[[3]] <- 4
## tracemem[0x7f8ee407bb68 -> 0x7f8ee1dd4b98]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>
untracemem(x)
If you’ve written functions in R, you’ll notice that the stuff you use inside the function doesn’t appear outside of it.
This is because of this same sort of compartmentalization: Copies are only made at the end of the function loop if you need them.
This is good because it’s fast and efficient in memory.
Lists are a set of references to values, whereas vectors are a set of values. (That’s why the syntax is different, among other things).
This makes lists efficient for some types of programming: values go into a global pool, making them easy on memory.
You can see how a list is stored by using lobstr.
Note the difference after we change element 3 of l2.
l1 <- list(1,2,3) l2 <- l1 ref(l1,l2)
## █ [1:0x7f8ee42ecba8] <list> ## ├─[2:0x7f8ee42c2c90] <dbl> ## ├─[3:0x7f8ee42c2c58] <dbl> ## └─[4:0x7f8ee42c2c20] <dbl> ## ## [1:0x7f8ee42ecba8]
Note the difference after we change element 3 of l2.
l2[[3]] <- 4 ref(l1,l2)
## █ [1:0x7f8ee42ecba8] <list> ## ├─[2:0x7f8ee42c2c90] <dbl> ## ├─[3:0x7f8ee42c2c58] <dbl> ## └─[4:0x7f8ee42c2c20] <dbl> ## ## █ [5:0x7f8ee15acfb8] <list> ## ├─[2:0x7f8ee42c2c90] ## ├─[3:0x7f8ee42c2c58] ## └─[6:0x7f8ee2f1f4e8] <dbl>
A data frame is a list of vectors.
This means that columns are privileged: column-wise operations will be faster and more efficient in memory than row-wise operations.
(To better manipulate rows: consider something like tidyverse where it’s been made as efficient as possible)
Because of how R stores objects, sometimes things are suprisingly small. You can query this with obj_size()
Here, y is not 3x: it’s x + a 3-element empty list.
This representation is more compact becaue of the names-bound-to-values property of R.
x <- runif(1e6) obj_size(x)
## 8,000,048 B
y <- list(x, x, x) obj_size(y)
## 8,000,128 B
There are exceptions to the copy-on-modify which make perfect sense from an efficiency standpoint.
The first is: if an object just has one name, it gets modified right away.
This has a caveat: R doesn’t always know accurately when an object only has one name, so sometimes it makes extra copies.
This is more likely with data frames (=more likely to be copied), and it’s why they are very slow in things like for loops (=more likely to make copies).
Instead, try using lists instead inside big for loops– those are more efficiently indexed.
The second exception is for environments which are always modified in place.
More about what an environment is specifically in a few chapters– but it’s this behavior that makes them really, really nice for some types of programming.
When you bind objects to different names, sometimes there are things floating in memory that aren’t accessed.
R takes care of this for you with the garbage collector (GC).
It will run on its own when you’re trying to allocate memory space to a new object.
-1. Given the following data frame, how do I create a new column called “3” that contains the sum of 1 and 2? You may only use $, not [[. What makes 1, 2, and 3 challenging as variable names?
df <- data.frame(runif(3), runif(3)) names(df) <- c(1, 2)
-1. Given the following data frame, how do I create a new column called “3” that contains the sum of 1 and 2? You may only use $, not [[. What makes 1, 2, and 3 challenging as variable names?
df <- data.frame(runif(3), runif(3)) names(df) <- c(1, 2) df$`3` <- df$`1` + df$`2`
-2. In the following code, how much memory does y occupy?
x <- runif(1e6) y <- list(x, x, x)
-2. In the following code, how much memory does y occupy?
x <- runif(1e6) y <- list(x, x, x) lobstr::obj_size(x)
## 8,000,048 B
lobstr::obj_size(y)
## 8,000,128 B
lobstr::obj_size(list(NULL, NULL, NULL))
## 80 B
-3. On which line does a get copied in the following example?
a <- c(1, 5, 3, 2) b <- a b[[1]] <- 10
-3. On which line does a get copied in the following example?
a <- c(1, 5, 3, 2) b <- a tracemem(a)
## [1] "<0x7f8ee1ff08d8>"
tracemem(b)
## [1] "<0x7f8ee1ff08d8>"
b[[1]] <- 10
## tracemem[0x7f8ee1ff08d8 -> 0x7f8ee3f18248]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>
-1. Explain the relationship between a, b, c and d in the following code:
a <- 1:10 b <- a c <- b d <- 1:10
-2. The following code accesses the mean function in multiple ways. Do they all point to the same underlying function object? Verify this with lobstr::obj_addr().
mean base::mean get("mean") evalq(mean) match.fun("mean")
-3. By default, base R data import functions, like read.csv(), will automatically convert non-syntactic names to syntactic ones. Why might this be problematic? What option allows you to suppress this behaviour?
-4. What rules does make.names() use to convert non-syntactic names into syntactic ones?
-5. I slightly simplified the rules that govern syntactic names. Why is .123e1 not a syntactic name? Read ?make.names for the full details.
-1. Why is tracemem(1:10) not useful?
-2. Explain why tracemem() shows two copies when you run this code. Hint: carefully look at the difference between this code and the code shown earlier in the section.
x <- c(1L, 2L, 3L) tracemem(x)
## [1] "<0x7f8ee4932348>"
x[[3]] <- 4
## tracemem[0x7f8ee4932348 -> 0x7f8ee585f508]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> ## tracemem[0x7f8ee585f508 -> 0x7f8ee589dbd8]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>
-3. Sketch out the relationship between the following objects:
a <- 1:10 b <- list(a, a) c <- list(b, a, 1:10)
-4. What happens when you run this code? Draw a picture.
x <- list(1:10) x[[2]] <- x
-1. In the following example, why are object.size(y) and obj_size(y) so radically different? Consult the documentation of object.size().
y <- rep(list(runif(1e4)), 100) object.size(y)
## 8005648 bytes
obj_size(y)
## 80,896 B
-2. Take the following list. Why is its size somewhat misleading?
funs <- list(mean, sd, var) obj_size(funs)
## 17,608 B
-3. Predict the output of the following code:
a <- runif(1e6) obj_size(a) b <- list(a, a) obj_size(b) obj_size(a, b) b[[1]][[1]] <- 10 obj_size(b) obj_size(a, b) b[[2]][[1]] <- 10 obj_size(b) obj_size(a, b)
-1. Explain why the following code doesn’t create a circular list.
x <- list() x[[1]] <- x
-2. Wrap the two methods for subtracting medians into two functions, then use the ‘bench’ package (Hester 2018) to carefully compare their speeds. How does performance change as the number of columns increase?
-3. What happens if you attempt to use tracemem() on an environment?