library(tidyverse)
library(kableExtra)
library(knitr)

str() Function

Compactly display the internal structure of an R object.

nested_lists <- list(
  a = list(
    b = list(
      matrix = matrix(rnorm(4), ncol = 2),
      numeric_vector = 1:10,
      character_vector = LETTERS,
      data_frame = tibble(data_a = 1:10, data_b = 11:20),
      list = list(1:10)
    )
  )
)

# On an object
str(nested_lists)

## List of 1
##  $ a:List of 1
##   ..$ b:List of 5
##   .. ..$ matrix          : num [1:2, 1:2] -0.438 -0.669 0.488 0.592
##   .. ..$ numeric_vector  : int [1:10] 1 2 3 4 5 6 7 8 9 10
##   .. ..$ character_vector: chr [1:26] "A" "B" "C" "D" ...
##   .. ..$ data_frame      :Classes 'tbl_df', 'tbl' and 'data.frame':  10 obs. of  2 variables:
##   .. .. ..$ data_a: int [1:10] 1 2 3 4 5 6 7 8 9 10
##   .. .. ..$ data_b: int [1:10] 11 12 13 14 15 16 17 18 19 20
##   .. ..$ list            :List of 1
##   .. .. ..$ : int [1:10] 1 2 3 4 5 6 7 8 9 10

# On a function
str(matrix)

## function (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)

# On dataframes
str(airquality)

## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

Simulation

Generating Random Numbers

Functions for probability distributions:

rnorm() - generate random normal variates.
dnorm() - evaluate the normal probability density at a point.
pnorm() - evaluate the cumulative distribution function for a normal distributionl
rpois() - generate random Poisson variates with a given rate.

For each distribution there are usually four functions with different prefixes:

‘r’ for random numbers.
‘d’ for density.
‘p’ for cumulative distribution.
‘q’ for quantile.

rnorm(4, 100, 10)

## [1] 109.98245  95.49942  85.53378  97.56938

# Should be .5
pnorm(10, mean = 10)

## [1] 0.5

rpois(10, 1)

##  [1] 1 2 3 0 0 3 0 2 0 1

rpois(10, 4)

##  [1] 2 6 4 8 3 3 6 5 5 7

Simulating a Linear Model

Suppose we want to simulate

\[ y = \beta_0+ \beta_1x_1 + \epsilon \]

where:

(0,2^2) x (0,1^2), _0 = 0.5 _1 = 2

set.seed(1)
x <- rnorm(100)
e <- rnorm(100, 0, 2)
y <- 0.5 + 2 * x + e

summary(y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5.4816 -1.1493  0.7582  0.6422  2.3404  6.1534

tibble(x = x, y = y) %>%
ggplot(aes(x, y)) +
  geom_point()

What if \(x\) is binary?

set.seed(1)
x <- rbinom(100, 1, 0.5)
e <- rnorm(100, 0, 2)
y <- 0.5 + 2 * x + e

summary(y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.3287 -0.1764  1.1098  1.4248  3.1967  6.8452

tibble(x = x, y = y) %>%
ggplot(aes(x,y)) +
  geom_point()

Simulating a Poisson Model

Y Poisson() \ log() = _0 + _1x \ _0 = 0.5 _1 = 0.3.

set.seed(1)
x <- rnorm(100)
log_mu <- 0.5 + 0.3 * x
y <- rpois(100, exp(log_mu))

tibble(x = x, y = y) %>%
  ggplot(aes(x,y)) +
  geom_point()

Random Sampling

The sample() functon draws randomly from a specified set of scalar objects.

set.seed(1)
sample(1:10, 4)

## [1] 9 4 7 1

sample(1:10, 4)

## [1] 2 7 3 6

sample(letters, 5)

## [1] "r" "s" "a" "u" "w"

# Permutation
sample(1:10)

##  [1] 10  6  9  2  1  5  8  4  3  7

# Sample with replacement
sample(1:10, 8, replace = T)

## [1]  5  5  2 10  9  1  4  3

R Profiler

A very basic tool is to use system.time(). Returns an object of class proc.time which has user time, system (kernel) time, and elapsed (wall clock) time.

system.time(
  solve( matrix(rnorm(2048 * 2048), ncol = 2048) )
)

##    user  system elapsed 
##   7.764   0.072   7.837

User time is less than than elapsed if the process spends time off CPU.
User time is greater than elapsed if parallel processing has occurred (multi-threading).swi

RProf

The Rprof() function starts the profiler in R.

The summaryRprof() function summarises the output for Rprof().

The profiler keeps track of the call stack at regular intervals - default is 0.02 seconds.

library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

x <- c(1:2000)
y <- rnorm(2000)
Rprof(tmp <- tempfile())
invisible(
  solve( matrix(rnorm(2048 * 2048), ncol = 2048) )
)
Rprof()
summaryRprof(tmp) %>%
  use_series(by.self) %>%
  kable() %>%
  kable_styling()

	self.time	self.pct	total.time	total.pct
“solve.default”	7.30	96.31	7.30	96.31
“rnorm”	0.26	3.43	0.26	3.43
“matrix”	0.02	0.26	0.28	3.69

Note: C or Fortran code is not profiled.

“By Total” and “By Self”

By total is how much time was spent in the function including child calls. By self is how much time is spent in that function only.

Course 2 - R Programming - Week 4 - Notes

Greg Foletta

2019-09-26