Course 2 - R Programming - Week 1 - Notes

Greg Foletta

2019-09-26

Pre-Lecture - Writing Code

Working directory found with getwd().

See the files in a directory with dir().

Overview and History of R

R is a dialect of the S language.

S was developed by John Chambers and others at Bell Labs.

R was developed in 1991 in New Zealand by Ross Ihaka and Robert Gentleman.

R version 1 released in 2000.

Features of R

  • Syntax is similar to S-PLUS
  • Semantics are superficially like S, but are in reality quite different.
  • Runs on almost any standard playform/OS
  • Frequent releases.
  • Quite lean, functionality in modular packages.
  • Graphics capabilities are very sophisticated and better than most stat packages.
  • Useful for interactive work.
  • Active and vibrant user community.

Free Software

With free software you are granted:

  • The freedom to run the program.
  • The freedom to study how the program works and apapt it to your needs.
  • The freedom to redistribute copies.
  • The freedom to improve the program.

Drawbacks of R

  • Essentially based on 40 year old technology.
  • Little built in support for dynamic or 3D graphics.
  • Functionality is based on consumer demand and user contributions.
  • Objects must generally be stored in physical memory - however advancements have been made on this.
  • Not ideal for all possible situations.

Design

  1. Base R
  2. Everything else

Base R comes with base packages (stat, util, etc) as well as recommended packages.

Data Types - R Objects and Attributes

Everything is R is an object.

R has five basic “atomic” classes of objects:

  • Character
  • Numeric (real numbers)
  • Integer
  • Complex
  • Logical

The most basic object is a vector. Can only contain objects of the same class. Empty vectors are created with the vector() function.

Numbers

Generally treated as numberic objects (double precision real numbers. Need to provide the L suffix if you want an integer.

## [1] "numeric"
## [1] "integer"

Inf represents infinity. NaN represents an undefined value.

Attributes

R objects can have attributes. names and dimnames are common - for example a matrix has the number of rows and columns:

##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
## [1] "Attributes:"
## $dim
## [1] 4 4

You also have class, length, and other user defined attributes / metadata.

Vectors and Lists

c() function can be used to create vectors, as well as vector().

##  [1]  1  2  3  4  5  6  7  8  9 10
##  [1] 0 0 0 0 0 0 0 0 0 0

If you mix the classes, values are coerced you’ll get the lowest common denominator.

## [1] "character"
## [1] "numeric"
## [1] "character"

Explicit Coercion

Can use explicit coercion:

## [1] "character"

Nonsensical coercion will result in NAs.

## [1] NA NA

Lists

Special type of vector that can contain elements of different classes.

## [[1]]
## [1] "a"
## 
## [[2]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## [[3]]
## [1] TRUE

Using a single bracket on a list returns a list of one. Using a double bracket returns the element in the list

## [[1]]
## [1] "a"
## [1] "list"
## [1] "a"
## [1] "character"

Matrices

A special type of vector with a ‘dimension’ attribute. The atrribute itself is a vector of length two (rows / cols).

##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16

Matrices are built column wise - see above.

Can be created from vectors:

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    3    5    7    9   11   13   15
## [2,]    2    4    6    8   10   12   14   16

Column and Row Binding

Can column or row bind a vector:

##      [,1] [,2]
## [1,]    0   32
## [2,]    1   33
## [3,]    2   34
## [4,]    3   35
## [5,]    4   36
## [6,]    5   37
## [7,]    6   38
## [8,]    7   39
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    8    9   10   11

Factors

Special vector used to represent categorical data. There’s ordered and unordered factors.

Can be thought of as an integer vector with labels.

## [1] y y n n y
## Levels: n y
## [1] "factor"
## [1] "integer"
## [1] 2 2 1 1 2
## attr(,"levels")
## [1] "n" "y"

Factor Ordering

Can be set using the levels argument to factor(). The levels are set alphabetically - hence why ‘n’ above was coded as 1.

## [1] 1 1 2 2 1
## attr(,"levels")
## [1] "y" "n"

Missing Values

Denoted by NA or NaN. Use is.na() or is.nan() to determine if an object of of that value.

The NAs can have a class as well.

An NaN is also considered NA.

## [1] FALSE FALSE  TRUE  TRUE FALSE
## [1] FALSE FALSE  TRUE FALSE FALSE
## [1] NA
## [1] "numeric"

Data Frames

Special type of list where every element has the same length. Each element can be considered a column. Different classes in each column.

They have an attribute row.names.

Names Attribute

Objects can have names.

## [1] 1 2 3
##   One   Two Three 
##     1     2     3

Named lists:

## $one
## [1] 1
## 
## $two
## [1] 2
## 
## $three
## [1] 3

Matrices have dimnames:

##   C D
## A 1 3
## B 2 4

Reading Tabular Data

  • read.table() and read.csv() for reading tabular data. readLines()` for reading lines of a text file.
  • source() for reading R code.
  • dget() for reading deparsed R objects.
  • load() for workspaces.
  • unserialize() for reading single R objects in binary form.

Read Table

Arguments:

  • colClasses - character vector indicating the class of each column.
  • comment.char - character string indicating the comment character.
  • stringsAsFactors - should character variables be coded as factors?
    • Defaults to TRUE.

Read table generally figures out the classes, skips lines beginning #, and figures out how many rows.

read.csv() is identical except the default separator is a comma instead of a space and expects a header line.

Reading Large Tables

  • Make a rough calculation on the memory required and determine if you have enough physical memory.
  • Set comment.char = '' if there are no comments.

Using the colClasses argument can make reading in much faster, and setting nrows can help memory usage, with a mild overestimate being okay.

Know your system - memory, applicatons, other users, OS.

You can calculate memory requirements. Consider 1,000,000 rows and 200 columns of numeric data. Numeric data is 8 bytes.

## [1] 1.490116

Textual Formats

dump() and dput() are useful.

dump() dumps named objects from the environment into text format:

File has the following contents:

Can usedput()` dumps a single object:

File contains:

Textual formats are good with version control, easier to fix ‘corruption’, and adhere to the Unix philosophy. However they are space-inefficient.

Connections

Connections can be made to:

  • file()
  • gzfile() - gzip compressed file.
  • bzfile() - bzip2 compressed file.
  • url()

Subsetting

The [ returns the same class as the original, can be used to extract more than one object (one exception).

## [1] 4
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [1]  6  7  8  9 10

The [[ returns the objects within.

The $ can extract by name, and the semantics are similar to [[.

Lists

## $a
## [1] 1 2 3 4
## 
## $c
## [1] 80 81 82 83
## $a
## [1] 1 2 3 4
## [1] 1 2 3 4

Matrices

Can be subsetted in the normal way using \(i,j\) row / column indicies.

## [1]  2  6 10 14
## [1] 1 2 3 4
## [1] 13

Subsetting out a matrix is a vector, not another matrix.

If you want a matrix you need to use the drop argument:

##      [,1] [,2] [,3] [,4]
## [1,]    2    6   10   14
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
## [4,]    4
##      [,1]
## [1,]   13

Partial Matching

Works with the $ and the [[. Double bracket expects exact, but can use exact = F argument,

Removing Missing Values

Use is.na() or complete.cases().

## [1] 1 2 3 4 5
## [1] 1 2 3 4 5

Can be done on data frames too - done by rows.

## [1]  TRUE FALSE FALSE FALSE  TRUE  TRUE
##   a b
## 1 1 1
## 5 5 5
## 6 6 6