Pre-Lecture - Writing Code
Working directory found with getwd().
See the files in a directory with dir().
Overview and History of R
R is a dialect of the S language.
S was developed by John Chambers and others at Bell Labs.
R was developed in 1991 in New Zealand by Ross Ihaka and Robert Gentleman.
R version 1 released in 2000.
Features of R
- Syntax is similar to S-PLUS
- Semantics are superficially like S, but are in reality quite different.
- Runs on almost any standard playform/OS
- Frequent releases.
- Quite lean, functionality in modular packages.
- Graphics capabilities are very sophisticated and better than most stat packages.
- Useful for interactive work.
- Active and vibrant user community.
Free Software
With free software you are granted:
- The freedom to run the program.
- The freedom to study how the program works and apapt it to your needs.
- The freedom to redistribute copies.
- The freedom to improve the program.
Drawbacks of R
- Essentially based on 40 year old technology.
- Little built in support for dynamic or 3D graphics.
- Functionality is based on consumer demand and user contributions.
- Objects must generally be stored in physical memory - however advancements have been made on this.
- Not ideal for all possible situations.
Design
- Base R
- Everything else
Base R comes with base packages (stat, util, etc) as well as recommended packages.
Data Types - R Objects and Attributes
Everything is R is an object.
R has five basic “atomic” classes of objects:
- Character
- Numeric (real numbers)
- Integer
- Complex
- Logical
The most basic object is a vector. Can only contain objects of the same class. Empty vectors are created with the vector() function.
Numbers
Generally treated as numberic objects (double precision real numbers. Need to provide the L suffix if you want an integer.
## [1] "numeric"
## [1] "integer"
Inf represents infinity. NaN represents an undefined value.
Attributes
R objects can have attributes. names and dimnames are common - for example a matrix has the number of rows and columns:
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
## [1] "Attributes:"
## $dim
## [1] 4 4
You also have class, length, and other user defined attributes / metadata.
Vectors and Lists
c() function can be used to create vectors, as well as vector().
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 0 0 0 0 0 0 0 0 0 0
If you mix the classes, values are coerced you’ll get the lowest common denominator.
## [1] "character"
## [1] "numeric"
## [1] "character"
Explicit Coercion
Can use explicit coercion:
## [1] "character"
Nonsensical coercion will result in NAs.
## [1] NA NA
Lists
Special type of vector that can contain elements of different classes.
## [[1]]
## [1] "a"
##
## [[2]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[3]]
## [1] TRUE
Using a single bracket on a list returns a list of one. Using a double bracket returns the element in the list
## [[1]]
## [1] "a"
## [1] "list"
## [1] "a"
## [1] "character"
Matrices
A special type of vector with a ‘dimension’ attribute. The atrribute itself is a vector of length two (rows / cols).
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
Matrices are built column wise - see above.
Can be created from vectors:
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 3 5 7 9 11 13 15
## [2,] 2 4 6 8 10 12 14 16
Column and Row Binding
Can column or row bind a vector:
## [,1] [,2]
## [1,] 0 32
## [2,] 1 33
## [3,] 2 34
## [4,] 3 35
## [5,] 4 36
## [6,] 5 37
## [7,] 6 38
## [8,] 7 39
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 8 9 10 11
Factors
Special vector used to represent categorical data. There’s ordered and unordered factors.
Can be thought of as an integer vector with labels.
## [1] y y n n y
## Levels: n y
## [1] "factor"
## [1] "integer"
## [1] 2 2 1 1 2
## attr(,"levels")
## [1] "n" "y"
Factor Ordering
Can be set using the levels argument to factor(). The levels are set alphabetically - hence why ‘n’ above was coded as 1.
## [1] 1 1 2 2 1
## attr(,"levels")
## [1] "y" "n"
Missing Values
Denoted by NA or NaN. Use is.na() or is.nan() to determine if an object of of that value.
The NAs can have a class as well.
An NaN is also considered NA.
## [1] FALSE FALSE TRUE TRUE FALSE
## [1] FALSE FALSE TRUE FALSE FALSE
## [1] NA
## [1] "numeric"
Data Frames
Special type of list where every element has the same length. Each element can be considered a column. Different classes in each column.
They have an attribute row.names.
Names Attribute
Objects can have names.
## [1] 1 2 3
## One Two Three
## 1 2 3
Named lists:
## $one
## [1] 1
##
## $two
## [1] 2
##
## $three
## [1] 3
Matrices have dimnames:
named_matrix <- matrix(1:4, ncol = 2)
dimnames(named_matrix) <- list(c('A', 'B'), c('C', 'D'))
named_matrix## C D
## A 1 3
## B 2 4
Reading Tabular Data
read.table()andread.csv()for reading tabular data.readLines()` for reading lines of a text file.source()for reading R code.dget()for reading deparsed R objects.load()for workspaces.unserialize()for reading single R objects in binary form.
Read Table
Arguments:
colClasses- character vector indicating the class of each column.comment.char- character string indicating the comment character.stringsAsFactors- should character variables be coded as factors?- Defaults to
TRUE.
- Defaults to
Read table generally figures out the classes, skips lines beginning #, and figures out how many rows.
read.csv() is identical except the default separator is a comma instead of a space and expects a header line.
Reading Large Tables
- Make a rough calculation on the memory required and determine if you have enough physical memory.
- Set
comment.char = ''if there are no comments.
Using the colClasses argument can make reading in much faster, and setting nrows can help memory usage, with a mild overestimate being okay.
Know your system - memory, applicatons, other users, OS.
You can calculate memory requirements. Consider 1,000,000 rows and 200 columns of numeric data. Numeric data is 8 bytes.
## [1] 1.490116
Textual Formats
dump() and dput() are useful.
dump() dumps named objects from the environment into text format:
File has the following contents:
dump_obj <-
structure(list(x = 1:4, y = 2:5), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
vec <-
1:10Can usedput()` dumps a single object:
File contains:
structure(list(x = 1:16, blah = c("a", "b", "c", "d", "e", "f",
"g", "h", "i", "j", "k", "l", "m", "n", "o", "p")), row.names = c(NA,
-16L), class = c("tbl_df", "tbl", "data.frame"))Textual formats are good with version control, easier to fix ‘corruption’, and adhere to the Unix philosophy. However they are space-inefficient.
Connections
Connections can be made to:
file()gzfile()- gzip compressed file.bzfile()- bzip2 compressed file.url()
Subsetting
The [ returns the same class as the original, can be used to extract more than one object (one exception).
## [1] 4
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## [1] 6 7 8 9 10
The [[ returns the objects within.
The $ can extract by name, and the semantics are similar to [[.
Lists
## $a
## [1] 1 2 3 4
##
## $c
## [1] 80 81 82 83
## $a
## [1] 1 2 3 4
## [1] 1 2 3 4
Nested Elements
l <- list(
a = list(
b = "Nest B",
c = "Nest C"
),
d = "Element D",
e = "Element E"
)
# Access the nested element
l[[c(1,2)]]## [1] "Nest C"
## [1] "Nest B"
Matrices
Can be subsetted in the normal way using \(i,j\) row / column indicies.
## [1] 2 6 10 14
## [1] 1 2 3 4
## [1] 13
Subsetting out a matrix is a vector, not another matrix.
If you want a matrix you need to use the drop argument:
## [,1] [,2] [,3] [,4]
## [1,] 2 6 10 14
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [,1]
## [1,] 13
Partial Matching
Works with the $ and the [[. Double bracket expects exact, but can use exact = F argument,
Removing Missing Values
Use is.na() or complete.cases().
## [1] 1 2 3 4 5
## [1] 1 2 3 4 5
Can be done on data frames too - done by rows.
## [1] TRUE FALSE FALSE FALSE TRUE TRUE
## a b
## 1 1 1
## 5 5 5
## 6 6 6