Pre-Lecture - Writing Code
Working directory found with getwd()
.
See the files in a directory with dir()
.
Overview and History of R
R is a dialect of the S language.
S was developed by John Chambers and others at Bell Labs.
R was developed in 1991 in New Zealand by Ross Ihaka and Robert Gentleman.
R version 1 released in 2000.
Features of R
- Syntax is similar to S-PLUS
- Semantics are superficially like S, but are in reality quite different.
- Runs on almost any standard playform/OS
- Frequent releases.
- Quite lean, functionality in modular packages.
- Graphics capabilities are very sophisticated and better than most stat packages.
- Useful for interactive work.
- Active and vibrant user community.
Free Software
With free software you are granted:
- The freedom to run the program.
- The freedom to study how the program works and apapt it to your needs.
- The freedom to redistribute copies.
- The freedom to improve the program.
Drawbacks of R
- Essentially based on 40 year old technology.
- Little built in support for dynamic or 3D graphics.
- Functionality is based on consumer demand and user contributions.
- Objects must generally be stored in physical memory - however advancements have been made on this.
- Not ideal for all possible situations.
Design
- Base R
- Everything else
Base R comes with base packages (stat, util, etc) as well as recommended packages.
Data Types - R Objects and Attributes
Everything is R is an object.
R has five basic “atomic” classes of objects:
- Character
- Numeric (real numbers)
- Integer
- Complex
- Logical
The most basic object is a vector. Can only contain objects of the same class. Empty vectors are created with the vector()
function.
Numbers
Generally treated as numberic objects (double precision real numbers. Need to provide the L
suffix if you want an integer.
## [1] "numeric"
## [1] "integer"
Inf
represents infinity. NaN
represents an undefined value.
Attributes
R objects can have attributes. names
and dimnames
are common - for example a matrix has the number of rows and columns:
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
## [1] "Attributes:"
## $dim
## [1] 4 4
You also have class
, length
, and other user defined attributes / metadata.
Vectors and Lists
c()
function can be used to create vectors, as well as vector()
.
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 0 0 0 0 0 0 0 0 0 0
If you mix the classes, values are coerced you’ll get the lowest common denominator.
## [1] "character"
## [1] "numeric"
## [1] "character"
Explicit Coercion
Can use explicit coercion:
## [1] "character"
Nonsensical coercion will result in NAs
.
## [1] NA NA
Lists
Special type of vector that can contain elements of different classes.
## [[1]]
## [1] "a"
##
## [[2]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[3]]
## [1] TRUE
Using a single bracket on a list returns a list of one. Using a double bracket returns the element in the list
## [[1]]
## [1] "a"
## [1] "list"
## [1] "a"
## [1] "character"
Matrices
A special type of vector with a ‘dimension’ attribute. The atrribute itself is a vector of length two (rows / cols).
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
Matrices are built column wise - see above.
Can be created from vectors:
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 3 5 7 9 11 13 15
## [2,] 2 4 6 8 10 12 14 16
Column and Row Binding
Can column or row bind a vector:
## [,1] [,2]
## [1,] 0 32
## [2,] 1 33
## [3,] 2 34
## [4,] 3 35
## [5,] 4 36
## [6,] 5 37
## [7,] 6 38
## [8,] 7 39
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 8 9 10 11
Factors
Special vector used to represent categorical data. There’s ordered and unordered factors.
Can be thought of as an integer vector with labels.
## [1] y y n n y
## Levels: n y
## [1] "factor"
## [1] "integer"
## [1] 2 2 1 1 2
## attr(,"levels")
## [1] "n" "y"
Factor Ordering
Can be set using the levels
argument to factor()
. The levels are set alphabetically - hence why ‘n’ above was coded as 1.
## [1] 1 1 2 2 1
## attr(,"levels")
## [1] "y" "n"
Missing Values
Denoted by NA
or NaN
. Use is.na()
or is.nan()
to determine if an object of of that value.
The NAs can have a class as well.
An NaN is also considered NA.
## [1] FALSE FALSE TRUE TRUE FALSE
## [1] FALSE FALSE TRUE FALSE FALSE
## [1] NA
## [1] "numeric"
Data Frames
Special type of list where every element has the same length. Each element can be considered a column. Different classes in each column.
They have an attribute row.names
.
Names Attribute
Objects can have names.
## [1] 1 2 3
## One Two Three
## 1 2 3
Named lists:
## $one
## [1] 1
##
## $two
## [1] 2
##
## $three
## [1] 3
Matrices have dimnames:
named_matrix <- matrix(1:4, ncol = 2)
dimnames(named_matrix) <- list(c('A', 'B'), c('C', 'D'))
named_matrix
## C D
## A 1 3
## B 2 4
Reading Tabular Data
read.table()
andread.csv()
for reading tabular data.readLines()` for reading lines of a text file.
source()
for reading R code.dget()
for reading deparsed R objects.load()
for workspaces.unserialize()
for reading single R objects in binary form.
Read Table
Arguments:
colClasses
- character vector indicating the class of each column.comment.char
- character string indicating the comment character.stringsAsFactors
- should character variables be coded as factors?- Defaults to
TRUE
.
- Defaults to
Read table generally figures out the classes, skips lines beginning #
, and figures out how many rows.
read.csv()
is identical except the default separator is a comma instead of a space and expects a header line.
Reading Large Tables
- Make a rough calculation on the memory required and determine if you have enough physical memory.
- Set
comment.char = ''
if there are no comments.
Using the colClasses
argument can make reading in much faster, and setting nrows
can help memory usage, with a mild overestimate being okay.
Know your system - memory, applicatons, other users, OS.
You can calculate memory requirements. Consider 1,000,000 rows and 200 columns of numeric data. Numeric data is 8 bytes.
## [1] 1.490116
Textual Formats
dump()
and dput()
are useful.
dump()
dumps named objects from the environment into text format:
File has the following contents:
dump_obj <-
structure(list(x = 1:4, y = 2:5), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
vec <-
1:10
Can use
dput()` dumps a single object:
File contains:
structure(list(x = 1:16, blah = c("a", "b", "c", "d", "e", "f",
"g", "h", "i", "j", "k", "l", "m", "n", "o", "p")), row.names = c(NA,
-16L), class = c("tbl_df", "tbl", "data.frame"))
Textual formats are good with version control, easier to fix ‘corruption’, and adhere to the Unix philosophy. However they are space-inefficient.
Connections
Connections can be made to:
file()
gzfile()
- gzip compressed file.bzfile()
- bzip2 compressed file.url()
Subsetting
The [
returns the same class as the original, can be used to extract more than one object (one exception).
## [1] 4
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## [1] 6 7 8 9 10
The [[
returns the objects within.
The $ can extract by name, and the semantics are similar to [[
.
Lists
## $a
## [1] 1 2 3 4
##
## $c
## [1] 80 81 82 83
## $a
## [1] 1 2 3 4
## [1] 1 2 3 4
Nested Elements
l <- list(
a = list(
b = "Nest B",
c = "Nest C"
),
d = "Element D",
e = "Element E"
)
# Access the nested element
l[[c(1,2)]]
## [1] "Nest C"
## [1] "Nest B"
Matrices
Can be subsetted in the normal way using \(i,j\) row / column indicies.
## [1] 2 6 10 14
## [1] 1 2 3 4
## [1] 13
Subsetting out a matrix is a vector, not another matrix.
If you want a matrix you need to use the drop
argument:
## [,1] [,2] [,3] [,4]
## [1,] 2 6 10 14
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [,1]
## [1,] 13
Partial Matching
Works with the $
and the [[
. Double bracket expects exact, but can use exact = F
argument,
Removing Missing Values
Use is.na()
or complete.cases()
.
## [1] 1 2 3 4 5
## [1] 1 2 3 4 5
Can be done on data frames too - done by rows.
## [1] TRUE FALSE FALSE FALSE TRUE TRUE
## a b
## 1 1 1
## 5 5 5
## 6 6 6