Course 3 - Getting and Cleaning Data - Week 4 - Notes

Greg Foletta

2019-10-25

Fixing Character Vectors

Fix the case of character vectors using tolower() and toupper().

Split up a word using strsplit()

string_to_split <- c('this.1', 'something.2', 'another')
strsplit(string_to_split, '\\.')
## [[1]]
## [1] "this" "1"   
## 
## [[2]]
## [1] "something" "2"        
## 
## [[3]]
## [1] "another"

Substitute out characters using the sub() function.

strings_to_sub <- c('this-is-going-to', 'replace-some', 'hyphens')

# sub replaces only the first instance
sub('-', '_', strings_to_sub)
## [1] "this_is-going-to" "replace_some"     "hyphens"
# gsub replaces all instances
gsub('-', '_', strings_to_sub)
## [1] "this_is_going_to" "replace_some"     "hyphens"

Use grep() and grepl() to find values. grep() returns a vector of the matching indicies, and grepl() returns a logical vector of matches and not matches.

greppable <- c('these', 'are', 'some', 'words', 'to', 'grep')

# Return indicies
grep('to', greppable)
## [1] 5
# Return a logical vector
grepl('o', greppable)
## [1] FALSE FALSE  TRUE  TRUE  TRUE FALSE
#Return the values
grep('[swt]o', greppable, value = T)
## [1] "some"  "words" "to"

Working with Dates

Dates

Date function:

d1 <- date()
d1
## [1] "Fri Oct 25 06:52:04 2019"
class(d1)
## [1] "character"
d2 <- Sys.Date()
d2
## [1] "2019-10-25"
class(d2)
## [1] "Date"

Use the format() function to alter the format of the date.

format(d2, "%A %d %B %Y")
## [1] "Friday 25 October 2019"

To create a date, use the as.Date() function.

some_dates <- c('01oct1990', '02sep1991')
d3 <- as.Date(some_dates, '%d%b%Y')
weekdays(d3)
## [1] "Monday" "Monday"

The Lubridate package is very good when working with dates / times.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
ymd('19200202')
## [1] "1920-02-02"
dmy('01-01-2001')
## [1] "2001-01-01"

Times

Lubridate also has functions for working with times.

ymd_hms('2002-01-01 14:01:01')
## [1] "2002-01-01 14:01:01 UTC"
ymd_hms('2002-01-01 14:01:01', tz = 'Australia/Melbourne')
## [1] "2002-01-01 14:01:01 AEDT"