Motivation

Where is data?

Tidy
CSV
JSON
SQL
Plain English
On a remote machine

Goal

The items in bold are in scope for this course:

Raw Data -> Processing Script -> Tidy Data -> Data Analysis -> Data Communication

Raw and Processed Data

Data

Data are values or quantitative variables belonging to a set of items.

Raw Data

Original source, hard to use for analysis. Data analytics includes processing. May only need to be processed once.

Remember that even with ‘raw’ data, there may be even ‘rawer’ data upstream that was put through its own processing pipeline.

Processed Data

Data that is ready for analysis. Processing can include merging, subsetting, transforming, etc.

There may be standards for processing, and all steps should be recorded.

Tidy Data Components

Four things to have:

Raw data
Tidy dataset
Code book describing each variable in the tidy data set
An explicit and exact recipe you used to go from 1 to 2 & 3.

The Raw Data

Examples:

The strange binary file your measurement machine spits out.
An unformatted Excel document with 10 worksheets.
Complicated JSON data.
Hand entered measurements collected from looking through a microscope.

You know the raw data is in the right format if:

You ran no software on the data.
You did not manipulate the numbers in the data.
You did not remove any data.
You did not summarise any data.

The Tidy Data

Each variable should be in one column.
Each different observation should be in one row.
There should be one table for each “kind” of variable.
- e.g. ‘Facebook’ table and a ‘Twitter’ table.
If there are multiple tables, they should include a column in the table that allows them to be linked.

Other tips: - Make variable names human readable. - One file per table.

The Code Book

Information about the variables (inc. units) in the data set not contained in the tidy data.
Information about the summary choices.
Information about the experimental study design.

Other tips: - Common format is Word / text. - There should be a section called “Study design” that has a thorough description of how the data was collected. - There must be a section called “Code Book” that describes each variable and its units.

The Instruction List

Ideally this is a script - could be R or Perl, Python, etc.
The input is the raw data.
The output is the processed, tidy data.
There are no parameters to the script.

It may not be possible to script every step. You will need to provide instructions on what needs to be done manually.

Downloading and Reading Files

Downloading Files

Remember to get and set your working directory - getwd() and setwd().

setwd('~/Documents/Study/Coursera_Data_Science_JHU/C3_Getting_and_Cleaning_Data/')

Check and create directories:

if (!file.exists('./data_dir')) {
    dir.create('./data_dir')
}

Use download.file(). This helps reproducibility. Also keep track of the date and time the file was downloaded.

check_and_download <- function(uri, destfile, ...) {
    if(!file.exists(destfile)) {
    download.file(url = uri, destfile = destfile, ...)
    }
}

data_uri <- 'https://data.melbourne.vic.gov.au/api/views/uyp8-7ii8/rows.csv?accessType=DOWNLOAD'

destination_file <- './data_dir/bike_counts.csv'

check_and_download(data_uri, destfile = destination_file, method = 'curl')

list.files('./data_dir/')

## [1] "bike_counts.csv" "rental.xlsx"

dateDownloaded <- date()
dateDownloaded

## [1] "Wed Sep 25 06:40:32 2019"

Reading Local Files

bike_counts <- read_csv(destination_file)
bike_counts %>%
    select(state, electorate, site_id, latitude, longitude, legs, description) %>%
    head() %>%
    kable() %>%
    kable_styling()

state	electorate	site_id	latitude	longitude	legs	description
VIC	Melbourne	4399	-37.78798	144.9590	2	Royal Pde/shared path [N], Royal Pde/shared path [S]
VIC	Melbourne	4405	-37.79399	144.9420	3	Melrose St [N], Melrose St [S], Mark St [W]
VIC	Melbourne	4406	-37.79412	144.9277	4	McCracken St [N], Macaulay Rd [E], Kensington Rd [SW], Epsom [NW]
VIC	Melbourne	4407	-37.79454	144.9310	4	Macaulay Rd - east end [E], Eastwood St [S], Macaulay Rd [W], Eastwood St/Rankins Rd [N]
VIC	Melbourne	4408	-37.79510	144.9359	4	Upstream direction [N], Macaulay Rd [E], Moonee Ponds Crk Trail [S], Macaulay Rd [W]
VIC	Melbourne	4409	-37.79542	144.9389	5	Boundary Rd [N], Canning St [E], Macaulay Rd [SE], Boundary Rd [S], Macaulay Rd [W]

Reading Excel Files

Still probably one of the most widely used data formats.

rental_file <- './data_dir/rental.xlsx'

check_and_download('https://www.dhhs.vic.gov.au/sites/default/files/documents/201908/Tables%20from%20rental%20report%20-%20June%20quarter%202019.xlsx', destfile = rental_file, method = 'curl')

rental <- read_excel(rental_file, sheet = 5, skip = 1)
rental %>%
    kable() %>%
    kable_styling()

Region	Median Rent	Quarterly Change	Annual Change
Metropolitan Melbourne	NA	NA	NA
Inner Melbourne	500	0.0000000	0.0224949
Inner Eastern Melbourne	470	0.0217391	0.0217391
Southern Melbourne	480	0.0000000	0.0212766
Western Melbourne	380	0.0000000	0.0026385
North Western Melbourne	400	-0.0243902	0.0126582
North Eastern Melbourne	400	0.0000000	0.0256410
Outer Eastern Melbourne	410	0.0250000	0.0250000
South Eastern Melbourne	380	0.0000000	0.0270270
Mornington Peninsula	390	0.0263158	0.0263158
Regional Victoria	NA	NA	NA
Barwon-South West	365	0.0138889	0.0735294
Gippsland	298	0.0275862	0.0642857
Goulburn-Ovens-Murray	300	0.0000000	0.0344828
Loddon-Mallee	305	-0.0161290	0.0338983
Central Highlands-Wimmera	300	-0.0322581	0.0344828

Reading XML

Two components: the markup and the content.

Tags
- <start_tag>
- </end_tag>
- <empty_tag />
Elements are specific examples
- <tag>content</tag>
Attributes are components of the tag
- <tag attribute='value'>content</tag>

Reading JSON

data.table

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

DT <- data.table(
    x = rnorm(10000000),
    y = rep(letters[1:10], each = 100000),
    z = rnorm(10000000)
)

# All of the data.frame functions work on a data.table object.
head(DT)

##             x y          z
## 1: -0.1900912 a -1.5168462
## 2:  0.7244642 a -0.4720658
## 3:  0.3833784 a -0.5133854
## 4: -0.6634814 a  0.8714404
## 5: -0.6492069 a -0.1477487
## 6: -0.9422502 a  0.8069570

# View the current tables in memory
tables()

##    NAME       NROW NCOL  MB  COLS KEY
## 1:   DT 10,000,000    3 229 x,y,z    
## Total: 229MB

Special Variables

The .N variable is an integer of length 1 containing the number of observations.

# Total number of observations
DT[, .N]

## [1] 10000000

# Observations with grouping
DT[, .N, by = y]

##     y       N
##  1: a 1000000
##  2: b 1000000
##  3: c 1000000
##  4: d 1000000
##  5: e 1000000
##  6: f 1000000
##  7: g 1000000
##  8: h 1000000
##  9: i 1000000
## 10: j 1000000

Keys

Easy subsetting using a key.

setkey(DT, y)
DT['a']

##                   x y          z
##       1: -0.1900912 a -1.5168462
##       2:  0.7244642 a -0.4720658
##       3:  0.3833784 a -0.5133854
##       4: -0.6634814 a  0.8714404
##       5: -0.6492069 a -0.1477487
##      ---                        
##  999996:  1.0008490 a -1.7289161
##  999997: -0.4213282 a  0.8826318
##  999998:  0.2829264 a  1.8317264
##  999999:  0.2791454 a -0.7867978
## 1000000:  0.2327153 a  2.0527258

```

Course 3 - Getting and Cleaning Data - Week 1 - Notes

Greg Foletta

2019-09-25