Course 3 - Getting and Cleaning Data - Week 1 - Notes

Greg Foletta

2019-09-25

Motivation

Where is data?

  • Tidy
  • CSV
  • JSON
  • SQL
  • Plain English
  • On a remote machine

Goal

The items in bold are in scope for this course:

Raw Data -> Processing Script -> Tidy Data -> Data Analysis -> Data Communication

Raw and Processed Data

Data

Data are values or quantitative variables belonging to a set of items.

Raw Data

Original source, hard to use for analysis. Data analytics includes processing. May only need to be processed once.

Remember that even with ‘raw’ data, there may be even ‘rawer’ data upstream that was put through its own processing pipeline.

Processed Data

Data that is ready for analysis. Processing can include merging, subsetting, transforming, etc.

There may be standards for processing, and all steps should be recorded.

Tidy Data Components

Four things to have:

  • Raw data
  • Tidy dataset
  • Code book describing each variable in the tidy data set
  • An explicit and exact recipe you used to go from 1 to 2 & 3.

The Raw Data

Examples:

  • The strange binary file your measurement machine spits out.
  • An unformatted Excel document with 10 worksheets.
  • Complicated JSON data.
  • Hand entered measurements collected from looking through a microscope.

You know the raw data is in the right format if:

  1. You ran no software on the data.
  2. You did not manipulate the numbers in the data.
  3. You did not remove any data.
  4. You did not summarise any data.

The Tidy Data

  1. Each variable should be in one column.
  2. Each different observation should be in one row.
  3. There should be one table for each “kind” of variable.
    • e.g. ‘Facebook’ table and a ‘Twitter’ table.
  4. If there are multiple tables, they should include a column in the table that allows them to be linked.

Other tips: - Make variable names human readable. - One file per table.

The Code Book

  1. Information about the variables (inc. units) in the data set not contained in the tidy data.
  2. Information about the summary choices.
  3. Information about the experimental study design.

Other tips: - Common format is Word / text. - There should be a section called “Study design” that has a thorough description of how the data was collected. - There must be a section called “Code Book” that describes each variable and its units.

The Instruction List

  • Ideally this is a script - could be R or Perl, Python, etc.
  • The input is the raw data.
  • The output is the processed, tidy data.
  • There are no parameters to the script.

It may not be possible to script every step. You will need to provide instructions on what needs to be done manually.

Downloading and Reading Files

Reading Local Files

state electorate site_id latitude longitude legs description
VIC Melbourne 4399 -37.78798 144.9590 2 Royal Pde/shared path [N], Royal Pde/shared path [S]
VIC Melbourne 4405 -37.79399 144.9420 3 Melrose St [N], Melrose St [S], Mark St [W]
VIC Melbourne 4406 -37.79412 144.9277 4 McCracken St [N], Macaulay Rd [E], Kensington Rd [SW], Epsom [NW]
VIC Melbourne 4407 -37.79454 144.9310 4 Macaulay Rd - east end [E], Eastwood St [S], Macaulay Rd [W], Eastwood St/Rankins Rd [N]
VIC Melbourne 4408 -37.79510 144.9359 4 Upstream direction [N], Macaulay Rd [E], Moonee Ponds Crk Trail [S], Macaulay Rd [W]
VIC Melbourne 4409 -37.79542 144.9389 5 Boundary Rd [N], Canning St [E], Macaulay Rd [SE], Boundary Rd [S], Macaulay Rd [W]

Reading Excel Files

Still probably one of the most widely used data formats.

Region Median Rent Quarterly Change Annual Change
Metropolitan Melbourne NA NA NA
Inner Melbourne 500 0.0000000 0.0224949
Inner Eastern Melbourne 470 0.0217391 0.0217391
Southern Melbourne 480 0.0000000 0.0212766
Western Melbourne 380 0.0000000 0.0026385
North Western Melbourne 400 -0.0243902 0.0126582
North Eastern Melbourne 400 0.0000000 0.0256410
Outer Eastern Melbourne 410 0.0250000 0.0250000
South Eastern Melbourne 380 0.0000000 0.0270270
Mornington Peninsula 390 0.0263158 0.0263158
Regional Victoria NA NA NA
Barwon-South West 365 0.0138889 0.0735294
Gippsland 298 0.0275862 0.0642857
Goulburn-Ovens-Murray 300 0.0000000 0.0344828
Loddon-Mallee 305 -0.0161290 0.0338983
Central Highlands-Wimmera 300 -0.0322581 0.0344828

Reading XML

Two components: the markup and the content.

  • Tags
    • <start_tag>
    • </end_tag>
    • <empty_tag />
  • Elements are specific examples
    • <tag>content</tag>
  • Attributes are components of the tag
    • <tag attribute='value'>content</tag>

Reading JSON

data.table

## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
##             x y          z
## 1: -0.1900912 a -1.5168462
## 2:  0.7244642 a -0.4720658
## 3:  0.3833784 a -0.5133854
## 4: -0.6634814 a  0.8714404
## 5: -0.6492069 a -0.1477487
## 6: -0.9422502 a  0.8069570
##    NAME       NROW NCOL  MB  COLS KEY
## 1:   DT 10,000,000    3 229 x,y,z    
## Total: 229MB

Special Variables

The .N variable is an integer of length 1 containing the number of observations.

## [1] 10000000
##     y       N
##  1: a 1000000
##  2: b 1000000
##  3: c 1000000
##  4: d 1000000
##  5: e 1000000
##  6: f 1000000
##  7: g 1000000
##  8: h 1000000
##  9: i 1000000
## 10: j 1000000

Keys

Easy subsetting using a key.

##                   x y          z
##       1: -0.1900912 a -1.5168462
##       2:  0.7244642 a -0.4720658
##       3:  0.3833784 a -0.5133854
##       4: -0.6634814 a  0.8714404
##       5: -0.6492069 a -0.1477487
##      ---                        
##  999996:  1.0008490 a -1.7289161
##  999997: -0.4213282 a  0.8826318
##  999998:  0.2829264 a  1.8317264
##  999999:  0.2791454 a -0.7867978
## 1000000:  0.2327153 a  2.0527258

```