Motivation
Where is data?
- Tidy
- CSV
- JSON
- SQL
- Plain English
- On a remote machine
Goal
The items in bold are in scope for this course:
Raw Data -> Processing Script -> Tidy Data -> Data Analysis -> Data Communication
Raw and Processed Data
Data
Data are values or quantitative variables belonging to a set of items.
Raw Data
Original source, hard to use for analysis. Data analytics includes processing. May only need to be processed once.
Remember that even with ‘raw’ data, there may be even ‘rawer’ data upstream that was put through its own processing pipeline.
Processed Data
Data that is ready for analysis. Processing can include merging, subsetting, transforming, etc.
There may be standards for processing, and all steps should be recorded.
Tidy Data Components
Four things to have:
- Raw data
- Tidy dataset
- Code book describing each variable in the tidy data set
- An explicit and exact recipe you used to go from 1 to 2 & 3.
The Raw Data
Examples:
- The strange binary file your measurement machine spits out.
- An unformatted Excel document with 10 worksheets.
- Complicated JSON data.
- Hand entered measurements collected from looking through a microscope.
You know the raw data is in the right format if:
- You ran no software on the data.
- You did not manipulate the numbers in the data.
- You did not remove any data.
- You did not summarise any data.
The Tidy Data
- Each variable should be in one column.
- Each different observation should be in one row.
- There should be one table for each “kind” of variable.
- e.g. ‘Facebook’ table and a ‘Twitter’ table.
- If there are multiple tables, they should include a column in the table that allows them to be linked.
Other tips: - Make variable names human readable. - One file per table.
The Code Book
- Information about the variables (inc. units) in the data set not contained in the tidy data.
- Information about the summary choices.
- Information about the experimental study design.
Other tips: - Common format is Word / text. - There should be a section called “Study design” that has a thorough description of how the data was collected. - There must be a section called “Code Book” that describes each variable and its units.
The Instruction List
- Ideally this is a script - could be R or Perl, Python, etc.
- The input is the raw data.
- The output is the processed, tidy data.
- There are no parameters to the script.
It may not be possible to script every step. You will need to provide instructions on what needs to be done manually.
Downloading and Reading Files
Downloading Files
Remember to get and set your working directory - getwd()
and setwd()
.
Check and create directories:
Use download.file()
. This helps reproducibility. Also keep track of the date and time the file was downloaded.
check_and_download <- function(uri, destfile, ...) {
if(!file.exists(destfile)) {
download.file(url = uri, destfile = destfile, ...)
}
}
data_uri <- 'https://data.melbourne.vic.gov.au/api/views/uyp8-7ii8/rows.csv?accessType=DOWNLOAD'
destination_file <- './data_dir/bike_counts.csv'
check_and_download(data_uri, destfile = destination_file, method = 'curl')
list.files('./data_dir/')
## [1] "bike_counts.csv" "rental.xlsx"
## [1] "Wed Sep 25 06:40:32 2019"
Reading Local Files
bike_counts <- read_csv(destination_file)
bike_counts %>%
select(state, electorate, site_id, latitude, longitude, legs, description) %>%
head() %>%
kable() %>%
kable_styling()
state | electorate | site_id | latitude | longitude | legs | description |
---|---|---|---|---|---|---|
VIC | Melbourne | 4399 | -37.78798 | 144.9590 | 2 | Royal Pde/shared path [N], Royal Pde/shared path [S] |
VIC | Melbourne | 4405 | -37.79399 | 144.9420 | 3 | Melrose St [N], Melrose St [S], Mark St [W] |
VIC | Melbourne | 4406 | -37.79412 | 144.9277 | 4 | McCracken St [N], Macaulay Rd [E], Kensington Rd [SW], Epsom [NW] |
VIC | Melbourne | 4407 | -37.79454 | 144.9310 | 4 | Macaulay Rd - east end [E], Eastwood St [S], Macaulay Rd [W], Eastwood St/Rankins Rd [N] |
VIC | Melbourne | 4408 | -37.79510 | 144.9359 | 4 | Upstream direction [N], Macaulay Rd [E], Moonee Ponds Crk Trail [S], Macaulay Rd [W] |
VIC | Melbourne | 4409 | -37.79542 | 144.9389 | 5 | Boundary Rd [N], Canning St [E], Macaulay Rd [SE], Boundary Rd [S], Macaulay Rd [W] |
Reading Excel Files
Still probably one of the most widely used data formats.
rental_file <- './data_dir/rental.xlsx'
check_and_download('https://www.dhhs.vic.gov.au/sites/default/files/documents/201908/Tables%20from%20rental%20report%20-%20June%20quarter%202019.xlsx', destfile = rental_file, method = 'curl')
rental <- read_excel(rental_file, sheet = 5, skip = 1)
rental %>%
kable() %>%
kable_styling()
Region | Median Rent | Quarterly Change | Annual Change |
---|---|---|---|
Metropolitan Melbourne | NA | NA | NA |
Inner Melbourne | 500 | 0.0000000 | 0.0224949 |
Inner Eastern Melbourne | 470 | 0.0217391 | 0.0217391 |
Southern Melbourne | 480 | 0.0000000 | 0.0212766 |
Western Melbourne | 380 | 0.0000000 | 0.0026385 |
North Western Melbourne | 400 | -0.0243902 | 0.0126582 |
North Eastern Melbourne | 400 | 0.0000000 | 0.0256410 |
Outer Eastern Melbourne | 410 | 0.0250000 | 0.0250000 |
South Eastern Melbourne | 380 | 0.0000000 | 0.0270270 |
Mornington Peninsula | 390 | 0.0263158 | 0.0263158 |
Regional Victoria | NA | NA | NA |
Barwon-South West | 365 | 0.0138889 | 0.0735294 |
Gippsland | 298 | 0.0275862 | 0.0642857 |
Goulburn-Ovens-Murray | 300 | 0.0000000 | 0.0344828 |
Loddon-Mallee | 305 | -0.0161290 | 0.0338983 |
Central Highlands-Wimmera | 300 | -0.0322581 | 0.0344828 |
Reading XML
Two components: the markup and the content.
- Tags
<start_tag>
</end_tag>
<empty_tag />
- Elements are specific examples
<tag>content</tag>
- Attributes are components of the tag
<tag attribute='value'>content</tag>
Reading JSON
data.table
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
DT <- data.table(
x = rnorm(10000000),
y = rep(letters[1:10], each = 100000),
z = rnorm(10000000)
)
# All of the data.frame functions work on a data.table object.
head(DT)
## x y z
## 1: -0.1900912 a -1.5168462
## 2: 0.7244642 a -0.4720658
## 3: 0.3833784 a -0.5133854
## 4: -0.6634814 a 0.8714404
## 5: -0.6492069 a -0.1477487
## 6: -0.9422502 a 0.8069570
## NAME NROW NCOL MB COLS KEY
## 1: DT 10,000,000 3 229 x,y,z
## Total: 229MB
Special Variables
The .N variable is an integer of length 1 containing the number of observations.
## [1] 10000000
## y N
## 1: a 1000000
## 2: b 1000000
## 3: c 1000000
## 4: d 1000000
## 5: e 1000000
## 6: f 1000000
## 7: g 1000000
## 8: h 1000000
## 9: i 1000000
## 10: j 1000000
Keys
Easy subsetting using a key.
## x y z
## 1: -0.1900912 a -1.5168462
## 2: 0.7244642 a -0.4720658
## 3: 0.3833784 a -0.5133854
## 4: -0.6634814 a 0.8714404
## 5: -0.6492069 a -0.1477487
## ---
## 999996: 1.0008490 a -1.7289161
## 999997: -0.4213282 a 0.8826318
## 999998: 0.2829264 a 1.8317264
## 999999: 0.2791454 a -0.7867978
## 1000000: 0.2327153 a 2.0527258
```