Course 4 - Exploratory Data Analysis - Week 1 - Notes

Greg Foletta

2019-10-25

Principles of Analytic Graphics

Refer to the book ‘Beautiful Evidence’ by Edward Tufte.

Principle 1 - Show comparisons.

Evidence is always relative to another competing hypothesis.

Always ask “compated to what”? The comparison may be against a control group.

Principle 2 - Show causality, mechanism, explanation, systematic structure.

What is the causal framework for thinking about the question? How do you beleive the system is operating?

Principle 3 - Show multivariate data.

The world is inherently multivariate (more than two variables), so you need to show the real picture of what is going on.

Principle 4 - Integrate the evidence.

  • Completely integrate words, numbers, images, diagrams. Don’t let the tools you use drive the analysis.

Principle 5 - Describe and document the evidence with appropriate labels, scales, sources, etc

The graphic should tell a complete story and be credible.

Principle 6 - Content is king.

Analytical presentations ultimately stand or fall depending on the quality, relevance, and integrity of their content.

What’s the content, what’s the story? Then think about how to present it.

Exploratory Graphs

These are graphs for yourself to explore the data. We want to understand data properties, find patterns, suggest modeling strategies and ‘debug’ analyses.

Characteristics

  • Made quickly
  • Large number
  • Goal is for personal understanding of the data set. How does it look? What are the problems?
  • Axis/legends are cleaned up later.
  • Colour/size are primarily used for information.

One Dimensional Summaries

  • Five number summary - not really a graph
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90

Boxplot

Histogram - a rug underneath can aid in interpretation.

Overlaying features can help interpretation - for example on the boxplot a horizonal bar could be added to designate some specific value. This could be a govenment mandated fuel efficiency.

Barplot can be used for categorical values:

Two Dimensional Summaries

Multiple boxplots can be used, with factors across the x-axis:

Multiple histograms can be achieved with a facet

Scatterplot is an obvious go to.

Can use colour to view different categories within the scatterplot:

Exploratory plots are ‘quick and dirty’. You summarise the data and explore basic questions and hypothesis, or rule some out.

Plotting Systems

Base

‘Artists palette model’ - theres a blank canvas and you add things one-by-one. Uses annotation functions to add/modify (text(), lines(), points(), axis()).

It’s convenient and intuiitive, but you can’t take items away once added. There’s no ‘language’ so it’s difficult to ‘translate’ to others.

Lattice System

Plots are created with a singple function call. (xyplot(), bwplot()). Most useful for conditioning types of plots: looking at how y changes with x across different levels of z,

Margins and spacing are set automatically because the plot is specified at once. Good for putting many many plots on screen quickly and easily.

However it’s awkward to specify a plot using a single function call, and annotation is not intuitive.

##            Population Income Illiteracy Life.Exp Murder HS.Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area region
## Alabama     50708  South
## Alaska     566432   West
## Arizona    113417   West
## Arkansas    51945  South
## California 156361   West
## Colorado   103766   West

ggplot2

Splits the difference between base and lattice. Automatically deals with spacings and text, but allows for annotation. Makes decisions for you, but you can customise.

Core plotting and graphics is encapsulated in:

  • graphics, which contains plotting functions for the ‘base’ graphing systems.
  • grDevices, which contains all the code implementing the various graphics devices: X11, PDF, PostScript, PNG, etce

The lattice plotting system is implemented using:

  • lattice, which contains the code for producing Trellis graphics, which are independent of the ‘base’ graphics system.
  • grid, which implements a different graphing system independent of the base system. Lattice builds upon grid.

Process

  • Where will it be made?
    • Screen
    • File
  • How will it be used?
    • Temporarily on the screen?
    • Web browser?
    • Printed academic paper?
    • Presentation
  • Is it a large amount of data or a few points?
  • Does it need to be dynamically resized? e.g. vector format rather than a raster.

Base Graphics

There are two phases: initialising the new plot, then annotating (adding to) the plot.

Calling plot(x, y) or hist(x) will launch a graphics device and draw a new plot.

plot() is generic, so if the arguments are not of some special class, the default method for plot is called. This has many arguments.

The base graphics system has many parameters, these are documented in ?par.

Important Parameters

  • pch - the plotting symbol (default is open circle)
  • lty - the line type (default is solid line).
  • lwd - the line width, specified as an integer multiple.
  • col - the plotting colour, specified as a number, string, or hex code.
    • The colors() function gives you a vector of colours by name.
  • xlab / ylab - character string for the X/Y axis label.

The par() function is used to specify global graphics parameters. These are overridden when specified as arguments to specific plotting functions.

  • las - the orientation of the axis labels on the plot.
  • bg - the background colour.
  • mar - the margin size.
  • oma - the outer margin size.
  • mfrow - the number of plots per row, column.
    • Plots are filled row wise.
  • mfcol - the number of plots per row, column.
    • Plots are filled column-wise.

Defaults

## $lty
## [1] "solid"
## 
## $col
## [1] "black"
## 
## $pch
## [1] 1
## 
## $bg
## [1] "white"
## 
## $mar
## [1] 5.1 4.1 4.1 2.1
## 
## $mfrow
## [1] 1 1

Functions

  • plot() makes a scatterplot, or other type depending on the class of the object.
  • lines() adds lines to a plot.
  • points() adds points to a plot.
  • text() adds text labels to a plot using x,y coordinates.
  • title() adds annotations to x/y axis, title, subtitle, outer margin.
  • mtext() adds arbitrary text to the margins.
  • axis() addin axis ticks or labels.

Graphics Devices

These are ‘a place where you can make a plot appear’. Screen, PDF, PNG, etc. A plot must be ‘sent’ to a graphics device.

The screen device on a Mac is quartz(), on Windows it’s windows(), and on Linux/Unix it’s x11().

For devices that may be printed out or incorporated into a document, a file device is most appropriate.

Plot Creation

There are two basic plotting approaches. The first is calling plot() etc and it just appearing on the screen device.

The second is explicity launching a device and calling the plotting function.

pdf(file = 'plot.pdf')
with(data, plot(x, y))
title(main = 'A Plot')
dev.off()

Graphics Devices

Two formats of file devices: vector and bitmap devices.

Vector formats:

  • pdf
  • svg
  • win.metafile
  • postscript

Bitmap formats

  • png
  • jpeg
  • tiff
  • bmp

It is possible to open multiple graphics devices, however you can only plot to one device at a time.

The currently active device is dev.cur(). Every open graphics device gets an integer handle.

You set the actice device with `dev.set().

You can use dev.copy() to copy from one plot device to another, or use dev.copy2pdf() to specifically copy to a PDF. Copying is not an exact operation and the result may not be identical to the original.