STAT 663/CSI 773 Exploratory Data Analysis
Syllabus prepared by Dr. Daniel B. Carr


Prerequisite: A 300 level statistics course or higher

Texts: Visualizing Data, by William S. Cleveland, Hobart Press 1994 
An Introduction to S and S-Plus, Phil Spector, Duxbury Press 1994

Exploratory data analysis is more than a collection of methods, it embodies a 
strategy of what to do next when attempting to describe and understand "new" 
data sets.  The strategy includes enlisting the support of those who 
collected the data and those that understand the phenomenology.  The strategy 
attempts to let the data speak without heavy dependence on preconceived 
assumptions and models.  The strategy encompasses some of the wisdom that has 
evolved in the task of describing structure in the presence of uncertainty.

The class starts with the basic notion of "replicates" and develops 
distributional descriptions of replicates.  From then on the basic theme is 
comparison: the comparison of empirical distributions to theoretical 
distributions and to other empirical distributions. The class stresses 
graphic representations to exploit the power of the eye-brain system and 
human intuition.  Graphics include dot plots, box plots, qqplots, rfplots, 
time series plots, brushing, scatterplot matrices and other multivariate 
graphics.  The class covers the perceptual advantages of particular graphical 
representations for revealing structure in the presence of noise.

The eye-brain system is not well-suited for numerous perceptual tasks.  
Comparison strategies frequently use computational methods to transform the 
data into form more suitable for visualization and statistical reasoning. The 
comparison strategies include reference simplification, structure and 
residual decompositions, data aggregation, and reexpression.  Statistical 
methods (such as analysis of variance, regression, density estimation and 
smoothing) are introduced as part of the course since they are fundamental 
tools in the transformation process.  Since full courses are devoted to these 
statistical topics, the treatment is cursory.

The class is a hands-on class and portions of the class are taught in a Unix 
workstation laboratory.  In the laboratory, students will use Splus to carry 
out their assignments and their data analysis project.  The assignments 
include simulation examples to help calibrate the interpretation of graphs 
and to provide a deeper understanding of the methods.  The class includes 
instruction in Splus, a high-level statistical programming language. No prior 
experience is required in either Unix or Splus.  Students quickly catch on to 
data access using intemet and Splus functions for manipulating data, for 
producing graphics (motif windows with postscript hardcopy), and for 
extending the language by writing new functions.  The data analysis projects 
is the primary determinant of the class grade.  Projects from work or student 
research are encouraged as long as they are not used for grades in another 
class.