STAT 663/CSI 773 Exploratory Data Analysis
Syllabus prepared by Dr. Daniel B. Carr
Prerequisite: A 300 level statistics course or higher
Texts: Visualizing Data, by William S. Cleveland, Hobart Press 1994
An Introduction to S and S-Plus, Phil Spector, Duxbury Press 1994
Exploratory data analysis is more than a collection of methods, it embodies a
strategy of what to do next when attempting to describe and understand "new"
data sets. The strategy includes enlisting the support of those who
collected the data and those that understand the phenomenology. The strategy
attempts to let the data speak without heavy dependence on preconceived
assumptions and models. The strategy encompasses some of the wisdom that has
evolved in the task of describing structure in the presence of uncertainty.
The class starts with the basic notion of "replicates" and develops
distributional descriptions of replicates. From then on the basic theme is
comparison: the comparison of empirical distributions to theoretical
distributions and to other empirical distributions. The class stresses
graphic representations to exploit the power of the eye-brain system and
human intuition. Graphics include dot plots, box plots, qqplots, rfplots,
time series plots, brushing, scatterplot matrices and other multivariate
graphics. The class covers the perceptual advantages of particular graphical
representations for revealing structure in the presence of noise.
The eye-brain system is not well-suited for numerous perceptual tasks.
Comparison strategies frequently use computational methods to transform the
data into form more suitable for visualization and statistical reasoning. The
comparison strategies include reference simplification, structure and
residual decompositions, data aggregation, and reexpression. Statistical
methods (such as analysis of variance, regression, density estimation and
smoothing) are introduced as part of the course since they are fundamental
tools in the transformation process. Since full courses are devoted to these
statistical topics, the treatment is cursory.
The class is a hands-on class and portions of the class are taught in a Unix
workstation laboratory. In the laboratory, students will use Splus to carry
out their assignments and their data analysis project. The assignments
include simulation examples to help calibrate the interpretation of graphs
and to provide a deeper understanding of the methods. The class includes
instruction in Splus, a high-level statistical programming language. No prior
experience is required in either Unix or Splus. Students quickly catch on to
data access using intemet and Splus functions for manipulating data, for
producing graphics (motif windows with postscript hardcopy), and for
extending the language by writing new functions. The data analysis projects
is the primary determinant of the class grade. Projects from work or student
research are encouraged as long as they are not used for grades in another
class.