Chapter 3
Initial Data Analysis
This chapter introduces the important, but often overlooked, topic of initial data analysis
(IDA). The aim of IDA is to process data so that its quality can be assessed before any
further analysis is undertaken. There are three basic steps in IDA, data processing, data
scrutiny and ‘cleaning’, and data description. Data processing involves coding and
entry of the data into a data set with a format suitable for subsequent exploratory analysis.
Data scrutiny and cleaning means checking on the quality and structure of data and
correcting any errors due to recording and processing. Data description involves
summary and display of the main characteristics of data distributions.
It is crucial to know the integrity of your data and to be confident that any data
recording and processing errors have been identified and remedied. Simple frequencies,
that is score counts for variables and range statistics, minimum and maximum values,
will reveal any odd data values. A listing of cases will enable those cases with odd values
to be checked against raw data as recorded on questionnaires or coding sheets. After data
processing and cleaning, underlying distributions of variables may be examined using
data visualization techniques. The main features of the data can then be summarized
using appropriate descriptive statistics and possible statistical models identified.
Concise and simple data presentation is essential for communication of research
findings. Examples include: barcharts, stem and leaf and box and whisker plots,
histograms and frequency tables. These represent a few of the many possible data
visualization and presentation techniques available, most of which are illustrated in later
sections of this chapter.
3.1 Data Processing
After having collected or been given some data preliminary considerations should
include:
- Close examination of what exactly has been measured, that is, number of observations
and number of variables. You should also consider whether numbers used for
statistical variables represent nominal, ordinal, interval or ratio levels of measurement.
It should be stressed that taking numbers at face value without consideration of how
the data were obtained can lead to wasted time in data processing and at worst
misleading results.