Chapter 16
The actual value we got for X^2 in the example was 19.18. Here's the probability that
this value is random:
round(float(cdf(19.18, 6)), 5)
0.00387
This probability is 3/775, with the denominator limited to 1000. Those are not good
odds of the data being random.
Summary
In this chapter, we looked at three optimization techniques. The first technique
involves finding the right algorithm and data structure. This has more impact on
performance than any other single design or programming decision. Using the
right algorithm can easily reduce runtimes from minutes to fractions of a second.
Changing a poorly used sequence to a properly used mapping, for example, may
reduce run time by a factor of 200.
We should generally optimize all of our recursions to be loops. This will be faster in
Python and it won't be stopped by the call stack limit that Python imposes. There are
many examples of how recursions are flattened into loops in other chapters, primarily,
Chapter 6, Recursions and Reductions. Additionally, we may be able to improve
performance in two other ways. First, we can apply memoization to cache results.
For numeric calculations, this can have a large impact; for collections, the impact may
be less. Secondly, replacing large materialized data objects with iterables may also
improve performance by reducing the amount of memory management required.
In the case study presented in this chapter, we looked at the advantage of using
Python for exploratory data analysis—the initial data acquisition including a little bit
of parsing and filtering. In some cases, a significant amount of effort is required to
normalize data from various sources. This is a task at which Python excels.
The calculation of a X^2 value involved three sum() functions: two intermediate
generator expressions, and a final generator expression to create a dictionary with
expected values. A final sum() function created the statistic. In under a dozen
expressions, we created a sophisticated analysis of data that will help us accept
or reject the null hypothesis.
We also evaluated some complex statistical functions: the incomplete and the
complete gamma function. The incomplete gamma function involves a potentially
infinite series; we truncated this and summed the values. The complete gamma
function has some potential complexity, but it doesn't happen to apply in our case.