11.6. NON-SEQUENTIALCOLLECTIONS 195
Atthehighestlevel,thisis justa multi-accumulatorproblem.We needa countforeachwordthatappears
inthedocument. We canusea loopthatiteratesthrougheachwordinthedocumentandaddsonetothe
appropriatecount.Theonlycatchis thatwewillneedhundredsorthousandsofaccumulators,oneforeach
uniquewordinthedocument.Thisis wherea (Python)dictionarycomesinhandy.
We willusea dictionarywherethekeysarestringsrepresentingwordsinthedocumentandthevaluesare
intsthatcountofhow many timesthewordappears.Let’s callourdictionarycounts. To updatethecount
fora particularword,w, wejustneeda lineofcodesomethinglike this:
counts[w] = counts[w]+ 1
Thissaystosetthecountassociatedwithwordwtobeonemorethanthecurrentcountforw.
Thereis onesmallcomplicationwithusinga dictionaryhere.Thefirsttimeweencountera word,it will
notyetbeincounts. Attemptingtoaccessa non-existentkey producesa run-timeKeyError. To guard
againstthis,weneeda decisioninouralgorithm.
if w is already in counts:
add one to the countfor w
else:
set count for w to 1
Thisdecisionensuresthatthefirsttimea wordis encountered,it willbeenteredintothedictionarywitha
countof1.
Onewaytoimplementthisdecisionis tousethehaskeymethodfordictionaries.
if counts.has_key(w):
counts[w] = counts[w]+ 1
else:
counts[w] = 1
Anotherapproachis touseatry-excepttocatchtheerror.
try:
counts[w] = counts[w]+ 1
except KeyError:
counts[w] = 1
Thisis a commonpatterninprogramsthatusedictionaries,andbothofthesecodingstylesareused.
Thedictionaryupdatingcodewillformtheheartofourprogram.We justneedtofillinthepartsaround
it.Thefirsttaskis tosplitourtextdocumentintoa sequenceofwords.Intheprocess,wewillalsoconvert
allthetexttolowercase(sooccurrencesof“Foo”match“foo”)andeliminatepunctuation(so“foo,” matches
“foo”).Here’s thecodetodothat:
fname = raw_input("Fileto analyze: ")
# read file as onelong string
text = open(fname,’r’).read()
# convert all lettersto lower case
text = string.lower(text)
# replace eachpunctuation character with a space
for ch in ’!"#$%&()*+,-./:;<=>?@[\\]ˆ_‘{|} ̃’:
text = string.replace(text, ch, ’ ’)
# split stringat whitespace to form a list of words
words = string.split(text)
Now wecaneasilyloopthroughthewordstobuildthecountsdictionary.