Python Programming: An Introduction to Computer Science

(Nora) #1
11.6. NON-SEQUENTIALCOLLECTIONS 195

Atthehighestlevel,thisis justa multi-accumulatorproblem.We needa countforeachwordthatappears
inthedocument. We canusea loopthatiteratesthrougheachwordinthedocumentandaddsonetothe
appropriatecount.Theonlycatchis thatwewillneedhundredsorthousandsofaccumulators,oneforeach
uniquewordinthedocument.Thisis wherea (Python)dictionarycomesinhandy.
We willusea dictionarywherethekeysarestringsrepresentingwordsinthedocumentandthevaluesare
intsthatcountofhow many timesthewordappears.Let’s callourdictionarycounts. To updatethecount
fora particularword,w, wejustneeda lineofcodesomethinglike this:


counts[w] = counts[w]+ 1


Thissaystosetthecountassociatedwithwordwtobeonemorethanthecurrentcountforw.
Thereis onesmallcomplicationwithusinga dictionaryhere.Thefirsttimeweencountera word,it will
notyetbeincounts. Attemptingtoaccessa non-existentkey producesa run-timeKeyError. To guard
againstthis,weneeda decisioninouralgorithm.


if w is already in counts:
add one to the countfor w
else:
set count for w to 1


Thisdecisionensuresthatthefirsttimea wordis encountered,it willbeenteredintothedictionarywitha
countof1.
Onewaytoimplementthisdecisionis tousethehaskeymethodfordictionaries.


if counts.has_key(w):
counts[w] = counts[w]+ 1
else:
counts[w] = 1


Anotherapproachis touseatry-excepttocatchtheerror.


try:
counts[w] = counts[w]+ 1
except KeyError:
counts[w] = 1


Thisis a commonpatterninprogramsthatusedictionaries,andbothofthesecodingstylesareused.
Thedictionaryupdatingcodewillformtheheartofourprogram.We justneedtofillinthepartsaround
it.Thefirsttaskis tosplitourtextdocumentintoa sequenceofwords.Intheprocess,wewillalsoconvert
allthetexttolowercase(sooccurrencesof“Foo”match“foo”)andeliminatepunctuation(so“foo,” matches
“foo”).Here’s thecodetodothat:


fname = raw_input("Fileto analyze: ")

# read file as onelong string
text = open(fname,’r’).read()

# convert all lettersto lower case
text = string.lower(text)

# replace eachpunctuation character with a space
for ch in ’!"#$%&()*+,-./:;<=>?@[\\]ˆ_‘{|} ̃’:
text = string.replace(text, ch, ’ ’)

# split stringat whitespace to form a list of words
words = string.split(text)

Now wecaneasilyloopthroughthewordstobuildthecountsdictionary.
Free download pdf