python - How do I find the most common words in multiple separate texts? -

- May 15, 2010

bit of simple question really, can't seem crack it. have string formatted in following way:

["category1",("data","data","data")] ["category2", ("data","data","data")]

i call different categories posts , want frequent words data section. tried:

from nltk.tokenize import wordpunct_tokenize collections import defaultdict freq_dict = defaultdict(int)  cat, text2 in posts:    tokens = wordpunct_tokenize(text2)    token in tokens:        if token in freq_dict:            freq_dict[token] += 1        else:            freq_dict[token] = 1    top = sorted(freq_dict, key=freq_dict.get, reverse=true)    top = top[:50]    print top

however, give me top words per post in string.

i need general top words list.
if take print top out of loop, gives me results of last post.
have idea?

why not use counter?

in [30]: collections import counter  in [31]: data=["category1",("data","data","data")]  in [32]: counter(data[1]) out[32]: counter({'data': 3})  in [33]: counter(data[1]).most_common() out[33]: [('data', 3)]

Search This Blog

HPH

python - How do I find the most common words in multiple separate texts? -

Comments

Post a Comment

Popular posts from this blog

Perl - how to grep a block of text from a file -

delphi - How to remove all the grips on a coolbar if I have several coolbands? -

javascript - Animating array of divs; only the final element is modified -