python - How do I find the most common words in multiple separate texts? -


bit of simple question really, can't seem crack it. have string formatted in following way:

["category1",("data","data","data")] ["category2", ("data","data","data")] 

i call different categories posts , want frequent words data section. tried:

from nltk.tokenize import wordpunct_tokenize collections import defaultdict freq_dict = defaultdict(int)  cat, text2 in posts:    tokens = wordpunct_tokenize(text2)    token in tokens:        if token in freq_dict:            freq_dict[token] += 1        else:            freq_dict[token] = 1    top = sorted(freq_dict, key=freq_dict.get, reverse=true)    top = top[:50]    print top 

however, give me top words per post in string.

i need general top words list.
if take print top out of loop, gives me results of last post.
have idea?

why not use counter?

in [30]: collections import counter  in [31]: data=["category1",("data","data","data")]  in [32]: counter(data[1]) out[32]: counter({'data': 3})  in [33]: counter(data[1]).most_common() out[33]: [('data', 3)] 

Comments

Popular posts from this blog

c++ - Function signature as a function template parameter -

algorithm - What are some ways to combine a number of (potentially incompatible) sorted sub-sets of a total set into a (partial) ordering of the total set? -

How to call a javascript function after the page loads with a chrome extension? -