python - How do I find the most common words in multiple separate texts? -
bit of simple question really, can't seem crack it. have string formatted in following way:
["category1",("data","data","data")] ["category2", ("data","data","data")]
i call different categories posts , want frequent words data section. tried:
from nltk.tokenize import wordpunct_tokenize collections import defaultdict freq_dict = defaultdict(int) cat, text2 in posts: tokens = wordpunct_tokenize(text2) token in tokens: if token in freq_dict: freq_dict[token] += 1 else: freq_dict[token] = 1 top = sorted(freq_dict, key=freq_dict.get, reverse=true) top = top[:50] print top
however, give me top words per post in string.
i need general top words list.
if take print top out of loop, gives me results of last post.
have idea?
why not use counter?
in [30]: collections import counter in [31]: data=["category1",("data","data","data")] in [32]: counter(data[1]) out[32]: counter({'data': 3}) in [33]: counter(data[1]).most_common() out[33]: [('data', 3)]
Comments
Post a Comment