什么是停用词?为什么我们需要删除停用词?我们应该何时删除停用词?删除停用词的不同方法使用NLTK使用spaCy使用Gensim文本标准化简介什么是词干化和词形还原?执行词干化和词形还原的方法使用NLTK使用spaCy使用TextBlob1. 什么是停用词?



考虑这个文本,"There is a pen on the table"。现在,单词"is","a","on"和"the"在解析它时对语句没有任何意义。而像"there","book"和"table"这样的词是关键词,并告诉我们这句话是什么意思。



a about after all also always am an and any are at be been being but by came can cant come could did didn do does doesn doing don else for from get give goes going had happen has have having how i if ill im in into is isn it its ive just keep let like made make many may me mean more most much no not now of only or our really say see some something take tell than that the their them then they thing this to try up us use used uses very want was way we what when where which who why will with without wont you your youre

2. 为什么我们需要删除停用词?






在删除停用词时,数据集大小减小,训练模型的时间也减少删除停用词可能有助于提高性能,因为只剩下更少且唯一有意义的词。因此,它可以提高分类准确性甚至像Google这样的搜索引擎也会删除停用词,以便从数据库中快速地检索数据3. 我们应该什么时候删除停用词?





机器翻译语言建模文本摘要问答(QA)系统4. 删除停用词的不同方法

4.1. 使用NLTK删除停用词



import nltkfrom nltk.corpus import stopwordsset(stopwords.words(english))现在,要使用NLTK删除停用词,你可以使用以下代码块

# 下面的代码是使用nltk从句子中去除停用词# 导入包import nltkfrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenize set(stopwords.words(english))# 例句text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had indeed the vaguest idea where the wood and river in question were."""# 停用词集合stop_words = set(stopwords.words(english)) # 分词word_tokens = word_tokenize(text) filtered_sentence = [] for w in word_tokens: if w not in stop_words: filtered_sentence.append(w) print("\n\nOriginal Sentence \n\n")print(" ".join(word_tokens)) print("\n\nFiltered Sentence \n\n")print(" ".join(filtered_sentence)) 这是我们分词后的句子:

He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rightshad become much less valuable, and he had indeed the vaguest idea where the wood and river in question were.删除停用词后:

He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts. He ready becuase rights become much less valuable, indeed vaguest idea wood river question.请注意,文本的大小几乎减少到一半!你能想象一下删除停用词的用处吗?

4.2. 使用spaCy删除停用词



from spacy.lang.en import English# 加载英语分词器、标记器、解析器、NER和单词向量nlp = English()text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had indeed the vaguest idea where the wood and river in question were."""# "nlp"对象用于创建具有语言注释的文档。my_doc = nlp(text)# 构建词列表token_list = []for token in my_doc: token_list.append(token.text)from spacy.lang.en.stop_words import STOP_WORDS# 去除停用词后创建单词列表filtered_sentence =[] for word in token_list: lexeme = nlp.vocab[word] if lexeme.is_stop == False: filtered_sentence.append(word) print(token_list)print(filtered_sentence) 这是我们在分词后获得的列表:

He determined to drop his litigation with the monastry and relinguish his claims to the wood-cuting and \n fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had \n indeed the vaguest idea where the wood and river in question were.删除停用词后的列表:

determined drop litigation monastry, relinguish claims wood-cuting \n fishery rihgts. readybecuase rights become valuable, \n vaguest idea wood river question需要注意的一点是,去除停用词并不会删除标点符号或换行符,我们需要手动删除它们。

4.3. 使用Gensim删除停用词



# 以下代码使用Gensim去除停用词from gensim.parsing.preprocessing import remove_stopwords# pass the sentence in the remove_stopwords functionresult = remove_stopwords("""He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had indeed the vaguest idea where the wood and river in question were.""")print(\ \n Filtered Sentence \n\n)print(result) He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts once.He ready becuase rights valuable, vaguest idea wood river question were.使用gensim去除停用词时,我们可以直接在原始文本上进行。在删除停用词之前无需执行分词。这可以节省我们很多时间。

5. 文本标准化(text normalization)简介


Lisa ate the food and washed the dishes.They were eating noodles at a cafe.Don’t you want to eat before we leave?We have just eaten our breakfast.It also eats fruit and vegetables.在所有这些句子中,我们可以看到"eat"这个词有多种形式。对我们来说,很容易理解"eat"就是这里具体的活动。所以对我们来说,无论是eat,ate还是eaten都没关系,因为我们知道发生了什么。



6. 什么是词干化和词形还原?









He was drivingHe went for a drive我们可以很容易地说两句话都传达了相同的含义,即过去的驾驶活动。机器将以不同的方式处理两个句子。因此,为了使文本可以理解,我们需要执行词干化或词形还原。






7. 执行文本标准化的方法

7.1. 使用NLTK进行文本标准化




from nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenize from nltk.stem import PorterStemmerset(stopwords.words(english))text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had indeed the vaguest idea where the wood and river in question were."""stop_words = set(stopwords.words(english)) word_tokens = word_tokenize(text) filtered_sentence = [] for w in word_tokens: if w not in stop_words: filtered_sentence.append(w) Stem_words = []ps =PorterStemmer()for w in filtered_sentence: rootWord=ps.stem(w) Stem_words.append(rootWord)print(filtered_sentence)print(Stem_words)He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts. He ready becuase rights become much less valuable, indeed vaguest idea wood river question.He determin drop litig monastri, relinguish claim wood-cut fisheri rihgt. He readi becuasright become much less valuabl, inde vaguest idea wood river question.我们在这里就可以很清晰看到不同点了,我们继续对这段文本执行词形还原


from nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenize import nltkfrom nltk.stem import WordNetLemmatizerset(stopwords.words(english))text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had indeed the vaguest idea where the wood and river in question were."""stop_words = set(stopwords.words(english)) word_tokens = word_tokenize(text) filtered_sentence = [] for w in word_tokens: if w not in stop_words: filtered_sentence.append(w) print(filtered_sentence) lemma_word = []import nltkfrom nltk.stem import WordNetLemmatizerwordnet_lemmatizer = WordNetLemmatizer()for w in filtered_sentence: word1 = wordnet_lemmatizer.lemmatize(w, pos = "n") word2 = wordnet_lemmatizer.lemmatize(word1, pos = "v") word3 = wordnet_lemmatizer.lemmatize(word2, pos = ("a")) lemma_word.append(word3)print(lemma_word)He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts. He ready becuase rights become much less valuable, indeed vaguest idea wood river question.He determined drop litigation monastry, relinguish claim wood-cuting fishery rihgts. He ready becuase right become much le valuable, indeed vaguest idea wood river question.在这里,v表示动词,a代表形容词和n代表名词。该词根提取器(lemmatizer)仅与lemmatize方法的pos参数匹配的词语进行词形还原。


7.2. 使用spaCy进行文本标准化


#确保使用"python -m spacy download en"下载英语模型import en_core_web_smnlp = en_core_web_sm.load()doc = nlp(u"""He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had indeed the vaguest idea where the wood and river in question were.""")lemma_word1 = [] for token in doc: lemma_word1.append(token.lemma_)lemma_word1-PRON- determine to drop -PRON- litigation with the monastry, and relinguish -PRON- claimto the wood-cuting and \n fishery rihgts at once. -PRON- be the more ready to do this becuase the right have become much less valuable, and -PRON- have \n indeed the vague ideawhere the wood and river in question be.这里-PRON-是代词的符号,可以使用正则表达式轻松删除。spaCy的好处是我们不必传递任何pos参数来执行词形还原。

7.3. 使用TextBlob进行文本标准化



# from textblob lib import Word method from textblob import Word text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had indeed the vaguest idea where the wood and river in question were."""lem = []for i in text.split(): word1 = Word(i).lemmatize("n") word2 = Word(word1).lemmatize("v") word3 = Word(word2).lemmatize("a") lem.append(Word(word3).lemmatize())print(lem)He determine to drop his litigation with the monastry, and relinguish his claim to the wood-cuting and fishery rihgts at once. He wa the more ready to do this becuase the righthave become much le valuable, and he have indeed the vague idea where the wood and riverin question were.就像我们在NLTK小节中看到的那样,TextBlob也使用POS标记来执行词形还原。

8. 结束

