900字范文 > 新手探索NLP（一）

新手探索NLP（一）

时间：2019-06-23 16:27:57

相关推荐

新手探索NLP（一）

基础知识模块

1. 概念

2. 正则化表达

Data Cleaning模块

3. 去掉标点

4. 切分词功能

5. 去掉停止词

6. Stemming & Lemmatizing

stemming

lemmatizing

两者的异同

Vectorizing模块

7. Count Vectorization

8. N-gram Vectorization

9. Inverse Document Frequency Weighting

Feature Engineering模块

10. Creating New Features

11. Transformation

基础知识模块

1. 概念

使得机器能够理解人类的语言。

包含的方向：

sentiment analysistopic modelingtext classificationsentence segmentation/ part-of-speech tagging

步骤：

raw texttokenizeclean text (remove stop words, punctuation, stemming)vectorize (convert to numeric form)

2. 正则化表达

在学习NLP之前，需要先学习regular expression.

在python中，调用re包。

re.split('s+', string)根据第一个参数分割字符串

re.findall(pattern, string)在字符串中寻找pattern

re.sub(pattern, replacement pattern, string)在字符串中将pattern替换为新的字符串

正则表达式是case sensitive.

正则表达式元字符如下：. ^ $ * + ? { } [ ] \ | ( )

 . 匹配除换行符以外的所以字符

 ^ 规定匹配模式必须出现在目标字符串的开头，例如：^hell hello hellboy

 $规定匹配模式必须出现在目标字符串的结尾，例如：ar$carbar

* 其前一个字符必须在目标对象中连续出现零次或多次

+ 其前一个字符必须在目标对象中连续出现一次或多次

？其前一个字符必须在目标对象中连续出现一次或零次

{n} 匹配确定的n次，例如:o{2} oo

{n,} 至少匹配n次，例如:o{2, } oo ooo oooo ...

{n,m} 至少匹配n次，至多匹配m次，例如:o{2,3} oo ooo

[A-Z] A-Z内任意一个大写字母

[a-z] a-z内任意一个小写字母

[0-9] 0-9内任意一个数字,等价于 \d

[A-Za-z0-9] 任意一个字母或数字,等价于 \w

[^] 表示非，与上面^不同。

\ 转义字符

| 管道符号，A和B是任意的RE，那么A|B就是匹配A或者B的一个新的RE。

\s 用于匹配单个空格，包括tab键和换行符

\S 用于匹配单个空格之外的所有字符

\d 匹配0-9的数字

\w 匹配字母、数字或下划线

\W 匹配所有和\w不匹配的字符

Data Cleaning模块

3. 去掉标点

def remove_punct(text):text_nopunct = "".join([char for char in text if char not in string.punctuation])return text_nopunct

如果单独使用[char for char in text if char not in string.punctuation]，没有join方法，会返回一个个的字母。注意必须使用方括号行成一个list，使用圆括号会返回一个generator。

data['body_text_clean'] = data.body_text.apply(lambda x: remove_punct(x))

4. 切分词功能

def tokenize(text):tokens = re.split('\W+', text)return tokens data['tokenized'] = data['clean'].apply(lambda x: tokenize(x.lower()))

此处Tokenize过程是相对于英文而言的。中文单词因为是由汉字复杂构成，将另外讨论。

5. 去掉停止词

import nltknltk.download("stopwords")stopword = nltk.corpus.stopwords.words('english')def remove_stopwords(tokenized_list):text = [word for word in tokenized_list if word not in stopword]return text

data["nostop"] = data.tokenized.apply(lambda x : remove_stopwords(x))

同样此处只讨论英文的停止词。

6. Stemming & Lemmatizing

stemming

概念：process of reduing inflected (or sometimes derived) words to their word stem or root

import nltkps = nltk.PorterStemmer()def stemming(text):text = [ps.stem(word) for word in text]return text

lemmatizing

概念：process of grouping together the inflected forms of a word so they can be analyzed as a single term, identified by the words' lemma

import nltkwn = nltk.WordNetLemmatizer()def lemmatizing(text):text = [wn.lemmatize(word) for word in text]return text

两者的异同

相同点：

common goal to condense derived words into base forms

不同点：

stemming is faster as it simply chops off the end of a word without understading contextlemmatizing is more accurateas it uses analysis to create groups of words with similar meaning based on the context

Vectorizing模块

7. Count Vectorization

from sklearn.feature_extraction.text import CountVectorizerdef clean_text(text):text = "".join([word.lower() for word in text if word not in string.punctuation])tokens = re.split('\W+', text)text = [ps.stem(word) for word in tokens if word not in stopwords]return textcount_vect = CountVectorizer(analyzer=clean_text)X_counts = count_vect.fit_transform(data.body_text)X_counts.shapecount_vect.get_feature_names()

此处得到的X_counts是一个sparse matrix.

Sparse Matrix: A matrix in which most entries are 0. In the interest of efficient storage, a sparse matrix will be stored by only storing the locations of the non-zero elements.

我们可以将其转换成一个DataFrame.

x_counts_df = pd.DataFrame(X_counts.toarray())x_counts_df.columns = count_vect.get_feature_names()x_counts_df

8. N-gram Vectorization

此方法与Count Vectorization非常类似。

Creates a document-term matrix where counts still occupy the cell but instead of the columns representing single terms, they represent all combinations of adjacent words of length n in your text.

"NLP is an interesting topic"

def clean_text(text):text = "".join([word.lower() for word in text if word not in string.punctuation])tokens = re.split('\W+', text)# pay attention, the way N-gram Vectorizing dealing with stemming is different than Count Vectorizingtext = " ".join([ps.stem(word) for word in tokens if word not in stopwords])return textdata['cleaned_text'] = data['body_text'].apply(lambda x: clean_text(x))ngram_vect = CountVectorizer(ngram_range=(2,2))x_count = ngram_vect.fit_transform(data.cleaned_text)

ngram_range参数表示n的范围。此处样例为只取bigram。后续处理方法与Count Vectorization相同。

9. Inverse Document Frequency Weighting

TF-IDF: Creates a document-term matrix where the columns represent single unique terms (unigrams) but the cell represents a weighting meant to represent how important a word is to a document.

TF-IDF实际上是：TF * IDF，TF词频(Term Frequency)，IDF逆向文件频率(Inverse Document Frequency)。TF-IDF是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。如果某个词或短语在一篇文章中出现的频率TF高，并且在其他文章中很少出现，则认为此词或者短语具有很好的类别区分能力，适合用来分类。

def clean_text(text):text = "".join([word.lower() for word in text if word not in string.punctuation])tokens = re.split('\W+', text)text = [ps.stem(word) for word in tokens if word not in stopwords]return textfrom sklearn.feature_extraction.text import TfidfVectorizertf_vect = TfidfVectorizer(analyzer=clean_text)x_tf = tf_vect.fit_transform(data.body_text)print(x_tf.shape)print(tf_vect.get_feature_names())

Feature Engineering模块

定义：creating new features or transforming existing fatures to get the most out of the data

10. Creating New Features

文本长度，文本中标点符号所占比例。

import stringfrom matplotlib import pyplotimport numpy as npdata["len"] = data["body_text"].apply(lambda x: len(x) - x.count(" "))def punct_pct(text):count = sum([1 for char in text if char in string.punctuation])pct = count/(len(text)- text.count(" "))return round(pct, 3)*100data["punct%"] = data.body_text.apply(lambda x: punct_pct(x))bins = np.linspace(0,200,40)pyplot.hist(data[data.label == 'spam']['len'], bins, alpha=0.5, normed=True, label="spam")pyplot.hist(data[data.label == 'ham']['len'], bins, alpha=0.5, normed=True, label="ham")pyplot.legend(loc="upper left")Nonebins = np.linspace(0,50,40)pyplot.hist(data[data.label == 'spam']['punct%'], bins, alpha=0.5, normed=True, label="spam")pyplot.hist(data[data.label == 'ham']['punct%'], bins, alpha=0.5, normed=True, label="ham")pyplot.legend(loc="upper right")None

11. Transformation

Process

Determine what range of exponents to testApply each transformation to each value of your chosen featureUse some criteria to determine which of the transformations yield the best distribution

for i in range(1,6):pyplot.hist(data["punct%"]**(1/i), bins = 40)pyplot.title("transform:1/{}".format(str(i)))pyplot.show()

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。