目录
基础知识模块
1. 概念
2. 正则化表达
Data Cleaning模块
3. 去掉标点
4. 切分词功能
5. 去掉停止词
6. Stemming & Lemmatizing
stemming
lemmatizing
两者的异同
Vectorizing模块
7. Count Vectorization
8. N-gram Vectorization
9. Inverse Document Frequency Weighting
Feature Engineering模块
10. Creating New Features
11. Transformation
基础知识模块
1. 概念
使得机器能够理解人类的语言。
包含的方向:
sentiment analysistopic modelingtext classificationsentence segmentation/ part-of-speech tagging
步骤:
raw texttokenizeclean text (remove stop words, punctuation, stemming)vectorize (convert to numeric form)
2. 正则化表达
在学习NLP之前,需要先学习regular expression.
在python中,调用re包。
re.split('s+', string)根据第一个参数分割字符串
re.findall(pattern, string)在字符串中寻找pattern
re.sub(pattern, replacement pattern, string)在字符串中将pattern替换为新的字符串
正则表达式是case sensitive.
正则表达式元字符如下:. ^ $ * + ? { } [ ] \ | ( )
. 匹配除换行符以外的所以字符
^ 规定匹配模式必须出现在目标字符串的开头,例如:^hell hello hellboy
$规定匹配模式必须出现在目标字符串的结尾,例如:ar$carbar
* 其前一个字符必须在目标对象中连续出现零次或多次
+ 其前一个字符必须在目标对象中连续出现一次或多次
? 其前一个字符必须在目标对象中连续出现一次或零次
{n} 匹配确定的n次,例如:o{2} oo
{n,} 至少匹配n次,例如:o{2, } oo ooo oooo ...
{n,m} 至少匹配n次,至多匹配m次,例如:o{2,3} oo ooo
[A-Z] A-Z内任意一个大写字母
[a-z] a-z内任意一个小写字母
[0-9] 0-9内任意一个数字,等价于 \d
[A-Za-z0-9] 任意一个字母或数字,等价于 \w
[^] 表示非,与上面^不同。
\ 转义字符
| 管道符号,A和B是任意的RE,那么A|B就是匹配A或者B的一个新的RE。
\s 用于匹配单个空格,包括tab键和换行符
\S 用于匹配单个空格之外的所有字符
\d 匹配0-9的数字
\w 匹配字母、数字或下划线
\W 匹配所有和\w不匹配的字符
Data Cleaning模块
3. 去掉标点
def remove_punct(text):text_nopunct = "".join([char for char in text if char not in string.punctuation])return text_nopunct
如果单独使用[char for char in text if char not in string.punctuation],没有join方法,会返回一个个的字母。注意必须使用方括号行成一个list,使用圆括号会返回一个generator。
data['body_text_clean'] = data.body_text.apply(lambda x: remove_punct(x))
4. 切分词功能
def tokenize(text):tokens = re.split('\W+', text)return tokens data['tokenized'] = data['clean'].apply(lambda x: tokenize(x.lower()))
此处Tokenize过程是相对于英文而言的。中文单词因为是由汉字复杂构成,将另外讨论。
5. 去掉停止词
import nltknltk.download("stopwords")stopword = nltk.corpus.stopwords.words('english')def remove_stopwords(tokenized_list):text = [word for word in tokenized_list if word not in stopword]return text
data["nostop"] = data.tokenized.apply(lambda x : remove_stopwords(x))
同样此处只讨论英文的停止词。
6. Stemming & Lemmatizing
stemming
概念:process of reduing inflected (or sometimes derived) words to their word stem or root
import nltkps = nltk.PorterStemmer()def stemming(text):text = [ps.stem(word) for word in text]return text
lemmatizing
概念:process of grouping together the inflected forms of a word so they can be analyzed as a single term, identified by the words' lemma
import nltkwn = nltk.WordNetLemmatizer()def lemmatizing(text):text = [wn.lemmatize(word) for word in text]return text
两者的异同
相同点:
common goal to condense derived words into base forms
不同点:
stemming is faster as it simply chops off the end of a word without understading contextlemmatizing is more accurateas it uses analysis to create groups of words with similar meaning based on the context
Vectorizing模块
7. Count Vectorization
from sklearn.feature_extraction.text import CountVectorizerdef clean_text(text):text = "".join([word.lower() for word in text if word not in string.punctuation])tokens = re.split('\W+', text)text = [ps.stem(word) for word in tokens if word not in stopwords]return textcount_vect = CountVectorizer(analyzer=clean_text)X_counts = count_vect.fit_transform(data.body_text)X_counts.shapecount_vect.get_feature_names()
此处得到的X_counts是一个sparse matrix.
Sparse Matrix: A matrix in which most entries are 0. In the interest of efficient storage, a sparse matrix will be stored by only storing the locations of the non-zero elements.
我们可以将其转换成一个DataFrame.
x_counts_df = pd.DataFrame(X_counts.toarray())x_counts_df.columns = count_vect.get_feature_names()x_counts_df
8. N-gram Vectorization
此方法与Count Vectorization非常类似。
Creates a document-term matrix where counts still occupy the cell but instead of the columns representing single terms, they represent all combinations of adjacent words of length n in your text.
"NLP is an interesting topic"
def clean_text(text):text = "".join([word.lower() for word in text if word not in string.punctuation])tokens = re.split('\W+', text)# pay attention, the way N-gram Vectorizing dealing with stemming is different than Count Vectorizingtext = " ".join([ps.stem(word) for word in tokens if word not in stopwords])return textdata['cleaned_text'] = data['body_text'].apply(lambda x: clean_text(x))ngram_vect = CountVectorizer(ngram_range=(2,2))x_count = ngram_vect.fit_transform(data.cleaned_text)
ngram_range参数表示n的范围。此处样例为只取bigram。后续处理方法与Count Vectorization相同。
9. Inverse Document Frequency Weighting
TF-IDF: Creates a document-term matrix where the columns represent single unique terms (unigrams) but the cell represents a weighting meant to represent how important a word is to a document.
TF-IDF实际上是:TF * IDF,TF词频(Term Frequency),IDF逆向文件频率(Inverse Document Frequency)。TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。如果某个词或短语在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。
def clean_text(text):text = "".join([word.lower() for word in text if word not in string.punctuation])tokens = re.split('\W+', text)text = [ps.stem(word) for word in tokens if word not in stopwords]return textfrom sklearn.feature_extraction.text import TfidfVectorizertf_vect = TfidfVectorizer(analyzer=clean_text)x_tf = tf_vect.fit_transform(data.body_text)print(x_tf.shape)print(tf_vect.get_feature_names())
Feature Engineering模块
定义:creating new features or transforming existing fatures to get the most out of the data
10. Creating New Features
文本长度,文本中标点符号所占比例。
import stringfrom matplotlib import pyplotimport numpy as npdata["len"] = data["body_text"].apply(lambda x: len(x) - x.count(" "))def punct_pct(text):count = sum([1 for char in text if char in string.punctuation])pct = count/(len(text)- text.count(" "))return round(pct, 3)*100data["punct%"] = data.body_text.apply(lambda x: punct_pct(x))bins = np.linspace(0,200,40)pyplot.hist(data[data.label == 'spam']['len'], bins, alpha=0.5, normed=True, label="spam")pyplot.hist(data[data.label == 'ham']['len'], bins, alpha=0.5, normed=True, label="ham")pyplot.legend(loc="upper left")Nonebins = np.linspace(0,50,40)pyplot.hist(data[data.label == 'spam']['punct%'], bins, alpha=0.5, normed=True, label="spam")pyplot.hist(data[data.label == 'ham']['punct%'], bins, alpha=0.5, normed=True, label="ham")pyplot.legend(loc="upper right")None
11. Transformation
Process
Determine what range of exponents to testApply each transformation to each value of your chosen featureUse some criteria to determine which of the transformations yield the best distribution
for i in range(1,6):pyplot.hist(data["punct%"]**(1/i), bins = 40)pyplot.title("transform:1/{}".format(str(i)))pyplot.show()