文本数据处理

作者:欧新宇(Xinyu OU)

本文档所展示的测试结果,均运行于:Intel Core i7-7700K CPU 4.2GHz

1. 文本数据的特征提取、中文分词及词袋模型

2. 对文本数据进一步进行优化处理

In [ ]:
 
In [1]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
en = ['The quick brown fox jumps over a lazy dog']
vect.fit(en)
print('单词数:{}'.format(len(vect.vocabulary_)))
print('分词:{}'.format(vect.vocabulary_))
单词数:8
分词:{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumps': 3, 'over': 5, 'lazy': 4, 'dog': 1}
In [2]:
vect
Out[2]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
In [3]:
cn = ['那只敏捷的棕色狐狸跳过了一只懒惰的狗']
vect.fit(cn)
print('单词数:{}'.format(len(vect.vocabulary_)))
print('分词:{}'.format(vect.vocabulary_))
单词数:1
分词:{'那只敏捷的棕色狐狸跳过了一只懒惰的狗': 0}
In [4]:
import jieba
cn = jieba.cut('那只敏捷的棕色狐狸跳过了一只懒惰的狗')
cn = [' '.join(cn)]
print(cn)
Building prefix dict from the default dictionary ...
Loading model from cache D:\Temp\jieba.cache
Loading model cost 1.550 seconds.
Prefix dict has been built succesfully.
['那 只 敏捷 的 棕色 狐狸 跳过 了 一只 懒惰 的 狗']
In [5]:
vect.fit(cn)
print('单词数:{}'.format(len(vect.vocabulary_)))
print('分词:{}'.format(vect.vocabulary_))
单词数:6
分词:{'敏捷': 2, '棕色': 3, '狐狸': 4, '跳过': 5, '一只': 0, '懒惰': 1}
In [6]:
bag_of_words = vect.transform(cn)
print('转化为词袋的特征:\n{}'.format(repr(bag_of_words)))
转化为词袋的特征:
<1x6 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>
In [7]:
print(bag_of_words)
  (0, 0)	1
  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (0, 5)	1
In [8]:
print('词袋的密度表达:\n{}'.format(bag_of_words.toarray()))
词袋的密度表达:
[[1 1 1 1 1 1]]
In [9]:
cn_1 = jieba.cut('懒惰的狐狸不如敏捷的狐狸敏捷,敏捷的狐狸不如懒惰的狐狸懒惰')
cn2 = [' '.join(cn_1)]
print(cn2)
['懒惰 的 狐狸 不如 敏捷 的 狐狸 敏捷 , 敏捷 的 狐狸 不如 懒惰 的 狐狸 懒惰']
In [10]:
new_bag = vect.transform(cn2)
print('转化为词袋的特征:\n{}'.format(repr(new_bag)))
print('词袋的密度表达:\n{}'.format(new_bag.toarray()))
转化为词袋的特征:
<1x6 sparse matrix of type '<class 'numpy.int64'>'
	with 3 stored elements in Compressed Sparse Row format>
词袋的密度表达:
[[0 3 3 0 4 0]]
In [11]:
joke = jieba.cut('道士看见和尚亲吻了尼姑的嘴唇')
joke = [' '.join(joke)]
vect.fit(joke)
joke_feature = vect.transform(joke)
print('这句话的特征表达:\n{}'.format(joke_feature.toarray()))
这句话的特征表达:
[[1 1 1 1 1 1]]
In [12]:
joke2 = jieba.cut('尼姑看见道士的嘴唇亲吻了和尚')
joke2 = [' '.join(joke2)]
joke2_feature = vect.transform(joke2)
print('这句话的特征表达:\n{}'.format(joke2_feature.toarray()))
这句话的特征表达:
[[1 1 1 1 1 1]]
In [13]:
vect = CountVectorizer(ngram_range=(2,2))
cv = vect.fit(joke)
joke_feature = cv.transform(joke)
print('调整n-Gram参数后的词典:{}'.format(cv.get_feature_names()))
print('新的特征表达:{}'.format(joke_feature.toarray()))
调整n-Gram参数后的词典:['亲吻 尼姑', '和尚 亲吻', '尼姑 嘴唇', '看见 和尚', '道士 看见']
新的特征表达:[[1 1 1 1 1]]
In [14]:
joke2 = jieba.cut('尼姑看见道士的嘴唇亲吻了和尚')
joke2 = [' '.join(joke2)]
joke2_feature = vect.transform(joke2)
print('这句话的特征表达:\n{}'.format(joke2_feature.toarray()))
这句话的特征表达:
[[0 0 0 0 0]]
In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer().fit(joke)
print(tf.get_feature_names())
['亲吻', '和尚', '嘴唇', '尼姑', '看见', '道士']
In [16]:
!tree ACLIMDB
#请将aclImdb替换成你放置数据集的文件夹地址
卷 Windows 的文件夹 PATH 列表
卷序列号为 00000169 2889:5EA9
C:\USERS\CHAO\DOCUMENTS\JUPYTER NOTEBOOK\ACLIMDB
├─test
│  ├─neg
│  └─pos
└─train
    ├─neg
    ├─pos
    └─unsup
In [17]:
from sklearn.datasets import load_files
train_set = load_files('Imdblite/train/')
X_train, y_train = train_set.data, train_set.target
print('训练集文件数量:{}'.format(len(X_train)))
print('\n随机抽一个看看:\n', X_train[22])
训练集文件数量:100

随机抽一个看看:
 b"All I could think of while watching this movie was B-grade slop. Many have spoken about it's redeeming quality is how this film portrays such a realistic representation of the effects of drugs and an individual and their subsequent spiral into a self perpetuation state of unfortunate events. Yet really, the techniques used (as many have already mentioned) were overused and thus unconvincing and irrelevant to the film as a whole.<br /><br />As far as the plot is concerned, it was lacklustre, unimaginative, implausible and convoluted. You can read most other reports on this film and they will say pretty much the same as I would.<br /><br />Granted some of the actors and actresses are attractive but when confronted with such boring action... looks can only carry a film so far. The action is poor and intermittent: a few punches thrown here and there, and a final gunfight towards the end. Nothing really to write home about.<br /><br />As others have said, 'BAD' movies are great to watch for the very reason that they are 'bad', you revel in that fact. This film, however, is a void. It's nothing.<br /><br />Furthermore, if one is really in need of an educational movie to scare people away from drug use then I would seriously recommend any number of other movies out there that board such issues in a much more effective way. 'Requiem For A Dream', 'Trainspotting', 'Fear and Loathing in Las Vegas' and 'Candy' are just a few examples. Though one should also check out some more lighthearted films on the same subject like 'Go' (overall, both serious and funny) and 'Halfbaked'.<br /><br />On a final note, the one possibly redeeming line in this movie, delivered by Vinnie Jones was stolen from 'Lock, Stock and Two Smokling Barrels'. To think that a bit of that great movie has been tainted by 'Loaded' is vile.<br /><br />Overall, I strongly suggest that you save you money and your time by NOT seeing this movie."
In [18]:
X_train = [doc.replace(b'<br />', b' ') for doc in X_train]
In [19]:
test = load_files('Imdblite/test/')
X_test, y_test = test.data, test.target
X_test = [doc.replace(b'<br />', b' ') for doc in X_test]
len(X_test)
Out[19]:
100
In [20]:
vect = CountVectorizer().fit(X_train)
X_train_vect = vect.transform(X_train)
print('训练集样本特征数量:{}'.format(len(vect.get_feature_names())))
print('最后10个训练集样本特征:{}'.format(vect.get_feature_names()[-10:]))
训练集样本特征数量:3941
最后10个训练集样本特征:['young', 'your', 'yourself', 'yuppie', 'zappa', 'zero', 'zombie', 'zoom', 'zooms', 'zsigmond']
In [21]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
scores = cross_val_score(LinearSVC(), X_train_vect, y_train)
print('模型平均分:{:.3f}'.format(scores.mean()))
模型平均分:0.778
In [22]:
X_test_vect = vect.transform(X_test)
clf = LinearSVC().fit(X_train_vect, y_train)
print('测试集模型得分:{}'.format(clf.score(X_test_vect, y_test)))
测试集模型得分:0.58
In [23]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(smooth_idf = False)
tfidf.fit(X_train_vect)
X_train_tfidf = tfidf.transform(X_train_vect)
X_test_tfidf = tfidf.transform(X_test_vect)
print('未经tfidf处理的特征:\n',X_train_vect[:5,:5].toarray())
print('经过tfidf处理的特征:\n',X_train_tfidf[:5,:5].toarray())
未经tfidf处理的特征:
 [[0 0 0 0 0]
 [0 0 0 0 0]
 [0 1 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]]
经过tfidf处理的特征:
 [[ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.13862307  0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]]
In [24]:
clf = LinearSVC().fit(X_train_tfidf, y_train)
scores2 = cross_val_score(LinearSVC(), X_train_tfidf, y_train)

print('经过tf-idf处理的训练集交叉验证得分:{:.3f}'.format(scores.mean()))
print('经过tf-idf处理的测试集得分:{:.3f}'.format(clf.score(X_test_tfidf,
                                                y_test)))
经过tf-idf处理的训练集交叉验证得分:0.778
经过tf-idf处理的测试集得分:0.580
In [32]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
print('停用词个数:', len(ENGLISH_STOP_WORDS))
print('列出前20个和最后20个:\n', list(ENGLISH_STOP_WORDS)[:20],
     list(ENGLISH_STOP_WORDS)[-20:])
停用词个数: 318
列出前20个和最后20个:
 ['around', 'fifty', 'together', 'un', 'very', 'across', 'next', 'amongst', 'nor', 'first', 'more', 'its', 'de', 'serious', 'wherein', 'wherever', 'who', 'cry', 'full', 'after'] ['please', 'myself', 'himself', 'or', 'however', 'seems', 'almost', 'within', 'the', 'made', 'your', 'become', 'amoungst', 'all', 'him', 'had', 'already', 'least', 'it', 'anyone']
In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(smooth_idf = False, stop_words = 'english')
tfidf.fit(X_train)
X_train_tfidf = tfidf.transform(X_train)
scores3 = cross_val_score(LinearSVC(), X_train_tfidf, y_train)
clf.fit(X_train_tfidf, y_train)
X_test_tfidf = tfidf.transform(X_test)
print('去掉停用词后训练集交叉验证平均分:{:.3f}'.format(scores3.mean()))
print('去掉停用词后测试集模型得分:{:.3f}'.format(clf.score(X_test_tfidf, 
                                              y_test)))
去掉停用词后训练集交叉验证平均分:0.890
去掉停用词后测试集模型得分:0.670
In [ ]: