利用python进行数据预处理

pandas是个很方便的lib,来探索数据分布,查看缺失值、异常值

问题:利用图书的特征、用户的特征、用户的行为做图书的推荐

下面只介绍了图书特征的预处理:

图书维表 共200164行 X 38列

字段格式:

1
bookid|bookname|authorid|typeid|classid|chargetype|totalfee|serialize|publishtype|wordcount|onshelfwordcount|format|isfinish|waitonshelfcount|onshelfcount|lastonshelfchaptertime|downloadflag|feestartchapter|sawtime|avescore|commonlabel|expressioneva|themeeva|enactmenteva|lightenesseva|cadenceeva|charactereva|contenteva|roofsuiteva|scenarioeva|stylewritingeva|sexs|ages|peoplegroup|sortlabel|wwwdesc|wapdesc|keyword

数据样本:

1
2
3
451267|掌控生命|98295087|1|31|2|8|1|0|725981|804122|3|0|0|264|20120723142502|0|20|20090514165645|||||||||||||||||在未来世界,人类都被迫集中到一座座巨型城市中,战斗成为最热话题,异能者是最耀眼明星。在这个残酷的世界,高守想掌控自己的命运,只有掌控强大的异能一途  |在未来世界,人类都被迫集中到一座座巨型城市中,战斗成为最热话题,异能者是最耀眼明星。在这个残酷的世界,高守想掌控自己的命运,只有掌控强大的异能一途  |都市;异能;高守
488834|望月|55206452|1|479|1|200|0|1|60374|60374|2|1||14|20170413095704|1|4|20090519151035|||||||||||||||||望月|望月|望月
503526|三国风云之猛将传|74251700|1|14|2|12|1|0|718921|718921|3|1|0|215|20101112154633|0|20|20090515180037|7.6|热血;爽感强;三国|||||8|7|||8|7||||三国演义|现代人薛冰,莫名其妙的来到了三国时代,成为刘备手下的一名将领。他会如何在这个乱世生存下去?前面,等待他的又有什么?|现代人薛冰,莫名其妙的来到了三国时代,成为刘备手下的一名将领。他会如何在这个乱世生存下去?前面,等待他的又有什么?|三国;猛将;薛冰

读取数据,初选特征

根据需要,先选择出可用的特征,共18列

1
bookid, classid, chargetype, totalfee, serialize, publishtype, wordcount, onshelfwordcount, isfinish, waitonshelfcount, onshelfcount, lastonshelfchaptertime, downloadflag, feestartchapter, sawtime, sexs, ages, peoplegroup

探索缺失值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
names = ['bookid', 'bookname', 'authorid', 'typeid', 'classid', 'chargetype', 'totalfee', 'serialize', 'publishtype', 'wordcount', 'onshelfwordcount', 'format', 'isfinish', 'waitonshelfcount', 'onshelfcount', 'lastonshelfchaptertime', 'downloadflag', 'feestartchapter', 'sawtime', 'avescore', 'commonlabel', 'expressioneva', 'themeeva', 'enactmenteva', 'lightenesseva', 'cadenceeva', 'charactereva', 'contenteva', 'roofsuiteva', 'scenarioeva', 'stylewritingeva', 'sexs', 'ages', 'peoplegroup', 'sortlabel', 'wwwdesc', 'wapdesc', 'keywords']
d = {'bookid': str,'classid': str, 'lastonshelfchaptertime': str, 'feestartchapter': str, 'sawtime':str}
df = pd.read_table('dim_bookinfo.txt', sep="|", quoting=3, names = names, dtype =d)
c18 = ['bookid','classid', 'chargetype', 'totalfee', 'serialize', 'publishtype', 'wordcount', 'onshelfwordcount', 'isfinish','waitonshelfcount', 'onshelfcount', 'lastonshelfchaptertime', 'downloadflag', 'feestartchapter', 'sawtime', 'sexs', 'ages', 'peoplegroup']
df = df[c18]
print(df.shape)
print(df.shape[0]- df.count()) #统计缺失值数量
print(1 - df.count()/df.shape[0]) #统计缺失值比例

1

waitonshelfcount、sexs、ages、peoplegroup缺失值太多,删除

最终选出14列:

bookid图书id | classid 分类id | chargetype 收费类型 | totalfee 总价格 | serialize 是否连载 | publishtype 是否出版 | wordcount 总字数 | onshelfwordcount 已上架章节的总字数 | isfinish 是否完本 | onshelfcount 已上架章节数 | lastonshelfchaptertime 最后上架章节的更新时间 | downloadflag 是否可下载 | feestartchapter 收费起始章节 | sawtime 入库时间


缺失值填充

缺失列: wordcount 总字数| lastonshelfchaptertime 最后上架章节的更新时间 | feestartchapter 收费起始章节 | sawtime 入库时间
填补方法:
a. feestartchapter列缺失项填 0;
b. wordcount列缺失项取平均值;
c. lastonshelfchaptertime列和sawtime 列互相补充,若两项均为空,则填补成最早的时间;
d. 对缺失的classid项“$”,替换成“-1”。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
c14 = ['bookid','classid', 'chargetype', 'totalfee', 'serialize', 'publishtype', 'wordcount', 'onshelfwordcount', 'isfinish', 'onshelfcount', 'lastonshelfchaptertime', 'downloadflag', 'feestartchapter', 'sawtime']
df = df[c14]
df2 = df.replace("$","-1")
#classid(图书分类)列发现了一些"$",替换成"-1",作为单独一类
df2['lastonshelfchaptertime'][df2['lastonshelfchaptertime'].isnull()] = df2['sawtime']
df2['sawtime'][df2['sawtime'].isnull()] = df2['lastonshelfchaptertime']
#sawtime(图书入库时间)和lastonshelfchaptertime(最后上架章节时间)互相补充
print(df.shape[0]- df.count()) #统计缺失值数量
df2 = df2.fillna({'feestartchapter': "0", 'wordcount': df['wordcount'].mean(), 'sawtime': "20090919202213", 'lastonshelfchaptertime': "20090919202213"})
#feestartchapter(免费起始章节)补0,wordcount(总字数)补均值,剩余的sawtime(图书入库时间)和lastonshelfchaptertime(最后上架章节时间)补最早的日期20090919202213
print(df.shape[0]- df.count()) #统计缺失值数量
df2.to_csv("out.txt", sep="|", index = False) #保留异常值处理前后的数据

2

最终没有缺失值了


探索数据分布

除bookid及时间特征lastonshelfchaptertime、sawtime共3个以外,画出其他特征的数据直方图:

1
2
3
4
5
6
7
8
c11 = ['classid', 'chargetype', 'totalfee', 'serialize', 'publishtype', 'wordcount', 'onshelfwordcount', 'isfinish', 'onshelfcount', 'downloadflag', 'feestartchapter']
for i in range(11):
print(c11[i])
plt.subplot(3,4,i+1)
plt.hist(df2[c11[i]].astype(int), bins=50, normed=False)
plt.title(c11[i])
plt.show()

3

分析图可得,其中classid、chargetype、serialize、publishtype、isfinish、downloadflag这6个是类别特征,之后做one-hot变换。
剩下的totalfee、wordcount、onshelfwordcount、onshelfcount、feestartchapter这5个特征需要进行异常值检测。

异常值检测

需要异常值检测的5列: totalfee 总价格 | wordcount 总字数 | onshelfwordcount 已上架总字数 | onshelfcount 已上架章节数 | feestartchapter 收费起始章节

1
2
3
4
5
c5 = ['totalfee', 'wordcount', 'onshelfwordcount', 'onshelfcount', 'feestartchapter']
for c in c5 :
df3 = df2[[c]].astype(int)
p = df3.boxplot(sym='r+',vert=False, patch_artist=True, meanline=True, showmeans=True)
plt.show()

4
5
6
7
8

异常值处理

将超过异常值上界的值替换成上界。(此处上界自己定,比上四分位数+1.5*四分位距多一些也行)
totalfee异常值上界可定为: 3000
wordcount的异常值上界可定为: 500万
onshelfwordcount的异常值上界可定为: 500万
onshelfcount的异常值上界可定为: 2500
feestartchapter的异常值上界可定为: 100

1
2
3
4
5
6
7
8
df2.loc[df2['totalfee']> 3000, 'totalfee'] = 3000
df2.loc[df2['wordcount']> 5000000, 'wordcount'] = 5000000
df2.loc[df2['onshelfwordcount']> 5000000, 'onshelfwordcount'] = 5000000
df2.loc[df2['onshelfcount']> 2500, 'onshelfcount'] = 2500
df2.loc[df2['feestartchapter'].astype(int)> 100, 'feestartchapter'] = 100
#print(df2[['totalfee']].describe())
df2.to_csv("out1.txt", sep="|", index = False)

数据变换

之后用sklearn处理。

One-hot变换

需要One-hot变换的6列: classid 分类id | chargetype 收费类型 | serialize 是否连载 | publishtype 是否出版 | isfinish 是否完本 | downloadflag 是否可下载

特殊变换

需要特殊变换的列: lastonshelfchaptertime 最后上架章节的更新时间 | sawtime 入库时间

规范化变换

需要规范化的列:所有连续值列


附完整代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
names = ['bookid', 'bookname', 'authorid', 'typeid', 'classid', 'chargetype', 'totalfee', 'serialize', 'publishtype', 'wordcount', 'onshelfwordcount', 'format', 'isfinish', 'waitonshelfcount', 'onshelfcount', 'lastonshelfchaptertime', 'downloadflag', 'feestartchapter', 'sawtime', 'avescore', 'commonlabel', 'expressioneva', 'themeeva', 'enactmenteva', 'lightenesseva', 'cadenceeva', 'charactereva', 'contenteva', 'roofsuiteva', 'scenarioeva', 'stylewritingeva', 'sexs', 'ages', 'peoplegroup', 'sortlabel', 'wwwdesc', 'wapdesc', 'keywords']
d = {'bookid': str,'classid': str, 'lastonshelfchaptertime': str, 'feestartchapter': str, 'sawtime':str}
df = pd.read_table('dim_bookinfo.txt', sep="|",quoting=3, names = names, dtype =d)
c18 = ['bookid','classid', 'chargetype', 'totalfee', 'serialize', 'publishtype', 'wordcount', 'onshelfwordcount', 'isfinish','waitonshelfcount', 'onshelfcount', 'lastonshelfchaptertime', 'downloadflag', 'feestartchapter', 'sawtime', 'sexs', 'ages', 'peoplegroup']
df = df[c18]
print(df.shape)
print(df.shape[0]- df.count()) #统计缺失值数量
print(1 - df.count()/df.shape[0]) #统计缺失值比例
c14 = ['bookid','classid', 'chargetype', 'totalfee', 'serialize', 'publishtype', 'wordcount', 'onshelfwordcount', 'isfinish', 'onshelfcount', 'lastonshelfchaptertime', 'downloadflag', 'feestartchapter', 'sawtime']
df = df[c14]
df2 = df.replace("$","-1")
#classid(图书分类)列发现了一些"$",替换成"-1",作为单独一类
df2.loc[df2['lastonshelfchaptertime'].isnull(), 'lastonshelfchaptertime'] = df2['sawtime']
df2.loc[df2['sawtime'].isnull(), 'sawtime'] = df2['lastonshelfchaptertime']
#sawtime(图书入库时间)和lastonshelfchaptertime(最后上架章节时间)互相补充
print(df2.shape[0]- df2.count()) #统计缺失值数量
df2 = df2.fillna({'feestartchapter': "0", 'wordcount': df['wordcount'].mean(), 'sawtime': "20090919202213", 'lastonshelfchaptertime': "20090919202213"})
#feestartchapter(免费起始章节)补0,wordcount(总字数)补均值,剩余的sawtime(图书入库时间)和lastonshelfchaptertime(最后上架章节时间)补最早的日期20090919202213
print(df2.shape[0]- df2.count()) #统计缺失值数量
df2.to_csv("out.txt", sep="|", index = False) #保留异常值处理前后的数据
c11 = ['classid', 'chargetype', 'totalfee', 'serialize', 'publishtype', 'wordcount', 'onshelfwordcount', 'isfinish', 'onshelfcount', 'downloadflag', 'feestartchapter']
for i in range(11):
plt.subplot(3,4,i+1)
plt.hist(df2[c11[i]].astype(int), bins=50, normed=False)
plt.title(c11[i])
plt.show()
c5 = ['totalfee', 'wordcount', 'onshelfwordcount', 'onshelfcount', 'feestartchapter']
for c in c5 :
df3 = df2[[c]].astype(int)
p = df3.boxplot(sym='r+',vert=False, patch_artist=True, meanline=True, showmeans=True)
plt.show()
df2.loc[df2['totalfee']> 3000, 'totalfee'] = 3000
df2.loc[df2['wordcount']> 5000000, 'wordcount'] = 5000000
df2.loc[df2['onshelfwordcount']> 5000000, 'onshelfwordcount'] = 5000000
df2.loc[df2['onshelfcount']> 2500, 'onshelfcount'] = 2500
df2.loc[df2['feestartchapter'].astype(int)> 100, 'feestartchapter'] = 100
#print(df2[['totalfee']].describe())
df2.to_csv("out1.txt", sep="|", index = False)

END

0 Comments