程振兴 @czxa.top
截止今天,我已经在本博客上写了607.9k个字了!
学习和大熊猫们(Pandas)一起游戏吧!的笔记。
为什么是pandas
普通的一个程序员看到一份数据会这么做:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19import codecs
import requests
import numpy as np
import scipy as sp
import pandas as pd
import datetime
import json
# 下载鸢尾花数据集为txt文档
r = requests.get('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data')
with codecs.open('S1EP3_Iris.txt', 'w', encoding='utf-8') as f:
f.write(r.text)
# 读取和打印
with codecs.open('S1EP3_Iris.txt', 'r', encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
print(line)
而pandas的意义就在于快速的识别结构话数据,如果使用pandas读取数据:1
2
3
4
5
6
7
8
9
10
11irisdata = pd.read_csv('S1EP3_Iris.txt', header = None, encoding = 'utf-8')
irisdata.head()
# 结果
Out[45]:
0 1 2 3 4
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
还可以快速的操作元数据1
2
3# 例如:列名
irisdata.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
irisdata.head()
快速过滤1
2
3
4
5
6
7
8
9# 例如,选择数据集中petal_width列中取值最大的行
irisdata[irisdata['petal_width'] == irisdata.petal_width.max()]
# 结果
Out[46]:
sepal_length sepal_width petal_length petal_width class
100 6.3 3.3 6.0 2.5 Iris-virginica
109 7.2 3.6 6.1 2.5 Iris-virginica
144 6.7 3.3 5.7 2.5 Iris-virginica
切片1
2
3
4
5
6
7
8
9
10
11# 例如,对于行,每隔30取一行,列取前两列
irisdata.iloc[::30, :2]
# 结果
Out[47]:
sepal_length sepal_width
0 5.1 3.5
30 4.8 3.1
60 5.0 2.0
90 5.5 2.6
120 6.9 3.2
快速统计1
2
3
4
5
6
7
8# 统计鸢尾花的各个类别的个数
print(irisdata['class'].value_counts())
# 结果
Iris-virginica 50
Iris-versicolor 50
Iris-setosa 50
Name: class, dtype: int64
1 | for x in range(0, 4): |
快速MapReduce(这部分展示还不明白)1
2
3
4
5
6
7
8
9
10slogs = lambda x:sp.log(x)*x
entpy = lambda x:sp.exp((slogs(x.sum()) - x.map(slogs).sum())/x.sum())
irisdata.groupby('class').agg(entpy)
# 结果
sepal_length sepal_width petal_length petal_width
class
Iris-setosa 49.878745 49.695242 49.654909 45.810069
Iris-versicolor 49.815081 49.680665 49.694505 49.452305
Iris-virginica 49.772059 49.714500 49.761700 49.545918
欢迎来到大熊猫的世界
pandas的重要数据类型
- DataFrame
- Series
- index(行索引)
Series
1 | np.random.randn(4) |
Series支持过滤的原理就如同Numpy:1
2
3
4
5
6
7
8
9
10print(Series1 > 0)
0 False
1 True
2 False
3 False
dtype: bool
print(Series1[Series1 > 0])
1 0.173527
dtype: float64
当然也支持Broadcasting:1
2
3
4
5
6
7
8
9
10
11
12
13
14# 序列中的每个元素分别作运算
print(Series1 * 2)
0 -1.692684
1 0.347055
2 -0.654823
3 -0.646435
dtype: float64
print(Series1 + 5)
0 4.153658
1 5.173527
2 4.672589
3 4.676782
dtype: float64
以及Universal Function1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20print(np.exp(Series1))
0 0.428981
1 1.189493
2 0.720787
3 0.723816
dtype: float64
f_np = np.frompyfunc(lambda x:np.exp(x*2 + 5), 1, 1)
print(f_np(Series1))
0 27.3117
1 209.989
2 77.1057
3 77.7551
dtype: object
# lambda表达式
f = lambda x:pow(x,2)
f(2)
Out[67]: 4
在序列上使用行标,而不是创建一个2列的数据表,能够轻松辨别哪里是数据,哪里是元数据1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16Series2 = pd.Series(Series1.values, index = ['norm_'+str(i) for i in range(0, 4)])
print(Series2, type(Series2))
norm_0 -0.846342
norm_1 0.173527
norm_2 -0.327411
norm_3 -0.323218
dtype: float64 <class 'pandas.core.series.Series'>
print(Series2.index)
Index(['norm_0', 'norm_1', 'norm_2', 'norm_3'], dtype='object')
print(Series2.values)
[-0.84634196 0.17352745 -0.32741142 -0.32321775]
print(type(Series2.index))
<class 'pandas.core.indexes.base.Index'>
虽然行是有顺序的,但是仍然可以通过行级index访问到数据:1
2
3
4print(Series2[['norm_0', 'norm_3']])
norm_0 -0.846342
norm_3 -0.323218
dtype: float64
下面的语句可以用来判断该序列中是否有该索引1
2
3
4
5print('norm_0' in Series2)
True
print('norm_6' in Series2)
False
默认索引就像行号一样1
2print(Series1.index)
RangeIndex(start=0, stop=4, step=1)
从key不重复的Ordered Dict或者从Dict来定义Series不用担心行索引重复:1
2
3
4
5
6
7
8
9
10
11
12
13Series3_dict = {"Japan": "Tokyo", "S.Korea": "Seoul", "China": "Beijing"}
Series3_pdSeries = pd.Series(Series3_dict)
print(Series3_pdSeries)
China Beijing
Japan Tokyo
S.Korea Seoul
dtype: object
print(Series3_pdSeries.index)
Index(['China', 'Japan', 'S.Korea'], dtype='object')
print(Series3_pdSeries.values)
['Beijing' 'Tokyo' 'Seoul']
想让序列按照你的排序方式保存?就算有缺失值也没有问题:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28Series4_IndexList = ["Japan", "China", "Singapoer", "S.Korea"]
Series4_pdSeries = pd.Series(Series3_dict, index = Series4_IndexList)
print(Series4_pdSeries)
Japan Tokyo
China Beijing
Singapoer NaN
S.Korea Seoul
dtype: object
print(Series4_pdSeries.values)
['Tokyo' 'Beijing' nan 'Seoul']
print(Series4_pdSeries.isnull())
Japan False
China False
Singapoer True
S.Korea False
dtype: bool
print(Series4_pdSeries.index)
Index(['Japan', 'China', 'Singapoer', 'S.Korea'], dtype='object')
print(Series4_pdSeries.notnull())
Japan True
China True
Singapoer False
S.Korea True
dtype: bool
整个序列级别的元数据信息:name1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16# 当数据序列和index本身有了名字,就可以更加方便的进行后续的数据关联了!
print(Series4_pdSeries.name)
None
print(Series4_pdSeries.index.name)
None
Series4_pdSeries.name = "Capital Series"
Series4_pdSeries.index.name = "Nation"
print(Series4_pdSeries)
Nation
Japan Tokyo
China Beijing
Singapoer NaN
S.Korea Seoul
Name: Capital Series, dtype: object
行索引也可以重复,不过并不推荐
1 | Series5_IndexList = ['A', 'B', 'B', 'C'] |
DataFrame
从numpy二维数组、从文件或从数据库定义:数据虽好,勿忘列名:1
2
3
4
5
6
7
8dn = np.asarray([('Japan', 'Tokyo', 4000), ('S.Korea', 'Seoul', 1200), ('China', 'Beijing', 910)])
df1 = pd.DataFrame(dn, columns = ['nation', 'capital', 'gdp'])
df1
Out[101]:
nation capital gdp
0 Japan Tokyo 4000
1 S.Korea Seoul 1200
2 China Beijing 910
等长的列数据保存在一个字典里(json):很不幸,字典key是无序的1
2
3
4
5
6
7
8dd = {'nation':['Japan', 'S.Korea', 'China'], 'capital':['Tokyo', 'Seoul', 'Beijing'], 'gdp':[4900, 1300, 9100]}
df2 = pd.DataFrame(dd)
df2
Out[104]:
capital gdp nation
0 Tokyo 4900 Japan
1 Seoul 1300 S.Korea
2 Beijing 9100 China
从另外一个数据框定义数据框1
2
3
4
5
6
7
8
9
10
11
12
13
14
15df21 = pd.DataFrame(df1, columns = ['nation', 'capital', 'gdp'])
df21
Out[106]:
nation capital gdp
0 Japan Tokyo 4000
1 S.Korea Seoul 1200
2 China Beijing 910
df22 = pd.DataFrame(df2, columns = ['nation', 'capital', 'gdp'], index = [2, 0, 1])
df22
Out[108]:
nation capital gdp
2 China Beijing 9100
0 Japan Tokyo 4900
1 S.Korea Seoul 1300
从DataFrame中取出列?两种方法(与JavaScript完全一致)
- ‘.’的写法容易与其他预留关键字产生冲突
- ‘[]’的方法最安全
1
2
3
4
5
6
7
8
9
10
11
12
13print(df22.nation, df22.capital)
2 China
0 Japan
1 S.Korea
Name: nation, dtype: object 2 Beijing
0 Tokyo
1 Seoul
Name: capital, dtype: object
print(df22['gdp'])
2 9100
0 4900
1 1300
Name: gdp, dtype: int64
从DataFrame中取出行:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23print(df22)
nation capital gdp
2 China Beijing 9100
0 Japan Tokyo 4900
1 S.Korea Seoul 1300
print(df22[0:1])
nation capital gdp
2 China Beijing 9100
# 取第一行
print(df22.iloc[0])
nation China
capital Beijing
gdp 9100
Name: 2, dtype: object
# 取索引为0的
print(df22.loc[0])
nation Japan
capital Tokyo
gdp 4900
Name: 0, dtype: object
像NumPy切片一样的终极招式1
2
3
4
5
6
7
8
9
10
11
12
13# 取第一行
print(df22.iloc[0, :])
nation China
capital Beijing
gdp 9100
Name: 2, dtype: object
# 取第一列
print(df22.iloc[:, 0])
2 China
0 Japan
1 S.Korea
Name: nation, dtype: object
动态增加列1
2
3
4
5
6
7df22['population'] = [1600, 130, 55]
df22
Out[123]:
nation capital gdp population
2 China Beijing 9100 1600
0 Japan Tokyo 4900 130
1 S.Korea Seoul 1300 55
index
直接定义索引,就像Series一样1
2
3
4
5
6
7
8
9
10
11
12
13index_names = ['a', 'b', 'c']
Series_for_index = pd.Series(index_names)
print(pd.Index(index_names))
Index(['a', 'b', 'c'], dtype='object')
print(pd.Index(Series_for_index))
Index(['a', 'b', 'c'], dtype='object')
index0 = pd.Index(index_names)
print(index0.get_values())
['a' 'b' 'c']
# 下面这个会出错,记住index是Immutable的!
# index0[2] = 'd'
多重Index
1 | multi1 = pd.Index([('Row_' + str(x+1), 'Col_' + str(y+1)) for x in range(0, 4) for y in range(0, 4)]) |
对于一个Series,如果拥有了多重Index,就可以很容易的变形:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46data_for_multi = pd.Series(range(0, 16), index = multi1)
data_for_multi
Out[137]:
Row_1 Col_1 0
Col_2 1
Col_3 2
Col_4 3
Row_2 Col_1 4
Col_2 5
Col_3 6
Col_4 7
Row_3 Col_1 8
Col_2 9
Col_3 10
Col_4 11
Row_4 Col_1 12
Col_2 13
Col_3 14
Col_4 15
dtype: int64
data_for_multi.unstack()
Out[138]:
Col_1 Col_2 Col_3 Col_4
Row_1 0 1 2 3
Row_2 4 5 6 7
Row_3 8 9 10 11
Row_4 12 13 14 15
data_for_multi.unstack().stack()
Out[139]:
Row_1 Col_1 0
Col_2 1
Col_3 2
Col_4 3
Row_2 Col_1 4
Col_2 5
Col_3 6
Col_4 7
Row_3 Col_1 8
Col_2 9
Col_3 10
Col_4 11
Row_4 Col_1 12
Col_2 13
Col_3 14
Col_4 15
dtype: int64
下面再看一下非平衡数据的例子,即Row_1,2,3,4和Col_1,2,3,4并不是完全组合的。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26multi2 = pd.Index([('Row_' + str(x + 1), 'Col_' + str(y + 1)) for x in range(0, 5) for y in range(0, x)])
multi2
Out[143]:
MultiIndex(levels=[['Row_2', 'Row_3', 'Row_4', 'Row_5'], ['Col_1', 'Col_2', 'Col_3', 'Col_4']],
labels=[[0, 1, 1, 2, 2, 2, 3, 3, 3, 3], [0, 0, 1, 0, 1, 2, 0, 1, 2, 3]])
data_for_multi2 = pd.Series(np.arange(10), index = multi2)
data_for_multi2
Out[145]:
Row_2 Col_1 0
Row_3 Col_1 1
Col_2 2
Row_4 Col_1 3
Col_2 4
Col_3 5
Row_5 Col_1 6
Col_2 7
Col_3 8
Col_4 9
dtype: int64
data_for_multi2.unstack()
Out[146]:
Col_1 Col_2 Col_3 Col_4
Row_2 0.0 NaN NaN NaN
Row_3 1.0 2.0 NaN NaN
Row_4 3.0 4.0 5.0 NaN
Row_5 6.0 7.0 8.0 9.0
DateTime标准库
1 | dates = [datetime.datetime(2015, 1, 1), datetime.datetime(2015, 1, 8), datetime.datetime(2015, 1, 30)] |
如果你不仅需要时间格式一致,时间频率也要一致的话:1
2
3periodindex1 = pd.period_range('2015-01', '2015-04', freq = 'M')
print(periodindex1)
PeriodIndex(['2015-01', '2015-02', '2015-03', '2015-04'], dtype='period[M]', freq='M')
月级精度和日级精度的转换1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20# 使用每个月的第一天
print(periodindex1.asfreq('D', how='start'))
PeriodIndex(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01'], dtype='period[D]', freq='D')
# 使用每个月的最后一天
print(periodindex1.asfreq('D', how='end'))
PeriodIndex(['2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30'], dtype='period[D]', freq='D')
periodindex_mon = pd.period_range('2018-01', '2018-09', freq = 'M').asfreq('D', how='start')
periodindex_day = pd.period_range('2018-07-01', '2018-09-01', freq = 'D')
print(periodindex_mon)
PeriodIndex(['2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01',
'2018-05-01', '2018-06-01', '2018-07-01', '2018-08-01',
'2018-09-01'],
dtype='period[D]', freq='D')
print(periodindex_day)
PeriodIndex(['2018-07-01', '2018-07-02', '2018-07-03', '2018-07-04',
'2018-07-05', '2018-07-06', '2018-07-07', '2018-07-08',
...
'2018-08-26', '2018-08-27', '2018-08-28', '2018-08-29',
'2018-08-30', '2018-08-31', '2018-09-01'],
dtype='period[D]', freq='D')
粗粒度数据 + reindex + ffill/bfill1
2
3
4
5
6
7
8
9
10pd.Series(periodindex_mon, index = periodindex_mon).reindex(periodindex_day, method = 'ffill')
Out[165]:
2018-07-01 2018-07-01
2018-07-02 2018-07-01
2018-07-03 2018-07-01
...
2018-08-30 2018-08-01
2018-08-31 2018-08-01
2018-09-01 2018-09-01
Freq: D, Length: 63, dtype: object
关于索引的方便操作
1 | index1 = pd.Index(['A', 'B', 'B', 'C', 'C']) |
pandas的I/O
结构化数据输出
csv
1 | print(irisdata.columns) |
read_csv()的一些常用参数
- header: 表示数据中是否有列名,如果在第0行就写0,没有就写None
- names: 指定列名
- encoding: 指定编码
- skiprows: 跳过一定的行数
- nrows: 仅读取一定的行数
- skipfooter: 尾部有固定的行数永远不读取
- skip_blank_lines: 跳过空行
- sep/delimiter: 指定分割符
- na_values: 指定应该被作为Na的数值
- thousands: 处理数值类型时,每千位分分隔符并不统一,此时要把字符串转化为数字需要指明千位分隔符
- index_col: 将真实的某列(列的数目甚至列名)作为index
- squeeze: 仅读到一列时,不再保存为pandas.DataFrame或pandas.Series
Excel
1 | irisdata.to_excel('S1EP3_Iris.xls', index = None, encoding = 'utf-8') |
唯一重要的参数是sheetname = k,指定excel文件的第k个sheet页将会被取出。(从0开始的)
半结构化的数据
1 | json_data = [ \ |
深入pandas数据操纵
横向拼接——直接DataFrame
1 | pd.DataFrame([np.random.rand(2), np.random.rand(2), np.random.rand(2)], columns = ['C1', 'C2']) |
横向拼接——concatenate
1 | pd.concat([data_employ_ri, data_employ_ri]) |
纵向拼接——merge
根据数据列关联:使用on关键字1
2
3
4
5
6
7
8
9
10
11
12
13pd.merge(data_employ_ri, data_employ_ri, on = 'name')
Out[192]:
name job_x sal_x report_x job_y sal_y report_y
0 Wang VP 50000 NaN VP 50000 NaN
1 Zhang VP 150000 NaN VP 150000 NaN
2 Li IT 20000 NaN IT 20000 NaN
pd.merge(data_employ_ri, data_employ_ri, on = ['name', 'job'])
Out[193]:
name job sal_x report_x sal_y report_y
0 Wang VP 50000 NaN 50000 NaN
1 Zhang VP 150000 NaN 150000 NaN
2 Li IT 20000 NaN 20000 NaN
根据index关联:使用left_index和right_index1
2
3
4
5
6
7
8data_employ_ri.index.name = 'index1'
pd.merge(data_employ_ri, data_employ_ri, left_index = True, right_index = True)
Out[196]:
name_x job_x sal_x report_x name_y job_y sal_y report_y
index1
0 Wang VP 50000 NaN Wang VP 50000 NaN
1 Zhang VP 150000 NaN Zhang VP 150000 NaN
2 Li IT 20000 NaN Li IT 20000 NaN
🏷️增加how关键字,并指定:
- how = ‘inner’
- how = ‘left’
- how = ‘right’
- how = ‘outer’
结合how,可以看到merge基本再现了SQL应有的功能,并保持代码的整洁。
自定义函数映射
1 | dataNumPy32 = np.asarray([('Japan', 'Tokyo', 4000),('S.Korea', 'Seoul', 1300),('China', 'Beijing', 9100)]) |
map: 以相同的规则将一列数据作为一个映射,也就是进行相同函数的处理
1 | def gdp_factorize(v): |
排序
sort: 按一列或者多列的值进行行级排序
sort_index: 根据index里的取值进行排序,而且可以根据axis决定是重排行还是列
1 | dataNumPy33 = np.asarray([('Japan', 'Tokyo', 4000), ('S.Korea', 'Seoul', 1300), ('China', 'Beijing', 9100)]) |
注意相同值的处理
- method = ‘average’
- method = ‘min’
- method = ‘max’
- method = ‘first’
缺失数据的处理:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49df34 = data_for_multi2.unstack()
df34
Out[222]:
Col_1 Col_2 Col_3 Col_4
Row_2 0.0 NaN NaN NaN
Row_3 1.0 2.0 NaN NaN
Row_4 3.0 4.0 5.0 NaN
Row_5 6.0 7.0 8.0 9.0
# 忽略缺失值
df34.mean(skipna = True)
Out[224]:
Col_1 2.500000
Col_2 4.333333
Col_3 6.500000
Col_4 9.000000
dtype: float64
df34.mean(skipna = False)
Out[225]:
Col_1 2.5
Col_2 NaN
Col_3 NaN
Col_4 NaN
dtype: float64
df34.fillna(0)
Out[226]:
Col_1 Col_2 Col_3 Col_4
Row_2 0.0 0.0 0.0 0.0
Row_3 1.0 2.0 0.0 0.0
Row_4 3.0 4.0 5.0 0.0
Row_5 6.0 7.0 8.0 9.0
df34.fillna(0).mean(axis = 0, skipna = False)
Out[227]:
Col_1 2.50
Col_2 3.25
Col_3 3.25
Col_4 2.25
dtype: float64
df34.fillna(0).mean(axis = 1, skipna = False)
Out[228]:
Row_2 0.00
Row_3 0.75
Row_4 3.00
Row_5 7.50
dtype: float64
pandas的groupby
groupby 的功能类似于SQL的group by关键字:
Split-Apply-Combine
1. Split:就是按照规则分组
2. Apply:通过按照一定的agg函数来获得输入pd.Series返回一个值的效果
3. Combine:把结果搜集起来
pandas的groupby的灵活性:
1. 分组的关键字可以来自于index,也可以来自于真实的列数据
2. 分组的规则可以通过一列或者多列
分组的具体逻辑
1 | irisdata_group = irisdata.groupby('class') |
分组可以快速实现MapReduce的逻辑
Map: 指定分组的列标签,不同的值就会被分到不同的组进行处理
Reduce:输入多个值,返回一个值,一般通过agg实现,agg能接受一个函数。
1 | irisdata.groupby('class').agg(lambda x:((x-x.mean())**3).sum()*len(x)/(len(x)-1)/(len(x)-2)/x.std()**3 if len(x) > 2 else None) |
汇总之后的广播操作
1 | irisdata.groupby('class').transform('mean') |
hierarchicalIndex(多列分组)后的数据透视表操作
1 | factor1 = np.random.randint(0, 3, 50) |