pandas 数据处理

pandas 数据处理

学习和大熊猫们(Pandas)一起游戏吧!的笔记。

为什么是 pandas

普通的一个程序员看到一份数据会这么做:

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import codecs
import requests
import numpy as np
import scipy as sp
import pandas as pd
import datetime
import json

# 下载鸢尾花数据集为txt文档
r = requests.get('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data')
with codecs.open('S1EP3_Iris.txt', 'w', encoding='utf-8') as f:
f.write(r.text)

# 读取和打印
with codecs.open('S1EP3_Iris.txt', 'r', encoding='utf-8') as f:
lines = f.readlines()

for line in lines:
print(line)

而 pandas 的意义就在于快速的识别结构话数据,如果使用 pandas 读取数据:

Python
1
2
3
4
5
6
7
8
9
10
11
irisdata = pd.read_csv('S1EP3_Iris.txt', header = None, encoding = 'utf-8')
irisdata.head()

# 结果
Out[45]:
0 1 2 3 4
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

还可以快速的操作元数据

Python
1
2
3
# 例如:列名
irisdata.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
irisdata.head()

快速过滤

Python
1
2
3
4
5
6
7
8
9
# 例如,选择数据集中petal_width列中取值最大的行
irisdata[irisdata['petal_width'] == irisdata.petal_width.max()]

# 结果
Out[46]:
sepal_length sepal_width petal_length petal_width class
100 6.3 3.3 6.0 2.5 Iris-virginica
109 7.2 3.6 6.1 2.5 Iris-virginica
144 6.7 3.3 5.7 2.5 Iris-virginica

切片

Python
1
2
3
4
5
6
7
8
9
10
11
# 例如,对于行,每隔30取一行,列取前两列
irisdata.iloc[::30, :2]

# 结果
Out[47]:
sepal_length sepal_width
0 5.1 3.5
30 4.8 3.1
60 5.0 2.0
90 5.5 2.6
120 6.9 3.2

快速统计

Python
1
2
3
4
5
6
7
8
# 统计鸢尾花的各个类别的个数
print(irisdata['class'].value_counts())

# 结果
Iris-virginica 50
Iris-versicolor 50
Iris-setosa 50
Name: class, dtype: int64

Python
1
2
3
4
5
6
7
8
9
for x in range(0, 4):
s = irisdata.iloc[:, x]
print('{0:<12}'.format(s.name.upper()), " Statistics: ", '{0:>5} {1:>5} {2:>5} {3:>5}'.format(s.max(), s.min(), round(s.mean(), 2), round(s.std(), 2)))

# 结果
SEPAL_LENGTH Statistics: 7.9 4.3 5.84 0.83
SEPAL_WIDTH Statistics: 4.4 2.0 3.05 0.43
PETAL_LENGTH Statistics: 6.9 1.0 3.76 1.76
PETAL_WIDTH Statistics: 2.5 0.1 1.2 0.76

快速 MapReduce(这部分展示还不明白)

Python
1
2
3
4
5
6
7
8
9
10
slogs = lambda x:sp.log(x)*x
entpy = lambda x:sp.exp((slogs(x.sum()) - x.map(slogs).sum())/x.sum())
irisdata.groupby('class').agg(entpy)

# 结果
sepal_length sepal_width petal_length petal_width
class
Iris-setosa 49.878745 49.695242 49.654909 45.810069
Iris-versicolor 49.815081 49.680665 49.694505 49.452305
Iris-virginica 49.772059 49.714500 49.761700 49.545918

欢迎来到大熊猫的世界

pandas 的重要数据类型

  • DataFrame
  • Series
  • index(行索引)

Series

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
np.random.randn(4)
Out[52]: array([-0.27891329, -0.57633302, -1.10535522, 0.05993792])

Series1 = pd.Series(np.random.randn(4))
print(Series1, type(Series1))
0 -0.846342
1 0.173527
2 -0.327411
3 -0.323218
dtype: float64 <class 'pandas.core.series.Series'>

print(Series1.index)
RangeIndex(start=0, stop=4, step=1)

print(Series1.values)
[-0.84634196 0.17352745 -0.32741142 -0.32321775]

Series 支持过滤的原理就如同 Numpy:

Python
1
2
3
4
5
6
7
8
9
10
print(Series1 > 0)
0 False
1 True
2 False
3 False
dtype: bool

print(Series1[Series1 > 0])
1 0.173527
dtype: float64

当然也支持 Broadcasting:

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 序列中的每个元素分别作运算
print(Series1 * 2)
0 -1.692684
1 0.347055
2 -0.654823
3 -0.646435
dtype: float64

print(Series1 + 5)
0 4.153658
1 5.173527
2 4.672589
3 4.676782
dtype: float64

以及 Universal Function

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
print(np.exp(Series1))
0 0.428981
1 1.189493
2 0.720787
3 0.723816
dtype: float64

f_np = np.frompyfunc(lambda x:np.exp(x*2 + 5), 1, 1)

print(f_np(Series1))
0 27.3117
1 209.989
2 77.1057
3 77.7551
dtype: object

# lambda表达式
f = lambda x:pow(x,2)
f(2)
Out[67]: 4

在序列上使用行标,而不是创建一个 2 列的数据表,能够轻松辨别哪里是数据,哪里是元数据

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Series2 = pd.Series(Series1.values, index = ['norm_'+str(i) for i in range(0, 4)])
print(Series2, type(Series2))
norm_0 -0.846342
norm_1 0.173527
norm_2 -0.327411
norm_3 -0.323218
dtype: float64 <class 'pandas.core.series.Series'>

print(Series2.index)
Index(['norm_0', 'norm_1', 'norm_2', 'norm_3'], dtype='object')

print(Series2.values)
[-0.84634196 0.17352745 -0.32741142 -0.32321775]

print(type(Series2.index))
<class 'pandas.core.indexes.base.Index'>

虽然行是有顺序的,但是仍然可以通过行级 index 访问到数据:

Python
1
2
3
4
print(Series2[['norm_0', 'norm_3']])
norm_0 -0.846342
norm_3 -0.323218
dtype: float64

下面的语句可以用来判断该序列中是否有该索引

Python
1
2
3
4
5
print('norm_0' in Series2)
True

print('norm_6' in Series2)
False

默认索引就像行号一样

Python
1
2
print(Series1.index)
RangeIndex(start=0, stop=4, step=1)

从 key 不重复的 Ordered Dict 或者从 Dict 来定义 Series 不用担心行索引重复:

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
Series3_dict = {"Japan": "Tokyo", "S.Korea": "Seoul", "China": "Beijing"}
Series3_pdSeries = pd.Series(Series3_dict)
print(Series3_pdSeries)
China Beijing
Japan Tokyo
S.Korea Seoul
dtype: object

print(Series3_pdSeries.index)
Index(['China', 'Japan', 'S.Korea'], dtype='object')

print(Series3_pdSeries.values)
['Beijing' 'Tokyo' 'Seoul']

想让序列按照你的排序方式保存?就算有缺失值也没有问题:

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Series4_IndexList = ["Japan", "China", "Singapoer", "S.Korea"]
Series4_pdSeries = pd.Series(Series3_dict, index = Series4_IndexList)
print(Series4_pdSeries)
Japan Tokyo
China Beijing
Singapoer NaN
S.Korea Seoul
dtype: object

print(Series4_pdSeries.values)
['Tokyo' 'Beijing' nan 'Seoul']

print(Series4_pdSeries.isnull())
Japan False
China False
Singapoer True
S.Korea False
dtype: bool

print(Series4_pdSeries.index)
Index(['Japan', 'China', 'Singapoer', 'S.Korea'], dtype='object')

print(Series4_pdSeries.notnull())
Japan True
China True
Singapoer False
S.Korea True
dtype: bool

整个序列级别的元数据信息:name

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 当数据序列和index本身有了名字,就可以更加方便的进行后续的数据关联了!
print(Series4_pdSeries.name)
None

print(Series4_pdSeries.index.name)
None

Series4_pdSeries.name = "Capital Series"
Series4_pdSeries.index.name = "Nation"
print(Series4_pdSeries)
Nation
Japan Tokyo
China Beijing
Singapoer NaN
S.Korea Seoul
Name: Capital Series, dtype: object

行索引也可以重复,不过并不推荐

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Series5_IndexList = ['A', 'B', 'B', 'C']
Series5 = pd.Series(Series1.values, index = Series5_IndexList)
print(Series5)
A -0.846342
B 0.173527
B -0.327411
C -0.323218
dtype: float64

print(Series5[['B', 'A']])
B 0.173527
B -0.327411
A -0.846342
dtype: float64

DataFrame

从 numpy 二维数组、从文件或从数据库定义:数据虽好,勿忘列名:

Python
1
2
3
4
5
6
7
8
dn = np.asarray([('Japan', 'Tokyo', 4000), ('S.Korea', 'Seoul', 1200), ('China', 'Beijing', 910)])
df1 = pd.DataFrame(dn, columns = ['nation', 'capital', 'gdp'])
df1
Out[101]:
nation capital gdp
0 Japan Tokyo 4000
1 S.Korea Seoul 1200
2 China Beijing 910

等长的列数据保存在一个字典里(json):很不幸,字典 key 是无序的

Python
1
2
3
4
5
6
7
8
dd = {'nation':['Japan', 'S.Korea', 'China'], 'capital':['Tokyo', 'Seoul', 'Beijing'], 'gdp':[4900, 1300, 9100]}
df2 = pd.DataFrame(dd)
df2
Out[104]:
capital gdp nation
0 Tokyo 4900 Japan
1 Seoul 1300 S.Korea
2 Beijing 9100 China

从另外一个数据框定义数据框

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
df21 = pd.DataFrame(df1, columns = ['nation', 'capital', 'gdp'])
df21
Out[106]:
nation capital gdp
0 Japan Tokyo 4000
1 S.Korea Seoul 1200
2 China Beijing 910

df22 = pd.DataFrame(df2, columns = ['nation', 'capital', 'gdp'], index = [2, 0, 1])
df22
Out[108]:
nation capital gdp
2 China Beijing 9100
0 Japan Tokyo 4900
1 S.Korea Seoul 1300

从 DataFrame 中取出列?两种方法(与 JavaScript 完全一致)

  • ‘.’的写法容易与其他预留关键字产生冲突
  • ‘[]’的方法最安全
    Python
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    print(df22.nation, df22.capital)
    2 China
    0 Japan
    1 S.Korea
    Name: nation, dtype: object 2 Beijing
    0 Tokyo
    1 Seoul
    Name: capital, dtype: object
    print(df22['gdp'])
    2 9100
    0 4900
    1 1300
    Name: gdp, dtype: int64

从 DataFrame 中取出行:

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
print(df22)
nation capital gdp
2 China Beijing 9100
0 Japan Tokyo 4900
1 S.Korea Seoul 1300

print(df22[0:1])
nation capital gdp
2 China Beijing 9100

# 取第一行
print(df22.iloc[0])
nation China
capital Beijing
gdp 9100
Name: 2, dtype: object

# 取索引为0的
print(df22.loc[0])
nation Japan
capital Tokyo
gdp 4900
Name: 0, dtype: object

像 NumPy 切片一样的终极招式

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
# 取第一行
print(df22.iloc[0, :])
nation China
capital Beijing
gdp 9100
Name: 2, dtype: object

# 取第一列
print(df22.iloc[:, 0])
2 China
0 Japan
1 S.Korea
Name: nation, dtype: object

动态增加列

Python
1
2
3
4
5
6
7
df22['population'] = [1600, 130, 55]
df22
Out[123]:
nation capital gdp population
2 China Beijing 9100 1600
0 Japan Tokyo 4900 130
1 S.Korea Seoul 1300 55

index

直接定义索引,就像 Series 一样

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
index_names = ['a', 'b', 'c']
Series_for_index = pd.Series(index_names)
print(pd.Index(index_names))
Index(['a', 'b', 'c'], dtype='object')

print(pd.Index(Series_for_index))
Index(['a', 'b', 'c'], dtype='object')

index0 = pd.Index(index_names)
print(index0.get_values())
['a' 'b' 'c']
# 下面这个会出错,记住index是Immutable的!
# index0[2] = 'd'

多重 Index

Ruby
1
2
3
4
5
multi1 = pd.Index([('Row_' + str(x+1), 'Col_' + str(y+1)) for x in range(0, 4) for y in range(0, 4)])
multi1.name = ['index1', 'index2']
print(multi1)
MultiIndex(levels=[['Row_1', 'Row_2', 'Row_3', 'Row_4'], ['Col_1', 'Col_2', 'Col_3', 'Col_4']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]])

对于一个 Series,如果拥有了多重 Index,就可以很容易的变形:

Ruby
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
data_for_multi = pd.Series(range(0, 16), index = multi1)
data_for_multi
Out[137]:
Row_1 Col_1 0
Col_2 1
Col_3 2
Col_4 3
Row_2 Col_1 4
Col_2 5
Col_3 6
Col_4 7
Row_3 Col_1 8
Col_2 9
Col_3 10
Col_4 11
Row_4 Col_1 12
Col_2 13
Col_3 14
Col_4 15
dtype: int64
data_for_multi.unstack()
Out[138]:
Col_1 Col_2 Col_3 Col_4
Row_1 0 1 2 3
Row_2 4 5 6 7
Row_3 8 9 10 11
Row_4 12 13 14 15
data_for_multi.unstack().stack()
Out[139]:
Row_1 Col_1 0
Col_2 1
Col_3 2
Col_4 3
Row_2 Col_1 4
Col_2 5
Col_3 6
Col_4 7
Row_3 Col_1 8
Col_2 9
Col_3 10
Col_4 11
Row_4 Col_1 12
Col_2 13
Col_3 14
Col_4 15
dtype: int64

下面再看一下非平衡数据的例子,即 Row_1,2,3,4 和 Col_1,2,3,4 并不是完全组合的。

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
multi2 = pd.Index([('Row_' + str(x + 1), 'Col_' + str(y + 1)) for x in range(0, 5) for y in range(0, x)])
multi2
Out[143]:
MultiIndex(levels=[['Row_2', 'Row_3', 'Row_4', 'Row_5'], ['Col_1', 'Col_2', 'Col_3', 'Col_4']],
labels=[[0, 1, 1, 2, 2, 2, 3, 3, 3, 3], [0, 0, 1, 0, 1, 2, 0, 1, 2, 3]])
data_for_multi2 = pd.Series(np.arange(10), index = multi2)
data_for_multi2
Out[145]:
Row_2 Col_1 0
Row_3 Col_1 1
Col_2 2
Row_4 Col_1 3
Col_2 4
Col_3 5
Row_5 Col_1 6
Col_2 7
Col_3 8
Col_4 9
dtype: int64
data_for_multi2.unstack()
Out[146]:
Col_1 Col_2 Col_3 Col_4
Row_2 0.0 NaN NaN NaN
Row_3 1.0 2.0 NaN NaN
Row_4 3.0 4.0 5.0 NaN
Row_5 6.0 7.0 8.0 9.0

DateTime 标准库

Python
1
2
3
dates = [datetime.datetime(2015, 1, 1), datetime.datetime(2015, 1, 8), datetime.datetime(2015, 1, 30)]
pd.DatetimeIndex(dates)
Out[151]: DatetimeIndex(['2015-01-01', '2015-01-08', '2015-01-30'], dtype='datetime64[ns]', freq=None)

如果你不仅需要时间格式一致,时间频率也要一致的话:

Python
1
2
3
periodindex1 = pd.period_range('2015-01', '2015-04', freq = 'M')
print(periodindex1)
PeriodIndex(['2015-01', '2015-02', '2015-03', '2015-04'], dtype='period[M]', freq='M')

月级精度和日级精度的转换

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 使用每个月的第一天
print(periodindex1.asfreq('D', how='start'))
PeriodIndex(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01'], dtype='period[D]', freq='D')
# 使用每个月的最后一天
print(periodindex1.asfreq('D', how='end'))
PeriodIndex(['2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30'], dtype='period[D]', freq='D')
periodindex_mon = pd.period_range('2018-01', '2018-09', freq = 'M').asfreq('D', how='start')
periodindex_day = pd.period_range('2018-07-01', '2018-09-01', freq = 'D')
print(periodindex_mon)
PeriodIndex(['2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01',
'2018-05-01', '2018-06-01', '2018-07-01', '2018-08-01',
'2018-09-01'],
dtype='period[D]', freq='D')
print(periodindex_day)
PeriodIndex(['2018-07-01', '2018-07-02', '2018-07-03', '2018-07-04',
'2018-07-05', '2018-07-06', '2018-07-07', '2018-07-08',
...
'2018-08-26', '2018-08-27', '2018-08-28', '2018-08-29',
'2018-08-30', '2018-08-31', '2018-09-01'],
dtype='period[D]', freq='D')

粗粒度数据 + reindex + ffill/bfill

Python
1
2
3
4
5
6
7
8
9
10
pd.Series(periodindex_mon, index = periodindex_mon).reindex(periodindex_day, method = 'ffill')
Out[165]:
2018-07-01 2018-07-01
2018-07-02 2018-07-01
2018-07-03 2018-07-01
...
2018-08-30 2018-08-01
2018-08-31 2018-08-01
2018-09-01 2018-09-01
Freq: D, Length: 63, dtype: object

关于索引的方便操作

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
index1 = pd.Index(['A', 'B', 'B', 'C', 'C'])
index2 = pd.Index(['C', 'D', 'E', 'E', 'F'])
index3 = pd.Index(['B', 'C', 'A'])
print(index1.append(index2))
Index(['A', 'B', 'B', 'C', 'C', 'C', 'D', 'E', 'E', 'F'], dtype='object')

print(index1.difference(index2))
Index(['A', 'B'], dtype='object')

print(index1.intersection(index2))
Index(['C', 'C'], dtype='object')

print(index1.union(index2))
Index(['A', 'B', 'B', 'C', 'C', 'D', 'E', 'E', 'F'], dtype='object')

print(index1.isin(index2))
[False False False True True]

print(index1.delete(2))
Index(['A', 'B', 'C', 'C'], dtype='object')

print(index1.insert(0, 'K'))
Index(['K', 'A', 'B', 'B', 'C', 'C'], dtype='object')

print(index3.drop('A'))
Index(['B', 'C'], dtype='object')

print(index1.is_monotonic, index2.is_monotonic, index3.is_monotonic)
True True False

print(index1.is_unique, index2.is_unique, index3.is_unique)
False False True

pandas 的 I/O

结构化数据输出

csv

Python
1
2
3
4
5
6
7
8
9
10
11
print(irisdata.columns)
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'], dtype='object')
irisdata = pd.read_csv('S1EP3_Iris.txt', header = None, names = irisdata.columns, encoding = 'utf-8')
irisdata.head()
Out[181]:
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

read_csv()的一些常用参数

  • header: 表示数据中是否有列名,如果在第 0 行就写 0,没有就写 None
  • names: 指定列名
  • encoding: 指定编码
  • skiprows: 跳过一定的行数
  • nrows: 仅读取一定的行数
  • skipfooter: 尾部有固定的行数永远不读取
  • skip_blank_lines: 跳过空行
  • sep/delimiter: 指定分割符
  • na_values: 指定应该被作为 Na 的数值
  • thousands: 处理数值类型时,每千位分分隔符并不统一,此时要把字符串转化为数字需要指明千位分隔符
  • index_col: 将真实的某列(列的数目甚至列名)作为 index
  • squeeze: 仅读到一列时,不再保存为 pandas.DataFrame 或 pandas.Series

Excel

Python
1
2
3
4
5
6
7
8
9
10
irisdata.to_excel('S1EP3_Iris.xls', index = None, encoding = 'utf-8')
irisdata_from_excel = pd.read_excel('S1EP3_Iris.xls', header = 0, encoding = 'utf-8')
irisdata_from_excel.head()
Out[184]:
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

唯一重要的参数是 sheetname = k,指定 excel 文件的第 k 个 sheet 页将会被取出。(从 0 开始的)

半结构化的数据

Python
1
2
3
4
5
6
7
8
9
10
11
12
json_data = [ \
{'name': 'Wang', 'sal': 50000, 'job': 'VP', }, \
{'name': 'Zhang', 'sal': 150000, 'job': 'VP', }, \
{'name': 'Li', 'sal': 20000, 'job': 'IT', }]
data_employ = pd.read_json(json.dumps(json_data))
data_employ_ri = data_employ.reindex(columns = ['name', 'job', 'sal', 'report'])
data_employ_ri
Out[188]:
name job sal report
0 Wang VP 50000 NaN
1 Zhang VP 150000 NaN
2 Li IT 20000 NaN

深入 pandas 数据操纵

横向拼接——直接 DataFrame

Python
1
2
3
4
5
6
pd.DataFrame([np.random.rand(2), np.random.rand(2), np.random.rand(2)], columns = ['C1', 'C2'])
Out[189]:
C1 C2
0 0.495542 0.805998
1 0.877208 0.497127
2 0.563253 0.208177

横向拼接——concatenate

Python
1
2
3
4
5
6
7
8
9
pd.concat([data_employ_ri, data_employ_ri])
Out[190]:
name job sal report
0 Wang VP 50000 NaN
1 Zhang VP 150000 NaN
2 Li IT 20000 NaN
0 Wang VP 50000 NaN
1 Zhang VP 150000 NaN
2 Li IT 20000 NaN

纵向拼接——merge

根据数据列关联:使用 on 关键字

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
pd.merge(data_employ_ri, data_employ_ri, on = 'name')
Out[192]:
name job_x sal_x report_x job_y sal_y report_y
0 Wang VP 50000 NaN VP 50000 NaN
1 Zhang VP 150000 NaN VP 150000 NaN
2 Li IT 20000 NaN IT 20000 NaN

pd.merge(data_employ_ri, data_employ_ri, on = ['name', 'job'])
Out[193]:
name job sal_x report_x sal_y report_y
0 Wang VP 50000 NaN 50000 NaN
1 Zhang VP 150000 NaN 150000 NaN
2 Li IT 20000 NaN 20000 NaN

根据 index 关联:使用 left_index 和 right_index

Python
1
2
3
4
5
6
7
8
data_employ_ri.index.name = 'index1'
pd.merge(data_employ_ri, data_employ_ri, left_index = True, right_index = True)
Out[196]:
name_x job_x sal_x report_x name_y job_y sal_y report_y
index1
0 Wang VP 50000 NaN Wang VP 50000 NaN
1 Zhang VP 150000 NaN Zhang VP 150000 NaN
2 Li IT 20000 NaN Li IT 20000 NaN

🏷️增加 how 关键字,并指定:

  • how = ‘inner’
  • how = ‘left’
  • how = ‘right’
  • how = ‘outer’
    结合 how,可以看到 merge 基本再现了 SQL 应有的功能,并保持代码的整洁。

自定义函数映射

Python
1
2
3
4
5
6
7
8
dataNumPy32 = np.asarray([('Japan', 'Tokyo', 4000),('S.Korea', 'Seoul', 1300),('China', 'Beijing', 9100)])
df32 = pd.DataFrame(dataNumPy32, columns = ['nation', 'capital', 'gdp'])
df32
Out[204]:
nation capital gdp
0 Japan Tokyo 4000
1 S.Korea Seoul 1300
2 China Beijing 9100

map: 以相同的规则将一列数据作为一个映射,也就是进行相同函数的处理

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def gdp_factorize(v):
fv = np.float64(v)
if fv > 6000.0:
return 'High'
elif fv < 2000.0:
return 'Low'
else:
return 'Medium'
df32['GDP_Level'] = df32['gdp'].map(gdp_factorize)
df32['nation'] = df32.nation.map(str.upper)
df32
Out[205]:
nation capital gdp GDP_Level
0 JAPAN Tokyo 4000 Medium
1 S.KOREA Seoul 1300 Low
2 CHINA Beijing 9100 High

排序

sort: 按一列或者多列的值进行行级排序
sort_index: 根据 index 里的取值进行排序,而且可以根据 axis 决定是重排行还是列

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
dataNumPy33 = np.asarray([('Japan', 'Tokyo', 4000), ('S.Korea', 'Seoul', 1300), ('China', 'Beijing', 9100)])
df33 = pd.DataFrame(dataNumPy33, columns = ['nation', 'capital', 'gdp'])
df33
Out[208]:
nation capital gdp
0 Japan Tokyo 4000
1 S.Korea Seoul 1300
2 China Beijing 9100

df33.sort_values('gdp')
Out[209]:
nation capital gdp
1 S.Korea Seoul 1300
0 Japan Tokyo 4000
2 China Beijing 9100

df33.sort_values(['capital', 'nation'], ascending = False)
Out[210]:
nation capital gdp
0 Japan Tokyo 4000
1 S.Korea Seoul 1300
2 China Beijing 9100

df33.sort_values('gdp').sort_values(by = 'gdp', ascending = False)
Out[211]:
nation capital gdp
2 China Beijing 9100
0 Japan Tokyo 4000
1 S.Korea Seoul 1300

df33.sort_index(axis = 1, ascending = True)
Out[212]:
capital gdp nation
0 Tokyo 4000 Japan
1 Seoul 1300 S.Korea
2 Beijing 9100 China

df33.rank()
Out[213]:
nation capital gdp
0 2.0 3.0 2.0
1 3.0 2.0 1.0
2 1.0 1.0 3.0

df33.rank(ascending = False)
Out[214]:
nation capital gdp
0 2.0 1.0 2.0
1 1.0 2.0 3.0
2 3.0 3.0 1.0

注意相同值的处理

  • method = ‘average’
  • method = ‘min’
  • method = ‘max’
  • method = ‘first’

缺失数据的处理:

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
df34 = data_for_multi2.unstack()
df34
Out[222]:
Col_1 Col_2 Col_3 Col_4
Row_2 0.0 NaN NaN NaN
Row_3 1.0 2.0 NaN NaN
Row_4 3.0 4.0 5.0 NaN
Row_5 6.0 7.0 8.0 9.0

# 忽略缺失值
df34.mean(skipna = True)
Out[224]:
Col_1 2.500000
Col_2 4.333333
Col_3 6.500000
Col_4 9.000000
dtype: float64

df34.mean(skipna = False)
Out[225]:
Col_1 2.5
Col_2 NaN
Col_3 NaN
Col_4 NaN
dtype: float64

df34.fillna(0)
Out[226]:
Col_1 Col_2 Col_3 Col_4
Row_2 0.0 0.0 0.0 0.0
Row_3 1.0 2.0 0.0 0.0
Row_4 3.0 4.0 5.0 0.0
Row_5 6.0 7.0 8.0 9.0

df34.fillna(0).mean(axis = 0, skipna = False)
Out[227]:
Col_1 2.50
Col_2 3.25
Col_3 3.25
Col_4 2.25
dtype: float64

df34.fillna(0).mean(axis = 1, skipna = False)
Out[228]:
Row_2 0.00
Row_3 0.75
Row_4 3.00
Row_5 7.50
dtype: float64

pandas 的 groupby

groupby 的功能类似于 SQL 的 group by 关键字:

Split-Apply-Combine

1. Split:就是按照规则分组
2. Apply:通过按照一定的 agg 函数来获得输入 pd.Series 返回一个值的效果
3. Combine:把结果搜集起来

pandas 的 groupby 的灵活性:

1. 分组的关键字可以来自于 index,也可以来自于真实的列数据
2. 分组的规则可以通过一列或者多列

分组的具体逻辑

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
irisdata_group = irisdata.groupby('class')
irisdata_group
Out[230]: <pandas.core.groupby.DataFrameGroupBy object at 0x116c6a898>
for level, subsetDF in irisdata_group:
print(level)
print(subsetDF)
Iris-setosa
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
···
48 5.3 3.7 1.5 0.2 Iris-setosa
49 5.0 3.3 1.4 0.2 Iris-setosa
Iris-versicolor
sepal_length sepal_width petal_length petal_width class
50 7.0 3.2 4.7 1.4 Iris-versicolor
···
98 5.1 2.5 3.0 1.1 Iris-versicolor
99 5.7 2.8 4.1 1.3 Iris-versicolor
Iris-virginica
sepal_length sepal_width petal_length petal_width class
100 6.3 3.3 6.0 2.5 Iris-virginica
···
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica

分组可以快速实现 MapReduce 的逻辑

Map: 指定分组的列标签,不同的值就会被分到不同的组进行处理
Reduce:输入多个值,返回一个值,一般通过 agg 实现,agg 能接受一个函数。

Python
1
2
3
4
5
6
7
irisdata.groupby('class').agg(lambda x:((x-x.mean())**3).sum()*len(x)/(len(x)-1)/(len(x)-2)/x.std()**3 if len(x) > 2 else None)
Out[232]:
sepal_length sepal_width petal_length petal_width
class
Iris-setosa 0.120087 0.107053 0.071846 1.197243
Iris-versicolor 0.105378 -0.362845 -0.606508 -0.031180
Iris-virginica 0.118015 0.365949 0.549445 -0.129477

汇总之后的广播操作

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
irisdata.groupby('class').transform('mean')
Out[233]:
sepal_length sepal_width petal_length petal_width
0 5.006 3.418 1.464 0.244
.. ... ... ... ...
149 6.588 2.974 5.552 2.026
[150 rows x 4 columns]

pd.concat([irisdata, irisdata.groupby('class').transform('mean')], axis=1)
Out[234]:
sepal_length sepal_width petal_length petal_width class \
0 5.1 3.5 1.4 0.2 Iris-setosa
.. ... ... ... ... ...
149 5.9 3.0 5.1 1.8 Iris-virginica
sepal_length sepal_width petal_length petal_width
0 5.006 3.418 1.464 0.244
.. ... ... ... ...
149 6.588 2.974 5.552 2.026
[150 rows x 9 columns]

hierarchicalIndex(多列分组)后的数据透视表操作

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
factor1 = np.random.randint(0, 3, 50)
factor2 = np.random.randint(0, 2, 50)
factor3 = np.random.randint(0, 3, 50)
values = np.random.randint(50)
hierindexDF = pd.DataFrame({'F1': factor1, 'F2': factor2, 'F3': factor3, 'F4': values})
hierindexDF
Out[240]:
F1 F2 F3 F4
0 2 0 1 7
···
49 1 1 0 7

hierindexDF_gbsum = hierindexDF.groupby(['F1', 'F2', 'F3']).sum()

hierindexDF_gbsum
Out[242]:
F4
F1 F2 F3
0 0 0 21
1 21
2 42
1 0 7
1 21
2 14
1 0 0 7
1 21
2 35
1 0 21
1 7
2 35
2 0 0 14
1 21
2 14
1 0 7
1 7
2 35

hierindexDF_gbsum.index
Out[243]:
MultiIndex(levels=[[0, 1, 2], [0, 1], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], [0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]],
names=['F1', 'F2', 'F3'])

hierindexDF_gbsum.unstack(0)
Out[244]:
F4
F1 0 1 2
F2 F3
0 0 21 7 14
1 21 21 21
2 42 35 14
1 0 7 21 7
1 21 7 7
2 14 35 35

hierindexDF_gbsum.unstack(1)
Out[245]:
F4
F2 0 1
F1 F3
0 0 21 7
1 21 21
2 42 14
1 0 7 21
1 21 7
2 35 35
2 0 14 7
1 21 7
2 14 35

hierindexDF_gbsum.unstack([2, 0])
Out[246]:
F4
F3 0 1 2 0 1 2 0 1 2
F1 0 0 0 1 1 1 2 2 2
F2
0 21 21 42 7 21 35 14 21 14
1 7 21 14 21 7 35 7 7 35
hierindexDF_gbsum.unstack([2, 0]).stack([1, 2])
# Python

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×