自从用了 tidyverse
系列的 R 包,彻底爱上了 tidyverse 数据处理流程。之前的每周分享中提到了两个类似的 Python 包。dplython和pandas-ply,你可以在这里下载我的 Jupyter notebook 文件:在 Python 中像 dplyr 那样操作数据框.ipynb
dplython 包
Python1 2 3 4 5
| import pandas as pd from dplython import \ (DplyFrame, X, diamonds, select, sift, sample_n, \ sample_frac, head, arrange, mutate, group_by, \ summarize, DelayFunction)
|
选择与首部预览
Pythonthon1
| diamonds >> select(X.carat, X.cut, X.price) >> head(5)
|

筛选
Python1 2 3
| diamonds >> \ sift(X.carat > 4) >> \ select(X.carat, X.cut, X.depth, X.price)
|

抽样
Python1 2 3 4
| (diamonds >> sample_n(10) >> arrange(X.carat) >> select(X.carat, X.cut, X.depth, X.price))
|

Python1 2 3 4
| (diamonds >> sample_frac(0.0002) >> arrange(X.depth) >> select(X.carat, X.depth, X.price))
|

传递整个数据框
可以直接把整个数据框传递给 X._
Python1 2 3 4
| (diamonds >> sample_n(5) >> select(X.carat, X.price, X.depth) >> X._.T)
|

传递数据框或列给函数
需要使用@DelayFunction。
Python1 2 3 4 5 6 7 8 9 10 11 12 13
| @DelayFunction def PairwiseGreater(series1, series2): index = series1.index newSeries = pd.Series([max(s1, s2) for s1, s2 in zip(series1, series2)]) newSeries.index = index return newSeries diamonds >> head(5) >> PairwiseGreater(X.x, X.y)
|
Python1 2 3 4 5 6 7
| diamonds >> head(5) >> select(X.x, X.y)
|
mutate()
Python1 2 3 4
| (diamonds >> mutate(carat_bin=X.carat.round()) >> group_by(X.cut, X.carat_bin) >> summarize(avg_price=X.price.mean()))
|

如果你的列名不能作为属性使用,你可以使用下面的方式:
R1 2
| diamonds["column w/ spaces"] = range(len(diamonds)) diamonds >> select(X["column w/ spaces"]) >> head()
|
将整个数据框传递给 ggplot 进行绘图
如果你的 ggplot 库导入出错,参考这篇博客:基于 Python 的 Anaconda3,导包报错 cannot import name ‘Timestamp’
R1 2 3 4 5 6 7 8
| from ggplot import ggplot, aes, geom_point, facet_wrap ggplot = DelayFunction(ggplot) (diamonds >> ggplot(aes(x = "carat", y = "price", color = "cut"), data = X._) + geom_point() + facet_wrap("color"))
|

将整个数据框传递给 pylab 进行绘图
R1 2 3
| import pylab as pl pl.scatter = DelayFunction(pl.scatter) diamonds >> sample_frac(0.1) >> pl.scatter(X.carat, X.price)
|

pandas-ply 包
Python1 2 3
| import pandas as pd from pandas_ply import install_ply, X, sym_call install_ply(pd)
|
ply_select()
首先读取 R 的 feather 数据 R 方面需要使用下面的代码生成 flights.feather 文件
R1
| nycflights13::flights %>% feather::write_feather("flights.feather")
|
Python1 2 3 4 5 6 7 8 9 10
| import feather flights = feather.read_dataframe('flights.feather') flights.head(5) (flights .groupby(['year', 'month', 'day']) .ply_select( arr = X.arr_delay.mean(), dep = X.dep_delay.mean()) .ply_where(X.arr > 30, X.dep > 30) .head(5))
|

使用 pandas 实现上面的操作:
Python1 2 3 4 5 6
| grouped_flights = flights.groupby(["year", "month", "day"]) output = pd.DataFrame() output['arr'] = grouped_flights.arr_delay.mean() output['dep'] = grouped_flights.dep_delay.mean() filtered_output = output[(output.arr > 30) & (output.dep > 30)] filtered_output.head(5)
|
Python1 2 3 4
| flights.ply_select('*', gain = X.arr_delay - X.dep_delay, speed = X.distance / X.air_time * 60).head(5)
|

ply_where()
R1
| flights.ply_where(X.month == 1, X.day == 1).head(5)
|