feather:在R与python之间共享数据存储格式

警告
本文最后更新于 2020-04-15,文中内容可能已过时。

更新

看新闻报道,feather 现在正式升级为 Apache Arrow 项目成员,得到业内大佬们的提携,性能上更加优秀。

项目地址:Apache Arrow

  • Python 的版本现在改成了 pyarrow
  • R 的版本改成了 ``arrrow`
1
2
3
4
5
6
## python 安装


## R 安装
install.packages("arrow")
arrow::install_arrow()

使用 R 与 Python 共同的数据存储文件格式:feather

项目的详细介绍在github: https://github.com/wesm/feather

python

1
pip install feather-format

R

1
install.packages("feather")
1
2
%%bash
ls -alh /home/william/20200414
total 2.4G
drwx------   2 william william 4.0K Apr 15 17:57 .
drwxr-xr-x 107 william william  12K Apr 15 17:57 ..
-rw-r--r--   1 william william 6.4K Apr 14 08:37 commission.csv
-rw-r--r--   1 william william 1.6M Apr 14 08:37 instrument.csv
-rw-r--r--   1 william william 2.4G Apr 14 15:32 tick.csv

性能测试: python

1
2
3
import pandas as pd
import numpy as np
import feather
1
2
3
%time tick_csv = pd.read_csv("/home/william/20200414/tick.csv")
for col in tick_csv.columns[6:]:
    tick_csv[col] = tick_csv[col].astype(float)
<string>:2: DtypeWarning: Columns (6,7,13,14,15,16,17,19) have mixed types.Specify dtype option on import or set low_memory=False.


CPU times: user 37.1 s, sys: 3.31 s, total: 40.4 s
Wall time: 41.1 s
1
tick_csv.head(10)
1
len(tick_csv)
13373363
1
2
## 写文件相对比较慢,因为要做序列化
%time tick_csv.to_feather("/home/william/20200414/tick.feather")
CPU times: user 3.26 s, sys: 1.49 s, total: 4.75 s
Wall time: 6.13 s
1
2
## 读文件非常快
%time tick_feather = pd.read_feather("/home/william/20200414/tick.feather")
CPU times: user 4.34 s, sys: 1.51 s, total: 5.85 s
Wall time: 5.15 s
1
tick_feather.head(10)
1
len(tick_feather)
13373363

性能测试: R

1
%load_ext rpy2.ipython
1
2
%%R
library(data.table)
1
2
%%R
system.time({dt <- fread('/home/william/20200414/tick.csv', verbose = FALSE, showProgress = FALSE)})
   user  system elapsed
 63.591   1.474  18.146
1
2
%%R
head(dt)
1
2
%%R
system.time({dt_feather <- feather::read_feather('/home/william/20200414/tick.feather')})
   user  system elapsed
  8.342   0.761   9.112
1
2
%%R
head(dt_feather)
1
2
3
4
%%R
system.time({
    fst::write_fst(dt, "/home/william/20200414/tick.fst")
})
   user  system elapsed
 10.718   1.065   4.356
1
2
3
4
%%R
system.time({
    dt_fst <- fst::read_fst("/home/william/20200414/tick.fst", as.data.table = TRUE)
})
   user  system elapsed
  6.918   0.751   5.671

R -> Python

1
2
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()
1
2
%%R
r_data = data.table(x = 1, y = 2)
1
r.r_data

x y
1 1.0 2.0
1
py_data = r.r_data
1
print(py_data)
     x    y
1  1.0  2.0

相关内容

william 支付宝支付宝
william 微信微信
0%