警告
本文最后更新于 2020-04-15,文中内容可能已过时。
更新
看新闻报道,feather
现在正式升级为 Apache Arrow
项目成员,得到业内大佬们的提携,性能上更加优秀。
项目地址:Apache Arrow
- Python 的版本现在改成了
pyarrow
- R 的版本改成了 ``arrrow`
1
2
3
4
5
6
|
## python 安装
## R 安装
install.packages("arrow")
arrow::install_arrow()
|
使用 R 与 Python 共同的数据存储文件格式:feather
项目的详细介绍在github: https://github.com/wesm/feather
python
1
|
pip install feather-format
|
R
1
|
install.packages("feather")
|
1
2
|
%%bash
ls -alh /home/william/20200414
|
total 2.4G
drwx------ 2 william william 4.0K Apr 15 17:57 .
drwxr-xr-x 107 william william 12K Apr 15 17:57 ..
-rw-r--r-- 1 william william 6.4K Apr 14 08:37 commission.csv
-rw-r--r-- 1 william william 1.6M Apr 14 08:37 instrument.csv
-rw-r--r-- 1 william william 2.4G Apr 14 15:32 tick.csv
性能测试: python
1
2
3
|
import pandas as pd
import numpy as np
import feather
|
1
2
3
|
%time tick_csv = pd.read_csv("/home/william/20200414/tick.csv")
for col in tick_csv.columns[6:]:
tick_csv[col] = tick_csv[col].astype(float)
|
<string>:2: DtypeWarning: Columns (6,7,13,14,15,16,17,19) have mixed types.Specify dtype option on import or set low_memory=False.
CPU times: user 37.1 s, sys: 3.31 s, total: 40.4 s
Wall time: 41.1 s
13373363
1
2
|
## 写文件相对比较慢,因为要做序列化
%time tick_csv.to_feather("/home/william/20200414/tick.feather")
|
CPU times: user 3.26 s, sys: 1.49 s, total: 4.75 s
Wall time: 6.13 s
1
2
|
## 读文件非常快
%time tick_feather = pd.read_feather("/home/william/20200414/tick.feather")
|
CPU times: user 4.34 s, sys: 1.51 s, total: 5.85 s
Wall time: 5.15 s
13373363
性能测试: R
1
2
|
%%R
library(data.table)
|
1
2
|
%%R
system.time({dt <- fread('/home/william/20200414/tick.csv', verbose = FALSE, showProgress = FALSE)})
|
user system elapsed
63.591 1.474 18.146
1
2
|
%%R
system.time({dt_feather <- feather::read_feather('/home/william/20200414/tick.feather')})
|
user system elapsed
8.342 0.761 9.112
1
2
3
4
|
%%R
system.time({
fst::write_fst(dt, "/home/william/20200414/tick.fst")
})
|
user system elapsed
10.718 1.065 4.356
1
2
3
4
|
%%R
system.time({
dt_fst <- fst::read_fst("/home/william/20200414/tick.fst", as.data.table = TRUE)
})
|
user system elapsed
6.918 0.751 5.671
R -> Python
1
2
|
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()
|
1
2
|
%%R
r_data = data.table(x = 1, y = 2)
|
x y
1 1.0 2.0