更新

看新闻报道,feather 现在正式升级为 Apache Arrow 项目成员,得到业内大佬们的提携,性能上更加优秀。

项目地址:Apache Arrow

  • Python 的版本现在改成了 pyarrow
  • R 的版本改成了 `arrrow
## python 安装


## R 安装
install.packages("arrow")
arrow::install_arrow()

使用 R 与 Python 共同的数据存储文件格式:feather

项目的详细介绍在github: https://github.com/wesm/feather

python

pip install feather-format

R

install.packages("feather")
%%bash
ls -alh /home/william/20200414
total 2.4G
drwx------   2 william william 4.0K Apr 15 17:57 .
drwxr-xr-x 107 william william  12K Apr 15 17:57 ..
-rw-r--r--   1 william william 6.4K Apr 14 08:37 commission.csv
-rw-r--r--   1 william william 1.6M Apr 14 08:37 instrument.csv
-rw-r--r--   1 william william 2.4G Apr 14 15:32 tick.csv

性能测试: python

import pandas as pd
import numpy as np
import feather
%time tick_csv = pd.read_csv("/home/william/20200414/tick.csv")
for col in tick_csv.columns[6:]:
    tick_csv[col] = tick_csv[col].astype(float)
<string>:2: DtypeWarning: Columns (6,7,13,14,15,16,17,19) have mixed types.Specify dtype option on import or set low_memory=False.


CPU times: user 37.1 s, sys: 3.31 s, total: 40.4 s
Wall time: 41.1 s
tick_csv.head(10)
len(tick_csv)
13373363
## 写文件相对比较慢,因为要做序列化
%time tick_csv.to_feather("/home/william/20200414/tick.feather")
CPU times: user 3.26 s, sys: 1.49 s, total: 4.75 s
Wall time: 6.13 s
## 读文件非常快
%time tick_feather = pd.read_feather("/home/william/20200414/tick.feather")
CPU times: user 4.34 s, sys: 1.51 s, total: 5.85 s
Wall time: 5.15 s
tick_feather.head(10)
len(tick_feather)
13373363

性能测试: R

%load_ext rpy2.ipython
%%R
library(data.table)
%%R
system.time({dt <- fread('/home/william/20200414/tick.csv', verbose = FALSE, showProgress = FALSE)})
   user  system elapsed 
 63.591   1.474  18.146 
%%R
head(dt)
%%R
system.time({dt_feather <- feather::read_feather('/home/william/20200414/tick.feather')})
   user  system elapsed 
  8.342   0.761   9.112 
%%R
head(dt_feather)
%%R
system.time({
    fst::write_fst(dt, "/home/william/20200414/tick.fst")
})
   user  system elapsed 
 10.718   1.065   4.356 
%%R
system.time({
    dt_fst <- fst::read_fst("/home/william/20200414/tick.fst", as.data.table = TRUE)
})
   user  system elapsed 
  6.918   0.751   5.671 

R -> Python

from rpy2.robjects import r, pandas2ri
pandas2ri.activate()
%%R
r_data = data.table(x = 1, y = 2)
r.r_data
<tr style="text-align: right;">
  <th></th>
  <th>x</th>
  <th>y</th>
</tr>
<tr>
  <th>1</th>
  <td>1.0</td>
  <td>2.0</td>
</tr>
py_data = r.r_data
print(py_data)
     x    y
1  1.0  2.0