R 读取数据

萝卜与青菜

我在之前的一篇博客谈到了如何使用 readr读取包含中文字段的数据文件。同时,文件也提到,readr 对于读取大文件存在性能上的不足。因此,我日常仍然以另外一个软件包为主要工具。

这个神奇的软件包,就是大名鼎鼎的 data.table,提高了从读取数据、数据清洗、数据汇总到数据写入的全方位功能函数,而且难能可贵的是,它的处理效率惊人得强大。我估计会在以后单独写一篇博客来介绍这个扩展包。不过这里作为引子,请允许我隆重介绍今天的主角:fread

read_csv

library(readr)
help(read_csv)

fread

library(data.table)
help(fread)

myFread.R

myFraed.R 是我编写的一个函数脚本,集成了以上两个函数的功能,并在此基础上对数据文件进行初步的判断:

  • 如果数据文件支持使用 fread 进行读取,则优先使用这个函数,因为这个读取速度极快,对于一个 1Gb 大小的文件,大概只需要10 ~ 15 秒`
  • 如果对于部分数据文件,有可能由于里面包含中文字段、数据格式异常等问题,这时候就需要借助 read_csv 更加强大的底层支撑来处理,不过读取的速度稍微有所降低

FromDC

## =============================================================================
## FUN: myFreadBar
## 从 DC 那份数据文件读取数据,用于制作 Bar
myFreadFromDC <- function(x){
  ## -- 如果使用 fread 可以正常读取数据文件
  if(class(try(fread(x, showProgress = FALSE, fill = TRUE, nrows = 1000),
               silent = TRUE))[1] != "try-error"){
    dt <- fread(x, showProgress = TRUE, fill = TRUE,
                select = c('TradingDay','UpdateTime','UpdateMillisec'
                           ,'InstrumentID','LastPrice'
                           ,"OpenPrice", "HighestPrice", "LowestPrice","ClosePrice"
                           ,'Volume','Turnover','OpenInterest'
                           ,'SettlementPrice','UpperLimitPrice','LowerLimitPrice'
                           ,'BidPrice1','BidVolume1','BidPrice2','BidVolume2'
                           ,'BidPrice3','BidVolume3','BidPrice4','BidVolume4'
                           ,'BidPrice5','BidVolume5'
                           ,'AskPrice1','AskVolume1','AskPrice2','AskVolume2'
                           ,'AskPrice3','AskVolume3','AskPrice4','AskVolume4'
                           ,'AskPrice5','AskVolume5','AveragePrice'
                ),
                colClasses = list(character = c("TradingDay","InstrumentID","UpdateTime"),
                                  numeric   = c("Volume","Turnover") ))
  }else{
  ## -- 如果使用 fread 读取失败,则使用 read_csv
    dt <- read_csv(x,
                   col_types = list(TradingDay   = col_character(),
                                    InstrumentID = col_character(),
                                    UpdateTime   = col_character(),
                                    Volume       = col_number(),
                                    Turnover     = col_number())
    ) %>% as.data.table() %>%
      .[,.(TradingDay, UpdateTime, UpdateMillisec
           ,InstrumentID,LastPrice
           ,OpenPrice, HighestPrice, LowestPrice,ClosePrice
           ,Volume,Turnover,OpenInterest
           ,SettlementPrice,UpperLimitPrice,LowerLimitPrice
           ,BidPrice1,BidVolume1,BidPrice2,BidVolume2
           ,BidPrice3,BidVolume3,BidPrice4,BidVolume4
           ,BidPrice5,BidVolume5
           ,AskPrice1,AskVolume1,AskPrice2,AskVolume2
           ,AskPrice3,AskVolume3,AskPrice4,AskVolume4
           ,AskPrice5,AskVolume5,AveragePrice)]
  }
  ##----------------------------------------------------------------------------
  return(dt)
}

CTPMD

## =============================================================================
## FUN: myFreadBarCTP
## 用于制作 bar
myFreadBarCTP <- function(x){
  ## -- 如果使用 fread 可以正常读取数据文件
  if(class(try(fread(x, showProgress = FALSE, fill = TRUE, nrows = 1000),
               silent = TRUE))[1] != "try-error"){
    dt <- fread(x, showProgress = TRUE, fill = TRUE,
                select = c('TimeStamp','TradingDay','UpdateTime','UpdateMillisec'
                           ,'InstrumentID','LastPrice'
                           ,"OpenPrice", "HighestPrice", "LowestPrice","ClosePrice"
                           ,'Volume','Turnover','OpenInterest'
                           ,'SettlementPrice','UpperLimitPrice','LowerLimitPrice'
                           ,'BidPrice1','BidVolume1','BidPrice2','BidVolume2'
                           ,'BidPrice3','BidVolume3','BidPrice4','BidVolume4'
                           ,'BidPrice5','BidVolume5'
                           ,'AskPrice1','AskVolume1','AskPrice2','AskVolume2'
                           ,'AskPrice3','AskVolume3','AskPrice4','AskVolume4'
                           ,'AskPrice5','AskVolume5'
                ),
                colClasses = list(character = c("TradingDay","InstrumentID","UpdateTime"),
                                  numeric   = c("Volume","Turnover") )) %>%
      .[grep("^[0-9]{8}:[0-9]{2}:[0-9]{2}:[0-9]{2}:[0-9]{4,6}$", TimeStamp)]
  }else{
  ## -- 如果使用 fread 读取失败,则使用 read_csv
    dt <- read_csv(x,
                   col_types = list(TradingDay   = col_character(),
                                    InstrumentID = col_character(),
                                    UpdateTime   = col_character(),
                                    Volume       = col_number(),
                                    Turnover     = col_number())
    ) %>% as.data.table() %>%
      .[grep("^[0-9]{8}:[0-9]{2}:[0-9]{2}:[0-9]{2}:[0-9]{4,6}$", TimeStamp)] %>%
      .[,.(TimeStamp, TradingDay, UpdateTime, UpdateMillisec
           ,InstrumentID,LastPrice
           ,OpenPrice, HighestPrice, LowestPrice,ClosePrice
           ,Volume,Turnover,OpenInterest
           ,SettlementPrice,UpperLimitPrice,LowerLimitPrice
           ,BidPrice1,BidVolume1,BidPrice2,BidVolume2
           ,BidPrice3,BidVolume3,BidPrice4,BidVolume4
           ,BidPrice5,BidVolume5
           ,AskPrice1,AskVolume1,AskPrice2,AskVolume2
           ,AskPrice3,AskVolume3,AskPrice4,AskVolume4
           ,AskPrice5,AskVolume5)]
  }
  ##----------------------------------------------------------------------------
  return(dt)
}

vnpyData

## =============================================================================
## FUN: myFreadvnpy
myFreadvnpy <- function(x){
  ## -- 如果使用 fread 可以正常读取数据文件
  if(class(try(fread(x, showProgress = FALSE, fill = TRUE, nrows = 100000),
               silent = TRUE))[1] != "try-error"){
    dt <- fread(x, showProgress = TRUE, fill = TRUE,
                select = c('timeStamp','date','time'
                           ,'symbol','lastPrice'
                           ,"openPrice", "highestPrice", "lowestPrice","closePrice"
                           ,'volume','turnover','openInterest'
                           ,'settlementPrice','upperLimit','lowerLimit'
                           ,'bidPrice1','bidVolume1','bidPrice2','bidVolume2'
                           ,'bidPrice3','bidVolume3','bidPrice4','bidVolume4'
                           ,'bidPrice5','bidVolume5'
                           ,'askPrice1','askVolume1','askPrice2','askVolume2'
                           ,'askPrice3','askVolume3','askPrice4','askVolume4'
                           ,'askPrice5','askVolume5'
                ),
                colClasses = list(character = c("date","symbol","time"),
                                  numeric   = c("volume","turnover") )) %>%
      .[grep("^[0-9]{8} [0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{4,6}$", timeStamp)]
      ## 考虑到部分文件可能使用的 Timestamp 是乱码
  }else{
  ## -- 如果使用 fread 读取失败,则使用 read_csv
    dt <- read_csv(x,
                   col_types = list(timeStamp = col_character(),
                                    date      = col_character(),
                                    symbol    = col_character(),
                                    time      = col_character(),
                                    volume    = col_number(),
                                    turnover  = col_number())
                   ) %>% as.data.table() %>%
      .[grep("^[0-9]{8} [0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{4,6}$", timeStamp)] %>%
      .[,.(timeStamp, date, time
           ,symbol,lastPrice
           ,openPrice,highestPrice,lowestPrice,closePrice
           ,volume,turnover,openInterest
           ,settlementPrice,upperLimit,lowerLimit
           ,bidPrice1,bidVolume1,bidPrice2,bidVolume2
           ,bidPrice3,bidVolume3,bidPrice4,bidVolume4
           ,bidPrice5,bidVolume5
           ,askPrice1,askVolume1,askPrice2,askVolume2
           ,askPrice3,askVolume3,askPrice4,askVolume4
           ,askPrice5,askVolume5)]
  }
  ##----------------------------------------------------------------------------
  return(dt)
}