如何搭建 Stock.Daily 数据

警告
本文最后更新于 2022-11-19,文中内容可能已过时。

如何搭建 Stock.Daily 数据

Created: November 18, 2022 10:09 PM Last Edited Time: November 18, 2022 11:43 PM Type: Technical Spec

对于股票研究与交易,我们需要一份内容正确且规则统一的历史数据 ,尽可能的降低数据噪音,避免因为数据错误导致的谬误,提高实证研究的可靠性与准确性。

根据现有的数据,我们复原了2014年1月至今的股票日频数据(stock.daly),运用统一的规则,对历史数据与每日更新的实盘数据进行获取、清洗、转换、验证、入库。

本文档旨在介绍股票日盘数据库的整理规则与入库标准,以供参考。

数据概览

数据来源

  • 历史数据(2014.01.01 ~ 2022.11.18)
    • PreClose、OHLC、Volume、Turnover 来自交易所
      • 上海:根据购买的交易所历史数据(Day.csv)
      • 深圳:从深交所网站爬虫获取
    • UpperLimit、LowerLimit:来自 RiceQuant api 获取
  • 每日更新数据(2022-11-18起)
    • PreClose、UpperLimit、LowerLimit:来自交易柜台查询(xtp_info, xeleq_info)
    • OHLC、Volume、Turnover 来自万得落地数据库(MySQL)、交易所网站爬虫
  • BeginDay、EndDay:爬虫交易所股票列表
  • TotalShare、FloatAShare:万得落地数据库(MySQL)、交易所网站爬虫
  • STStatus:根据股票名称是否包含 ST|退 判断,实盘数据来自交易柜台查询、历史数据来自网易财经 Net163Scraper

每日更新时间:18:30/19:30

数据库格式

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
CREATE TABLE stock.daily
(
    `TradingDay` Date,
    `Exchange` LowCardinality(String),
    `Category` Enum('X'=0,'FUTURES'=1,'STOCK'=2,'BOND'=3,'FUND'=4),
    `Product` Enum('X'=0,'STK'=1,'ETF'=2,'INDEX'=3),
    `Symbol` LowCardinality(String),
    `PreClose` Nullable(Float32),
    `Open` Nullable(Float32),
    `High` Nullable(Float32),
    `Low` Nullable(Float32),
    `Close` Float32,
    `UpperLimit` Nullable(Float32),
    `LowerLimit` Nullable(Float32),
    `Volume` Nullable(UInt64),
    `Amount` Nullable(Float64),
    `BeginDay` Nullable(Date),
    `EndDay` Nullable(Date),
    `STStatus` Enum('TRUE'=1,'FALSE'=0) DEFAULT 0,
    `TradeStatus` Enum('X'=0,'NEW'=1,'TRADING'=2,'SUSPENDED'=3),
    `TotalShare` Nullable(Float32),
    `FloatAShare` Nullable(Float32)
)
ENGINE = MergeTree
ORDER BY (TradingDay, Exchange, Symbol)
SETTINGS index_granularity = 8192

字段说明

  • CategoryProduct 用于区分不同交易市场
    • FUTURES:期货类
    • STOCK:证券类
      • STK
      • INDEX
    • BOND:债券类
    • FUND:基金类
      • ETF
  • PreClose : 交易所除权除息后的昨收,可结合 LastClose 计算复权因子(AdjFactor
  • BeginDay : 股票上市日期
  • EndDay : 股票退市日期
  • STStatus : 股票是否出于 ST 状态(ST|S|*ST|SST|退市),每日根据股票名称是否包含ST或退来判断
  • TradeStatus : 当天的交易状态
    • NEW : 上市第一日
    • TRADING : 正常交易
    • SUSPENDED : 当日停牌
  • TotalShare : 总股本(万得对应:TOT_SHARE_TODAY)
  • FloatAShare : 流通股(总股本减去限售股)

数据清洗规则

  • 对于没有涨跌停的合约(上市前五日),统一设置涨跌停价格为

    1
    2
    3
    4
    5
    6
    7
    
    UPPER_LIMIT_PX_THRES = 999999.999
    LOWER_LIMIT_PX_THRES = 0.001
    
    df.fillna({'UpperLimit': self.UPPER_LIMIT_PX_THRES}, inplace = True)
    df.fillna({'LowerLimit': self.LOWER_LIMIT_PX_THRES}, inplace = True)
    df.loc[((df['days'] <= 5) | (df.TradeStatus.isin(['NEW']))) & (df.UpperLimit <= 0.01), 'UpperLimit'] = self.UPPER_LIMIT_PX_THRES
    df.loc[((df['days'] <= 5) | (df.TradeStatus.isin(['NEW']))) & (df.LowerLimit <= 0.01), 'LowerLimit'] = self.LOWER_LIMIT_PX_THRES
  • 对于停牌的股票:

    • 有可能拿不到当日的数据,这时候需要先把当日所有的合约补齐(TRADING + SUSPENDED)

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      
      _day = cal.ymd(df.TradingDay.values[0])
      sql = f"""
      		select Exchange,Symbol,
                 OnlistingDay,
                 DelistingDay
          from bardata.listing
          where OnlistingDay <= '{_day}'
          and DelistingDay > '{_day}'
          """
      df_listing = ch.read(sql)
      
      ## 先合成所有的上市合约
      df = pd.merge(df, df_listing, on = 'Symbol', how = 'outer', suffixes = ['', '_listing'])
    • 使用上一次有效交易日的收盘价作为当天的 OHLC,Volume = 0, Turnover = 0.0, UpperLimit = LowerLimit = PreClose

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      
      ## 获取上一次有效交易日的数据
      sql = f"""
      		select stock.daily.* from stock.daily as a,
                 (select Symbol, max(TradingDay) as LastDay
                  from stock.daily
                  where TradingDay < '{_day}'
                  and Volume > 0
                  group by Symbol) as b
           where a.Symbol = b.Symbol and a.TradingDay = b.LastDay
           and BeginDay <= '{_day}'
           and EndDay > '{_day}'
      		"""
      df_last = self.ch.read(sql)
      ## 然后使用上一次最后交易的数据来补充
      df = pd.merge(df, df_last, on = 'Symbol', how = 'outer', suffixes = ['', '_last'])
      
      l = ['PreClose', 'Open', 'High', 'Low', 'Close', 'UpperLimit', 'LowerLimit',
           'Volume', 'Amount', 'STStatus', 'TradeStatus', 'TotalShare', 'FloatAShare']
      for col in l:
      		if col in ['Volume', 'Amount']:
      		    val = 0
      		elif col in ['PreClose', 'Open', 'High', 'Low', 'Close', 'UpperLimit', 'LowerLimit']:
              val = df.loc[(df.Product_last == 'STK') &
                           (df.Exchange_last.isin(['sse', 'szse'])) &
                           ((pd.isna(df[col])) | (df.Volume <= 0.0001)), "Close_last"]
          elif col in ['TradeStatus']:
      		    val = 'SUSPENDED'
          else:
              val = df.loc[(df.Product_last == 'STK') &
                           (df.Exchange_last.isin(['sse', 'szse'])) &
                           ((pd.isna(df[col])) | (df.Volume <= 0.0001)), f"{col}_last"]
      
          df.loc[(df.Product_last == 'STK') &
                 (df.Exchange_last.isin(['sse', 'szse'])) &
                 ((pd.isna(df[col]) | (df.Volume <= 0.0001))), col] = val
  • 对于指数,统一设置:

    1
    2
    3
    4
    5
    6
    7
    
    for col in ['Open', 'High', 'Low']:
    		df.loc[(df.Product == 'INDEX') & (df[col] <= 0.0001), col] = df.loc[(df.Product == 'INDEX') & (df[col] <= 0.0001), 'Close']
    for col in ['UpperLimit', 'LowerLimit', 'TotalShare', 'FloatAShare']:
    		df.loc[df.Product == 'INDEX', col] = np.nan
    
    df.loc[df.Product == 'INDEX', 'TradeStatus'] = 'TRADING'
    df.loc[df.Product == 'INDEX', 'STStatus'] = False
  • STStatue 规则:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    
    _day = cal.ymd(df.TradingDay.values[0])[:10]
    df_xtp = self.read_xtp_info(_day)
    
    df = pd.merge(df, df_xtp,
                  left_on = ['Symbol'],
                  right_on = ['Symbol'],
                  how = 'left',
                  suffixes = ['', '_xtp'])
    df['STStatus'] = df['TickerName'].apply(
        lambda x: True if re.search(r"sT|St|st|ST|退|退市|^退", str(x), re.IGNORECASE) else False
        )
  • 其他字段设置:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    
    df['Category'] = df[['Category', 'Symbol']].apply(
        lambda x:
            self.convert_category(
                x[0] if x[0] in self.CATEGORY_DICT
                else self.guess_category_from_symbol(x[1])
                ),
        axis = 1)
    df['Product'] = df[['Product', 'Symbol']].apply(
        lambda x:
            self.convert_product(
                x[0] if not pd.isna(x[0])
                else self.guess_product_from_symbol(x[1])
                ),
        axis = 1)
    df['STStatus'] = df['STStatus'].apply(lambda x: 1 if x else 0)
    df['TradeStatus'] = df['TradeStatus'].apply(self.convert_tradestatus)

入库前检查规则

PreClose, Open, High, Low, Close, UpperLimit, LowerLimit, (Amount/Volume), 这几个应该是同一个数量级; TotalShare, FloatAShare, FreeShare, (NetAsset/Close) 应该是差不多是一个。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
    检查项目:
    1. vwap vs close
    2. turnover = Volume/FloatAShare < 1.0,
    3. lower_limit <= px <= upper_limit
    4. pb > 0.2(去掉负的 NetAsset)
    5. 涨跌停幅度(去掉新股)
    6. 关于 TotalShare vs TotalShareToday: 前面是全部股本,后面是当日股本,比如
        TotalShare < FloatAShare: ---------------------------
        TradingDay     Symbol         TotalShare        FloatAShare
        2020-10-19  603882.SH 457,884,577.000000 459,487,577.000000
    7. 如果是 ST, 涨跌幅是 5%

    可以再加两个检查,
    1. 就是'STK'里TradeStatus不应该出现'X'(我这边现在就是用Suspen来判断股票不交易)。
    2. 也可以检查一下每天,Volume=0, Amount=0和TradeStatus='Sus' 是不是同一批股票

    对于index的检查(000016.SH, 000300.SH, 000905.SH, 000852.SH, 000985.CSI)暂时只需要检查这五个就行了。
    1. 每天都有数据
    2. PreClose, Open, High, Low, Close,Volume, Amount不能有NaN。UpperLimit, LowerLimit,Totalshare, FloatAshares强制是NaN.

历史数据

每日更新数据

相关内容

william 支付宝支付宝
william 微信微信
0%