ULL (ULTRA LOW LATENCY) ARCHITECTURES FOR ELECTRONIC TRADING

ULL (Ultra Low Latency) Architectures for Electronic Trading *NYU SPS – Online, Adjunct Instructor: Ted Hruzd* FALL 2019 GLOBAL ON-LINE SEP 18 – NOV7

8 weeks with 4 modules, 5 assignments 90 minute optional weekly Google hangout sessions plus collaboration with WhatsApp with my responses within 24 hours

On-Line Registration – https://www.sps.nyu.edu/professional-pathways/topics/finance/asset-management-and-investment-strategies/FINA1-CE9515-ull-ultra-low-latency-architectures-for-electronic-trading.html

*** All course content will be online by Sep 16 for Sep 18 start. WhatsApp will be set up also so we can all communicate / collaborate / discuss assignments / etc … Then an optional 90 minute Google Hangouts session will be setup Saturday morning New York Time likely 9 am – 10:30 am

Course Objectives

Develop advanced skills in architecting electronic trading (ET) and market data applications for ultra low latency (ULL), for competitive advantage, and for positive ROI. At end of course, one will have developed expertise in end-end architecture of ET applications and infrastructure, including:

  • roles of FPGA’s, GPU’s, over-clocked servers, and high end Intel Cascade Lake servers and AM EPYC Rome servers
  • Linux kernel and NIC kernel bypass tuning,
  • options available for architecting ULL networks from infrastructure and application perspectives
  • network performance analysis via WireShark and Corvil (hands-on tech expertise via remote access to a simulate trading app)
  • Machine Learning (ML), AI, Neural Networks including LSTM (Long Short Term Memory) Recurrent Neural Networks via Python / Tensor Flow, Decision Trees, Random Forests, Anomaly Detection Engines, Reinforcement Learning Engines, Pattern Recognition, Classification Models with R – Studio and ML models include alpha seeking, smart order routing, fill rate predictions. Even if you have little or no experience in developing ML, you will learn subsets of R and Python to develop your own upon course completion
  • Intro to Block Chain, exploring how to scales Block Chain for financial apps

MODULE-1

MODULE-1: Hardware/Application Accelerated Architectures

  • Tick-2-Trade applications with single digit micro seconds, even with sub 1 micro seconds
  • How to architect for deterministic latencies even in times of volume spikes
  • Why ‘Meta-Speed’ (info how to used speed) is more important than pure speed
  • Proper use of multi-layer ULL switches, FPGA’s, GPU’s, MicroWave wireless RF network technologies & over-clocked servers
  • Options available with FPGA’s integrated in multi-layer ULL switches (ex: Market Data normalization & Book Builds)
  • Assess advantages among the leading FPGA vendors Intel/Altera and Xilinx
    • Examine each for both trading & analytics
    • Assess each vendor’s capabilities for FPGA applications based on OpenCL, C++ using FPGA libraries
    • Learn best practices in FPGA architectures for market data, order routing, Machine Learning & AI
    • Learn Engine 2/from memory optimizations on FPGA’s
  • NVIDIA DGX-2 GPU processors role for precision analytics
  • Compare FPGA’s vs GPU’s for ML Deep Learning
  • Explore relevancy to ULL ET and ML of new “Processor in Memory” or PIM architectures from Intel and NVIDIA (speed up data ingestion to CPU & GPU processing cycles)
  • Alternate role Data Direct Networks (DDN) for above data/processing speedups for HPC, HFT, AI
  • Market Data Feed Handlers in FPGA; Order Books in Intel Cores or FPGA’s – achieve 20 x’s parallelization for full depth books?
  • Integration of FPGA’s and Intel cores via high speed caches, FPGA’s and cores on same die (Intel-Altera and Xilinx — current and upcoming enhancements)
  • FIX engines in FPGA based NIC’s and appliances
  • Multi core, high speed cache Intel based servers + Intel’s new MESH socket interconnects for ULL and deterministic memory I/O
  • Leading FPGA based NIC(s) – from SolarFlare, Mellanox, ExaBlaze, Enyx
  • SolarFlare Direct TCP
  • Layer 1 and multi layer network switches (Arista/Metamako, ExaBlaze)
  • Fundamentals of FPGA design and programming
  • OpenCL and C++ for ULL programming best practices & FPGA programming
  • Intel’s optimizing C++ with deep vectors AVX-512, Thread Building blocks (TBB), and Intel’s new AVX 512 VNNI for Neural Network speedups
  • Intel C++ best practice design pattern of internally vectorize code inside a loop or interaction, and externally parallelize the code vi pragma’s and specific code
  • How to optimize app code performance with hardware server config (NUMA)
  • Prospects for Application Specific Integrated Circuits (ASICs) supplanting FPGA’s in 1-3 years for most latency sensitive applications
  • Best Practices for Market Makers & High Freq Traders (HFT)

    • Automated ULL software
    • Wide range of markets
    • Role of ML & AI
    • Direct Market Access (DMA) architectures
    • Risk Mgt
    • Colo Configurtions
    • Resiliency & High Availability, DR
    • High Performance Compute Clusters
    • News Sentiment Analytics
  • Python development of basic Algo strategies & software design/analysis for back testing Algo’s

  • Intel’s new “HPAT” Python compiler with directives to parallelize Python code

  • Hot right now – Chronicle: a Java based microservices framework touting superior mem mgt + horizontal scalability for FIX Engines and more

  • Intro to BlockChain – can common interests least to technology to benefit all & cut costs, speed up settlements – LL settlements, enhance TCA?

    • What electronic trading applications can integrate with BlockChain?
    • How to architect such applications
  • ROI analysis

HOMEWORK:

Ted will present 3 Visio ULL architectures of end-end trading systems and ask class to critique all infrastructure components individually and in the aggregate, along with 1st steps to start ROI analysis. In addition, students will be required to enhance the architecture they chose as the “best”

MODULE-2

MODULE-2: Linux kernel tuning + NIC kernel bypass technologies and configurations

  • Detail benefits of Red Hat Linux Network-Latency profile (ex favors performance over power savings)
  • Linux 7.3/4/5/6 and Linux 8 kernel and NIC tuning for kernel bypass –
  • Identify niche kernel tuning for extremely high message processing
  • Kernel bypass technologies including RDMA and LDMA
  • Infiniband (IB) and RDMA over Ethernet (RoCE) protocols for ULL
  • Identify additional tuning for ultimate ULL kernel and micro services frameworks
  • How to validate tuned OS via load tests and commands such as sysctl -a
  • SolarFlare (SF) latency benefits of Open OnLoad kernel bypass and how to further configure/ tune subsequent to analysis via sfnettest, sfjitter, SF Dump, jhickup, and performance load tests
  • SF ef_vi, TCP Direct and role on “raw” Tick-2-Trade (T2T) times of under 100 ns via OnLoad + LDA-Tech LightFleet FPGA appliance + Penguine servers (STAC T0 benchmark) and other potential options
  • Mellanox VMA kernel bypass, 40Gig E NICs, up to 100 Gig switches, how to integrate with IB and Exegy Market Data appliance

HOMEWORK:

Ted will pose several technical questions pertaining to kernel tuning and bypass for class to answer.

MODULE-3

MODULE-3: Machine Leaning (ML) & Artificial Intelligence (AI) for both ULL Electronic Trading, Wealth Mgt, and BlockChain applications

  • Math behind multiple ML models, with deep dive into Neural Networks (NN), especially LSTM Recurrent NN, Auto Encoding NN for Anomaly Detection Engines
    • Will learn the subset of Python & Tensor Flow code to create a RNN that predicts future stock prices, fill rates, and even trading revenue; then integrate a Decision Tree model in R utilizing same input data to identify what factors may be “tuned” to render more accurate projections; also we will learn how to use the ML insights to identify options to improve fill rates / increase revenues
    • Explore how Reinforcement Learning ML can integrate with near real time NN for competitive advantages via ULL insights to alpha, risk, routes (SOR), TCA, compliance)
  • ML / NN for seeking alpha via basic R programming + specific ML libraries
  • Supervised vs unsupervised ML
  • Synergies with Data Mining
  • Optimal Architectures for ML: Infrastructure, Software
  • Role of SME in ML & AI
  • Determining what model to choose
  • How to interpret results
  • How to verify models
  • Tensor Flow for parallelization of ML models
  • How to tune, tweak models for greater accuracy and predictive value
  • ML and Event Stream processing, real time analytics for seeking alpha (trade opportunities)
  • Definition of Deep Learning (DL)
  • DL Models and use cases
  • Define AI; provide use cases
  • ML and DL as inputs to AI
  • Time-2-Market & ROI projections for ML / AI initiatives end-2-end
  • Best Practices in AI in our industry
  • Options to integrate ML/AI alpha seeking capabilities in CoLo environments

    • How to decrease Total Cost of Ownership (TCO) in CoLo architectures
  • In Class (Hands-On)

    • RStudio & H20
    • Portfolio analysis via Classification Model using R/H20
    • Predictive analysis of new trading strategies via Decision Trees (R or Python)
    • Pattern Recognition of Trading Patterns to provide am Alpha service for Buy-Side
  • Blockchain – more depth than Module 1

  • Blockchain scaling limitations

  • Assess CFTC BlockChain plans for near real time clearing (then extend the BlockChain for low volume trading, ultimately to higher volume trading of commodities, derivatives, options)

  • How to integrate real time ML and AI with BlockChain architectures

  • Learn a rapidly pervasive BlockChain related protocol “Smart Contracts” that may largely solve BlockChain scaling limittions

  • Assess Ant Financial Wealth Mgt BlockChain plans (examine its use of Smart Contracts)

HOMEWORK:

  • Ted will provide several “how-to” docs with all code / procedures to run to create multiple ML models. Students will run at least 1 of these models and respond with what predictive insights they provide. ** NO Coding will be required. But there will be one simple ML model where one will have option to alter or add code to improve accuracy or provide more meaningful insights (extra credit)

MODULE-4

MODULE-4: Architecting ULL Networks + how to diagnose/resolve network problems vi Corvil and Wireshark

  • ULL network configuration best practices
  • Spine-Leaf architectures
  • Multicast best practices
  • Examine ULL Multicast architectures available from lightfleet.com; determine applicability for ULL networks and how to project resulting performance (latency) improvements
  • Network protocols including TCP, UD, BRP, OSPF, LLDP
  • Remote access to a Corvil appliance for deep dive in network and transaction diagnostics
  • How to utilize Corvil decoders for FIX protocol, LBM messaging, and market data feeds
  • Wireshark – to supplement Corvil analytics with deep dive network diagnostics to identify RCA of latencies
  • Best practices in architecting Corvil’s new App Agent software for software processing insights
  • ULL messaging middleware (29 West LBM/UME) and 60 East Tech AMPS
  • PTP architectures for large market data / trading application infrastructures
  • Network appliances – detailed timings/analytics – network, market data, and order routing – Corvil, Instrumentix, SolarCapture
  • ULL Networks, including options for integrating multi-layer switches, FPGA appliances, new approaches to ULL multi-cast market data distribution
  • ULL storage networks, including NVMeOF fabrics, Intel Optane, 3D XPoint, EverSpin new MRAM deterministic memory + persistent storage options. Special focus on DDN and Pure Storage
  • How ULL deterministic memory can lower end-2-end latencies for subset of application flows, especially those based on ULL analytics
  • Correlation of ULL networks and fill rates
  • Tools (some free, several with RH Linux) to attain network performance optimization insights

HOMEWORK

  • Specific hands on Corvil exercises for class to evaluate results and propose mitigation of network problems and strategic redesigns

FINAL ASSIGNMENT: Architect and End-2-End ULL Eectroinic Trading App integrated with ULL ML/AI for Alpha, Risk, Routing (SOR), TCA, Complaince

  • Ted will detail requirements by Oct 23, with 2 weeks to complete

PreReq – (for most, expecting basic to intermediate expertise, unless noted)

  • Most important: at least 2 years working with electronic trading applications/infrastructures as Developer, SA, network admin/engineer, Architect, QA analyst, tech project mgr, operations engineer, manager, CTO, CIO, CEO, vendor or consultant providing technology to Wall Street IT,
  • TCP/IP, UDP, multicast (basic knowledge),
  • Linux OS and shell or scripting (ex bash, perl); at minimum basic familiarity of output and usefulness of core Linux commands such as sysctl –a, ethtool, ifconfig, top, ls, grep, awk, sed, and others listed later in this syllabus
  • Intel servers, cores, sockets, GHz clock speed, NUMA
  • Network routers, switches
  • 1 or more network protocols from BGP, OSPF, EIGRP, MPLS, IB
  • FIX protocol
  • Market Data, at minimum contents of equities consolidated feeds
  • Visio (will use for homework assignments; HOWEVER – to save time I will accept ‘pictures’ of white board architectures / designs)
  • R programming (nice to have. Will use basics that one can learn in 1-2 hours), then extend upon that in classes for class hands-on Machine Learning
  • Python (very basic will be fine – a 2 hour reading assignment will be arranged for beginners). We will use a text written for traders with zero programming experience that quickly trains them how to use small set of Python for creating trading algo’s

Course Logistics

相关讲座

INTRO TO ULTRA-LOW-LATENCY (ULL) ARCHITECTURES FOR ELECTRONIC TRADING

Expect a 30-45 minute free session, sponsored by a tech vendor, I will lead by EOY in NYC.. This will be an intro to (1) a 2-day course I am negotiating with Queens U / Belfast Jan 2017 & (2) -with NYU – 9 week – 6-9 pm Mondays Summer 2017. Both courses will cover much more than Linux. Belfast course details follow this intro.

WE BEGIN …. With an optimized network for ULL CoLo Trading app

Chart (BELOW): optimize network latency for a CoLo Trading App

Few minutes on this design, then concentrate on LINUX, Servers, Application Design

l1

Run with Linux 7.2 (for now) but with the network-latency profile set

  • This out-of-box kernel tuning features speed over power savings, a 2 way TCP session handshake instead of 3, and decreases kernel interrupts from disabling auto NUMA and large page management.
  • Linux 7.4 – “internal socket” cache to priority arrays and memory objects
  • Intel’s Vtune and PCM to identify how to optimize cache line usage to decrease memory I/O’s

Cache Access (Intel’s latest processors – SkyLake)

  • L1 5 ns
  • L2 7 ns
  • Main mem 100 ns

MCDRAM

  • New high bandwidth “multi channel” caches for Intel PHI Knights Landing processors
  • Options to dedicate this cache per high priority threads

NEVER configure 1 NIC for both market data and order flow.

  • A spike in 1 may significantly impact latencies in the other

Use NICs with built in Kernel Bypass (else get out of Electronic Trading)

  • No excuse not to use kernel bypass. If you do not heed this recommendation, you will loose fills and soon be out of business.
  • further speed up out-of-the box kernel bypass (SolarFlare & Mellanox) from approx 1 – 1.5 uSec / io, use EF_VI API for SolarFlare or VERBS for Mellanox, lower latencies by approx another 200 nanoseconds.
  • For really ULL, utilize FPGA based NICs like those from FiberBlaze, ExaBlaze, or Enyx or SolarFlare’s version.
  • Imagine no kernel bypass and CPU core running at near 100% (largely user%). Many trading partners will take note and may NEVER trade with you again.

Record real time Critical performance metrics

  • Record Latencies with your trading partners under all scenarios, especially during market data spikes.
  • Have a list of trading partners with ‘deterministic’ latencies

Why deterministic latencies are CRITICAL

What is ‘Speed2’ as Tabb and Corvil jointly coined and why all electronic trading firms and especially markey makers must adhere to being aware of ‘Speed2’

Set up accurate time stamps – PTP or NTP

  • Important for MIFID II and upcoming US regulations
  • For best PTP accuracy, within 10 ns, utilize hardware based time stamps;
  • multiple network interface vendors are offering that including SolarFlare, Mellanox, ExaBlaze.
  • They can handle inputs from GPS sources, can function as “boundary” clocks, and can function as client or server clocks. Hence they can be one source for all precise timestamps for the three main levels of timestamps within a trading network.

SolarFlare solution may be most impressive

  • ptp daemon “sfptpd” avoids any kernel processing for time stamps, in contrast to the default Linux ptp daemon which requires kernel and kernel interrupts

NTP – sync software is inferior to PTP

  • switch Q’s and latencies are more critical, and levels away from central GPS source should be kept to a minimum.

TIME PERMITTING WILL COVER PART OR ALL OF THIS

Deep Dive into RH 7.2 LINUX TUNING – perf over power saving

  • Tcp_fastopen=3 (2 way handshake – encryption of cookie of client @ init, so reconnect is 2 way, using the cookie)
  • Enable Intel_pstat & min_perf_pct =100 (Ghz steady; disable fluctuations)
  • Disable THP (Transparent Huge Pages of 2 MB under K control)
  • Cpu_dma_latency
    • @ c_states, keep cores from sleeping; part of QoS
  • Busy_read 50 uSec (100 uSec for large# pkts) & busy_poll 50 uSec (skt poll recvQ of NIC, disable net interrupt); cores “active”
    • BUT — K bypass much better (discuss 3 methods of K bypass)
  • Numa_balance 0 (no auto NUMA mgt)
  • Disable unnecessary daemons and services (ex firewalld & iptables)
  • Max # ring buffer size
    • Dev driver drains buf via soft IRQ (other tasks not interr vs hard interr)
  • Set RFS (Recv Flow Steering)- increase CPU cache hits,forwards pkts to consuming app
  • TCP SACK- retrains only missed bytes)- tcp_sack+1
  • TCP Window scaling – up to 1 GB
  • Sysctl –w net.ipv4.tcp_low_latency=1
  • Timing and scheduling:– decrease timing dispatch interrupts
    • Sched_latency_ns (20 ms default; increase!!)
    • Sched_min_granularity (4 ms default; increase!)
    • Increase # procs, threads – formula may lower this 4 ms
    • Some applications may benefit from tickles kernel
    • (ex: small # procs, threads at no more than # cores)
    • Sched_migration_ns (default500 uS; increase!)
    • This pertains to period of “hot” cache, prevents pre task migration
    • Basic Linux and Server measures and utilities for performance analytics:
    • BIOS updates and tuning
    • Turbostat
    • Lstopo – cores and me caches
    • Lscpu – cpu arch info from sysfs & /proc/cpuinfo
    • Numactl
    • Numastat
    • Tuned
    • Tuned-admin network-latency configuration (set profile)
    • Isolcpus
    • Interrupt affinity or isolation
    • Irqbalance
    • Busy_poll — any # other than 0 say 50 RH rec, ms to wait for pkts on devQ for sock poll & selects
    • Check gamut of process (pid) info, much pertaining to performance in /proc/; for ex: files numa_maps, stat, syscall
    • Tuna – control processor and scheduler affinity
      • Options: Isolate sockets from user space, push to socket 0
    • VTune Amplifier 2016
      • CPU, GPU, threads, BW, cach, locks, spinTm, FxCalls, serial+Par Tm,
      • ID code section ID for parallelization; ex: TBB – more control over OpenMP
      • MPI analysis ex locks, MCDRAM analysis
    • Intel’s PCM (Performance Counter Monitor) – major enhancements
      • Ex: times specific threads hit/miss L1-2-3 caches and measures cache times and impacts of misses; helps ID priority procs, threads for cache
    • Tx Profilers – Wily, VisualJVM, BeaWLS, valgrind, custom FREE– T/S, ESP correl, ML
    • Perf: perf top –g (functions)
      • Perf counters in hardware (cpu) , with kernel trace points (ex: cache miss, cpu-migration, softirq’s)
    • strace
    • Ftrace- uses the frysk engine to trace systemcalls
      • sycalls of procs and threads
      • Dynamic kernel fx trace, including latencies (ex: how long proc wakes/starts
      • /debug/tracing
      • Trace_clock
    • Dtrace for Linux:
      • Dynamic – cpu, fs, net resources by active procs, can be quite specific
      • Procs accessing specific files
      • what procs with most packets or bandwidth%
    • SystemTap

Oprofile uses hw counters, tracks mem access and L2 cache, hw interrupts

  • - Mpstat, vmstat, iostat, nicstat, free, top, netstat, ss [filter/script for analytics]
  • VM (Virtual Memory) and page flushes, optimize market data caches

Application Design:

Programming with Multiple core multi thread, parallelism

  • Vectorize application code
  • Design – Internal loops with deep vector instructions, outer loops with parallelization (threads)

INTRO ENDS RIGHT HERE

———————————————————————–

  • Explain how following Linux Tuning options will will impact latencies
    • Swappiness =0;
    • Dirty-ratio 10;
    • Background- ratio 10
    • NIC interrupt coalescing (pre kernel-bypass)
    • Ring buffer increase
    • UCP receive buffer at 32 MB
    • Netdev-backup 1000000 (traf stored before TCP/IP proc; 1/core)
  • Explain what following commands produce for latency analysis:
    • ifconfig command
    • Netstat –s (send/recv Q’s)
    • Ss utility
  • Detail major benefits of VTune and DTrace and when you would use either

Linux 7.4

  • Dedicate “internal socket” cache to priority arrays and memory objects
  • Utilize Intel’s Vtune and PCM to identify how to optimize cache line usage to decrease memory I/O’s

Cache Access (Intel’s latest processors – SkyLake)

  • L1 5 ns
  • L2 7 ns
  • Main mem 100 ns
  • Role of TLB’s

MCDRAM

  • New high bandwidth “multi channel” caches for Intel PHI Knights Landing processors
  • Options to dedicate this cache per high priority threads

Application Design:

  • Programming with Multiple core multi thread, parallelism
  • Vectorize application code
  • Design – Internal loops with deep vector instructions, outer loops with parallelization (threads)
  • Servers, sockets, cores, caches. MCDRAM (Intel Phi)
  • Core speeds GHz vs more cores, larger and faster caches
  • Over clocked servers – features and what applications can benefit
  • Linux, Solaris, Windows, other ex SmartOS, Mesosphere DC OS
  • How to benchmark performance, analyze, tune
  • NUMA aware processes and threads
  • Optimize cache assignments per high priority threads
  • Intel technologies including …
  • AVX-512 deep vector instructions (speeds up FP ops)
    • 6-8 registers; more ops/instruction; less power
  • TBB thread Building blocks (limit oversubscription of threads)
    • OpenMP- explosion of threads
  • Omni-Path high speed / bandwidth interconnect (no HBA, fabric QoS, MTU to 10K, OFA verbs,105 ns thru switch ports, 50 GB/s bi ) & QPI
    • Uses Silicon Photonics (constant light beam, lower latencies and deterministic)
  • QuickPath: mult pairs serial links 25.6 GB/s (prior to Omni-Path)
    • Mem controllers integrated with microprocessors
    • Replaced legacy bus technology
    • Cache coherent
  • Shared Memory is faster than Mem Maps; allows multiple procs read/write into shared mem among the procs – without OS read/write commands. Procs just access the part of shared mem of interest.
    • Discuss ex of server proc sending HTML file to client; file is passed to mem then net function copies mem to OS mem; client calls OS function which copies to its own mem; contrast with Shared Mem
  • PCIE
  • NVME
  • Flash SSD Drives
  • C ++ vs Java for ULL
  • Lists vs vectors
  • Iterate lists
  • Role of FPGA, GPU, MicroWave networks for ULL
  • C/C++, Java, Python, CUDA, FPGA – OpenCL: programming design considerations
  • Java 8 new streams API and lambda expressions – for analytics
  • Class Ex – Explain how Quick Path & Omni Path both improve latencies and advise which is preferred for ULL and why

End of Intro Session


Queens University 2 day – course proposal – Ted Hruzd January 2017 ULL (Ultra Low Latency) Architectures for Electronic Trading Course Objectives

Develop advanced skills in architecting electronic trading and market data applications for ultra low latency (ULL), for competitive advantage, and for positive ROI. At end of course one will have developed expertise in end-end architecture of electronic trading infrastructure, including architecting of:

  • Multi core, high speed cache Intel based servers
  • Linix 7.2 kernel and NIC tuning + quick intro to upcoming Linux 7.4
  • Kernel bypass technologies including RDMA and LDMA
  • FPGA based NIC(s) – from ExaBlaze
  • Single tier (or simplified spine-leaf); Ex: Plexxi
  • Layer 1 network switches (ExaBlaze & Metamako)
  • SDN (Software Defined Networks)
  • New binary FIX protocol for ULL order routing
  • ULL messaging middleware (29 West LBM/UME) and 60 East Tech AMPS
  • ULL software design (deep vectors ex AVX-512 and multi threading – OpenMP, TBB)
  • Databases – structured and unstructured
  • Storage, including NVME Flash
  • Tools (some free) to attain performance optimization insights
  • Network appliances – detailed timings/analytics – network, market data, and order routing
  • Big Data and Event Stream processing, real time analytics for seeking alpha (trade opportunities)
  • Fundamentals of FPGA design and programming
  • ROI analysis

PreReq – Intermediate – advanced knowledge of

  • TCP/IP, UDP, multicast,
  • Linux OS and shell or scripting (ex bash, perl),
  • Intel servers
  • Network routers, switches
  • 1 or more network protocols from BGP, OSPF, EIGRP, MPLS, IB
  • FIX protocol
  • Market Data, included base multicast knowledge
  • Visio
  • Python (very basic will be fine – a 2 hour reading assignment will be arranged for beginners)
  • R programming (nice to have. Will use basics that one can learn in 1-2 hours),
  • at least 2 years working with electronic trading applications/infrastructures as Developer, SA, network admin/engineer, Architect, QA analyst, tech project mgr, operations engineer, manager, CTO, CIO, CEO, vendor or consultant providing technology to Wall Street IT

Course Logistics

  • 2 full days (8-9 hours each day)
  • Tech book(s) to download to kindle TBD
    • Architects of Electronic Trading, Spephanie Hammer, Wiley 2013
    • Ultimate Algorithmic Trading Systems ToolBox, George Pruitt, Wiley, 2016
    • (optional) Trading and Electronic Markets: What Investment Professionals Need to Know, Larry Harris, CFA, 2015
  • Multiple web site links to technical white papers and tech analyses (ex nextplatform.com, www.intelligenttradingtechnology.com , and www.tradersmagazine.com )
  • Visio
  • Extensive use of white board by instructor and students. Sessions will present students with few infrastructures to architect per specific business success criteria

Day-1

ULL components: servers, OS, networks, software & middleware, FPGA’s, market data

  • Will present a Visio diagram with a potential co-lo ULL architecture that receives orders destined for trading venues, utilizing Layer 1 switching. Goal: 500 nanosecond Ack times ingress to appliance that includes Layer1 switching to/from Trading venues).
  • Will present an alternative architecture utilizing a Single Tier network (Plexxi)
  • We will periodically revisit this co-lo architecture throughout this course when we cover specific architecture components in depth (Algo Trading and/or SOR that feeds this architecture, use of FPGA, multi-cores, and Layer 1 switching)
  • Why speed of processing still matters and will for next several years at least
  • Why Layer 1 switches
  • Layer 1 switch with integrated cores and FPGA for risk checks
  • High speed real time analytics for seeking alpha (trade opportunities) & infrastructure analytics
  • Exchange (Trading Venue) connectivity
  • Layer 23 aggregation in new switch appliances
  • Role of Linux kernel tuning for ULL – use network-latency profile
  • Present some Linux configurations to critique
  • Class Exercise – Given few server, Linux configurations with flaws, respond with measures to optimize performance & lower latencies

  • Deep dive into Linux 7.2 network-latency configuration

    • Base config includes (perf over power saving):
    • Tcp_fastopen=3 (2 way handshake – encryption of cookie of client @ init, so reconnect is 2 way, using the cookie)
    • Enable Intel_pstat & min_perf_pct =100 (Ghz steady; disable fluctuations)
    • Disable THP (Transparent Huge Pages of 2 MB under K control)
    • Cpu_dma_latency
      • @ c_states, keep cores from sleeping; part of QoS
    • Busy_read 50 uSec (100 uSec for large# pkts) & busy_poll 50 uSec (skt poll recvQ of NIC, disable net interrupt); cores “active”
      • BUT — K bypass much better (discuss 3 methods of K bypass)
    • Numa_balance 0 (no auto NUMA mgt)
    • Disable unnecessary daemons and services (ex firewalld & iptables)
    • Max # ring buffer size
    • Dev driver drains buf via soft IRQ (other tasks not interr vs hard interr)
    • Set RFS (Recv Flow Steering)- increase CPU cache hits,forwards pkts to consuming app
    • TCP SACK- retrains only missed bytes)- tcp_sack+1
    • TCP Window scaling – up to 1 GB
    • Sysctl –w net.ipv4.tcp_low_latency=1
    • Timing and scheduling:– decrease timing dispatch interrupts
    • Sched_latency_ns (20 ms default; increase!!)
    • Sched_min_granularity (4 ms default; increase!)
      • Increase # procs, threads – formula may lower this 4 ms
    • Some applications may benefit from tickles kernel
      • (ex: small # procs, threads at no more than # cores)
    • Sched_migration_ns (default500 uS; increase!)
      • This pertains to period of “hot” cache, prevents pre task migration
    • Basic Linux and Server measures and utilities for performance analytics:
      • BIOS updates and tuning
      • Turbostat
      • Lstopo – cores and me caches
      • Lscpu – cpu arch info from sysfs & /proc/cpuinfo
      • Numactl
      • Numastat
      • Tuned
      • Tuned-admin network-latency configuration (set profile)
      • Isolcpus
      • Interrupt affinity or isolation
      • Irqbalance
      • Busy_poll — any # other than 0 say 50 RH rec, ms to wait for pkts on devQ for sock poll & selects
      • Check gamut of process (pid) info, much pertaining to performance in /proc/; for ex: files numa_maps, stat, syscall
      • Tuna – control processor and scheduler affinity
      • Options: Isolate sockets from user space, push to socket 0
      • VTune Amplifier 2016
      • CPU, GPU, threads, BW, cach, locks, spinTm, FxCalls, serial+Par Tm,
      • ID code section ID for parallelization; ex: TBB – more control over OpenMP
      • MPI analysis ex locks, MCDRAM analysis
      • Intel’s PCM (Performance Counter Monitor) – major enhamcements
      • Ex: times specific threads hit/miss L1-2-3 caches and measures cache times and impacys of misses; helps ID priority procs, threads for cache
      • Tx Profilers – Wily, VisualJVM, BeaWLS, valgrind, custom FREE– T/S, ESP correl, ML
      • Perf: perf top –g (functions)
      • Perf counters in hardware (cpu) , with kernel trace points (ex: cache miss, cpu-migration, softirq’s)
      • strace
      • Ftrace- uses the frysk engine to trace systemcalls
      • sycalls of procs and threads
      • Dynamic kernel fx trace, including latencies (ex: how long proc wakes/starts
      • /debug/tracing
      • Trace_clock
      • Dtrace for Linux:
      • Dynamic – cpu, fs, net resources by active procs, can be quite specific
      • Log of args /fx
      • Procs accessing specific files
      • # New processes with arguments
      • dtrace -n ‘proc:::exec-success { trace(curpsinfo->pr_psargs); }’

§ # Pages paged in by process§ dtrace -n ‘vminfo:::pgpgin { @pg[execname] = sum(arg0); }’§ # Syscall count by process§ dtrace -n ‘syscall:::entry { @num[pid,execname] = count(); }’ ….. specific syscall ct per process or thread§ also ‘canned’ scripts for processes with top tcp and udp traffic, ranking of processes by bandwidth o SystemTap – ex:probe tcp.setsockopt.return§ Uses strace points for kernel and user probes§ Script thief.stp – interrupts by procs histogramo dynamically instrumenting running production Linux kernel-based operating systems. System administrators can use SystemTap to extract, filter and summarize data in order to enable diagnosis of complex performance or functional problems.o SysDig Tool – only syscalls, dump for post processing scripting

  • Oprofile uses hw counters, tracks mem access and L2 cache, hw interrupts
    • Mpstat, vmstat, iostat, nicstat, free, top, netstat, ss [filter/script for analytics]
  • VM (Virtual Memory) and page flushes, optimize market data caches
  • - Slab allocation= mem mgt for k objects, eliminates frag
  • Slow network connections and packet drops
  • Intro to NetPerf tool
  • NIC tuning
  • Kernel bypass, LDMA, RDMA
  • Kernel bypass with NIC vendors (SolarFlare, Mellanox, ExaBlaze,) – description how each work

    • SolarFlare OpenOnLoad sets up all socket calls in user space instead of kernel space, with dedicated socket connection & data handled in NIC memory
    • Mellanox VMA linked library to user space, also sets up user space calls to NIC; Connect-IB NIC allows non-contiguous memory transfers for app-app; RV offload – speeds up MC; MLNX OFED open fabric verbs for IB and Ethernet; PCIe switch & NVMe over Fabric; MPI offloads; 2 ports at 100 Gbps; IB & Ethernet connections < 600 ns latency
    • Enyx NICs: differs from SF and MX network stack in user space (can be CPU intensive)
    • Enyx places full TCP stack in hardware (FPGA); reduce jitter
    • Network appliances:
    • ExaBlaze Fusion
    • Metamako MetaApp
    • FixNetics ZeroLatency
    • Precision Timing – PTP and NTP
    • PTP Symmetricom Sync Server S300s – NTP & PTP,owned by Microsemi GM
    • GPS Satellite satisfies UTC Req.
    • MIFIF II and PTP (software (sw) + hardware(hw) critical for accuracy; Req: 100 uSec + UTC
    • Symmetricom PTP GM 6 ports +/- 4 ns, <25ns to UTC
    • GPS -> GM-Spectracom->B-Clock(Arista7150s-FPGA timing+NAT)->servers-PTP-sw with FPGA based NIC’s ex: Exablaze ExaNIC models) – or SolarFlare NIC’s with HW timestamps
      • Linuxptp – ptp4I & phc2sys (can act as B-Clk) sync PTP hw clock on client, including VLAN tagged interfaces and bonded interfaces to master (GM) but with kernel; Dmons can’t consume MC; K delivers pkt to bonded interface SF’s sfptpd does all in HW; can sync every SF adapter; ptpd – mult platforms but just sw.
      • Timemaster – on start, reads NTP & PTP time servers, starts daemons, can sync sys clock to all time servers in multiple PTP domains
      • Master-slave time sync (ex:
      • PTP Timing within 6 ns –
      • consider disable tickless kernel : nohz=off (for accuracy) BUT test this and app impact
      • PTP in hardware best but costs; do ROI
      • If multiple interfaces in diff networks, set reverse FWD mode to loose mode
      • Cmd: ethtool –T -verify timestamp —- (for hw)
      • “timemaster” reads config of PTP time source
      • Cmd: systemctl start timemaster
      • ExaNIC FPGA can be programmed for extra analytics; some base programs available
      • MC if sync msg from Master but UDP unicast delay msg from slave to Master
      • PTP assumptions:
      • Network path symmetry (hence switch, router, FW, OS impact this)
      • Master and slave accurately measure when at pt of send/receive
      • Every hop can reduce PTP accuracy
      • PTP options:
      • Each slave clock direct cables to master .. but complexity. Cost …
      • Dedicate PTP switch infrastructure; switch PTP aware & eliminate switch delay or act as PTP M B-Clk; do not mix traffic
      • In dedicated LAN, PTP thru switch L2 Bcast to PTP bridge (server as B-Clk & bonded interface mgr), sends MC to FW (–if no SF; FW has list MC groups, IGMPv3 config ), MC to PTP clients for Time Sync, best if clients have with SF
        • FW configured for IGMP3, has necessary config allowing PTP-Bridge & clients to join std PTP MC group 224.0.1.129
        • Sfptpd can work on bonded interfaces so PTP clients need specify mgt interface to get PTP TS (from PTP bridge)
      • Hardware time stamps at every point
      • More PTP details:
      • Slaves periodically send messages back to Master (sync)
      • Sfptpd à file or syslog; ptp4l à stdout
      • Offset: amt Slave Clk off from Master
      • Freq Adjustment: how much clock oscillator adjusts to run at same rate as Mstr
      • Path Delay: how long to Slv & VV
      • Metrics – collectd, applies RegEx
      • NTP – selects accurate Time servers from multiple (ex 3); polls 3 or more servers
      • Keep stratum levels to no more than 2
      • Keep 3 clock sources near for sync
      • Use switches with light of no queuing
      • Use “timekeeper” – transforms any server into a timing appliance
      • Class Exercise–Explain different approaches to kernel bypass of following: ExaBlaze, SolarFlare, Mellanox, Enyx). Explain strengths and advantages of each; advise what specific electronic trading applications would best benefit from each.
      • Explain how following Linux Tuning options will will impact latencies
        • Swappiness =0;
        • Dirty-ratio 10;
        • Background- ratio 10
        • NIC interrupt coalescing (pre kernel-bypass)
        • Ring buffer increase
        • UCP receive buffer at 32 MB
        • Netdev-backup 1000000 (traf stored before TCP/IP proc; 1/core)
      • Explain what following commands produce for latency analysis:
        • ifconfig command
        • Netstat –s (send/recv Q’s)
        • Ss utility
      • Detail major benefits of VTune and DTrace and when you would use either
  • OPTIONAL — Quick intro Python

  • Programming with Multiple core multi thread, parallelism

  • Vectorize application code

  • Design – Internal loops with deep vector instructions, outer loops with parallelization (threads)

  • Servers, sockets, cores, caches. MCDRAM (Intel Phi)

  • Core speeds GHz vs more cores, larger and faster caches

  • Over clocked servers – features and what applications can benefit

  • Linux, Solaris, Windows, other ex SmartOS, Mesosphere DC OS

  • How to benchmark performance, analyze, tune

  • NUMA aware processes and threads

  • Optimize cache assignments per high priority threads

  • Intel technologies including …

  • AVX-512 deep vector instructions (speeds up FP ops)

    • 6-8 registers; more ops/instruction; less power
  • TBB thread Building blocks (limit oversubscription of threads)

    • OpenMP- explosion of threads
  • Omni-Path high speed / bandwidth interconnect (no HBA, fabric QoS, MTU to 10K, OFA verbs,105 ns thru switch ports, 50 GB/s bi ) & QPI

    • Uses Silicon Photonics (constant light beam, lower latencies and deterministic)
  • QuickPath: mult pairs serial links 25.6 GB/s (prior to Omni-Path)

    • Mem controllers integrated with microprocessors
    • Replaced legacy bus technology
    • Cache coherent
  • Shared Memory is faster than Mem Maps; allows multiple procs read/write into shared mem among the procs – without OS read/write commands. Procs just access the part of shared mem of interest.

    • Discuss ex of server proc sending HTML file to client; file is passed to mem then net function copies mem to OS mem; client calls OS function which copies to its own mem; contrast with Shared Mem
  • PCIE

  • NVME

  • Flash SSD Drives

  • C ++ vs Java for ULL

  • Lists vs vectors

  • Iterate lists

  • Role of FPGA, GPU, MicroWave networks for ULL

  • C/C++, Java, Python, CUDA, FPGA – OpenCL: programming design considerations

  • Java 8 new streams API and lambda expressions – for analytics

  • Class Ex – Explain how Quick Path & Omni Path both improve latencies and advise which is preferred for ULL and why

  • Intro to Wireshark

  • Intro to FIX Protocol

  • Intro to Wireshark with FIX protocol “Plug-in”

  • TCP, UDP, multicast (MC), then analysis via WireShark, Corvil

  • New age networks – Spine leaf to single tier

  • SDN (Software Defined Networks)

    • Cisco ACI + Tetration (ACI tells switches what to do instead of SDN SW doing it)
    • Cloudistics
    • Plexxi
    • NSX
  • Pico – ULL SDN vendor

  • Options-IT – colo managed ULL infrastructure

  • Cisco and Arista switches for ULL

  • Cisco ACI and Cisco Tetration – Deep machine Learning to automatically optimize large networks

  • Switches with deep buffers, great for Big Data Analytics

  • Configure Routers for ULL – LLDP, MLAG, VRRP, VARP (active-active L3 gateway)

    • LLDP = protocol for LAN devices informing of their config for Enet switches, Rtrs, wireless LAN AP, to advertise to other nodes; allow 2 systems running different network layer protocols to learn about each other
  • Network protocols – BGP, OSPF, HSRP

  • Arista 7124FX with EOS

  • Plexxi switches – a disruptive technology – single tier

  • Plexxi optimal bandwidth via its SDN

  • Optimal VLANs configuration for analytics

    • Use trunks from 1 switch to another switch after defining a VLAN, or use router
  • VPLS (Virtual Private LAN Service) also for analytics

    • Enet based multiPt-multiPt over IP or MPLS
  • Decrease network hops for speed

    • (ex: Slim-Fly: Low diameter network architecture if not ready for Single Tier)
  • Network protocols:

    • EBGP: external, distance vector, via paths, network policies, rule sets, finite state machine; BGP peering of AS-AS
    • BGP-MP: multi protocol + IPv6, unicast & MC; use for MPLS-VPN
    • OSPF: interior within AS, link state routing with metrics of RTT, amt data thru specific links, and link reliability
    • MOSPF: uses group membership info from IGMP + OSPF DB, builds MC trees
    • EIGRP: OSPF + more criteria: latencies, effective BW, delays, MTU
    • MPLS: between network nodes, short path lables – avoid complex lookups in RTE table; multi protocol- includes ATM, Frame Relay, DSL
    • IB: hw, light wt, no pkt reordering; link level flow control, loss less, QoS virtual lanes 0-14, RDMA verbs (adds latency), UFM tool, 4000 byte MTU (adds latency as this MTU must fill before transmission)
    • OPA: Intel’s Omni Path Architecture: 100 Gbps, new 48 ports switch silicon, silicon photonics, no HBA, 50% decrease infra vs IB, 100-110 ns / port, congestion control – reroutes traffic, MTU up to 10K
    • FC: FiberChannel – optical pkts, units of 4 10-bit codes, 4 codes=TransWord; meta-data sets up link and Seq; tools-Agilent, CATC, Finisar,Xyratex; FC: similar to Enetpkt ex: mult frames assembled with src, dest
    • IGMP: used by hosts and adjacent routers on IPv4 networks to establish multicast group memberships
  • OPTIONAL

  • Next Gen Firewalls (ex: Fortinet)

    • 1 platform end-2-end, with multiple security related aspects including anti-virus, malware, intrusion detection, database and OS access controls, web filtering, web app security, user ID awareness, standard rules access, internal segmentation (into functional security zones, limits spread of malware & mischief, identifies mischief & quarantines infected devices; shares all info via its fabric to whole network; Zero Trust Policy – places the FW in network center, in front of data
    • Empow – Security Orchestration product
  • Kerberos

    • Client/server network authentication protocol via secret key cryptography, stronger than traditional firewalls as they focus on external threats whereas Kerberos focuses on internal
    • Each KDC has copy of Kerberos DB; Master KDC has copy of realm DB, which is replicated to slave KDC’s @ regular intervals; DB password changes are in the Master; slaves grant Kerberos ticket servers / services, time series critical & create ACLs too; Kerberos daemons are started on the Master, assigns hostnames to Kerberos realms, ports, slaves
    • Opportunity for Docker container security:
    • Kerberos for access to multiple levels of container types (ex: checking account KYC vs withdrawal .. acct mgr vs authenticated client)
    • IPTables – may opt to disable for ULL, rely on external FW
    • Set up, inspect tables of IP packet filter rules; each table: built-in “chains” & user defined chains; chains list rules to match set of packets
    • Required for server “routers” NAT aware
      • Rte intercepts, determines NAT@
    • END- OPTIONAL
    • Class Exercise – Determine whether Single Tier Networks improve ULL versus Spine Leaf. If so explain why. Several scenarios will be presented and students will architect networks on white boards.

FPGA’s & Market Data

  • Hardware accelerated appliances for ULL and deterministic performance

  • OPTIONAL

— Intro to FPGA’s including intro to FPGA design & programming (I/O blocks + Logic blocks, OpenCL for creating “kernels” + synchronization for parallelism )

  • FPGA cache may be a limiting factor

  • Why performance tends to be very deterministic with FPGA’s & why deterministic performance (latencies) are critical for HFT and algo traders

  • Pitfalls of FPGA’s

  • FPGA’s vs GPU’s, Intel Phi, and multi cores

  • Feeds in FPGA –architecture, performance, design, support

  • Switch crossbars or caches for fan out with TCP distribution

  • Multicast (MC) performance considerations

    • Turn on IGMP Snooping

    on Switch

    • Switch listens to IGMP conversations between hosts/routers; maps links that require MC streams; Routers periodically query; 1 member per MC group per subnet reports.

    • Clients issue IGMP join requests to MC groups

    • Routers solicit group member requests from direct connect hosts

    • PIM-SM (Sparse Mode …low % MC) requires a Rendezvous Point (RP) router

    • Routers in PIM domain provide mappings to RP (exchange info for other routers)

    • PIM domain: enable PIM on each router

    • Enable PIM sparse mode on each interface

    • After RP, forward to receivers down shared distribution tree

    • When receiver’s 1st hop router learns source, it sends join message directly to source

    • Protocol Independent Multicast (PIM) is used between the local and remote MC routers, to direct MC traffic from the MC server to many MC clients.

  • Message based appliances, including FPGA based

  • Direct feed normalization

  • Conflation to conserve bandwidth

  • NBBO

  • Levels 1 and 2 market data

  • Depth of book builds (in FPGA’s or new multi core servers)

  • Smart order routers

  • Exablaze NICs and switches v Metamako switches for market data

  • ENYX FPGA NICs and Appliances for market data and order flow

  • Nova Sparks FPGA based market data ticker

  • Fixnetics Zero latency – multi thread risk checks in FPGA and order processing in parallel on a core(a)

  • Other products — Exegy, Algo logic, Redline, SR labs

  • Consolidated feed vendors Bloomberg and Thomson Reuters

  • Class Ex – (1) white board sessions where students will design ULL market data and multi cast architectures, per specific business/application criteria. (2) Given a Visio of a large network but with only a few MC groups and subscribers, identify the likely path(s) to sources few. Include choice of router as RP.

Day-2

Middleware, Analytics, Machine Learning, leading to end-end ULL Architectures

Analytics & Machine Learning: to seek alpha and for infrastructure analytics

  • Intro to Big Data Analytics & Machine Learning (focus on neural networks)
  • Role of Java 8 new streams API
    • Speeds up extracting insight from large collections via methods such as:
    • Filter, sort, max, map, flatmap, reduce, collect
    • Use with ArrayLists, HashMaps (does not replace them)
    • Stream is 1-time use object
  • Intro to Complex Event Processing (CEP) and Event Stream Processing (ESP)
  • Databases – Never in the path of ULL
  • Column based (contiguous memory) vs relational
  • KDB and OneTick – leading players in high speed market data tick databases
  • Event Stream Processing (ESP) – use ESP to seek alpha
  • Combine market data with News sentiment analytics to seek alpha,
  • Intro to Ravenpack news sentiment analytics
  • Intro to Spark
  • Role of new storage technology (ex NVMe Flash drives)
  • In-mem analytics ex HANA, Spark
  • Corvil – intro to how to configure Corvils and how to analyze FIX order flow with it
  • Machine learning, neural networks in R or Python – create equations to project latencies
  • Machine learning for Latency analysis, tuning insight, seeking alpha -trade opportunities
  • Programming for multi threaded trading risk analytics
  • Class Ex – output from Corvil streams will be provided. Students will analyze and determine how latencies can be projected using neural networks (design only – no programming)

Middleware, High Speed Messaging

  • 60 West AMPS
  • 29 East LBM (UME)
  • New FIX Binary protocol in beta promises to lower latencies
  • Importance of High Speed messaging for Algo Trading
  • Intro to basic algo’s for trading equities (ex VWAP, Volume Participation, use of AVX and RSI)
  • How to back-test algo’s for trading
  • Class Ex – output from application logs will be provided. Students will analyze and determine how AMPS can be configured for both high speed middleware and event stream processing for analytcis

End-End ULL Architectures & Intro to Cloud Architectures

  • Co-Lo with 500 ns order ack times (revisited with our new knowledge)
  • Dark pools
  • Algo Trading (servers, appliances, or FPGA’s) in the architecture
  • Smart Order routers
  • Prop Trading
  • Exchanges
  • Why traditional cloud architectures fall short for ULL
  • Cloud for analytics – pitfalls vs best practices
  • Micro services potential
  • How to conduct ROI for new ULL architectures
  • Class Ex – VISIO or white boarding of a new trading system TBD, applying all learned in course

OPTIONAL

Futures, including Cloud Architectures for ULL

  • Brief Quiz on material covered last week + review of VISIO assignment
  • Projections on new technologies’ impacts on ULL – may include:
  • new Intel cores & software,
  • adoption of Single Tier networks,
  • impact of in memory machine learning for alpha generation of trading signals,
  • integration of deep machine learning from cloud to live trading networks via high speed interconnects, to an asynchronous Q, with NO latency impact,
  • applicability of block chains,
  • system reliability engineering (SRE),

END-OPTIONAL

Quick Bio of Ted Hruzd

Ted has 33 years Wall Street IT experience in multiple capacities, ranging from Software Developer, SA, Manager of SA’s, Performance Architect and ULL Infrastructure Architect. Ted started his career with SIAC (1983-99), then progressed to firms with priorities in space of electronic trading – Instinet, Arca, Citigroup, Deutsche Bank, JP Morgan, RBC.

Ted’s career theme has been to trade as fast and as intelligently as possible, for competitive advantage. Ted now specializes in architecting ultra low latency electronic trading and market data infrastructures and in evangelizing disruptive technologies, with focus on ROI & cost savings. Ted keeps up to date with all latest technologies, conducts deep dive tech sessions with vendors, and can quickly architect and design new age infrastructures for any capital markets application, to increase revenues and net income. Subsequent to developing skills in Machine Learning / Neural Networks, Ted projects a very significant role for Machine Learning for both analytics to seek alpha and risk checks and thus also speed up electronic trading.

Ted is currently negotiating teaching the following course at NYU and Queens University (Belfast):

ULL (Ultra Low Latency) Architectures for Electronic Trading, link below:

https://homerunfitness.wordpress.com/2016/08/19/nyu-course-proposal-ull-architectures-for-electronic-trading/.

Ted’s passion for ultra low latencies and high speed processing is closely matched with his main hobby – a lifetime serious athlete. This includes being a ACSM certified Personal Trainer since 2008. Ted trains clients very part-time and only weekends, specializing in safe strength and speed training, especially increasing speed for events from the 40 yard dash up to 10K.

Yes – Ted is all about speed and ultra low latencies.

APPENDIX

MC Protocols

IGMP

A network designed to deliver a multicast service using IGMP might use this basic architecture:

IGMP operates between the client computer and a local multicast router. Switches featuring IGMP snooping derive useful information by observing these IGMP transactions. Protocol Independent Multicast (PIM) is then used between the local and remote multicast routers, to direct multicast traffic from the multicast server to many multicast clients.

IGMP operates on the network layer, just the same as other network management protocols like ICMP.[1]

The IGMP protocol is implemented on a particular host and within a router. A host requests membership to a group through its local router while a router listens for these requests and periodically sends out subscription queries.

Spanning Tree Protocol:

subset of internetwork links are selected to define a tree structure (loop-less graph) such that there is only one active path between any two routers. Since this tree spans to all nodes in the internetwork it is called spanning tree. Whenever a router receives a multicast packet, it forwards the packet on all the links which belong to the spanning tree except the one on which the packet has arrived, guaranteeing that the multicast packet reaches all the routers in the internetwork. Obviously, the only information a router needs to keep is a boolean variable per network interface indicating whether the link belongs to the spanning tree or not. We use a small network with five nodes and six links to show different trees. For simplicity sake, we do not differentiate between hosts and routers, subnets and links. We also assume that links are symmetric and their costs are shown next to the links. The spanning tree from source node © is shown in Figure 4:

Reverse Path Broadcasting (RPB)

The RPB algorithm which is currently being used in the MBone (Multicast Backbone), is a modification of the Spanning Tree algorithm. In this algorithm, instead of building a network-wide spanning tree, an implicit spanning tree is constructed for each source. Based on this algorithm whenever a router receives a multicast packet on link “L” and from source “S”, the router will check and see if the link L belongs to the shortest path toward S. If this is the case the packet is forwarded on all links except L. Otherwise, the packet is discarded. Three Multicast trees from two sources of our test network are shown in Figure 5.

The RPB algorithm can be easily improved by considering the fact the if the local router is not on the shortest path between the source node and a neighbor, the packet will be discarded at the neighboring router. Therefore, if this is the case there is no need to forward the message to that neighbor. This information can be easily obtained if a link-state routing protocol is being used. If a distance-vector routing protocol is being used, a neighbor can either advertise its previous hop for the source as part of its routing update messages or “poison-reverse” the route [Semeria].

This algorithm is efficient and easy to implement. Furthermore since the packets are forwarded through the shortest path from the source to the destination nodes, it is very fast. The RPB algorithm does not need any mechanism to stop the forwarding process. The routers do not need to know about the entire spanning tree and since the packets are delivered through different spanning trees (and not a unique spanning tree) traffic is distributed over multiple tress and network is better utilized. Nevertheless, the RPB algorithm suffer from a major deficiency: it does not take into account the information about multicast group membership for constructing the distribution trees.

*** Reverse Path Multicasting (RPM)

The RPM algorithm (also known as RPB with prunes) is an enhancement to the RPB and TRPB algorithms. RPM constructs a delivery tree that spans only: 1) subnetworks with group members, and 2) routers and subnetworks along the shortest path to subnetworks with group members [Semeria]. The RPM tree can be pruned such that the multicast packets are forwarded along links which lead to members of the destination group.

For a given pair of (source, group) the first multicast packet is forwarded based on the TRPB algorithm. The routers which do not have any downstream router in the TRPB tree are called leaf routers. If a leaf router receives a multicast packet for a (source, group) pair and it does not have any group member on its subnetworks, it will send a “prune” message to the router from which it has received the multicast packet. The prune message indicates that the multicast packets of that particular (source, group) pair should not be forwarded on the link from which the prune message has been received. It is important to note that prune messages are only sent one hop back towards the source. The upstream router is required to record the prune information in its memory. On the other hand, if the upstream router does not have any local recipient and receives prune messages from all of its children in the TRPB tree, the upstream router will send a prune message itself to its parent in the TRPB tree indicating that the multicast packets for the (source, group) pair need not be forwarded to it. The cascaded prune messages will truncate the original TRPB tree such that the multicast packets will be forwarded only on those links that will lead to a destination node(multicast group member). For showing the tree obtained after the exchange of prune messages in a network, we need to use a more complicated network. Figure 6 illustrates pruning and the obtained RPM tree.

Networks

TCP PSH ACK

Sender (S) indicates — if Receiver ® TCP did not yet provide data, do ASAP.

When R TCP sees PSH, R must not wait for more data.

Receive buffer returned to user for processing.

RST – can occur if not intended pkt received

Perf Tools and tracers

http://www.brendangregg.com/blog/

http://www.brendangregg.com/blog/2015-07-08/choosing-a-linux-tracer.html

Linux has available built-in:

Tools – perf, ftrace, eBPF

FTRACE

  • Built into K
  • Fx-flow walking
  • Is NOT easily programmable so can’t calc latencies, unless:
    • Dump events to user-level, create scripts, or use eBPF
  • uses the frysk engine to trace systemcalls
    • sycalls of procs and threads
    • Dynamic kernel fx trace, including latencies (ex: how long proc wakes/starts)
    • /debug/tracing
    • Trace_clock
  • http://lwn.net/Articles/370423/

Probably the most powerful tracer derived from Ftrace is the function tracer. It has the ability to trace practically every function in the kernel. It can be run not just for debugging or analyzing, but also to learn and observe the flow of the Linux kernel.

  • CONFIG_FUNCTION_TRACER
  • CONFIG_DYNAMIC_FTRACE
  • CONFIG_FUNCTION_GRAPH_TRACER

  • here is a quick example of how to enable the function tracer.

· [tracing]# echo function > current_tracer· [tracing]# cat trace· -0 [000] 1726568.996435: hrtimer_get_next_event <-get_next_timer_interrupt· -0 [000] 1726568.996436: _spin_lock_irqsave <-hrtimer_get_next_event· -0 [000] 1726568.996436: _spin_unlock_irqrestore <-hrtimer_get_next_event· -0 [000] 1726568.996437: rcu_needs_cpu <-tick_nohz_stop_sched_tick· -0 [000] 1726568.996438: enter_idle <-cpu_idle

PERF_EVENTS – source in K, usually added to tools package “perf”

  • perf top –g (functions)
  • does most of what ftrace does but not fx flow walking
  • profiles, samples, dumps user level stack, line tracing for local variables

SystemTap

  • profile, tracepoints, k & u probes
  • in K programming
  • compiles programs into K modules
  • panics, freezes
  • download, code, test in dev

sysdig:

  • syscalls (ONLY) with tcpdump-like syntax
  • dumps all events to user level for post processing

DTRACE

http://www.brendangregg.com/dtrace.html

DTraceToolkit

See the DTraceToolkit website (and please update links to point to it).

DTrace One Liners

These are handy one liners to use at the command line. dtrace_oneliners.txt contains the full listing with examples.

# New processes with arguments:dtrace -n ‘proc:::exec-success { trace(curpsinfo->pr_psargs); }’ # Files opened by process:dtrace -n ‘syscall::open*:entry { printf(“%s %s”,execname,copyinstr(arg0)); }’ # Syscall count by program:dtrace -n ‘syscall:::entry { @num[execname] = count(); }’ # Syscall count by syscall:dtrace -n ‘syscall:::entry { @num[probefunc] = count(); }’ # Syscall count by process:dtrace -n ‘syscall:::entry { @num[pid,execname] = count(); }’ # Read bytes by process:dtrace -n ‘sysinfo:::readch { @bytes[execname] = sum(arg0); }’ # Write bytes by process:dtrace -n ‘sysinfo:::writech { @bytes[execname] = sum(arg0); }’ # Read size distribution by process:dtrace -n ‘sysinfo:::readch { @dist[execname] = quantize(arg0); }’ # Write size distribution by process:dtrace -n ‘sysinfo:::writech { @dist[execname] = quantize(arg0); }’ # Disk size by process:dtrace -n ‘io:::start { printf(“%d %s %d”,pid,execname,args[0]->b_bcount); }’ # Pages paged in by process:dtrace -n ‘vminfo:::pgpgin { @pg[execname] = sum(arg0); }’ # Minor faults by process:dtrace -n ‘vminfo:::as_fault { @mem[execname] = sum(arg0); }’ # Profile user-level stacks at 99 Hertz, for PID 189:dtrace -n ‘profile-99 /pid == 189 && arg1/ { @[ustack()] = count(); }’

There are also many one-liners in the DTrace book, and as Appendix D of the Systems Performance book.

  • psio is another DTrace enabled disk I/O tool.

iotop display top disk I/O events by process. This tracks disk I/O by process, and prints a summary report that is refreshed every interval. Full example. first release. check for updates.

# iotop -CSampling… Please wait.2005 Jul 16 00:31:38, load: 1.03, disk_r: 5023 Kb, disk_w: 22 Kb UID PID PPID CMD DEVICE MAJ MIN D BYTES 0 27740 20320 tar cmdk0 102 16 W 23040 0 27739 20320 find cmdk0 102 0 R 668672 0 27740 20320 tar cmdk0 102 16 R 1512960 0 27740 20320 tar cmdk0 102 3 R 3108864 2005 Jul 16 00:31:43, load: 1.06, disk_r: 8234 Kb, disk_w: 0 Kb UID PID PPID CMD DEVICE MAJ MIN D BYTES 0 27739 20320 find cmdk0 102 0 R 1402880 0 27740 20320 tar cmdk0 102 3 R 7069696[…]

tcpsnoop snoop TCP network packets by process. This analyses TCP network packets and prints the responsible PID and UID, plus standard details such as IP address and port. This captures traffic of newly created TCP connections that were established while this program was running. It can help identify which processes is causing TCP traffic. Full example. new release. check for updates.

# tcpsnoop.d UID PID LADDR LPORT DR RADDR RPORT SIZE CMD 100 20892 192.168.1.5 36398 -> 192.168.1.1 79 54 finger 100 20892 192.168.1.5 36398 -> 192.168.1.1 79 54 finger 100 20892 192.168.1.5 36398 <- 192.168.1.1 79 54 finger 0 242 192.168.1.5 23 <- 192.168.1.1 54224 54 inetd 0 242 192.168.1.5 23 -> 192.168.1.1 54224 54 inetd 0 242 192.168.1.5 23 <- 192.168.1.1 54224 54 inetd 0 242 192.168.1.5 23 <- 192.168.1.1 54224 78 inetd 0 242 192.168.1.5 23 -> 192.168.1.1 54224 54 inetd 0 20893 192.168.1.5 23 -> 192.168.1.1 54224 57 in.telnetd 0 20893 192.168.1.5 23 <- 192.168.1.1 54224 54 in.telnetd 0 20893 192.168.1.5 23 -> 192.168.1.1 54224 78 in.telnetd 0 20893 192.168.1.5 23 <- 192.168.1.1 54224 57 in.telnetd 0 20893 192.168.1.5 23 -> 192.168.1.1 54224 54 in.telnetd […]

tcptop display top TCP network packets by process. This captures traffic of newly created TCP connections that were established while this program was running. It can help identify which processes is causing TCP traffic. Full example. first release. check for updates.

# tcptop -C 30Sampling… Please wait.2005 Jul 5 05:18:56, load: 1.07, TCPin: 3 Kb, TCPout: 112 Kb UID PID LADDR LPORT RADDR RPORT SIZE NAME 0 242 192.168.1.5 79 192.168.1.1 54283 272 inetd 0 242 192.168.1.5 23 192.168.1.1 54284 294 inetd 0 20929 192.168.1.5 79 192.168.1.1 54283 714 in.fingerd 100 20926 192.168.1.5 36409 192.168.1.1 79 1160 finger 100 20927 192.168.1.5 36410 192.168.1.1 79 1160 finger 100 20928 192.168.1.5 36411 192.168.1.1 23 1627 telnet 0 20313 192.168.1.5 22 192.168.1.1 54285 2798 sshd 0 20931 192.168.1.5 23 192.168.1.1 54284 4622 in.telnetd 100 20941 192.168.1.5 858 192.168.1.1 514 115712 rcp 2005 Jul 5 05:19:26, load: 1.04, TCPin: 0 Kb, TCPout: 4 Kb UID PID LADDR LPORT RADDR RPORT SIZE NAME 100 20942 192.168.1.5 36412 192.168.1.1 79 1160 finger 0 20931 192.168.1.5 23 192.168.1.1 54284 7411 in.telnetd […]

udpsnoop.d snoop UDP network I/O by process. This analyses UCP network I/O and prints the responsible PID and UID, plus standard details such as IP address and port. This tracks UDP read/writes by payload. Full example. first release! check for updates.

# udpsnoop.d UID PID LADDR LPORT DR RADDR RPORT SIZE CMD 0 27127 192.168.1.5 35534 -> 192.168.1.1 53 29 nslookup 0 27127 192.168.1.5 35534 <- 192.168.1.1 53 181 nslookup 1 221 192.168.1.5 111 <- 192.168.1.1 37524 56 rpcbind 1 221 192.168.1.5 111 -> 192.168.1.1 37524 28 rpcbind 0 27128 192.168.1.5 35116 <- 192.168.1.1 37524 40 rpc.sprayd 0 27128 192.168.1.5 35116 -> 192.168.1.1 37524 24 rpc.sprayd 0 27128 192.168.1.5 35116 <- 192.168.1.1 37524 44 rpc.sprayd 0 27128 192.168.1.5 35116 <- 192.168.1.1 37524 44 rpc.sprayd 0 27128 192.168.1.5 35116 <- 192.168.1.1 37524 44 rpc.sprayd 0 27128 192.168.1.5 35116 <- 192.168.1.1 37524 44 rpc.sprayd 0 27128 192.168.1.5 35116 <- 192.168.1.1 37524 44 rpc.sprayd 0 27128 192.168.1.5 35116 <- 192.168.1.1 37524 44 rpc.sprayd 0 27128 192.168.1.5 35116 <- 192.168.1.1 37524 44 rpc.sprayd 0 27128 192.168.1.5 35116 <- 192.168.1.1 37524 44 rpc.sprayd 0 27128 192.168.1.5 35116 <- 192.168.1.1 37524 44 rpc.sprayd 0 27128 192.168.1.5 35116 <- 192.168.1.1 37524 44 rpc.sprayd 0 27128 192.168.1.5 35116 <- 192.168.1.1 37524 40 rpc.sprayd 0 27128 192.168.1.5 35116 -> 192.168.1.1 37524 36 rpc.sprayd ^C

connections snoop inbound TCP connections as they are established, displaying the server process that accepted the connection. Full example is here.

# connections UID PID CMD TYPE PORT IP_SOURCE 0 242 inetd tcp 79 192.168.1.1 0 359 sshd tcp 22 192.168.1.1 100 1532 Xorg tcp 6000 192.168.1.1 ^C

prustat This displays %CPU, %Mem, %Disk and %Net utilisation by process. To examine all four key performance areas by process in Solaris was prohibitivly difficult without DTrace. prustat also uses Perl, Kstat and the procfs structures from /proc//. It is a new tool and still under development, released as a demonstration. Full example.

# prustat -t5 5 PID %CPU %Mem %Disk %Net COMM22301 65.01 3.17 0.00 0.00 setiathome 440 8.91 45.39 0.00 0.00 Xsun 2618 0.33 14.34 0.00 0.00 mozilla-bin 582 4.01 2.16 0.00 0.00 gnome-terminal 574 1.80 1.31 0.00 0.00 metacity PID %CPU %Mem %Disk %Net COMM22694 3.74 0.20 74.47 0.00 tar22301 66.70 3.17 0.00 0.00 setiathome 440 6.67 45.39 0.00 0.00 Xsun 2618 0.33 14.34 0.00 0.00 mozilla-bin22693 3.81 1.50 0.00 0.00 dtrace PID %CPU %Mem %Disk %Net COMM22301 63.72 3.17 0.00 0.00 setiathome 440 8.14 45.39 0.00 0.00 Xsun22694 6.47 0.20 36.47 0.00 tar22698 0.00 0.00 6.88 22.43 rcp 2618 0.34 14.34 0.00 0.00 mozilla-bin^C

dtruss This is a DTrace version of truss, designed to be less of a burden and safer than truss. In the below example, dtruss examines all processes named “bash” and prints out regular truss output plus elapsed and overhead times. See the full example.

# dtruss -eon bashPID/LWP ELAPSD OVERHD SYSCALL(args) = return 39111: 41 26 write(0x2, “l\0”, 0x1) = 1 0 39111: 1001579 43 read(0x0, “s\0”, 0x1) = 1 0 39111: 38 26 write(0x2, “s\0″, 0x1) = 1 0 39111: 1019129 43 read(0x0, ” \001\0″, 0x1) = 1 0 39111: 38 26 write(0x2, ” \0″, 0x1) = 1 0 39111: 998533 43 read(0x0, “-\0”, 0x1) = 1 0 39111: 38 26 write(0x2, “-\001\0”, 0x1) = 1 0 39111: 1094323 42 read(0x0, “l\0”, 0x1) = 1 0 39111: 39 27 write(0x2, “l\001\0”, 0x1) = 1 0 39111: 1210496 44 read(0x0, “\r\0”, 0x1) = 1 0[…]

hotkernel Sample on-CPU kernel-level functions and modules. This samples at 1000 Hertz, for a simple yet effective modules-level profiling tool. The output will identify which function is on the CPU the most – which is the hottest. The following examples show hotkernel analyse an x86 kernel. Full example.

# ./hotkernelSampling… Hit Ctrl-C to end.^CFUNCTION COUNT PCNTunixswtch 1 0.1%pcplusmpapic_redistribute_compute 1 0.1%genunixstrrput 1 0.1%unixsys_call 1 0.1%genunixfsflush_do_pages 1 0.1%TSts_wakeup 1 0.1%genunixcallout_schedule_1 1 0.1%unixpage_create_putback 1 0.1%unixmutex_enter 4 0.3%unixcpu_halt 1575 99.2% # ./hotkernel -mSampling… Hit Ctrl-C to end.^CMODULE COUNT PCNTusbms 1 0.0%specfs 1 0.0%uhci 1 0.0%sockfs 2 0.0%genunix 28 0.6%unix 4539 99.3%

Diff MX and SF?

What changes app for OnLoad

/proc/pid NUMA and memory

receive side steering

Cross calls

Mem map vs shared mem

http://www.solarflare.com/openonload-enterpriseonload

No app change

But if sync call back for more speed

http://www.mellanox.com/related-docs/whitepapers/WP_VMA_TCP_vs_Solarflare_Benchmark.pdf

https://www.informatix-sol.com/low-latency.html

Kernel based TCP/IP has it’s limitations, you can typically half its contribution to latency by adopting a User space implementation such as OpenOnload. This can be used in any of three modes each with decreasing latency but with increasing API complexity. The simplest is to take an existing TCP/IP socket based program, it can even be in binary format and preload onload before it is started. The next is to use TCPdirect, which uses a simplified socket API referred to as zsockets. This requires refactoring of existing Linux socket ‘C’ code. The best latency however, for Onload, is to re-write your application to use their EF_VI API. This is asynchronous, using completion events so it usually requires a complete rewrite of the send/receive modules

Where you have control of both ends of the wire then lowest latency and jitter is obtained by bypassing TCP altogether.. There are three main candidates for this – InfiniBand, RoCE and iWARP. These all share a common set of API’s known as the VERB’s so it’s possible to develop applications that will run on any of them. As with Openonload, SDP and Mellanox’s VMA all preload to accelerate an existing TCP/IP socket program. Openonload retains the TCP/IP protocol so can be used single ended. SDP and VMA both map to VERBS so must be deployed on both ends of the wire. Best latency with Openonload is achieved by receive polling, this sacrifices a core just to receieve packets on this socket but does avoid the kernel wakeup delay of the user thread. VERB based programs can run be run on Ethernet, InfiniBand or OmniPath. Verb programs over Ethernet use either RoCE or iWARP. RoCE relies on PFC’s to limit senders and the non-drop queuing for any RDMA Ethertype packets queued in the switches to provide the underlying reliability, whilst iWARP uses TCP (implemented with offload engines). Unfortunately, whilst the DCB standard includes the mechanisms to enable this, the current generation of Ethernet switches typically only enable dropless behavior for the FCoE Ethertype. Whiles RoCE programs may appear to work any drops may be undetected and result in data corruption. Large scale RDMA Ethernet deployments also need L2 mesh support to replace Spanning tree. There are a number of proprietary approaches appearing to solve this, whilst the DCB group are focusing on TRIL which we should see emerge during 2012.

OUTLINE:

3 options SF OnLoad

  1. PreLoad OnLoad (no code changes)
  2. TCP Direct – zsockets / refactor Linux C code
  3. EF_VI API – recode send/receive modules for asynchronous / completion events

Bypass TCP altogether

IB, RoCE, iWARP(VERB API’s)

Multi Process vs Multi Thread

In a multi-threaded program,

Positives:

multiple actors live in a shared program context. In multi-process systems, there are multiple actors but each lives in its own independent program context

Less overhead to establish and terminate vs. a process: because very little memory copying is required (just the thread stack), threads are faster to start than processes. To start a process, the whole process area must be duplicated for the new process copy to start. While some operating systems only copy memory once it is modified (copy-on-write), this is not universally guaranteed. Faster task-switching: in many cases, it is faster for an operating system to switch between threads for the active CPU task than it is to switch between different processes. The CPU caches and program context can be maintained between threads in a process, rather than being reloaded as in the case of switching a CPU to a different process.Data sharing with other threads in a process: for tasks that require sharing large amounts of data, the fact that threads all share a process’s memory pool is very beneficial. Not having separate copies means that different threads can read and modify a shared pool of memory easily. While data sharing is possible with separate processes through shared memory and inter-process communication, this sharing is of an arms-length nature and is not inherently built into the process model.

Synchronization overhead of shared data: shared data that is modified requires special handling in the form of locks, mutexes and other primitives to ensure that data is not being read while written, nor written by multiple threads at the same time. Negatives Shared process memory space: all threads in a process share the same memory space. If something goes wrong in one thread and causes data corruption or an access violation, then this affects and corrupts all the threads in that process

Program debugging: multi-threaded programs present difficulties in finding and resolving bugs over and beyond the normal difficulties of debugging programs. Synchronization issues, non-deterministic timing and accidental data corruption all conspire to make debugging multi-threaded programs an order of magnitude more difficult than single-threaded programs Processes

Positives Processes are a useful choice for parallel programming with workloads where tasks take significant computing power, memory or both. For example, rendering or printing complicated file formats (such as PDF) can sometimes take significant amounts of time – many milliseconds per page – and involve significant memory and I/O requirements. In this situation, using a single-threaded process and using one process per file to process allows for better throughput due to increased independence and isolation between the tasks vs. using one process with multiple threads.

http://www.tradersmagazine.com/news/buyside/buy-side-focuses-on-order-routing-115494-1.html

Asset managers are increasingly including order routing in their best execution due diligence as equity volumes shift to algorithmic trading.

Consultancy Greenwich Associates said in a report yesterday that large institutions are shifting trading volume to algorithmic avenues of execution which has kept the overall commission pool flat. The annual pool of cash equity commissions paid by institutional investors to brokers on US equity trades was $9.65bn (€8.7bn), down more than 30% from its peak in 2009 but 4% more than the low of $9.3bn reported in 2013.

The largest commission-generating accounts participating in the study increased their use of algorithmic trading strategies/smart-order routing algorithms by almost 10% between 2015 and 2016. Greenwich Associates interviewed 223 US equity portfolio managers and 321 US equity traders between November 2015 and February 2016 for the report ‘Flat E-Trading Volumes in U.S. Equities Mask Increase Among Larger Accounts’.

Greenwich added that although notional dollar volumes traded via algorithms have not been more than a third of total volume since 2012, recent bouts of volatility and the approval of IEX’s exchange application are likely to force a refocus on the use of algo-driven routing logic to help navigate an increasingly complex market structure.

The US Securities and Exchange Commission controversially approved IEX Group as a stock exchange last month. The venue, featured in bestseller “Flash Boys” incorporates a speed bump – a 350 microseconds delay between matching an order and publicly displaying the match – in order to make it more difficult for high-frequency trading strategies to interact with its order flow.

Pragma Securities, a provider of algorithmic trading tools, also said in a report yesterday that the buyside is paying increasing attention to order routing practices.

“This increasing diligence is appropriate,” said Pragma. “Ultimately most high-touch order flow also ends up being traded algorithmically, and algo routing logic can compromise execution quality for high-touch trades in the same ways as self-directed trading.”

http://www.therealizationgroup.com/Downloads/FPGA_beyond_market_data_140529.pdf

“However, despite the marketing messages you might hear, this approach to kernel

bypass is not a miracle cure because it remains CPU-intensive. Some network hardware

vendors now use kernel bypass technology in their ‘low-latency’ NICs to try to avoid

bottlenecks by taking the whole network stack out of the kernel and into the user space.

But the problem with this approach is that the network stack is still running on the CPU

and is therefore loading the CPU”, says de Barry.

See figure one below

.

“Everything you can offload from the CPU helps improve latency and – more importantly

– reduce jitter”, continues de Barry. “So our solution is to place the full TCP stack in

hardware. That way the CPU doesn’t have to worry about TCP any more as all of those

processes are offloaded to the FPGA”.

See figure two below

.

The main advantage with this approach is that we don’t use the CPU at all, it’s all done on

the FPGA card”, says de Barry

Scalable Broadcast

The traditional approach to broadcasting data is to use a multicast switch or regenerative in-line network taps. Neither approach scales very well with increased port counts with respect to performance and manageability. In contrast, MetaConnect can replicate an input to all of its ports in an extremely low 4ns. Each signal is regenerated to avoid degradation in signal quality. With two layers of cascading 48 port MetaConnects it is possible to scale out to over 2000 feed copies within 16 nanoseconds (including fibers).

Here is a trading example: a router is connecting through MetaConnect to a WAN from which it can receive routing information (BGP), multicast subcription (PIM) and other negotiations. Any downstream packets from the WAN to the router will be replicated and sent to each of the trader machines, with a latency between the WAN and the Trader machines of less than 4ns. Orders can be placed via a second network port on each trading machine.

More links, doc, and PICTURES:

http://www.fixnetix.com/

http://www.fixnetix.com/perch/resources/cscfixnetixzerolatencypov-1.pdf

http://www.fixnetix.com/services/trading-and-risk-solutions/

http://www.netcope.com/en

100 Gig E adapter

process all incoming network traffic within the FPGA chip at wire speed and transfer more than 100G of user data over PCI Express to the software.

Ethernet variants like 25G/50G, recently announced by Google, Microsoft and Arista Networks, are supported as well.

Network link speed, performance of on-board network controller, throughput of PCI Express bus, performance of host system – these are all factors that influence the whole solution and we paid maximum attention to make all links of this chain as strong as possible.

FPGA cards are equipped with the latest PCI Express Gen 3 bus. 100G variants have x16 wide PCI Express interface and in combination with support of PCI Express bifurcation feature, they are able to achieve sustainable throughput of more than 100Gbps to host system. Learn more in the Xilinx blog.

  • PPS (pulse per second) input for precise hardware timestamping
  • Xilinx Virtex 7 Series FPGA chip
  • QDR/DDR memories

http://www.netcope.com/en/products/tradecope

a tick-to-trade engine directly in FPGA hardware:

Write your decision logic in C/C++.

performance of software implementations of feed handlers, trading strategies’ logic or order generation usually shows a slow-down caused by the operating system running on the machine. There are several clear latency bottlenecks virtually on every common system running an OS.

Tradecope successfully overcomes these latency bottlenecks by implementing a tick-to-trade engine directly in FPGA hardware:

  • The network communication containing the market data does not need to be forwarded to software, but instead is processed inside the FPGA network interface card eliminating OS kernel network stack latency and latency of PCIe transfers.
  • True parallelism is a natural feature of FPGA hardware which can handle many distinct jobs at the same time.

TradeCope Features

· Input traffic filtering · A-B channel arbitrage
· Sequence number gap detection · Sending orders to market
· Message and symbol filtering · Pre-trade risk check
· Computation of pre-defined statistics (EMA, MID, etc.) · Trading strategy triggered by market update
· Building an order book representation in FPGA · FIX/FAST and binary decoding

whole trading pipeline starting from receiving market data, through packet arbitration, message decoding, book building, trading strategy execution, risk check and order generation is implemented and runs directly in the hardware. This allows us to fine tune every single processing step and achieve sub-microsecond tick-to-trade latency.

Strategies and API in C++. We provide a latency optimized API allowing the user to easily communicate with the card and build software routines for further management of the trading behaviour. The latency insensitive part of the trading strategy should run on the CPU to keep the latency sensitive part in FPGA as fast as possible. Using the optimized API, the software can send parameters to the logic in FPGA to modify its behaviour on the fly.

Write your decision logic in C/C++. From the end user point of view, the only requirement is to provide decision making logic written in C/C++ language (or optionally in HDL). This code is automatically transformed into a hardware representation and connected to the hardware pipeline in FPGA. After this the Tradecope is ready to trade. This includes statistics gathering, trading parameter updates, or offloading the latency insensitive parts of trading strategy to the CPU to keep the latency sensitive part in FPGA as fast as possible.

?? how switch?

Metamako

ExaBlaze Fusion

FixNetix ZeroLatency

API with examples of software applications

FIX, Arca Direct and OTTO

Accelerated feed handler. Tradecope processes market data directly (filtering, decoding) in the FPGA and creates an order book. Predefined statistics are computed as well. All the data is accessible from the software using a highly optimized API.

Accurate market data recording & replaying

user code is automatically transformed into an FPGA representation

Pre-trade risk checks and TCP injection. Tradecope includes cores for the processing and decoding of Order entry protocols, which makes it possible to perform wire-speed risk checks in the high-performance FPGA chip transparently to the user. A packet with an order may be dropped if risk checks do not pass, or a TCP packet may be injected if a specific event is detected. For instance, it is possible to specify thresholds for open positions or traded volume. Tradecope is able to cancel orders on the book with ultra-low latency when the threshold is breached.

TradeCope part of NetCope FPGA Boards (NFB)?

NetCope Pkt Capture (NPC)

filtering-based packet manipulation, and throughput to the host system. Standard PCI Express form factor makes it suitable for commodity multi-core servers. Multiple cards can be plugged into a single server to build a high-density solution.

Vs Corvil?

hardware filter that supports up to eight thousand filtering rules. L3 and L4 header fields can be used to express conditions. Corresponding action can transfer a packet to the host system, send it to an output network interface, or drop it. Intelligent transfer to the host system can be used to distribute the traffic over CPU cores based on hashing in flow-aware fashion.

zero-copy API for high-speed transfers to the memories of the host system

Interconnection of two cards to achieve overall throughput of 200Gbps to software with load balancing

Acceleration of IDS/IPS (Intrusion Detection/Prevention Systems)

NPC-100G2

Separate products?

products are ideal for all OEM vendors, R&D customers and end customers to build, develop and deploy hardware-accelerated solutions

ware-accelerated solutions