Market Data for Quants
Honestly, the data matters no matter what kind of trader you are. It matters at least enough that you should know about it anyway.
So, Larry, you want to be a serious quant researcher. Well, let me ask myself a question then. Where are you going to start? Data? Why would you want to start there? I thought you were going deep in math, bragging about learning stochastic calculus and shit. Why, all of a sudden, does data matter?
Let me answer myself...
Data matters. I have mentioned this many times before, but it wasn't until recently, while avoiding work and contemplating all the different ways to try and build quantKit, that I realized, I can't test anything until I have data. Profound, right?
Well, then I started thinking about how I wanted to handle this data. Of course, I can do what everyone else does: just use CSV files and call it a day. But that shit gets messy quick. And it's slow. Sure, you can work some magic—compress files, stream the data in chunks, ensure you aren't storing anything to memory that doesn't need to be stored—but it's still slow. Especially when you want Python to handle this.
Also, it isn't exactly easy for the user, is it?
I mean, no one wants to have to constantly manually import CSV files of data. Imagine having to do this with hundreds of equities in an index. It gets old fast, and if I am going to take the time to abstract that away for the user, I might as well go ahead and manage the data more appropriately.
This got me thinking. I started reading about market data and asking AI to take up its role as my research assistant once again. After a few weeks of trying to consume as much information as I could, I came to a conclusion.
Data matters more than you think.
And because it matters more than you think, it should get a decent amount of consideration when you are attempting to build a robust research engine for traders to use.
So, I am going to go over some basics in this post. Just some food for thought. This will be a part of a mini-series that will explain market data, why it matters, and how I intend to handle it for quantKit.
Spoiler: It ain't with fucking CSV files.
Types of Market Data
There are really only two main types of market data as far as an exchange is concerned: trade data and market depth.
Trade data is "tape". It is the source of truth: every executed transaction on the exchange, including price, size, and timestamp.
Market depth is the real-time record of resting buy/sell orders at multiple price levels, updated as orders are added, canceled, or executed.
From there, data gets aggregated into other formats such as snapshots and OHLCV data. The latter, which we are all familiar with, builds most of our charts and indicators.
Technically, exchanges also publish reference data. This is essentially metadata that the exchange broadcasts about symbols. It's not true market data, at least not as far as a quant is concerned. You can't math this data.
How the data gets from the exchange to you
This is where it gets fun.
The data has to get to your computer from the exchange somehow, and many of us don't have direct access to exchange data, so the data takes a path through several layers:
Vendors (Rithmic, CQG, DTN IQFeed, etc) aggregate and redistribute exchange data.
Brokers also received market data and redistribute it to you, usually throttled or normalized (IBKR's 250ms snapshots are a good example of this).
Platforms (Sierra Chart, NinjaTrader, TradingView) sit on top of these feeds and present data visually, while also adding their own calculations and aggregations.
Example:
Exchange —> Vendor —> Broker —> Platform—> You
This isn't always the chain of events. Some brokers and vendors simply pass through exchange data without modification. For example, Alpaca's paid IEX stream is a direct redistribution of IEX exchange data. Sierra Chart's Denali feed works the same way.
The key point: every time the data changes hands, or hops, there's potential for increased latency and distortion, especially if the intermediary is throttling, batching, or applying its own adjustments.
Why this matters for research
As quants, we aren't just consuming charts. We are building systems and testing ideas. That means we need to understand exactly what kind of data we're working with, and how it was handled before it reached us.
Historical data is what we use to populate charts and calculate our features/indicators for testing ideas. Not all historical data is the same. It could be raw data that we aggregate ourselves or pre-aggregated OHLCV bars from a vendor or a broker.
Streaming data drives real-time execution and decisions. This is true tick-by-tick data or aggregated snapshots (like IBKR) that skip over many small events.
These distinctions matter because the fidelity of your backtests and research depends on the fidelity of the underlying data. Aggregated bar data is fine for forming ideas or running simple tests. Tick-level data gives you more realistic insights into slippage and execution. Order book data adds the missing dimension of liquidity, showing you how your order would have interacted with the market at the time of the trade.
If we want our research to reflect reality, we need to match the granularity of our data to the questions we are asking.
Quant's Dilemma
There is a common saying in trading that I've even repeated myself: "Test with the data you trade on." It sounds reasonable, but for quants and traders it's misleading.
When you place an order, it's executed against the raw exchange order book. Your broker's data feed, whether throttled, aggregated, or delayed, is not what determines your fill. Execution always happens on the exchange.
This means that your research data and execution broker don't have to match. In fact, separating them often gives you an edge. I remember when I was first learning about day trading and watched Ross Cameron trade. He used Lightspeed for trade execution and eSignal for his charting and market data. It was a simple but powerful example of why it doesn't really matter if you test and analyze on different data than the broker you execute through. What matters is pairing the best data for decision-making with the fastest broker for fills.
Data feeds and execution feeds are not the same, and they do not need to come from the same source. The goal is simple: use the highest-quality data you can for research and decision-making, and choose a broker that gives you the fastest, most reliable execution.
Closing thoughts
At the exchange level, all market data boils down to two streams: trade and market depth. Everything else is derived from them. Once it passes through vendors and brokers, it's often batched, throttled, or normalized before it ever reaches you. That's why understanding how your data is handled is critical for researchers.
Data quality drives research.
Execution depends on your broker.
Data feeds and execution feeds are separate.
For quants, this isn't a liability—it's an advantage. By doing a little work, we can get the best of both worlds.
Coming up next
The next post will explore the ways that data gets delivered. I will look at different protocols, see how the plumbing of these systems shapes the speed, granularity, and reliability of the feeds we depend on.
This post doesn't represent any type of advice, financial or otherwise. Its purpose is to be informative and educational. Backtest results are based on historical data, not real-time data. There is no guarantee that these hypothetical results will continue in the future. Day trading is extremely risky, and I do not suggest running any of these strategies live.