A quick guide to using Python for market data
Part 1 of 2. Just showing you simple ways to use Python to get data for testing.
In today’s post I am going to cover a few different ways to use Python to get data and format it for testing purposes. There are plenty of ways to get data without having to use Python. Data services like Norgate and Data Bento are examples of services that allow you to just get CSV data for your testing purposes without code. Most trading platforms, like Sierra Chart, also have export to CSV features as well. The trick is learning how they structure their data and what data you actually need for testing. This is where scripting languages like Python come in handy.
All of the code for these tutorials can be found on my Github.
Install Python and setup your environment
Before you do anything, you have to make sure that you have Python installed. The examples I am going to give are for Windows, but if you are on a different operating system, a quick search online should help you figure out what to do for your particular system.
# Install (or upgrade) via winget
winget install --id Python.Python.3.13 -e
# or upgrade if already installed
winget upgrade --id Python.Python.3.13 -e
# Verify
python --version
After this, you want to create a directory for your Python projects or scripts. Then, you want to create a virtual environment for that project. This allows us to compartmentalize the dependencies that we need for each project and not clutter up our main python install on our machine. Each tutorial in this guide will have a different environment with different dependencies.
mkdir python_tutorials\tutorial1
cd python_tutorials\tutorial1
python -m venv .venv
.\.venv\Scripts\Activate
After we get the environment activated, we need to create a requirements.txt
file in our directory. This is where we will list each of the dependencies that we need for our project. In Powershell, that looks like:
New-Item requirements.txt
Then, using your favorite text editor (I use Neovim, btw), open that file and add the following two lines:
pandas
yfinance
Then, we run the command to install them:
python -m pip install -r requirements.txt
Once it installs, we are ready to start playing with data.
Using yfinance
to get free data
The first thing we are going to cover is the most popular and easy way to get free, mostly unreliable (but good for learning), data. Before we get started, make sure you create a data directory inside this project to save our CSVs to when we are done. Then, create a new file tutorial1.py
and then open that file with our text editor.
There, we are going to add the following code:
import yfinance as yf
import pandas as pd
from pathlib import Path
# Parameters
ticker = "SPY"
period = "15mo" # last 15 months
interval = "1d"
# Fetch data with auto-adjust so prices reflect dividends/splits
data = yf.download(ticker, period=period, interval=interval, auto_adjust=True)
# Show raw data straight from Yahoo
print("\nRaw data (first 5 rows):")
print(data.head())
# Handle multi-index columns if present
if isinstance(data.columns, pd.MultiIndex):
data.columns = data.columns.get_level_values(0) # keep only column names, drop ticker level
# Clean data with pandas
out = (
data.rename(
columns={
"Open": "open",
"High": "high",
"Low": "low",
"Close": "close",
"Volume": "volume",
}
)
.reset_index()
.rename(columns={"Date": "date"})
.loc[:, ["date", "open", "high", "low", "close", "volume"]]
)
# Save to file
Path("data").mkdir(exist_ok=True)
all_path = Path("data") / f"{ticker}_15mo.csv"
ins_path = Path("data") / f"{ticker}_in_sample.csv"
oos_path = Path("data") / f"{ticker}_out_sample.csv"
out.to_csv(all_path, index=False)
# Split ~12mo IS / ~3mo OOS using pandas slicing
insample = out.iloc[:-63]
outsample = out.iloc[-63:]
insample.to_csv(ins_path, index=False)
outsample.to_csv(oos_path, index=False)
print(f"\nSaved {len(out):,} total rows")
print(f"In-sample rows: {len(insample)} → {ins_path}")
print(f"Out-of-sample rows: {len(outsample)} → {oos_path}")
# Verify by reading back with pandas
print("\nPreview of saved CSVs:")
print(f"\nAll data:\n{pd.read_csv(all_path).head()}")
print(f"\nIn-sample:\n{pd.read_csv(ins_path).head()}")
print(f"\nOut-of-sample:\n{pd.read_csv(oos_path).head()}")
Then, run this code with:
python tutorial1.py
And behold the data.
A good bit of the code in the above code block is print statements that just confirm that we got data and that it was formatted correctly. Aside from that code, what we are doing is:
Defining the data we want to get (SPY), timeframe (daily), and how much (15months).
Downloading that data and formatting it. This isn't always necessary, but the example shows you how to change column names and drop columns that we don’t need.
Lastly, we save it to CSV in our data file path
Notice that we separated our data out into three different files. One of them holds all the data, the next two are our in-sample (IS) and out-of-sample (OS) data. It is always a good practice to withhold some data from your testing. This helps us insure that we are overfitting our strategy to all of the available data. When we get to testing our data, we will create, test, and tweak our strategy with the IS data, then test it against the OS data to see how it performs.
Getting data from Alpaca (or any similar web API)
Moving along to our next example. Instead of using yfinance, this time we are going to use Alpaca. I am using Alpaca here because it is a developer first broker/provider. It has solid APIs and documentation. It requires having an account with them (which you can do for free) and that you use an API key to make calls to their API.
Make sure you deactivate your previous venv
session if you are going through this in order.
deactivate
Create a new directory (tutorial2), add you data folder, create a new venv
for this project, and the add the requirements file:
pandas
alpaca-py
python-dotenv
Then, just like before, we need to activate our venv
and install the dependencies.
We will need to create a .env
file to add our API key too.
ALPACA_API_KEY=your_api_key_here
ALPACA_API_SECRET=your_api_secret_here
If you are planning on using git for version control, be sure to add the .env
file to your .gitignore
so you don’t advertise your keys to the internet.
In your Alpaca account, you are going to want to go to the profile settings and manage accounts. Here you can generate the API keys you need for the .env
file above. Make sure you use the paper account for this tutorial. We won’t be making any calls to the account itself, but there is no need to take chances.
Once that is done, open up the new tutorial2.py
file and add:
from datetime import datetime, timedelta
import os
from pathlib import Path
import pandas as pd
from dotenv import load_dotenv
from alpaca.data.historical import StockHistoricalDataClient
from alpaca.data.requests import StockBarsRequest
from alpaca.data.timeframe import TimeFrame
# Load keys from .env
load_dotenv()
API_KEY = os.getenv("ALPACA_API_KEY")
API_SECRET = os.getenv("ALPACA_API_SECRET")
# Initialize the market data client (no need for base_url here)
data_client = StockHistoricalDataClient(API_KEY, API_SECRET)
# Date range
end = pd.Timestamp.today(tz="UTC").normalize()
start = end - pd.DateOffset(months=15)
# Define the request (last 5 days of SPY daily bars)
request_params = StockBarsRequest(
symbol_or_symbols=["SPY"],
timeframe=TimeFrame.Day,
start=start.to_pydatetime(),
end=end.to_pydatetime(),
adjustment='all',
)
# Fetch the bars
bars = data_client.get_stock_bars(request_params)
# JSON --> DataFrame
df = bars.df.reset_index().rename(columns={"timestamp": "date"})
df["date"] = pd.to_datetime(df["date"]).dt.tz_localize(None)
df = df.loc[:, ["date", "open", "high", "low", "close", "volume"]]
# Print the raw response so you can inspect its format
bar_list = bars.data["SPY"]
print("\nFirst Bar:", bar_list[0])
print("Last Bar:", bar_list[-1])
print(df.head())
# Save into data/
Path("data").mkdir(exist_ok=True)
all_path = Path("data") / f"SPY_15mo.csv"
is_path = Path("data") / f"SPY_in_sample.csv"
os_path = Path("data") / f"SPY_out_sample.csv"
df.to_csv(all_path, index=False)
# Split Data
cutoff = df["date"].min() + pd.DateOffset(months=12)
df[df["date"] < cutoff].to_csv(is_path, index=False)
df[df["date"] >= cutoff].to_csv(os_path, index=False)
# Print to confirm csv data
print("\nPreview of saved CSVs:")
print("\nAll data:\n", pd.read_csv(all_path).head())
print("\nIn-sample:\n", pd.read_csv(is_path).head())
print("\nOut-of-sample:\n", pd.read_csv(os_path).head())
Similar to the method we used before, this retrieves data from an API and then formats the fields we want and saves them into a CSV file for us. The biggest change is that we get the data in JSON format and not as a DataFrame. We use Pandas to convert the data to a DataFrame and then save it to CSV. This is pretty standard for web APIs, but there is no guarantee that the data will be in the same format. It is always good to check the data format in the documentation or just print it out like we did here to see what it looks like before saving or changing anything.
Closing Notes
The purpose of this post was to just show you some easy ways to use Python to get data. Both methods use freely available data, with the Alpaca data being more reliable than the yfinance
data. Free data is still free, and you get what you pay for.
Next time
Part two will cover how to get data from your trading platform. For my example, I will be using Sierra Chart. Most platforms offer at least one to export data, but Sierra Chart actually has two methods for getting data. One, a simple export to CSV feature (the easiest if your goal is to just get data), and two, a server option that allows you to retrieve data via their own protocol (DTC) that makes it possible to use your programming language of choice to retrieve data, stream data, and even make trades.
Before part two, I will also cover the different protocols and formats of data that researchers will likely run into on their quant journeys.
Until next time, happy hunting!
This post doesn’t represent any type of advice, financial or otherwise. Its purpose is to be informative and educational. Backtest results are based on historical data, not real-time data. There is no guarantee that these hypothetical results will continue in the future. Day trading is extremely risky, and I do not suggest running any of these strategies live.