How I Built an ML Stock Predictor with 68.7% Accuracy
The Problem
Predicting stock movements is one of the hardest problems in machine learning. Markets are noisy, non-stationary, and influenced by countless factors. Most academic papers report barely-better-than-random results.
I set out to build a practical system that could give me a real edge — not a theoretical exercise, but a production pipeline I run daily.
The Approach
Data Pipeline
I collect 10+ years of daily data for 502 S&P 500 tickers, including:
- OHLCV price data from yfinance
- Technical indicators (RSI, MACD, Bollinger Bands, etc.)
- Macro features (VIX, treasury yields, sector momentum)
- Sentiment signals
Feature Engineering
Started with 95 features and pruned down to 33 based on importance scores. Features below a threshold of 0.009 importance were dropped. This reduced overfitting significantly.
3-Model Ensemble
Instead of relying on a single model, I use three gradient boosting models:
Final predictions use majority voting: if 2 out of 3 models agree, that's the signal. All-agree predictions hit 70.0% accuracy.
Walk-Forward Validation
The most important part. I use 6-fold walk-forward validation to ensure the accuracy numbers are honest out-of-sample results, not in-sample overfitting.
Result: 67.9% ± 3.8% walk-forward accuracy.
Key Lessons
What's Next
I'm working on integrating alternative data sources and experimenting with attention-based models for feature interaction. The goal is to push past 70% walk-forward accuracy.