← Back to Blog
·8 min read

How I Built an ML Stock Predictor with 68.7% Accuracy

Machine LearningPythonFinance

The Problem

Predicting stock movements is one of the hardest problems in machine learning. Markets are noisy, non-stationary, and influenced by countless factors. Most academic papers report barely-better-than-random results.

I set out to build a practical system that could give me a real edge — not a theoretical exercise, but a production pipeline I run daily.

The Approach

Data Pipeline

I collect 10+ years of daily data for 502 S&P 500 tickers, including:

- OHLCV price data from yfinance

- Technical indicators (RSI, MACD, Bollinger Bands, etc.)

- Macro features (VIX, treasury yields, sector momentum)

- Sentiment signals

Feature Engineering

Started with 95 features and pruned down to 33 based on importance scores. Features below a threshold of 0.009 importance were dropped. This reduced overfitting significantly.

3-Model Ensemble

Instead of relying on a single model, I use three gradient boosting models:

  • XGBoost (67.7% accuracy)
  • LightGBM (68.4% accuracy)
  • CatBoost (69.3% accuracy)
  • Final predictions use majority voting: if 2 out of 3 models agree, that's the signal. All-agree predictions hit 70.0% accuracy.

    Walk-Forward Validation

    The most important part. I use 6-fold walk-forward validation to ensure the accuracy numbers are honest out-of-sample results, not in-sample overfitting.

    Result: 67.9% ± 3.8% walk-forward accuracy.

    Key Lessons

  • Feature pruning matters more than feature creation — removing noisy features improved accuracy more than adding new ones.
  • Ensembles beat single models — the majority vote consistently outperforms any individual model.
  • Walk-forward validation is non-negotiable — traditional cross-validation gives inflated numbers for time series data.
  • Predict big moves, not direction — switching from "up/down" to ">2% moves" dramatically improved signal quality.
  • What's Next

    I'm working on integrating alternative data sources and experimenting with attention-based models for feature interaction. The goal is to push past 70% walk-forward accuracy.