Story

A Controlled Comparison of Deep Learning Architectures for Multi-Horizon Financial Forecasting: Evidence from 918 Experiments

ComputingMath & Economics

Key takeaway

Researchers tested different deep learning models for predicting stock prices over multiple time horizons. This could help investors and traders make better financial decisions.

Read the paper

Quick Explainer

This study undertook a comprehensive comparison of deep learning architectures for forecasting financial time series. The key idea was to systematically evaluate a range of popular model families, including Transformers, MLPs, CNNs, and RNNs, across a diverse set of assets and forecast horizons. The researchers employed a controlled experimental protocol, with fixed-seed hyperparameter optimization and multi-seed final training, to isolate the impact of architectural choices. The results revealed a clear hierarchy, with two top-performing models - ConvTCN and PatchTST - consistently outperforming the others. The combination of large-kernel convolutions and multi-stage downsampling in ConvTCN, as well as the patch-based self-attention in PatchTST, were identified as providing the most transferable temporal representations for this challenging forecasting task.

Deep Dive

Deep Learning for Financial Time-Series Forecasting: A Controlled, Multi-Horizon Comparison

Overview

This document summarizes the key findings from a comprehensive benchmark of deep learning architectures for multi-horizon financial time-series forecasting. The study:

Compares 9 architectures spanning 4 model families (Transformer, MLP, CNN, RNN) across 12 assets, 3 asset classes, and 2 forecasting horizons
Employs a controlled experimental protocol with fixed-seed hyperparameter optimization, multi-seed final training, and rank-based statistical validation
Establishes that architectural choice dominates performance, explaining 99.90% of variance versus only 0.01% for random seed
Identifies a clear three-tier hierarchy, with 2 top models (ConvTCN, PatchTST) consistently outperforming the rest
Shows that top architectures maintain their relative rankings across short and long horizons despite 2-2.5x error amplification
Finds that MSE-optimized models lack directional forecasting skill, producing predictions no better than a coin flip

Problem & Context

Multi-horizon price forecasting is central to finance, but financial time series exhibit complex properties that make accurate prediction challenging
Deep learning architectures for sequence modeling have proliferated rapidly, raising the question of which models are best suited for financial data
Prior benchmarks suffer from methodological limitations, including uncontrolled hyperparameter tuning, single-seed evaluations, narrow asset coverage, and lack of statistical validation

Methodology

Employs a 5-stage experimental pipeline: fixed-seed Bayesian hyperparameter optimization, configuration freezing, multi-seed final training, metric aggregation, and statistical validation
Evaluates 9 architectures (4 Transformer, 2 MLP/linear, 2 CNN, 1 RNN) across 12 assets in 3 classes (crypto, forex, equity indices)
Computes RMSE, MAE, and directional accuracy on held-out test sets, aggregating across 3 seeds
Conducts formal statistical tests, including Friedman, Holm-Wilcoxon pairwise comparisons, Spearman rank correlations, variance decomposition, and Jonckheere-Terpstra trend analysis

Data & Experimental Setup

Uses H1 (hourly) OHLCV data for 12 assets (4 per class), with 70/15/15 chronological train/val/test splits
Enforces category-level hyperparameter tuning and frozen configurations to prevent asset-level overfitting
Adopts a direct multi-step forecasting strategy, predicting all horizon steps simultaneously

Results

Identifies a clear three-tier performance hierarchy, with ConvTCN and PatchTST significantly outperforming the rest
Shows that top models maintain their relative rankings across short and long horizons despite 2-2.5x error amplification
Finds that architecture choice explains 99.90% of forecast variance, dwarfing the 0.01% contribution from random seed
Demonstrates that MSE-optimized models lack directional forecasting skill, with accuracy no better than 50%

Interpretation

The combination of large-kernel convolutions and multi-stage downsampling in ConvTCN provides the most transferable temporal representations
Patch-based self-attention in PatchTST offers an effective compromise between local and global modeling
Architectural inductive bias matters more than raw model capacity for this task
Practitioners should prioritize architecture selection over hyperparameter tuning or ensemble techniques
Directional trading requires explicit loss functions beyond standard MSE regression

Limitations & Uncertainties

Hyperparameter search budget is limited to 5 trials per model
Statistical power is constrained for pairwise comparisons within the middle performance tier
Feature set is restricted to raw OHLCV, excluding technical indicators or other exogenous data
Temporal scope is limited to hourly frequency; generalization to higher or lower frequencies is untested
Asset universe, while representative, does not include commodities, fixed income, or emerging markets

What Comes Next

Compute standardized effect sizes (Cohen's d) to complement the statistical significance tests
Increase hyperparameter search budget and seed count to strengthen ranking evidence for middle-tier models
Explore directional loss functions and post-processing to improve forecasting skill beyond MSE regression
Expand horizon coverage to map architecture-specific scaling behavior more precisely
Assess asset-specific model selection to identify further niche advantages beyond the observed altcoin specialization
Investigate heterogeneous ensemble methods combining the strengths of top-performing architectures

Source

A Controlled Comparison of Deep Learning Architectures for Multi-Horizon Financial Forecasting: Evidence from 918 Experiments
PreprintarXiv cs.LG3/19/2026