加载头像

Causal Transformer for High-Frequency Return Prediction

This was my Lingjun Quant Challenge project. I used it to practice sequence modeling under a harsh information boundary: high-frequency return prediction is not only about architecture, but also about refusing to leak the future into validation.

Challenge Context

The task was to model ten-minute-ahead returns from A-share high-frequency microstructure data:

  • 500 stocks
  • 239 intraday minutes per day
  • 384 features
  • Parquet-scale data storage

The central difficulty was not only modeling a large sequential table. I had to design a validation protocol that preserved enough intraday context while preventing information leakage.

Method

I treated each stock-day as a minute-level sequence. Each minute became a token, and a strict causal mask prevented the model from reading future minutes while doing minute-wise regression. The architecture followed the information boundary of the task instead of ignoring it.

The main components were:

  • causal self-attention over intraday minute sequences
  • intraday time embeddings
  • stock identity embeddings
  • train-date-fitted normalization reused unchanged for validation and inference

Validation and Leakage Control

I split validation chronologically by dateid. Standardization statistics were fit once on training dates and then fixed. That was the main discipline of the project: time-respecting splits, fixed normalization, and suspicion toward any improvement that might come from leakage or time artifacts.

Diagnostics

I used multi-checkpoint diagnostics to understand model behavior instead of only watching a scalar loss:

  • feature dependency views
  • parameter distribution tracking
  • prediction and residual distributions
  • per-timeid behavior profiles
  • positional embedding norm tracking

These diagnostics helped identify late-training degradation and rising dependence on time features.

What I learned

I do not want to oversell this project with a weak metric. The lesson I keep is more basic and more useful: in high-frequency sequence modeling, validation protocol and leakage control matter as much as the architecture. Diagnostics are necessary because an apparent gain can be a real signal, a time artifact, or simply the model learning the wrong shortcut.

引用到评论
随便逛逛博客分类文章标签
复制地址关闭热评深色模式轉為繁體