Processing 1M Chess Games in 15 Seconds with Rust
I train self-supervised models on chess game data. My Python pipeline using python-chess took 25 minutes to parse and tokenize 1M games from Lichess PGN dumps. I rewrote it in Rust. It now takes 15...

Source: DEV Community
I train self-supervised models on chess game data. My Python pipeline using python-chess took 25 minutes to parse and tokenize 1M games from Lichess PGN dumps. I rewrote it in Rust. It now takes 15 seconds. This post covers the architecture, why Rust was the right choice, and what I learned. The problem Training a chess move predictor requires converting PGN (Portable Game Notation) files into tokenized sequences β arrays of integer IDs that a neural network can consume. A typical Lichess monthly dump has 5M+ games in a zstd-compressed PGN file. My Python pipeline had three bottlenecks: PGN parsing β python-chess parses SAN notation, validates moves on a board, handles edge cases. Correct, but slow. ~15 minutes for 1M games. Tokenization β converting validated UCI moves to token IDs, tracking piece types and turns. ~10 minutes. Memory β all games loaded into a Python list of dicts. 1M games = ~4GB RAM. The Rust rewrite The tool is called ailed-soulsteal (named after a Castlevania abili