Recurrent Neural Networks (RNNs) laid the muse for sequence modeling, however their intrinsic sequential nature restricts parallel computation, making a basic barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, extra lately, State House Fashions (SSMs). Whereas SSMs obtain environment friendly parallelization by structured linear recurrences, this linearity constraint limits their expressive energy and precludes modeling advanced, nonlinear sequence-wise dependencies. To handle this, we current ParaRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Constructing on prior work, we solid the sequence of nonlinear recurrence relationships as a single system of equations, which we remedy in parallel utilizing Newton’s iterations mixed with customized parallel reductions. Our implementation achieves speedups of as much as 665x over naive sequential software, permitting coaching nonlinear RNNs at unprecedented scales. To showcase this, we apply ParaRNN to diversifications of LSTM and GRU architectures, efficiently coaching fashions of 7B parameters that attain perplexity similar to similarly-sized Transformers and Mamba2 architectures. To speed up analysis in environment friendly sequence modeling, we launch the ParaRNN codebase as an open-source framework for automated training-parallelization of nonlinear RNNs, enabling researchers and practitioners to discover new nonlinear RNN fashions at scale.
