ParaRNN: Giant-Scale Nonlinear RNNs, Trainable in Parallel

April 25, 2026

86

Recurrent Neural Networks (RNNs) are naturally suited to environment friendly inference, requiring far much less reminiscence and compute than attention-based architectures, however the sequential nature of their computation has traditionally made it impractical to scale up RNNs to billions of parameters. A brand new development from Apple researchers makes RNN coaching dramatically extra environment friendly — enabling large-scale coaching for the primary time and widening the set of structure decisions out there to practitioners in designing LLMs, significantly for resource-constrained deployment.

In ParaRNN: Unlocking Parallel Coaching of Nonlinear RNNs for Giant Language Fashions, a brand new paper accepted to ICLR 2026 as an Oral, Apple researchers share a brand new framework for parallelized RNN coaching that achieves a 665× speedup over the standard sequential strategy (see Determine 1). This effectivity achieve permits the coaching of the primary 7-billion-parameter classical RNNs that may obtain language modeling efficiency aggressive with transformers (see Determine 2).

To speed up analysis in environment friendly sequence modeling and allow researchers and practitioners to discover new nonlinear RNN fashions at scale, the ParaRNN codebase has been launched as an open-source framework for computerized training-parallelization of nonlinear RNNs.

Speedup from Parallel RNN Coaching

Determine 1: Runtime comparability for parallel and sequential software of the tailored ParaGRU and ParaLSTM cells as a perform of enter sequence size. ParaRNN unlocks training-time parallelizability, permitting dramatic speedups over vanilla sequential software.

Efficiency of Giant-Scale Traditional RNNs

Determine 2: Perplexity (decrease is best) for varied mannequin sizes for Mamba2, ParaLSTM, ParaGRU, and a transformer. With large-scale coaching enabled by parallelization, the tailored GRU and LSTM fashions present perplexity aggressive with a transformer and Mamba2.

The computational value of the eye mechanism in a transformer grows quadratically with sequence size, whereas the computation required for a single ahead cross via an RNN is identical no matter how a lot context got here earlier than. This permits constant-time token technology throughout inference, making them significantly enticing for environment friendly deployment.

However there’s a catch: this effectivity benefit solely applies at inference time. Not like transformers, RNN coaching can’t be parallelized alongside the sequence size.

The very property that makes RNN environment friendly at inference — their sequential, recurrent construction — turns into a basic bottleneck throughout coaching. Not like the eye mechanism, which may course of all tokens in a sequence concurrently, an RNN software should be unrolled step-by-step, as illustrated in Determine 3.

Elementary trade-off between RNNs and Consideration

Determine 3: The elemental trade-off between RNNs and a focus. RNNs preserve a compact hidden state

h_l

Fashionable recurrent architectures have leveraged a intelligent workaround to allow sequence parallelization: simplifying the recurrence relationship to be purely linear within the hidden state. Selective state area fashions (SSMs) like Mamba use a recurrence within the kind:

$textual content{SSMs: } h_{l} = A_l h_{l-1} + B_l x_l,$

whereas classical RNNs embrace nonlinearities:

$textual content{RNNs: } h_l = sigma_l(h_{l-1}, x_l).$

Linearity permits parallelization as a result of linear operations are associative, that means the order by which you mix them doesn’t have an effect on the ultimate end result, identical to $(2+3)+4=2+(3+4)$

A Compromise for Parallelizability: Linearity

Determine 4: How one can flip a sequential associative operation (like cumulative sum) right into a parallel one: the computation followers out throughout many parallel employees that every deal with a partial end result, that are progressively merged collectively.

Linearity, nevertheless, comes with its personal set of limitations: the sorts of hidden state evolutions that may be modeled by a linear system is diminished in its selection, which has a direct impression on the general expressivity of the RNN mannequin. The query then turns into: should we sacrifice expressivity for velocity, or can we’ve each?

The important thing perception is that we can have one of the best of each worlds, by adapting Newton’s technique — a traditional numerical approach for fixing nonlinear equations. Moderately than tackling the complicated system straight, Newton’s technique works by iteratively constructing and fixing an approximation (within the type of a extra tractable linear system) which is used to progressively refine the answer to the goal nonlinear one.

In Determine 5 we illustrate how we will apply this to our state of affairs. As an alternative of considering of the RNN as a sequence of sequential steps, we reframe your entire sequence as a single system of equations, the place the hidden states throughout all steps are unknowns to resolve for concurrently. Newton’s technique solves this method iteratively, approximating the nonlinearities with a linearization given by their native derivatives (i.e., their Jacobians). That is the place the magic occurs: the linearized RNN system has precisely the identical kind as a linear SSM, with the Jacobians taking part in the function of the state matrices $A_l$

Whereas this strategy introduces some overhead with respect to computing the RNN software sequentially, if the Newton iterations converge shortly sufficient (which we empirically noticed for well-designed RNN fashions), we will successfully get well the total nonlinear RNN habits in a fraction of the time, due to parallelization.

Newton’s method applied to RNN parallelization — Determine 5: Newton’s technique utilized to RNN parallelization. We first collate the recurrence relationships for every recurrent step right into a single system, which includes all hidden states concurrently. The system is then iteratively linearized and solved, to get well a greater and higher approximation to the goal hidden state evolution. Crucially, the linearizations themselves symbolize the evolution of an SSM, which may then be solved effectively by way of parallel scan.

In our experiments, we apply this system to diversifications of two classical RNNs: the GRU and LSTM cells, and obtain convergence constantly with simply three iterations. In different phrases, with three fastidiously designed, parallel SSM purposes, we will get well the identical hidden state evolution as with the sequential software of a nonlinear RNN — dramatically decreasing wall-clock time at coaching.

Whereas the ParaRNN framework can in precept be utilized to any RNN, some cautious engineering remains to be required to make it sensible for large-scale coaching. The parallel discount algorithm on the coronary heart of the tactic must effectively assemble, retailer, and multiply collectively the Jacobian matrices arising from the linearization. For generic RNNs, these Jacobians are dense, which makes their storage develop quadratically and their multiplication cubically with hidden state measurement — a price intractable for large-scale fashions.

We tackle this following the design rules from fashionable SSMs like Mamba, and introduce the ParaGRU and ParaLSTM cells: diversifications of the classical GRU and LSTM cells that yield structured Jacobians. Specifically, we simplify the matrices within the cells’ definition to solely have nonzero parts of their primary diagonal. This ensures that their Jacobians are additionally diagonal (for ParaGRU) and block-diagonal (for ParaLSTM), as outlined in Determine 6.

General: dense — Determine 6: Vanilla GRU and LSTM cells embrace mixing amongst hidden state parts, which causes their Jacobians to be dense normally. In our work, we simplify the cells definitions to forestall mixing. The ensuing Jacobians are closely structured: a diagonal matrix for the GRU cell, and a 2×2 block diagonal matrix for the LSTM cell.

Structured: (block) diagonal — Determine 6: Vanilla GRU and LSTM cells embrace mixing amongst hidden state parts, which causes their Jacobians to be dense normally. In our work, we simplify the cells definitions to forestall mixing. The ensuing Jacobians are closely structured: a diagonal matrix for the GRU cell, and a 2×2 block diagonal matrix for the LSTM cell.

To get one of the best speedups, we then implement customized CUDA kernels to carry out environment friendly parallel discount of Jacobians presenting these constructions. Within the design of our kernels, we make certain to carefully comply with the GPU reminiscence hierarchy to maintain the information as native as potential. Because of this, our fully-fused implementation handles Newton iterations, system meeting, and parallel discount in a single kernel, reaching outstanding speedups over Mamba, In Determine 7 we offer timing outcomes for the three implementations we offer to use our ParaRNN cells in parallel: in pure PyTorch, in PyTorch with CUDA-accelerated discount, and fully-fused in CUDA.

Ahead Move via ParaGRU and ParaLSTM Cells

Determine 7: Timing for a ahead cross via a ParaGRU (left) and a ParaLSTM (proper) cell, for varied implementations of the ParaRNN parallelization methodology: in pure PyTorch, utilizing customized CUDA kernels for under the parallel discount operation, and with a fully-fused kernel for the entire Newton routine. The fully-fused implementations are aggressive with the Mamba one throughout all sequence lengths.

To validate the effectiveness of the ParaRNN framework in enabling sensible coaching of nonlinear RNNs at scale, we skilled fashions starting from 400M to 7B parameters on language modelling duties. Our objective is to check how classical RNNs can carry out as LLMs, as soon as large-scale coaching turns into possible by way of parallelization.

The outcomes present that even classical RNNs make for aggressive LLMs when skilled on the 7B scale. Each by way of sheer perplexity and downstream job efficiency (see Desk 1), ParaGRU and ParaLSTM obtain scores similar to transformers and state-of-the-art SSMs.

Mannequin	#params	↓ PPL	↑ Arc-C (25) (3)		↑ HSwag (10) (0)		↑ OBQA (10) (0)		↑ WinoG (5) (0)		↑ PiQA (0)	↑ MMLU (0)
Mamba2	6.96B	8.62	40.02	39.59	69.78	69.68	42.20	42.20	65.19	63.77	76.66	26.61
ParaLSTM	6.76B	9.16	37.46	36.52	62.47	62.85	42.20	38.80	57.70	59.12	75.19	25.31
ParaGRU	6.76B	9.19	39.68	36.77	65.85	65.75	42.20	40.40	61.40	59.83	76.66	25.29
Transformer	6.89B	9.55	34.30	33.36	62.98	62.20	40.00	37.20	61.48	60.85	74.97	23.12

Desk 1: Fashions accuracy on downstream duties from the lm-eval-harness analysis suite (variety of photographs in brackets). Total, the RNN fashions attain efficiency similar to Mamba and transformer throughout the board.

The actual benefit of recurrent fashions shines at inference time, as proven in Determine 8. The constant-time token technology of RNNs means sustaining excessive throughput no matter context size, making them an interesting selection for purposes the place reaching quick technology is paramount.

Determine 8: In distinction to transformer, whose technology time will increase with sequence size, recurrent fashions like Mamba and RNNs require a relentless time, unbiased of sequence size, to generate every new token — thus leading to increased throughput.

Furthermore, the flexibility to incorporate nonlinearities within the recurrence step definition dramatically boosts efficiency on artificial duties requiring state monitoring and retrieval capabilities. These are helpful to check the mannequin’s potential to keep up and replace significant info in its inside state, and to retailer and get well related info when wanted. The improved efficiency outlined in Desk 2 signifies that nonlinear RNNs present superior expressivity over linear ones, and are value contemplating to design extra highly effective fashions.

Mannequin	MQAR	𝑘-hop	Parity
Transformer	100%	78%	53%
Mamba2	100%	98%	51%
ParaGRU	100%	100%	100%
ParaLSTM	100%	100%	100%

Desk 2: Together with nonlinearities into the recursion definition permits to realize
superior efficiency on state monitoring and recall duties over purely linear RNNs like Mamba.

Finally, our experiments present that scalability was the lacking piece within the RNN puzzle all alongside. At billion scale, they carry out in addition to fashionable language fashions, and boast superior expressivity and sooner throughput as well. These outcomes open the door to reconsidering nonlinear recurrence in fashionable sequence modelling: the nonlinearity vs coaching effectivity trade-off isn’t basic — it was only a consequence of computational limitations, which we will now overcome.

To speed up progress and allow additional exploration nonlinear RNN fashions at scale, the ParaRNN codebase has been launched as an open-source framework. Researchers and practitioners can simply concentrate on designing the RNN cell — the framework takes care of every part else.

To outline a customized cell in ParaRNN, simply implement its recurrence step:

class MyRNNCell( BaseRNNCell ):
    def recurrence_step( self, h, x, system_parameters ):
        h_new = ... 
        return h_new

The framework robotically handles Newton’s technique software, Jacobian meeting, parallel discount routines, and optimizations for structured Jacobians.

Pure PyTorch: For prototyping. Makes use of computerized differentiation and works for any RNN, however is just not optimized for scale
CUDA-Accelerated: For generic cells with (block-)diagonal Jacobians. Makes use of customized kernels for parallel discount
Absolutely-Fused: Single-kernel implementation fusing the entire Newton routine. Requires CUDA implementation of recurrence step

The modular design makes it simple to experiment with customized cells, Jacobian constructions, and solver configurations—it suffices to inherit from the out there base lessons and implement the precise discount operations.

RNN software is just not inherently sequential anymore. For the primary time, classical RNNs will be skilled on the scale of billions of parameters with coaching instances and efficiency matching fashionable architectures — unlocking the inference effectivity and expressivity benefits that made recurrence interesting within the first place.

ParaRNN opens the door to exploring nonlinear recurrence at scale: experimenting with novel architectures and pushing the boundaries of what’s potential with recurrent fashions is now simpler than ever. Nonlinear RNNs are again and able to scale.

ParaRNN: Giant-Scale Nonlinear RNNs, Trainable in Parallel

Speedup from Parallel RNN Coaching

Efficiency of Giant-Scale Traditional RNNs

Elementary trade-off between RNNs and Consideration

A Compromise for Parallelizability: Linearity

Newton’s Methodology Utilized

Ahead Move via ParaGRU and ParaLSTM Cells

Related Articles

Safety within the Submit-Mythos Period

Google Releases Gemini 3.5 Reside Translate, a Streaming Speech-to-Speech Audio Mannequin Masking 70+ Languages Throughout Meet, Translate, and the Reside API

The AI boomerang impact: extra information suggests employers are reversing AI layoffs

Latest Articles

Safety within the Submit-Mythos Period

Google Releases Gemini 3.5 Reside Translate, a Streaming Speech-to-Speech Audio Mannequin Masking 70+ Languages Throughout Meet, Translate, and the Reside API

The AI boomerang impact: extra information suggests employers are reversing AI layoffs

Planet 9 thriller deepens as new discovery challenges hidden planet concept

Lastly the Steady Diff-in-Diff Estimator Reveals Up!