Blog

Reflections on ICLR 2026

In late April, a delegation from the IMC Quantitative Research team headed to beautiful Rio de Janeiro for the ICLR 2026 conference, with the intent to engage the AI Research community and understand the main themes happing in the research community.

Reflections on ICLR 2026

Overall Metrics

ICLR 2026 in Rio marked a significant expansion in research activity compared to the previous ICLR conference in Singapore, with submissions rising sharply from 11,603 in 2025 to 19,525 in 2026 (a 68% increase). While the number of accepted papers also grew—from 3,704 to 5,357 (up 45%)—this increase did not keep pace with submissions, leading to a drop in acceptance rate from 32% to 27%. This suggests that, compared to 2025, ICLR 2026 was notably more competitive, and that was very noticeable on the ground as we found it very easy to meet highly capable and inspiring researchers across a variety of fields.

In contrast to the growth in submissions, overall participation declined compared to Singapore. Total attendance fell from 11,039 in 2025 to 9,954 in 2026, driven primarily by a substantial decrease in in-person attendance (from 10,435 to 7,054). At the same time, virtual participation increased significantly, from 604 to 2,900, indicating a shift toward more hybrid engagement. The number of represented countries decreased slightly from 85 to 80.

From an organisational standpoint, the scale of the review process expanded considerably compared to 2025. The number of reviewers increased from 18,325 to 21,674, while the number of area chairs nearly doubled from 823 to 1,634, reflecting the additional coordination required to handle the surge in submissions.

Source: ICLR 2025 and ICLR 2026 official fact sheets (media.iclr.cc).

Main Research Themes

Compared to 2025, which felt like a year of consolidating and refining foundation models, ICLR 2026 leaned more into understanding how these systems actually behave and how to build on top of them.

Last year, it seemed a lot of research energy went into making large models more practical—improving alignment, refining fine-tuning approaches, editing model behaviour, working out data attribution, extending to new modalities like video, and squeezing out more efficient inference. The focus was very much on making existing models work better in real-world settings.

In Rio, the emphasis shifted. There was greater focus on reasoning, agent-style setups, and what’s going on inside the models themselves. You saw more work trying to pin down what transformers can represent, where they break down, and how to train them more effectively at scale. The workshops reflected this as well, with a noticeable tilt toward agents, memory, continual learning, and reasoning-heavy use cases.

Viewed as a whole, ICLR 2026 feels like a transition point compared to 2025. The conference grew significantly in scale and became more selective, even as in-person attendance dipped and hybrid participation became more prominent.

More interestingly, the centre of gravity shifted. Where 2025 focused on refining foundation models for deployment, 2026 was more about what happens when you start to use these models in more complex settings. Agentic systems were a big part of that—models acting over multiple steps, interacting with tools, and maintaining some notion of state—but they weren’t the only story.

Alongside this, there was a stronger push to understand reasoning, memory, and model behaviour more generally: what transformers can represent, where they fail in longer interactions, and how to train them more robustly at scale. In that sense, agentic AI is less a standalone theme and more a visible outcome of a broader shift toward building and understanding systems of models that can operate reliably in more realistic, multi-step environments.

Most Interesting Papers

One of the winners of the Outstanding Papers award, “Transformers are Inherently Succinct”, provides a theoretical explanation for the effectiveness of transformer architectures. The paper argues that transformers can represent certain functions far more compactly than other sequence models, offering a new lens on their expressive power. Rather than focusing on empirical scaling alone, it helps explain why transformers work so well, and suggests that their architectural advantages are fundamental rather than incidental.

Apart from the officially recognised papers, the following section highlights a selection of papers from the conference that stood out on a personal level. Given the breadth of topics and the sheer volume of high-quality work presented, any such selection is inevitably subjective. Different researchers are drawn to different ideas—whether that’s theoretical insights, empirical breakthroughs, or practical applications—so this should be read as a reflection of individual interests rather than a definitive list of the most important contributions.

With that in mind, the papers discussed here were chosen because they offer interesting perspectives on emerging directions in the field, particularly around model behaviour, training dynamics, and system-level design.

Rényi Sharpness: A Novel Sharpness that Strongly Correlates with Generalization

In this paper, Qiaozhe Zhang and his collaborators from Huazhong University of Science and Technology introduce Rényi sharpness, a new measure of neural network sharpness based on the Rényi entropy of the Hessian eigenvalue spectrum. The motivation is to revisit the long-standing link between flat minima and good generalisation, which existing sharpness metrics have struggled to capture reliably. Instead of focusing only on the magnitude of curvature, Rényi sharpness accounts for how the curvature is distributed across directions.

In simplified form, if

are the (normalised) eigenvalues of the Hessian H, the Rényi sharpness of order

is:

where

are the normalised eigenvalues (forming a probability distribution).

The key idea is that good generalisation is associated not just with low curvature, but with a more uniform spread of Hessian eigenvalues. The authors provide theoretical results showing that this measure leads to meaningful generalisation bounds and avoids issues like reparameterisation sensitivity that affect earlier definitions of sharpness.

Empirically, Rényi sharpness correlates more strongly with test performance than prior metrics across a range of models and settings. The paper also proposes Rényi Sharpness-Aware Minimization (RSAM), a training method that uses this metric as a regulariser and improves over standard sharpness-aware methods such as SAM. It is worth noting that RSAM avoids Hessian eigenvalue computations by estimating sharpness through random weight perturbations and aggregating their losses, rather than explicitly measuring curvature. This replaces expensive second-order analysis with a first-order, Monte Carlo–style approximation of the local loss landscape.

Spectral Conditioning of Attention Improves Transformer Performance

Whilst Transformers and the attention mechanism are certainly not new, this paper by two Australian researchers, Hemanth Saratchandran and Simon Lucey, shows that there are benefits from revisiting a fundamental issue in transformers: the conditioning of attention layers. The authors show theoretically that the Jacobian of an attention block is heavily influenced by the query, key, and value projections, and that poor spectral conditioning can make optimisation and learning less effective.

To address this, the paper introduces a method for spectral conditioning of attention layers. At a high level, the method works by normalising or reshaping the spectrum of these projection matrices during training. Concretely, the weights are reparameterised (or adjusted) so that their singular values are kept within a controlled range—preventing them from becoming too large or too skewed. This avoids situations where a few directions dominate (leading to large Jacobian condition numbers) and instead promotes a more balanced transformation.

Importantly, the method is lightweight and designed as a drop-in replacement that can be applied across different transformer architectures and attention variants.

Empirically, the authors report consistent performance improvements across a range of transformer models and tasks. The broader significance of the work is that it reframes attention not just as a representational mechanism, but as an optimisation problem: improving the conditioning of attention layers can directly improve transformer training dynamics and downstream performance

On The Wasserstein Geodesic Principal Component Analysis Of Probability Measures

This paper by Nina Vesseron et al. studies Geodesic Principal Component Analysis (GPCA) for datasets where each observation is a probability distribution rather than a vector. Instead of applying PCA in Euclidean space, the authors work in Wasserstein space, where distances are defined via optimal transport, allowing them to capture modes of variation that respect the geometry of distributions.

The paper develops a tractable GPCA formulation for Gaussian distributions by lifting the problem to a space of invertible linear maps, leveraging the closed-form structure of Wasserstein geometry in this case. It then extends the approach to more general distributions using neural network parameterisations of transport maps and geodesics, avoiding tangent-space linearisation while still enabling practical computation.

Empirically, the method captures nonlinear geometric variation that classical tangent PCA can miss. The work is notable for combining optimal transport, differential geometry, and deep learning to provide a more faithful notion of dimensionality reduction for distributional data.

Potential applications arise in settings where each data point is a distribution and its evolution matters, such as time-varying systems (e.g. traffic or population dynamics), where the method can help characterise dominant modes of change.

EigenScore: OOD Detection using Posterior Covariance in Diffusion Models

Diffusion models are widely considered state-of-the-art for high-quality sample generation, but they remain susceptible to out-of-distribution (OOD) inputs. This paper proposes a new method for detecting OOD inputs using diffusion models by analyzing a covariance-like structure derived from the diffusion denoiser. Instead of relying on reconstruction error or likelihood estimates, the method examines the spectral properties of this structure. The key insight is that OOD samples exhibit systematic differences in their covariance spectra, particularly in the leading eigenvalues, yielding a measurable “spectral signature” of distribution shift.

The authors introduce EigenScore, which uses the dominant eigenvalues of this matrix as an OOD score. To make the method computationally practical, they use a Jacobian-free subspace iteration approach that estimates leading eigenvalues without explicitly forming the Jacobian. Empirically, the approach achieves competitive or state-of-the-art OOD detection performance and remains effective even in challenging near-OOD settings (such as CIFAR-10 vs CIFAR-100). More broadly, the paper shows that the local geometric structure of diffusion models provides useful signals for identifying distribution mismatch.

Over-parametrization bends the landscape: BBP transitions at initialization in simple Neural Networks

This paper studies how over-parameterisation changes the geometry of neural network loss landscapes at initialisation. Using a simplified teacher–student setup related to phase retrieval, the authors analyse the Hessian spectrum of the loss function and identify a BBP (Baik–Ben Arous–Péché) transition that separates regimes where the loss landscape contains directions aligned with the target signal from regimes where it does not.

The main result is that increasing over-parameterisation reshapes the Hessian spectrum, effectively shifting this transition point in a favourable direction. In highly over-parameterised settings, the transition moves closer to the information-theoretic recovery limit, suggesting that wider networks naturally create optimisation landscapes that are easier for gradient-based methods to exploit. The paper also distinguishes between continuous and discontinuous transitions and shows that finite-size effects can allow learning even below the asymptotic threshold.

More broadly, the work provides theoretical evidence for why over-parameterised neural networks can be easier to train despite their complexity. Rather than simply increasing model capacity, over-parameterisation appears to reshape the optimisation landscape in ways that make informative directions easier to discover during training.

It is important to note, especially if you work with financial data, the analysis in the paper largely assumes a low-noise teacher–student setting, where a clear signal direction exists and the Hessian exhibits a BBP-type phase transition familiar from spiked random matrix theory (RMT): a leading eigenvalue separates from the bulk when the signal is strong enough. In noisy settings, this picture maps directly onto classical RMT results where the spike must exceed a signal-to-noise threshold to remain detectable; increasing noise shrinks or submerges the spike into the Marchenko–Pastur bulk, blurring or eliminating the transition. From this perspective, over-parameterisation can still help by effectively amplifying or stabilising signal directions, but its benefits are governed by SNR limits rather than clean phase transitions, making the theory qualitatively relevant but quantitatively less predictive in high-noise regimes.

Selective Rotary Position Embedding

This paper, co-authored by an incoming IMC QR intern, Timur Carstensen, revisits how transformers encode positional information. Standard Rotary Position Embeddings (RoPE) use fixed-angle rotations to encode token order, while many linear or gated transformer variants instead rely on learned decay or gating mechanisms. The authors argue that these approaches capture complementary aspects of sequence modelling: rotations encode positional structure, while gating controls forgetting.

The paper introduces Selective RoPE, an input-dependent extension of RoPE that allows the model to learn token-conditioned rotation angles rather than relying on fixed positional rotations. This generalises rotary embeddings and provides a common perspective linking softmax attention, gated linear transformers, and state-space models. A key theoretical insight is that softmax attention can be interpreted as implicitly inducing a form of selective rotation, while the proposed method makes this mechanism explicit and learnable.

Empirically, the authors show that adding Selective RoPE consistently improves performance on language modelling and long-sequence tasks such as copying, retrieval, and state tracking. More broadly, the work suggests that positional encoding should be treated as a dynamic, input-dependent operation that interacts closely with memory and attention mechanisms, rather than a fixed preprocessing step.

Onwards and Upwards

Overall, ICLR 2026 was an energising week with many interesting new ideas presented. Beyond the individual papers, what stood out was the sheer volume of new perspectives — from theoretical insights to practical modelling tricks — that we’re bringing back with us. The conference reinforced how fast the field is moving, but also how many genuinely new directions are opening up across optimisation, sequence modelling, generative methods, and beyond. We’re leaving with plenty to explore, test, and build on internally. Looking ahead, it’s exciting to see how these ideas will evolve over the next few months — and we’re already looking forward to continuing the conversation at ICML in South Korea!