Deep sequence models tend to memorize geometrically; it is unclear why.
Deep sequence models tend to memorize geometrically; it is unclear why.
[Preprint Link Available Soon]
Abstract: We present a clean and analyzable phenomenon that contrasts the predominant associative view of Transformer memory with a nascent geometric view. Concretely, we present an in-weights path-finding task where a next-token learner succeeds in planning ahead, despite the task being adversarially constructed. This observation is incompatible with memory as strictly a storage of local associations; instead, training with gradient descent must have synthesized a geometry of global relationships from witnessing mere local associations. While such a geometric memory may seem intuitive in hindsight, we argue that its emergence cannot be easily explained by various pressures, be it statistical, or architectural, or supervisory. To make sense of this, we draw a connection to an open question in the simpler Node2Vec algorithm, and we provide empirical clues to a closed-form solution for what graph embeddings are learned. Our insight is that global geometry arises from a spectral bias that--in contrast to prevailing intuition--does not require low dimensionality of the embeddings. Our study raises open questions concerning implicit reasoning and the bias of gradient-based memorization, while offering a simple example for analysis. Our findings also call for revisiting theoretical abstractions of parametric memory in Transformers.
