Based on a quick first skim of the abstract and the introduction, the results from hierarchical reasoning (HRM) models look incredible:
> Using only 1,000 input-output examples, without pre-training or CoT supervision, HRM learns to solve problems that are intractable for even the most advanced LLMs. For example, it achieves near-perfect accuracy in complex Sudoku puzzles (Sudoku-Extreme Full) and optimal pathfinding in 30x30 mazes, where state-of-the-art CoT methods completely fail (0% accuracy). In the Abstraction and Reasoning Corpus (ARC) AGI Challenge 27,28,29 - a benchmark of inductive reasoning - HRM, trained from scratch with only the official dataset (~1000 examples), with only 27M parameters and a 30x30 grid context (900 tokens), achieves a performance
of 40.3%, which substantially surpasses leading CoT-based models like o3-mini-high (34.5%) and Claude 3.7 8K context (21.2%), despite their considerably larger parameter sizes and context lengths, as shown in Figure 1.
I'm going to read this carefully, in its entirety.
> It uses two interdependent recurrent modules: a *high-level module* for abstract, slow planning and a *low-level module* for rapid, detailed computations. This structure enables HRM to achieve significant computational depth while maintaining training stability and efficiency, even with minimal parameters (27 million) and small datasets (~1,000 examples).
> HRM outperforms state-of-the-art CoT models on challenging benchmarks like Sudoku-Extreme, Maze-Hard, and the Abstraction and Reasoning Corpus (ARC-AGI), where CoT methods fail entirely. For instance, it solves 96% of Sudoku puzzles and achieves 40.3% accuracy on ARC-AGI-2, surpassing larger models like Claude 3.7 and DeepSeek R1.
I am extremely skeptical of a 27M parameter model being trained “from scratch” on 1000 datapoints. I am likewise incredulous of the lack of comparison with any other model which is trained “from scratch” using their data preparation. Instead they strictly compare with 3rd party LLMs which are massively more general purpose and may not have any of those 1000 examples in their training set.
Yeah, the results look incredible indeed. That's why I and many others here have decided to download, review, and test the code published by the authors.[a] If their code doesn't live up to their claims, we will all ignore their work and move on. If their code lives up to their claims, no one can argue with it. In my experience, when authors publish working code, it's usually a good sign.
The architecture is very similar offset lstms which have been studied extensively. The main difference is the handover of the hidden state, which my naive mind would assume makes optimization substantially more difficult.
I haven't had a chance to read the preprint carefully or play with the code yet. Best place to follow what's happening is by looking at the github repo, specifically open and closed issues and pull requests.
I'll wait until some more benchmarks are run in this case. Unlike traditional software, vetting a model architecture works better than alternatives is a time and compute intensive process. You really can't just download it and "try it out" outside of general purpose models (which this is not).
> "After completing the T steps, the H-module incorporates the sub-computation’s outcome (the final state L) and performs its own update. This H update establishes a fresh context for the L-module, essentially “restarting” its computational path and initiating a new convergence phase toward a different local equilibrium."
So they let the low-level RNN bottom out, evaluate the output in the high level module, and generate a new context for the low-level RNN. Rinse, repeat. The low-level RNNs are iterating backpropagation while the high-level is periodically kicking the low-level RNNs to get better outputs. Loops within loops. Composition.
Another interesting part:
> "Neuroscientific evidence shows that these cognitive modes share overlapping neural circuits, particularly within regions such as the prefrontal cortex and the default mode network. This indicates that the brain dynamically modulates the “runtime” of these circuits according to task complexity and potential rewards.
> Inspired by the above mechanism, we incorporate an adaptive halting strategy into HRM that enables `thinking, fast and slow'"
A scheduler that dynamically balances resources based on the necessary depth of reasoning and the available data.
I love how this paper cites parallels with real brains throughout. I believe AGI will be solved as the primitives we're developing are composed to extreme complexity, utilizing many cooperating, competing, communicating, concurrent, specialized "modules." It is apparent to me that human brain must have this complexity, because it's the only feasible way evolution had to achieve cognition using slow, low power tissue.
That’s not as impossible as it seems, Gaussian Processes are equivalent to a Neural Network with infinite hidden units, and any multilayer NN can be approximated by one with a single, larger layer of hidden units.
Does this not mean that the entire model must cycle to operate any given part? Division into concurrent "modules" (the term appearing in this paper,) affords optimizing frequency independently and intentionally.
Also, what certainty is there that everything is best modelled with multilayer NN? Diversity of algorithms, independently optimized, could yield benefits.
Further, can we hope that modularity will create useful points of observability? The inherent serialization that develops between modules could be analyzed, and possibly reveal great insights.
Finally, isn't there a possibility that AGI could be achieved more rapidly by factoring the various processes into discrete modules, as opposed to solving every conceivable difficulty in a monolithic manner, whatever the algorithm?
That's a lot of questions. Seems like identifying possible benefits is easy enough that this approach is worthwhile exploring. We shall see I suppose. At the very least we know the modularization of HRM has a valid precedent: real biological brains.
It would not surprise me if all of these tangential advances in various models and approaches ultimately become part of a larger frame work of modules designed to handle certain tasks - similar to how your medula oblongata operates breathing and heart rate, and your amygdala sorts out memory and hormone production, and your cingulate gyrus helps control motor function, et al.
We have a great example (us), we just need to hone and replicate it.
This work does have some very interesting ideas, specifically avoiding the costs of backpropagation through time.
However, it does not appear to have been peer reviewed.
The results section is odd. It does not include include details of how they performed the assesments, and the only numerical values are in the figure on the front page. The results for ARC2 are (contrary to that figure) not top of the leaderboard (currently 19% compared to HRMs 5% https://www.kaggle.com/competitions/arc-prize-2025/leaderboa...)
In fields like AI/ML, I'll take a preprint with working code over peer-reviewed work without any code, always, even when the preprint isn't well edited.
Everyone everywhere can review a preprint and its published code, instead of a tiny number of hand-chosen reviewers who are often overworked, underpaid, and on tight schedules.
If the authors' claims hold up, the work will gain recognition. If the claims don't hold up, the work will eventually be ignored. Credentials are basically irrelevant.
Think of it as open-source, distributed, global review. It may be messy and ad-hoc, since no one is in charge, but it works much better than traditional peer review!
I sympathize partially with your views, but how would this work in practice? Where would the review comments be stored? Is one supposed to browse Hacker News to check the validity of a paper?
If a professional reviewer spots a serious problem, the paper will not make it to a conference or journal, saving us a lot of trouble.
Peer review is a way to distribute the work of identifying which papers are potentially worth reading. If you're starting from an individual paper and then ask yourself whether it was peer reviewed or not, you're doing it wrong. If you really need to know, read it yourself and accept that you might just be wasting your time.
If you want to mostly read papers that have already been reviewed, start with people or organizations you trust to review papers in an area you're interested in and read what they recommend. That could be on a personal blog or through publishing a traditional journal, the difference doesn't matter much.
“Find papers that support what you want via online echo chambers” isn’t the advice you want to be giving but it is the net result of it. Society needs trusted institutions. Not that publishers are the best result of that but adhoc blog posts are decidedly not better.
It is totally the advice I want to be giving. Given the choice between an echo chamber matched to my interests and wading through a stream of unfiltered crap, I'll take the echo chamber every time. (Of course there's also the option of not reading papers at all, which is typically a good choice if you're not a subject matter expert and don't intend to put in the work to become one.)
If you choose to focus on the output of a well-known publisher, you're not avoiding echo chambers, you're using a heuristic to hopefully identify a good one.
Those are not the only options, namely the parent mentioned 'trusted institutions'. It is the best way to defer that filtering to a group of other humans, whose collective expertise will surpass any one individual.
The destruction of trust in both public and private institutions - newspapers, journals, research institutions, universities - and replacement with social media 'influencers' and online echo chambers is how we arrived at the current chaotic state of politics worldwide, the rise of extremist groups, cults, a resurgence of nationalism, religious fanaticism... This is terrible advice.
Your question is like asking "how can I verify this rod is 1m long if I can't ask an expert". The answer is of course you measure it. That's much more reliable than asking an expert. However, the results of many papers take a huge amount of work to replicate, so we've built a network of experts over the years to evaluate them.
But this is open source, so TL;DR: you download the code, run it, and see if it gets the results claimed.
Scepticism is generally always a good idea with ML papers. Once you start publishing regularly in ML conferences, you understand that there is no traditional form of peer review anymore in this domain. The volume of papers has meant that 'peers' are often students coming to grips with parts of the field that rarely align with what they are asked to review. Conference peer review has become a 'vibe check' more than anything.
Real peer review is when other experts independently verify your claims in the arXiv submission through implementation and (hopefully) cite you in their followup work. This thread is real peer review.
I appreciate this insight, makes you wonder, why even publish a paper if it only amounts to a vibe check? If it's just the code we need we can get that peer reviewed through other channels.
Skepticism is best expressed by repeating the experiment and comparing results. I'm game and I have 10 days off work next month. I wonder what can be had in terms of full source and data, etc. from the authors?
I think that’s too harsh a position solely for not being peer reviewed yet. Neither of yhe original mamba1 and mamba2 papers were peer reviewed. That said, strong claims warrant strong proofs, and I’m also trying to reproduce the results locally.
The fact that you are expecting a paper just published to have been peer reviewed already tells me that you are likely not familiar with the process. The first step to have your work peer reviewed is to publish it.
Enough already. Please. The paper + code is here for everybody to read and test. Either it works or it doesn't. Either people will build upon it or they won't. I don't need to wait 20 months for 3 anonymous dudes to figure it out.
Do you consider yourself a peer? Feel free to review it.
A peer reviewer will typically comment that some figures are unclear, that a few relevant prior works have gone uncited, or point out a followup experiment that they should do.
That's about the extent of what peer reviewers do, and basically what you did yourself.
> However, it does not appear to have been peer reviewed.
my observation is that peer reviewers never try to reproduce results or do basic code audit to check that there is no data leak for example to training dataset.
Skepticism is an understatement. There are tons of issues with this paper. Why are they comparing results of their expert model that was trained from scratch on a single task to general purpose reasoning models? It is well established in the literature that you can still beat general purpose LLMs in narrow domain tasks with specially trained, small models. The only comparison that would have made sense is one to vanilla transformers using the same nr of parameters and trained on the same input-output dataset. But the paper shows no such comparison. In fact, I would be surprised if it was significantly better, because such architecture improvements are usually very modest or not applicable in general. And insinuating that this is some significant development to improve general purpose AI by throwing in ARC is just straight up dishonest. I could probably cook up a neural net in pytorch in a few minutes that beats a hand-crafted single task that o3 can't solve in an hour. That doesn't mean that I made any progress towards AGI.
Have you spent much time with the ARC-1 challenge? Their results on that are extremely compelling, showing results close to the initial competition's SOTA (as of closing anyway) with a tiny model and no hacks like data augmentation, pretraining, etc that all of the winning approaches leaned on heavily.
Your criticism makes sense for the maze solving and sudoku sets, of course, but I think it kinda misses the point (there are traditional algos that solve those just fine - it's more about the ability of neural nets to figure them out during training, and known issues with existing recurrent architectures).
Looking at the code, there is a lot of data augmentation going on there. For the Sudoku and ARC data sets, they augment every example by a factor of 1,000.
That's fair, they are relabelling colours and rotating the boards. I meant more like mass generation of novel puzzles to try and train specific patterns. But you are right that technically there is some augmentation going on here, my bad.
Hm, I'm not so sure it's fair play for the Sudoku puzzle. Suggesting that the AI will understand the rules of the game with only 1,000 examples, and then adding 1,000,000 derived examples does not feel fair to me. Those extra examples leak a lot of information about the rules of the game.
I'm not too familiar with the ARC data set, so I can't comment on that.
True, it leaks information about all the symmetries of the puzzle, but that's about it. I guess someone needs to test how much that actually helps - if I get the model running I'll give it a try!
> That's fair, they are relabelling colours and rotating the boards.
Photometric augmentation, Geometric augmentation
> I meant more like mass generation of novel puzzles to try and train specific patterns.
What is the difference between Synthetic Data Generation and Self Play (like AlphaZero)? Don't self play simulations generate synthetic training data as compared to real observations?
I don't know the jargon, but for me the main thing is the distinction between humans injecting additional bits of information into the training set vs the algorithm itself discovering those bits of information. So self-play is very interesting (it's automated as part of the algorithm) but stuff like generating tons of novel sudoku puzzles and adding them to the training set is less interesting (the information is being fed into the training set "out-of-band", so to speak).
In this case I was wrong, the authors are clearly adding bits of information themselves by augmenting the dataset with symmetries (I propose "symmetry augmentation" as a much more sensible phrase for this =P). Since symmetries share a lot of mutual information with each other, I don't think this is nearly as much of a crutch as adding novel data points into the mix before training, but ideally no augmentation would be needed.
I guess you could argue that in some sense it's fair play - when humans are told the rules of sudoku the symmetry is implicit, but here the AI is only really "aware" of the gradient.
Traditional ML CV Computer Vision research has perhaps been supplanted by multimodal LLMs that are trained on image analysis annotations. (CLIP, Brownian-motion based Dall-E and Latent Diffusion were published in 2021. More recent research: Brownian Bridges, SDEs, Lévy processes. What are foundational papers in video genai?)
TOPS are now necessary.
I suspect that existing CV algos for feature extraction would also be useful for training LLMs. OpenCV, for example, has open algorithms like ORB (Oriented FAST and Rotated BRIEF), KAZE and AKAZE, and SIFT since 2020. SIFT "is highly robust to rotation, scale, and illumination changes".
But do existing CV feature extraction and transform algos produce useful training data for LLMs as-is?
Similarly, pairing code and tests with a feature transform at training time probably yields better solutions to SWE-bench.
Self Play algos are given rules of the sim. Are self play simulations already used as synthetic training data for LLMS and SLMs?
There are effectively rules for generating synthetic training data.
The orbits of the planets might be a good example of where synthetic training data is limited and perhaps we should rely upon real observations at different scales given cost of experimentation and confirmations of scale invariance.
Extrapolations from orbital observations and classical mechanics failed to predict the Perihelion precession of Mercury (the first confirmation of GR General Relativity).
To generate synthetic training data from orbital observations where Mercury's 43 arcsecond deviation from Newtonian mechanics was disregarded as an outlier would result in a model overweighted by existing biases in real observations.
As the other commenter already pointed out, I'll believe it when I see it on the leaderboard. But even then it already lost twice against the winner of last year's competition, because that too was a general purpose LLM that could also do other things.
Let's not move the goalposts here =) I don't think it's really fair to compare them directly like that. But I agree, this is triggering my "too good to be true" reflex very hard.
As a cognitive psychologist, I highly suspected that, broadly speaking, this was the needed direction for AI. See Fuzzy Trace Theory[1].
Fuzzy Trace Theory basically suggests that memory (and cognition generally) works at multiple levels spanning verbatim representations to gist-level representations, that get bound together into memories. Recalling gist, the general idea, along with specific details, allows for powerful generalization and flexible retrieval pathways.
If I understand this correctly, it learns the rules of Sudoku by looking at 1,000 examples of (puzzle, solution) pairs. It is then able to solve previously unseen puzzles with 55% accuracy. If given millions of examples, it becomes almost perfect.
This is apparently without pretraining of any sort, which is kind of amazing. In contrast, systems like AlphaZero have the rules to go or chess built-in, and only learn the strategy, not the rules.
Off to their GitHub repository [1] to see this for myself.
AlphaZero may have the rules built in, but MuZero and the other follow-ups didn't. MuZero not only matched or surpassed AlphaZero, but it did so with less training, especially in the EfficientZero variant; notably also on the Atari playground.
To be fair, MuZero only learns a model of the rules for navigating its search tree. To make actual moves, it gets a list of valid actions from the game engine, so at that level it does not learn the rules of the game.
(HRM possibly does the same, and could be in the same realm as MuZero. It probably makes a lot of illegal moves.)
To follow up, after experimenting a bit with the source code:
1. Please, for the love of God, and for scientific reproducibility, specify library versions explicitly, and use pyproject.toml instead of an incomplete requirements.txt.
2. The 1,000 Sudoku examples are augmented with hand-coded permutation algorithms, so the actual input data set is more like 1,000,000 examples, not 1,000.
I don't know how common this is, but the fschat library maintainers went for at least a year without making an official release or updating the version number in their GitHub repo, so the only way to both have current code and a reproducible build (without just including the fschat library directly, of course) was to pin it to a particular GitHub commit hash, which would get you code that was current, but with the version number from 12+ months earlier.
fschat is pretty popular for LLM-related work, so I assume this is at least not unheard-of for other notable third-party libraries.
I don't remember the exact scenario but it might have been related to the underlying python or some sys library being a little different and then the dependency lock not being compatible with it.
I appreciate the connections with neurology, and the paper itself doesn't ring any alarm bells. I don't think I'd reject it if it fell to me to peer review.
However, I have extreme skepticism when it comes to the applicability of this finding. Based on what they have written, they seem to have created a universal (maybe; adaptable at the very least) constraint-satisfaction solver that learns the rules of the constraint-satisfaction problem from a small number of examples. If true (I have not yet had the leisure to replicate their examples and try them on something else), this is pretty cool, but I do not understand the comparison with CoT models.
CoT models can, in principle, solve _any_ complex task. This needs to be trained to a specific puzzle which it can then solve: it makes no pretense to universality. It isn't even clear that it is meant to be capable of adapting to any given puzzle. I suspect this is not the case, just based on what I have read in the paper and on the indicative choice of examples they tested it against.
This is kind of like claiming that Stockfish is way smarter than current state of the art LLMs because it can beat the stuffing out of them in chess.
I feel the authors have a good idea here, but that they have marketed it a bit too... generously.
Yes, I agree, but this is a huge deal in and of itself. I suppose the authors had to frame it in this way for obvious reasons of hype surfing, but this is an amazing achievement, especially given the small size of the model! I’d rather use a customized model for a specific problem than a supposedly « generally intelligent » model that burns orders of magnitude more energy for much less reliability.
That's a fair argument to make. I should have, perhaps, written "are supposed to be able," or "have become famous for their apparent ability to solve loosely-specified arbitrary problems."
CoT _is,_ in my mind at least, a hack that is bolted to LLMs to create some sort of loose approximation of reasoning. When I read the paper I expected to see a better hack, but could not find anything on how you take this architecture, interesting though it is, and put it to use in a way similar to CoT. The whole paper seems to make a wild pivot between a fully general biomimetic grandeur of the first half, and the narrow effectiveness of the second half.
I don't see how that changes anything. By this logic, there's no need for CoT reasoning at all, as a single pass should be sufficient. I don't see how that proves that CoT increases capabilities.
> CoT models can, in principle, solve _any_ complex task.
The authors explicitly discuss the expressive power of transformers and CoT in the introduction. They can only solve problems in a fairly restrictive complexity class (lower than PTIME!) - it's one of the theoretical motivations for the new architecture.
"The fixed depth of standard Transformers places them in computational complexity classes such as AC0 [...]"
This architecture by contrast is recurrent with inference time controlled by the model itself (there's a small Q-learning based subnetwork that decides halting time as it "thinks"), so there's no such limitation.
The main meat of the paper is describing how to train this architecture efficiently, as that has historically been the issue with recurrent nets.
Agreed, regarding the computational simplicity of CoT LLMs, and that this solution certainly has much more flexibility. But is there a reason to believe that this architecture (and training method) is as applicable to the development of generally-capable models as it is to the solution of individual puzzles?
Don't get me wrong, this is a cool development, and I would love to see how this architecture behaves on a constraint-based problem that's not easily tractable via traditional algorithm.
The ARC-1 problem set that they benchmark on is an example of such a problem, I believe. It's still more-or-less completely unsolved. They don't solve it either, mind, but they achieve very competitive results with their tiny (27m param) model. Competitive with architectures that are using extensive pretraining and hundreds of billions of parameters!
That's one of the things that sticks out for me about the paper. Having tried very hard myself to solve ARC it's pretty insane what they're claiming to have done here.
(I think a lot of the sceptics in this thread are unaware of just how difficult ARC-1 is, and are focusing on the sudoku part, which I agree is much simpler and less surprising that they do well on)
I hope/fear this HRM model is going to be merged with MoE very soon. Given the huge economic pressure to develop powerful LLMs I think this can be done in just a month.
The paper seems to only study problems like sudoku solving, and not question answering or other applications of LLMs. Furthermore they omit a section for future applications or fusion with current LLMs.
I think anyone working in this field can envision their applications, but the details to have a MoE with an HRM model could be their next paper.
I only skimmed the paper and I am not an expert, sure other will/can explain why they don't discuss such a new structure. Anyway, my post is just blissful ignorance over the complexity involved and the impossible task to predict change.
Edit: A more general idea is that Mixture of Expert is related to cluster of concepts and now we would have to consider a cluster of concepts related by the time they take to be grasped, so in a sense the model would have in latent space an estimation of the depth, number of layers, and time required for each concept, just like we adapt our reading style for a dense math book different to a newspaper short story.
This HRM is essentially purpose-designed for solving puzzles with a small number of rules interacting in complex ways. Because the number of rules is small, a small model can learn them. Because the model is small, it can be run many times in a loop to resolve all interactions.
In contrast, language modeling requires storing a large number of arbitrary phrases and their relation to each other, so I don't think you could ever get away with a similarly small model. Fortunately, a comparatively small number of steps typically seems to be enough to get decent results.
But if you tried to use an LLM-sized model in an HRM-style loop, it would be dog slow, so I don't expect anyone to try it anytime soon. Certainly not within a month.
Maybe you could have a hybrid where an LLM has a smaller HRM bolted on to solve the occasional constraint-satisfaction task.
> In contrast, language modeling requires storing a large number of arbitrary phrases and their relation to each other
A person has some ~10k word vocabulary, with words fitting specific places in a really small set of rules. All combined, we probably have something on the order of a few million rules in a language.
What, yes, is larger than the thing in this paper can handle. But is nowhere near as large as a problem that should require something the size of a modern LLM to handle. So it's well worth it to try to enlarge models with other architectures, try hybrid models (note that this one is necessarily hybrid already), and explore every other possibility out there.
What about many small HRM models that solve conceptually distinct subtasks as determined and routed to by a master model who then analyzes and aggregates the outputs, with all of that learned during training.
I would assuming that training a LLM would be unfeasible for a small research lab, so isn't tackling small problems like this unavoidable? Given that current LLMs have clear limitations, I can't think of anything better than developing beter architectures on small test cases, then a company can try scaling it later.
Skimming this, there is no reason why a MoE LLM system (whether autoregressive, diffusion, energy-based or mixed) couldn't be given a nested architecture that duplicates the layout of a HRM. Combining these in different ways should allow for some novel benchmarks around efficiency and quality, which will be interesting.
I've been keeping an eye on this one as well. based on what the paper claims this would be huge. But i think like many here, we are waiting for either confirmation or denial of the claim via 3d parties. the concept behind them sounds legit, but id like to see it in practice.
This is really interesting, but does anyone think this is something that might generalize for ambiguous reasoning situations with more development? I am no expert, but sudoku and puzzles seem like very well-defined problem spaces.
I really like this usage of recurrent modules to augment attention-based models, and I think this is a really cool result and a fruitful avenue for future work
> For ARC-AGI challenge, we start with all input-output example pairs in the training and the evalua- tion sets ... At test time, we proceed as follows for each test input in the evaluation set: ...
Very often I see people misuse the ARC-AGI data when training. The input examples in the evaluation set are not intended for training your AI system. It is a downside of ARC that its data is (somehow?) complicated enough for the clever people building AI systems to miss the point, and people report and compare results as a single percentage where the data mix used for training may not make the comparison applicable.
Based on a quick first skim of the abstract and the introduction, the results from hierarchical reasoning (HRM) models look incredible:
> Using only 1,000 input-output examples, without pre-training or CoT supervision, HRM learns to solve problems that are intractable for even the most advanced LLMs. For example, it achieves near-perfect accuracy in complex Sudoku puzzles (Sudoku-Extreme Full) and optimal pathfinding in 30x30 mazes, where state-of-the-art CoT methods completely fail (0% accuracy). In the Abstraction and Reasoning Corpus (ARC) AGI Challenge 27,28,29 - a benchmark of inductive reasoning - HRM, trained from scratch with only the official dataset (~1000 examples), with only 27M parameters and a 30x30 grid context (900 tokens), achieves a performance of 40.3%, which substantially surpasses leading CoT-based models like o3-mini-high (34.5%) and Claude 3.7 8K context (21.2%), despite their considerably larger parameter sizes and context lengths, as shown in Figure 1.
I'm going to read this carefully, in its entirety.
Thank you for sharing it on HN!
Exactly!
> It uses two interdependent recurrent modules: a *high-level module* for abstract, slow planning and a *low-level module* for rapid, detailed computations. This structure enables HRM to achieve significant computational depth while maintaining training stability and efficiency, even with minimal parameters (27 million) and small datasets (~1,000 examples).
> HRM outperforms state-of-the-art CoT models on challenging benchmarks like Sudoku-Extreme, Maze-Hard, and the Abstraction and Reasoning Corpus (ARC-AGI), where CoT methods fail entirely. For instance, it solves 96% of Sudoku puzzles and achieves 40.3% accuracy on ARC-AGI-2, surpassing larger models like Claude 3.7 and DeepSeek R1.
Erm what? How? Needs a computer and sitting down.
Yeah, that was pretty much my reaction. I will need time on a computer too.
The repo is at https://github.com/sapientinc/HRM .
I love it when authors publish working code. It's usually a good sign. If the code does what the authors claim, no one can argue with it!
Same! Guan’s work on sample packing during finetuning has become a staple. His openchat code is also super simple and easy to understand.
Is it talking about fine tuning existing models with 1000 examples to beat them in those tasks?
I am extremely skeptical of a 27M parameter model being trained “from scratch” on 1000 datapoints. I am likewise incredulous of the lack of comparison with any other model which is trained “from scratch” using their data preparation. Instead they strictly compare with 3rd party LLMs which are massively more general purpose and may not have any of those 1000 examples in their training set.
This smells like some kind of overfit to me.
Yeah, the results look incredible indeed. That's why I and many others here have decided to download, review, and test the code published by the authors.[a] If their code doesn't live up to their claims, we will all ignore their work and move on. If their code lives up to their claims, no one can argue with it. In my experience, when authors publish working code, it's usually a good sign.
---
[a] https://github.com/sapientinc/HRM
Did it work? :)
The architecture is very similar offset lstms which have been studied extensively. The main difference is the handover of the hidden state, which my naive mind would assume makes optimization substantially more difficult.
I haven't had a chance to read the preprint carefully or play with the code yet. Best place to follow what's happening is by looking at the github repo, specifically open and closed issues and pull requests.
I'll wait until some more benchmarks are run in this case. Unlike traditional software, vetting a model architecture works better than alternatives is a time and compute intensive process. You really can't just download it and "try it out" outside of general purpose models (which this is not).
> "After completing the T steps, the H-module incorporates the sub-computation’s outcome (the final state L) and performs its own update. This H update establishes a fresh context for the L-module, essentially “restarting” its computational path and initiating a new convergence phase toward a different local equilibrium."
So they let the low-level RNN bottom out, evaluate the output in the high level module, and generate a new context for the low-level RNN. Rinse, repeat. The low-level RNNs are iterating backpropagation while the high-level is periodically kicking the low-level RNNs to get better outputs. Loops within loops. Composition.
Another interesting part:
> "Neuroscientific evidence shows that these cognitive modes share overlapping neural circuits, particularly within regions such as the prefrontal cortex and the default mode network. This indicates that the brain dynamically modulates the “runtime” of these circuits according to task complexity and potential rewards.
> Inspired by the above mechanism, we incorporate an adaptive halting strategy into HRM that enables `thinking, fast and slow'"
A scheduler that dynamically balances resources based on the necessary depth of reasoning and the available data.
I love how this paper cites parallels with real brains throughout. I believe AGI will be solved as the primitives we're developing are composed to extreme complexity, utilizing many cooperating, competing, communicating, concurrent, specialized "modules." It is apparent to me that human brain must have this complexity, because it's the only feasible way evolution had to achieve cognition using slow, low power tissue.
As soon I read the hlm/llm split, it immediately reminded me of the human brain.
Composition is the whole point of deep learning. Deep as in multilayer, multilevel.
You need recursion at some point: you can't account for all possible scenarios of combinations, as you would need an infinite number of layers.
> infinite number of layers
That’s not as impossible as it seems, Gaussian Processes are equivalent to a Neural Network with infinite hidden units, and any multilayer NN can be approximated by one with a single, larger layer of hidden units.
"a single, larger layer of hidden units"
Does this not mean that the entire model must cycle to operate any given part? Division into concurrent "modules" (the term appearing in this paper,) affords optimizing frequency independently and intentionally.
Also, what certainty is there that everything is best modelled with multilayer NN? Diversity of algorithms, independently optimized, could yield benefits.
Further, can we hope that modularity will create useful points of observability? The inherent serialization that develops between modules could be analyzed, and possibly reveal great insights.
Finally, isn't there a possibility that AGI could be achieved more rapidly by factoring the various processes into discrete modules, as opposed to solving every conceivable difficulty in a monolithic manner, whatever the algorithm?
That's a lot of questions. Seems like identifying possible benefits is easy enough that this approach is worthwhile exploring. We shall see I suppose. At the very least we know the modularization of HRM has a valid precedent: real biological brains.
It would not surprise me if all of these tangential advances in various models and approaches ultimately become part of a larger frame work of modules designed to handle certain tasks - similar to how your medula oblongata operates breathing and heart rate, and your amygdala sorts out memory and hormone production, and your cingulate gyrus helps control motor function, et al.
We have a great example (us), we just need to hone and replicate it.
I mean recurrence is an attempt to allow approximation of recursive processes, no?
I advise scepticism.
This work does have some very interesting ideas, specifically avoiding the costs of backpropagation through time.
However, it does not appear to have been peer reviewed.
The results section is odd. It does not include include details of how they performed the assesments, and the only numerical values are in the figure on the front page. The results for ARC2 are (contrary to that figure) not top of the leaderboard (currently 19% compared to HRMs 5% https://www.kaggle.com/competitions/arc-prize-2025/leaderboa...)
The authors' code is at https://github.com/sapientinc/HRM .
In fields like AI/ML, I'll take a preprint with working code over peer-reviewed work without any code, always, even when the preprint isn't well edited.
Everyone everywhere can review a preprint and its published code, instead of a tiny number of hand-chosen reviewers who are often overworked, underpaid, and on tight schedules.
If the authors' claims hold up, the work will gain recognition. If the claims don't hold up, the work will eventually be ignored. Credentials are basically irrelevant.
Think of it as open-source, distributed, global review. It may be messy and ad-hoc, since no one is in charge, but it works much better than traditional peer review!
I sympathize partially with your views, but how would this work in practice? Where would the review comments be stored? Is one supposed to browse Hacker News to check the validity of a paper?
If a professional reviewer spots a serious problem, the paper will not make it to a conference or journal, saving us a lot of trouble.
Peer review is a way to distribute the work of identifying which papers are potentially worth reading. If you're starting from an individual paper and then ask yourself whether it was peer reviewed or not, you're doing it wrong. If you really need to know, read it yourself and accept that you might just be wasting your time.
If you want to mostly read papers that have already been reviewed, start with people or organizations you trust to review papers in an area you're interested in and read what they recommend. That could be on a personal blog or through publishing a traditional journal, the difference doesn't matter much.
“Find papers that support what you want via online echo chambers” isn’t the advice you want to be giving but it is the net result of it. Society needs trusted institutions. Not that publishers are the best result of that but adhoc blog posts are decidedly not better.
It is totally the advice I want to be giving. Given the choice between an echo chamber matched to my interests and wading through a stream of unfiltered crap, I'll take the echo chamber every time. (Of course there's also the option of not reading papers at all, which is typically a good choice if you're not a subject matter expert and don't intend to put in the work to become one.)
If you choose to focus on the output of a well-known publisher, you're not avoiding echo chambers, you're using a heuristic to hopefully identify a good one.
Those are not the only options, namely the parent mentioned 'trusted institutions'. It is the best way to defer that filtering to a group of other humans, whose collective expertise will surpass any one individual.
The destruction of trust in both public and private institutions - newspapers, journals, research institutions, universities - and replacement with social media 'influencers' and online echo chambers is how we arrived at the current chaotic state of politics worldwide, the rise of extremist groups, cults, a resurgence of nationalism, religious fanaticism... This is terrible advice.
> If a professional reviewer spots a serious problem
Did that ever happen? :-)
Of course. As usual, you tend to not hear about it when a system we rely on works well.
Your question is like asking "how can I verify this rod is 1m long if I can't ask an expert". The answer is of course you measure it. That's much more reliable than asking an expert. However, the results of many papers take a huge amount of work to replicate, so we've built a network of experts over the years to evaluate them.
But this is open source, so TL;DR: you download the code, run it, and see if it gets the results claimed.
Scepticism is generally always a good idea with ML papers. Once you start publishing regularly in ML conferences, you understand that there is no traditional form of peer review anymore in this domain. The volume of papers has meant that 'peers' are often students coming to grips with parts of the field that rarely align with what they are asked to review. Conference peer review has become a 'vibe check' more than anything.
Real peer review is when other experts independently verify your claims in the arXiv submission through implementation and (hopefully) cite you in their followup work. This thread is real peer review.
I appreciate this insight, makes you wonder, why even publish a paper if it only amounts to a vibe check? If it's just the code we need we can get that peer reviewed through other channels.
Because publications is the number that academics have to make go up.
This and the exposure. There are so many papers on arXiv now that people often look to conference or journal publication lists.
The number has clearly ceased its function, so what are we chasing?
Clout, funding, and employment I'd imagine?
THIS is so true but also not limited to ML.
Having been both a publisher and reviewer across multiple engineering, science, and bio-medical disciplines this occurs across academia.
Skepticism is best expressed by repeating the experiment and comparing results. I'm game and I have 10 days off work next month. I wonder what can be had in terms of full source and data, etc. from the authors?
24 hours on a 4070. Seems quite doable. https://github.com/sapientinc/HRM
Nice! They provide trained checkpoints on their GitHub. Repeating their results would be a good start. https://github.com/sapientinc/HRM
I think that’s too harsh a position solely for not being peer reviewed yet. Neither of yhe original mamba1 and mamba2 papers were peer reviewed. That said, strong claims warrant strong proofs, and I’m also trying to reproduce the results locally.
The fact that you are expecting a paper just published to have been peer reviewed already tells me that you are likely not familiar with the process. The first step to have your work peer reviewed is to publish it.
>does not appear to have been peer reviewed
Enough already. Please. The paper + code is here for everybody to read and test. Either it works or it doesn't. Either people will build upon it or they won't. I don't need to wait 20 months for 3 anonymous dudes to figure it out.
Do you consider yourself a peer? Feel free to review it.
A peer reviewer will typically comment that some figures are unclear, that a few relevant prior works have gone uncited, or point out a followup experiment that they should do.
That's about the extent of what peer reviewers do, and basically what you did yourself.
> However, it does not appear to have been peer reviewed.
my observation is that peer reviewers never try to reproduce results or do basic code audit to check that there is no data leak for example to training dataset.
Skepticism is an understatement. There are tons of issues with this paper. Why are they comparing results of their expert model that was trained from scratch on a single task to general purpose reasoning models? It is well established in the literature that you can still beat general purpose LLMs in narrow domain tasks with specially trained, small models. The only comparison that would have made sense is one to vanilla transformers using the same nr of parameters and trained on the same input-output dataset. But the paper shows no such comparison. In fact, I would be surprised if it was significantly better, because such architecture improvements are usually very modest or not applicable in general. And insinuating that this is some significant development to improve general purpose AI by throwing in ARC is just straight up dishonest. I could probably cook up a neural net in pytorch in a few minutes that beats a hand-crafted single task that o3 can't solve in an hour. That doesn't mean that I made any progress towards AGI.
Have you spent much time with the ARC-1 challenge? Their results on that are extremely compelling, showing results close to the initial competition's SOTA (as of closing anyway) with a tiny model and no hacks like data augmentation, pretraining, etc that all of the winning approaches leaned on heavily.
Your criticism makes sense for the maze solving and sudoku sets, of course, but I think it kinda misses the point (there are traditional algos that solve those just fine - it's more about the ability of neural nets to figure them out during training, and known issues with existing recurrent architectures).
Assuming this isn't fake news lol.
Looking at the code, there is a lot of data augmentation going on there. For the Sudoku and ARC data sets, they augment every example by a factor of 1,000.
https://github.com/sapientinc/HRM/blob/main/dataset/build_ar...
That's fair, they are relabelling colours and rotating the boards. I meant more like mass generation of novel puzzles to try and train specific patterns. But you are right that technically there is some augmentation going on here, my bad.
Hm, I'm not so sure it's fair play for the Sudoku puzzle. Suggesting that the AI will understand the rules of the game with only 1,000 examples, and then adding 1,000,000 derived examples does not feel fair to me. Those extra examples leak a lot of information about the rules of the game.
I'm not too familiar with the ARC data set, so I can't comment on that.
True, it leaks information about all the symmetries of the puzzle, but that's about it. I guess someone needs to test how much that actually helps - if I get the model running I'll give it a try!
> That's fair, they are relabelling colours and rotating the boards.
Photometric augmentation, Geometric augmentation
> I meant more like mass generation of novel puzzles to try and train specific patterns.
What is the difference between Synthetic Data Generation and Self Play (like AlphaZero)? Don't self play simulations generate synthetic training data as compared to real observations?
I don't know the jargon, but for me the main thing is the distinction between humans injecting additional bits of information into the training set vs the algorithm itself discovering those bits of information. So self-play is very interesting (it's automated as part of the algorithm) but stuff like generating tons of novel sudoku puzzles and adding them to the training set is less interesting (the information is being fed into the training set "out-of-band", so to speak).
In this case I was wrong, the authors are clearly adding bits of information themselves by augmenting the dataset with symmetries (I propose "symmetry augmentation" as a much more sensible phrase for this =P). Since symmetries share a lot of mutual information with each other, I don't think this is nearly as much of a crutch as adding novel data points into the mix before training, but ideally no augmentation would be needed.
I guess you could argue that in some sense it's fair play - when humans are told the rules of sudoku the symmetry is implicit, but here the AI is only really "aware" of the gradient.
Symmetry augmentation sounds good for software.
Traditional ML CV Computer Vision research has perhaps been supplanted by multimodal LLMs that are trained on image analysis annotations. (CLIP, Brownian-motion based Dall-E and Latent Diffusion were published in 2021. More recent research: Brownian Bridges, SDEs, Lévy processes. What are foundational papers in video genai?)
TOPS are now necessary.
I suspect that existing CV algos for feature extraction would also be useful for training LLMs. OpenCV, for example, has open algorithms like ORB (Oriented FAST and Rotated BRIEF), KAZE and AKAZE, and SIFT since 2020. SIFT "is highly robust to rotation, scale, and illumination changes".
But do existing CV feature extraction and transform algos produce useful training data for LLMs as-is?
Similarly, pairing code and tests with a feature transform at training time probably yields better solutions to SWE-bench.
Self Play algos are given rules of the sim. Are self play simulations already used as synthetic training data for LLMS and SLMs?
There are effectively rules for generating synthetic training data.
The orbits of the planets might be a good example of where synthetic training data is limited and perhaps we should rely upon real observations at different scales given cost of experimentation and confirmations of scale invariance.
Extrapolations from orbital observations and classical mechanics failed to predict the Perihelion precession of Mercury (the first confirmation of GR General Relativity).
To generate synthetic training data from orbital observations where Mercury's 43 arcsecond deviation from Newtonian mechanics was disregarded as an outlier would result in a model overweighted by existing biases in real observations.
Tests of general relativity > Perihelion precession of Mercury https://en.wikipedia.org/wiki/Tests_of_general_relativity#Pe...
Okay, haha, I'm not sure what we're doing here.
I have a list of questions for ai or an expert, IDK what
As the other commenter already pointed out, I'll believe it when I see it on the leaderboard. But even then it already lost twice against the winner of last year's competition, because that too was a general purpose LLM that could also do other things.
Let's not move the goalposts here =) I don't think it's really fair to compare them directly like that. But I agree, this is triggering my "too good to be true" reflex very hard.
If anything, they moved the goalpost closer to the starting line. I'm merely putting it back where it belongs.
As a cognitive psychologist, I highly suspected that, broadly speaking, this was the needed direction for AI. See Fuzzy Trace Theory[1].
Fuzzy Trace Theory basically suggests that memory (and cognition generally) works at multiple levels spanning verbatim representations to gist-level representations, that get bound together into memories. Recalling gist, the general idea, along with specific details, allows for powerful generalization and flexible retrieval pathways.
[1] https://pmc.ncbi.nlm.nih.gov/articles/PMC4979567/
If I understand this correctly, it learns the rules of Sudoku by looking at 1,000 examples of (puzzle, solution) pairs. It is then able to solve previously unseen puzzles with 55% accuracy. If given millions of examples, it becomes almost perfect.
This is apparently without pretraining of any sort, which is kind of amazing. In contrast, systems like AlphaZero have the rules to go or chess built-in, and only learn the strategy, not the rules.
Off to their GitHub repository [1] to see this for myself.
[1] https://github.com/sapientinc/HRM
AlphaZero may have the rules built in, but MuZero and the other follow-ups didn't. MuZero not only matched or surpassed AlphaZero, but it did so with less training, especially in the EfficientZero variant; notably also on the Atari playground.
This is "The Bitter Lesson" of AI, no? "More compute beats clever algorithm."
> MuZero not only matched or surpassed AlphaZero, but it did so with less training
Seems the opposite?
Quite the opposite, a clever algorithm needs less compute, and can leverage extra compute even more.
Apologies, "clever" is a poor paraphrase of "domain-specific", or "methods that leveraged human understanding."[0]
0. http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Thanks for pointing that out.
To be fair, MuZero only learns a model of the rules for navigating its search tree. To make actual moves, it gets a list of valid actions from the game engine, so at that level it does not learn the rules of the game.
(HRM possibly does the same, and could be in the same realm as MuZero. It probably makes a lot of illegal moves.)
To follow up, after experimenting a bit with the source code:
1. Please, for the love of God, and for scientific reproducibility, specify library versions explicitly, and use pyproject.toml instead of an incomplete requirements.txt.
2. The 1,000 Sudoku examples are augmented with hand-coded permutation algorithms, so the actual input data set is more like 1,000,000 examples, not 1,000.
Do you have a fork or the changes? I might take a look, and python dependency hell on Sunday is no good
> specify library versions explicitly
Sometimes even that is not helpful. It's a pain we have to deal with.
How is it not helpful?
A dependency lock file with resolved versions for both direct and transient dependencies = reproducible build
I don't know how common this is, but the fschat library maintainers went for at least a year without making an official release or updating the version number in their GitHub repo, so the only way to both have current code and a reproducible build (without just including the fschat library directly, of course) was to pin it to a particular GitHub commit hash, which would get you code that was current, but with the version number from 12+ months earlier.
fschat is pretty popular for LLM-related work, so I assume this is at least not unheard-of for other notable third-party libraries.
I don't remember the exact scenario but it might have been related to the underlying python or some sys library being a little different and then the dependency lock not being compatible with it.
I appreciate the connections with neurology, and the paper itself doesn't ring any alarm bells. I don't think I'd reject it if it fell to me to peer review.
However, I have extreme skepticism when it comes to the applicability of this finding. Based on what they have written, they seem to have created a universal (maybe; adaptable at the very least) constraint-satisfaction solver that learns the rules of the constraint-satisfaction problem from a small number of examples. If true (I have not yet had the leisure to replicate their examples and try them on something else), this is pretty cool, but I do not understand the comparison with CoT models.
CoT models can, in principle, solve _any_ complex task. This needs to be trained to a specific puzzle which it can then solve: it makes no pretense to universality. It isn't even clear that it is meant to be capable of adapting to any given puzzle. I suspect this is not the case, just based on what I have read in the paper and on the indicative choice of examples they tested it against.
This is kind of like claiming that Stockfish is way smarter than current state of the art LLMs because it can beat the stuffing out of them in chess.
I feel the authors have a good idea here, but that they have marketed it a bit too... generously.
Yes, I agree, but this is a huge deal in and of itself. I suppose the authors had to frame it in this way for obvious reasons of hype surfing, but this is an amazing achievement, especially given the small size of the model! I’d rather use a customized model for a specific problem than a supposedly « generally intelligent » model that burns orders of magnitude more energy for much less reliability.
> CoT models can, in principle, solve _any_ complex task.
What is the justification for this? Is there a mathematical proof? To me, CoT seems like a hack to work around the severe limitations of current LLMs.
That's a fair argument to make. I should have, perhaps, written "are supposed to be able," or "have become famous for their apparent ability to solve loosely-specified arbitrary problems."
CoT _is,_ in my mind at least, a hack that is bolted to LLMs to create some sort of loose approximation of reasoning. When I read the paper I expected to see a better hack, but could not find anything on how you take this architecture, interesting though it is, and put it to use in a way similar to CoT. The whole paper seems to make a wild pivot between a fully general biomimetic grandeur of the first half, and the narrow effectiveness of the second half.
The Universal Approximation Theorem.
I don't see how that changes anything. By this logic, there's no need for CoT reasoning at all, as a single pass should be sufficient. I don't see how that proves that CoT increases capabilities.
> CoT models can, in principle, solve _any_ complex task.
The authors explicitly discuss the expressive power of transformers and CoT in the introduction. They can only solve problems in a fairly restrictive complexity class (lower than PTIME!) - it's one of the theoretical motivations for the new architecture.
"The fixed depth of standard Transformers places them in computational complexity classes such as AC0 [...]"
This architecture by contrast is recurrent with inference time controlled by the model itself (there's a small Q-learning based subnetwork that decides halting time as it "thinks"), so there's no such limitation.
The main meat of the paper is describing how to train this architecture efficiently, as that has historically been the issue with recurrent nets.
Agreed, regarding the computational simplicity of CoT LLMs, and that this solution certainly has much more flexibility. But is there a reason to believe that this architecture (and training method) is as applicable to the development of generally-capable models as it is to the solution of individual puzzles?
Don't get me wrong, this is a cool development, and I would love to see how this architecture behaves on a constraint-based problem that's not easily tractable via traditional algorithm.
The ARC-1 problem set that they benchmark on is an example of such a problem, I believe. It's still more-or-less completely unsolved. They don't solve it either, mind, but they achieve very competitive results with their tiny (27m param) model. Competitive with architectures that are using extensive pretraining and hundreds of billions of parameters!
That's one of the things that sticks out for me about the paper. Having tried very hard myself to solve ARC it's pretty insane what they're claiming to have done here.
(I think a lot of the sceptics in this thread are unaware of just how difficult ARC-1 is, and are focusing on the sudoku part, which I agree is much simpler and less surprising that they do well on)
I hope/fear this HRM model is going to be merged with MoE very soon. Given the huge economic pressure to develop powerful LLMs I think this can be done in just a month.
The paper seems to only study problems like sudoku solving, and not question answering or other applications of LLMs. Furthermore they omit a section for future applications or fusion with current LLMs.
I think anyone working in this field can envision their applications, but the details to have a MoE with an HRM model could be their next paper.
I only skimmed the paper and I am not an expert, sure other will/can explain why they don't discuss such a new structure. Anyway, my post is just blissful ignorance over the complexity involved and the impossible task to predict change.
Edit: A more general idea is that Mixture of Expert is related to cluster of concepts and now we would have to consider a cluster of concepts related by the time they take to be grasped, so in a sense the model would have in latent space an estimation of the depth, number of layers, and time required for each concept, just like we adapt our reading style for a dense math book different to a newspaper short story.
This HRM is essentially purpose-designed for solving puzzles with a small number of rules interacting in complex ways. Because the number of rules is small, a small model can learn them. Because the model is small, it can be run many times in a loop to resolve all interactions.
In contrast, language modeling requires storing a large number of arbitrary phrases and their relation to each other, so I don't think you could ever get away with a similarly small model. Fortunately, a comparatively small number of steps typically seems to be enough to get decent results.
But if you tried to use an LLM-sized model in an HRM-style loop, it would be dog slow, so I don't expect anyone to try it anytime soon. Certainly not within a month.
Maybe you could have a hybrid where an LLM has a smaller HRM bolted on to solve the occasional constraint-satisfaction task.
> In contrast, language modeling requires storing a large number of arbitrary phrases and their relation to each other
A person has some ~10k word vocabulary, with words fitting specific places in a really small set of rules. All combined, we probably have something on the order of a few million rules in a language.
What, yes, is larger than the thing in this paper can handle. But is nowhere near as large as a problem that should require something the size of a modern LLM to handle. So it's well worth it to try to enlarge models with other architectures, try hybrid models (note that this one is necessarily hybrid already), and explore every other possibility out there.
What about many small HRM models that solve conceptually distinct subtasks as determined and routed to by a master model who then analyzes and aggregates the outputs, with all of that learned during training.
must say I am suspicious in this regard, as they don't show applications other than a Sudoku solver and don't discuss downsides.
and the training was only on Sudoku. Which means they need to train a small model for every problem that currently exists.
Back to ML models?
I would assuming that training a LLM would be unfeasible for a small research lab, so isn't tackling small problems like this unavoidable? Given that current LLMs have clear limitations, I can't think of anything better than developing beter architectures on small test cases, then a company can try scaling it later.
Not only on Sudoku, there is also maze solving and ARC-AGI.
Skimming this, there is no reason why a MoE LLM system (whether autoregressive, diffusion, energy-based or mixed) couldn't be given a nested architecture that duplicates the layout of a HRM. Combining these in different ways should allow for some novel benchmarks around efficiency and quality, which will be interesting.
I've been keeping an eye on this one as well. based on what the paper claims this would be huge. But i think like many here, we are waiting for either confirmation or denial of the claim via 3d parties. the concept behind them sounds legit, but id like to see it in practice.
This is really interesting, but does anyone think this is something that might generalize for ambiguous reasoning situations with more development? I am no expert, but sudoku and puzzles seem like very well-defined problem spaces.
Is this not a variation of ReAct + Chain-of-Thought + Structured Planning? Or is that too unfair to the authors work?
[1] - https://arxiv.org/abs/2210.03629
I really like this usage of recurrent modules to augment attention-based models, and I think this is a really cool result and a fruitful avenue for future work
> For ARC-AGI challenge, we start with all input-output example pairs in the training and the evalua- tion sets ... At test time, we proceed as follows for each test input in the evaluation set: ...
Very often I see people misuse the ARC-AGI data when training. The input examples in the evaluation set are not intended for training your AI system. It is a downside of ARC that its data is (somehow?) complicated enough for the clever people building AI systems to miss the point, and people report and compare results as a single percentage where the data mix used for training may not make the comparison applicable.
Goodbye captchas I guess? Somehow they are still around.
but does it scale?
Is it just me or are symbolic (or as I like to call it 'video game') AI is seeping back into AI?
Perhaps so - but represented in a trainable, neural form. Very exciting!
Natural general intelligences sure seem to work this way.
Training databases is much easier than training neural networks.
But symbolic != hierarchical
[dead]