Reproducing HRM Experiments A Deep Dive Into Sudoku-Extreme And ARC-AGI Results

Jul 30, 2025 by Aria Freeman 80 views

Reproducing Sudoku-Extreme and ARC-AGI Experiments: A Deep Dive into HRM Results

Hey guys! We at HigherOrderCO were super intrigued by the Hierarchical Reasoning Model (HRM)'s impressive results, especially when we looked at its compute time compared to other models like Large Language Models (LLMs). So, we decided to roll up our sleeves and reproduce the experiments. Let's dive into our journey of reproducing the HRM results, focusing on Sudoku-Extreme and ARC-AGI experiments. We'll share our setup, the challenges we faced, and the interesting results we observed.

Sudoku-Extreme 9x9 Reproduction

First up, we tackled the Sudoku-Extreme 9x9 experiment. We wanted to see if we could replicate the results using the configurations outlined in the README. We used a single H200 GPU for this, and the training clocked in at around an hour – not bad, right?

Training Setup

Our training process followed the exact steps in the README. We kicked things off with this command:

OMP_NUM_THREADS=1 torchrun --nproc-per-node 1 pretrain.py data_path=data/sudoku-extreme-1k-aug-1000 epochs=20000 eval_interval=2000 lr=1e-4 puzzle_emb_lr=1e-4 weight_decay=1.0 puzzle_emb_weight_decay=1.0

This command essentially sets up the training run, specifying parameters like the number of threads, learning rates, and weight decay. It's crucial to get these settings right to ensure a successful training process.

Evaluation Process

Next, we evaluated our model using this command:

OMP_NUM_THREADS=1 torchrun --nproc-per-node 1 evaluate.py checkpoint=checkpoints/Sudoku-extreme-1k-aug-1000\ ACT-torch/HierarchicalReasoningModel_ACTV1\ loose-caracara/step_26040

This command loads the trained model from the checkpoint and evaluates its performance on the Sudoku-Extreme dataset. We were eager to see how well our model would perform compared to the original results.

The Results: A Slight Discrepancy

Here's where things got interesting. Our evaluation gave us these results:

Accuracy: 45.8% (a bit lower than the reported 55% in the paper)
Perfect Halting Accuracy: Spot on!
Total Parameters: 27,275,266

The 45.8% accuracy was a tad disappointing, as it was about 10% lower than the 55% reported in the paper. However, achieving perfect halting accuracy was a positive sign, indicating that the model was correctly determining when to stop its reasoning process. The number of parameters also matched the expected value.

ARC-AGI-1 Experiment: Scaling Up the Challenge

Feeling up for a bigger challenge, we moved on to reproducing the ARC-AGI-1 experiment. This one required more horsepower, so we used 8 H200 GPUs. The runtime for this experiment was roughly 24 hours.

Dataset Creation

First, we built the dataset using this command:

python dataset/build_arc_dataset.py

This script generates the necessary dataset for training the model on the ARC-AGI-1 task.

Training Phase

We then trained the model with the following command:

OMP_NUM_THREADS=8 torchrun --nproc-per-node 8 pretrain.py

This command launches the training process across 8 GPUs, leveraging the power of parallel processing to speed things up. Setting OMP_NUM_THREADS is important for controlling the number of threads used by OpenMP, which can impact performance.

Evaluating the ARC-AGI Model

Finally, we evaluated the trained model using this command:

OMP_NUM_THREADS=8 torchrun --nproc-per-node 8 evaluate.py checkpoint=<CHECKPOINT_PATH>

Make sure to replace <CHECKPOINT_PATH> with the actual path to your saved model checkpoint. This command evaluates the model's performance on the ARC-AGI-1 dataset.

ARC-AGI Results: A Wider Gap

Our results for the ARC-AGI-1 experiment were:

Accuracy: ~25% (a noticeable 15% drop from the reported 40%)
Halting Accuracy: 58%
Parameters: 27,276,290

The accuracy result was the biggest surprise here. We got around 25% accuracy, which is 15% lower than the 40% reported. The halting accuracy was 58%, and the number of parameters was consistent. This discrepancy raised some questions for us.

Key Findings and Discussion

So, we successfully reproduced the HRM experiment, which is a great step! But the burning question is: Why did we see a 10% accuracy drop in Sudoku-Extreme and a significant 15% drop in ARC-AGI?

Potential Reasons for the Discrepancies

Training Time: We stumbled upon a tweet mentioning that the ARC training took between 50 to 200 hours, and while the exact setup wasn't shared, it suggests they might have trained the model for a much longer duration than our 24 hours. Longer training times can often lead to improved performance.
Hyperparameter Tuning: It's possible that subtle tweaks to the training setup or hyperparameters could account for the difference. Small changes in learning rates, weight decay, or other settings can sometimes have a significant impact on the final results.
Dataset Variations: Even with the same dataset generation script, there might be slight variations in the dataset used. Different random seeds or data augmentation techniques could lead to subtle differences in the training data, affecting the model's performance.
Hardware and Software Differences: While we used powerful H200 GPUs, the exact hardware and software environment can also play a role. Differences in GPU drivers, PyTorch versions, or other system-level configurations could contribute to the discrepancies.

The Intriguing Efficiency of HRM

Despite the accuracy differences, it's still super impressive that HRM achieved 25% accuracy on ARC-AGI with just 960 examples and 24 hours of training. This highlights the model's potential for efficient learning, especially when compared to the vast amounts of data and compute time often required by Large Language Models (LLMs). The efficiency demonstrated by HRM is a key strength that warrants further exploration.

Conclusion: Reproducibility and the Quest for Understanding

Our journey to reproduce the HRM experiments has been both enlightening and intriguing. While we didn't exactly match the original accuracy results, we successfully replicated the experiments and gained valuable insights into the model's behavior. The discrepancies we observed highlight the importance of carefully documenting experimental setups and considering the potential impact of factors like training time, hyperparameters, and dataset variations.

The fact that HRM achieved a reasonable level of performance on ARC-AGI with limited data and compute is a testament to its efficiency. We're excited to continue exploring this model and understanding its strengths and limitations. Hopefully, this detailed account of our reproduction efforts will help others in the community as well! We believe that sharing our findings and engaging in open discussions is crucial for advancing research in AI and machine learning. Let's keep the conversation going and work together to unlock the full potential of these exciting models!