3 min readfrom Machine Learning

[R] Controlled experiment: giving an LLM agent access to CS papers during automated hyperparameter search improves results by 3.2%

[R] Controlled experiment: giving an LLM agent access to CS papers during automated hyperparameter search improves results by 3.2%
[R] Controlled experiment: giving an LLM agent access to CS papers during automated hyperparameter search improves results by 3.2%

Ran a controlled experiment measuring whether LLM coding agents benefit from access to research literature during automated experimentation.

Setup:

Two identical runs using Karpathy's autoresearch framework. Claude Code agent optimizing a ~7M param GPT-2 on TinyStories. M4 Pro, 100 experiments each, same seed config. Only variable — one agent had access to an MCP server that does full-text search over 2M+ CS papers and returns synthesized methods with citations.

Results:

Without papers With papers
Experiments run 100 100
Papers considered 0 520
Papers cited 0 100
Techniques tried standard 25 paper-sourced
Best improvement 3.67% 4.05%
2hr val_bpb 0.4624 0.4475

Gap was 3.2% and still widening at the 2-hour mark.

Techniques the paper-augmented agent found:

  • AdaGC — adaptive gradient clipping (Feb 2025)
  • sqrt batch scaling rule (June 2022)
  • REX learning rate schedule
  • WSD cooldown scheduling

What didn't work:

  • DyT (Dynamic Tanh) — incompatible with architecture
  • SeeDNorm — same issue
  • Several paper techniques were tried and reverted after failing to improve metrics

Key observation: Both agents attempted halving the batch size. Without literature access, the agent didn't adjust the learning rate — the run diverged. With access, it retrieved the sqrt scaling rule, applied it correctly on first attempt, then successfully halved again to 16K.

Interpretation:

The agent without papers was limited to techniques already encoded in its weights — essentially the "standard ML playbook." The paper-augmented agent accessed techniques published after its training cutoff (AdaGC, Feb 2025) and surfaced techniques it may have seen during training but didn't retrieve unprompted (sqrt scaling rule, 2022).

This was deliberately tested on TinyStories — arguably the most well-explored small-scale setting in ML — to make the comparison harder. The effect would likely be larger on less-explored problems.

Limitations: Single run per condition. The model is tiny (7M params). Some of the improvement may come from the agent spending more time reasoning about each technique rather than the paper content itself. More controlled ablations needed.

I built the paper search MCP server (Paper Lantern) for this experiment. Free to try: https://code.paperlantern.ai

Full writeup with methodology, all 15 paper citations, and appendices: https://www.paperlantern.ai/blog/auto-research-case-study

Would be curious to see this replicated at larger scale or on different domains.

submitted by /u/kalpitdixit
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#AI formula generation techniques
#financial modeling with spreadsheets
#automated anomaly detection
#no-code spreadsheet solutions
#machine learning in spreadsheet applications
#rows.com
#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#real-time data collaboration
#real-time collaboration
#LLM
#hyperparameter search
#MCP server
#automated experimentation
#TinyStories
#adaptive gradient clipping
#sqrt scaling rule
#REX learning rate schedule
#WSD cooldown scheduling