🔬 Deep Dives
2026-04-07 · 326 pts · cs.AI · Bowen Ye, Rang Li…
📄 New in cs.AI
However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.
2026-04-08 · 4 pts · 3 comments · ⭐ 3.0k today
⭐ 3.0k GitHub stars today · ⭐ 34.1k total stars
The agent that grows with you
2026-04-07 · 7 pts · cs.LG · cs.AI · Guhao Feng, Shengjie Luo…
📄 New in cs.LG, cs.AI
Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.
2026-04-08 · 10 pts · ⭐ 99 today
⭐ 99 stars today on GitHub · ⭐ 56.2k total stars
12 Lessons to Get Started Building AI Agents
2026-04-08 · 171 pts · 188 comments
💬 Major HN discussion (188 comments)
⚡ Quick Signals
Agent / Python
GitHub
lyogavin / airllm
2 pts · ⭐61 · ⭐ 61 stars today on GitHub · ⭐ 15.1k total stars
Demystifying / Pruning
Iran / Agree
Project / Glasswing
Card / Claude
Neural / Networks