Daily AI Digest — 2026-04-08

🔬 Deep Dives

arXiv Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

2026-04-07 · 326 pts · cs.AI · Bowen Ye, Rang Li…

📄 New in cs.AI

However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.

GitHub NousResearch / hermes-agent

2026-04-08 · 4 pts · 3 comments · ⭐ 3.0k today

⭐ 3.0k GitHub stars today · ⭐ 34.1k total stars

The agent that grows with you

arXiv In-Place Test-Time Training

2026-04-07 · 7 pts · cs.LG · cs.AI · Guhao Feng, Shengjie Luo…

📄 New in cs.LG, cs.AI

Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.

GitHub microsoft / ai-agents-for-beginners

2026-04-08 · 10 pts · ⭐ 99 today

⭐ 99 stars today on GitHub · ⭐ 56.2k total stars

12 Lessons to Get Started Building AI Agents

HF GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

2026-04-08 · 171 pts · 188 comments

💬 Major HN discussion (188 comments)

⚡ Quick Signals

Agent / Python

arXiv Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework 1 pts · 📄 New in cs.CL

arXiv Action Images: End-to-End Policy Learning via Multiview Video Generation 📄 New in cs.CV, cs.RO

arXiv MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control 📄 New in cs.CV, cs.AI

HF ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement 107 pts · 💬 Active HN discussion (49 comments)

arXiv Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning 📄 New in cs.CV, cs.AI

GitHub lyogavin / airllm 2 pts · ⭐61 · ⭐ 61 stars today on GitHub · ⭐ 15.1k total stars

GitHub HKUDS / DeepTutor ⭐168 · ⭐ 168 GitHub stars today · ⭐ 12.6k total stars

GitHub vectorize-io / hindsight ⭐160 · ⭐ 160 GitHub stars today

HN Revision Demoparty 2026: Razor1911 [video] 60 pts · 🔥 60 pts on Hacker News

arXiv A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models 6 pts · 📄 New in cs.CL, cs.AI

Demystifying / Pruning

HF Demystifying When Pruning Works via Representation Hierarchies 218 pts · 💬 Active HN discussion (93 comments)

Iran / Agree

HN US and Iran agree to provisional ceasefire 448 pts · 💬 Major HN discussion (1.2k comments)

Project / Glasswing

HN Project Glasswing: Securing critical software for the AI era 1.2k pts · 💬 Major HN discussion (591 comments) · 🔥 Trendi…

Card / Claude

HN System Card: Claude Mythos Preview [pdf] 658 pts · 💬 Major HN discussion (475 comments) · 🔥 Trendi…

Neural / Networks

RSS What Convolutional Neural Networks Taught Me About Life 🆕 New article

📰 Daily AI Digest