Terminal Bench 2
3 mentions across 1 person
All mentions
“I think is probably the one of the more kind of like popular coding benchmarks right now. You can actually see they they have like the the agent harness and then the model and so you can see the variation in performance and claude code is not at the top of that.”
The Evolution of LLM Agent Development from Scaffolds to Long-Horizon Agent Harn ↗