Uq Assessing Language Models On Unsolved Questions
1 mentions across 1 person
Visit ↗All mentions
“We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi and history, probing capabilities including reasoning, factuality, and browsing. UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers, thus solving them yields direct real-world value.”
UQ: A Novel Benchmark for Language Model Evaluation on Unsolved Questions ↗