May-June 2025 Progress in Guaranteed Safe AI
Benchmarks! Get your benchmarks! A startup and a nonprofit lab each launch
There will be a AIxFM conference in the Bay Area Q4, according to a little birdie.
Morph ships big autoformalization result in 3599 lines of Lean
They have human decomposition in the latex/lean blueprint, into 67 lemmas with human spotchecking. Still, I’m impressed with their system (called Trinity).
I’d like to know how expensive (in tokens, or some other compute metric) it was to do this!
On Verified Superintelligence
I of course have opinions on their blog post Verified Superintelligence.
Today's most advanced AI systems—reasoning LLMs trained with supervised RL—have hit a fundamental wall. They can only improve on problems where we can verify the (known) answers. Every math problem needs a known solution. Every coding challenge requires test cases. Every reasoning task demands ground truth.
I think I know where they’re going with this: “verification” in the sense of supervised RL as a cognate with “verification” in the sense of typechecking.
Perhaps most importantly, this approach offers a path to what we call "trustless alignment." By ensuring all outputs can be independently validated, we create AI systems whose safety doesn't depend on faith in their training process or architecture. The proof is in the output.
This describes effectively a boxing strategy, not alignment. Boxing is when you police the boundary between the AI and the world. Alignment is when you don’t need to. Is this an important distinction, or a narcissism of small differences you’d only notice if you have Lesswrong Disease (the disease they diagnose you with when you read or write on Lesswrong)? Being afflicted myself, I’m not sure how I would tell.
Two benchmarks
VERINA
Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often lack support for end-to-end verifiable code generation. In this paper, we introduce Verina (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Verina consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, generates only 61.4% correct code, 51.0% sound and complete specifications, and 3.6% successful proofs, with one trial per task. We hope Verina will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on this https URL and our evaluation code on this https URL.
CLEVER
We introduce CLEVER, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, CLEVER avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use CLEVER to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub (this https URL) as well as HuggingFace (this https URL). All our evaluation code is also available online (this https URL).
Welcome to the summer of benchmarks. VERINA and CLEVER each critically cite a paper we saw in a previous newsletter to patch some of its issues, especially around completeness of specification synthesis.
The Theorem launch
Friends of the newsletter Jason and Rajashree launched Theorem (YC W25)
MISSION: Theorem is an AI and programming languages research lab. Our question: as the models get better at generating code, what is the bottleneck to safely deploying vastly more computation in the world?
Code has bugs. AI-enabled attackers will increase the volume and velocity of attacks on software infrastructure. Human code review will not scale.
AIs are untrusted generators. In order to deploy AIs in critical systems without fully understanding their generalization behavior, we’ll need robust, scalable methods for overseeing their solutions.
I’ve seen some of the tech and it’s fantastic; when I think about projects I could do, I try only to do things that Theorem wouldn’t be able to do better.
Proof Carrying Code Completions
Would’ve been great to have covered this paper when it came out, but it’s relevant enough to discuss it even though I’m late.
Code completions produced by today’s large language models (LLMs) offer no formal guarantees. We propose proof-carrying code completions (𝑃𝐶^3). In this paradigm, a high-resourced entity (the LLM provided by the server) must provide a code completion together with a proof of a chosen safety property which can be independently checked by a low-resourced entity (the user). In order to provide safety proofs without requiring the user to write specifications in formal logic, we statically generate preconditions for all dangerous function calls (i.e., functions that may violate the safety property) which must be proved by the LLM.
To demonstrate the main ideas, we provide a prototype implementation in the program verification language Dafny, and a case study focusing on file system vulnerabilities. Unlike Python code generated by GPT-4, Dafny code generated by 𝑃𝐶^3 provably avoids a common weakness related to path traversal (CWE-35), using a single generation attempt (𝑘 = 1) and a modest number of tokens (3, 350). Our tool is available as an open source repository at https://github.com/DavisPL/PCCC.
The setup in this paper will be familiar to people who’ve been thinking about guaranteed safe AI for a bit. We have trusted and untrusted components, where a core trusted component is a proof checker of some kind.
Proof carrying code is conceptually similar to a sigma type or dependent pair. It just means you tag or decorate whatever code you deliver with a proof of its correctness.
Law Zero launch
Yoshua Bengio’s new org, pursuing the Scientist AI agenda. Scientist AI is to an extent in the same conceptual family as Safeguarded AI. Both are heavily involved in probability theory, where Scientist AI has even more information theory. Both are trying to avoid informally specified agents running amok, both by locking AI behind some probabilistic assurances. Safeguarded AI, however, emphasizes a robust notion of certificate, while Scientist AI doesn’t emphasize this. People have criticized Scientist AI for falling into the pitfalls of “oracles” that Bostrom et al discussed over a decade ago. Specifically, agency can’t be blocked without banning loops and subprocess.run, since tool use can be easily recovered in this day and age.
Law Zero is hiring for sr roles in Montreal.
AI Security Forum in Tel Aviv, DC, and Vegas this summer
Following the success of Defcon satellite events and the Paris AI Action Summit satellite event, the security forum hits Tel Aviv and DC for the first time before returning to Vegas.
Foresight RFP alludes to some GSAI topics
AI tools that assist with increasing the tractability of formal verification for real-world systems, starting at security-critical infrastructure.
Designing or proving properties of secure system architectures.
They accept applications rolling year round, but evaluate and disperse on a quarterly basis. The next “deadline” is June 30th. Sorry I didn’t tell you sooner, I’m bad at my job.
Safeguarded AI Technical Area 2
Time to fire up a machine learning lab to automate all the proof engineering in the new proof stack that TA1 cooks up. Deadline October 1st.