# October 2024 Progress in Guaranteed Safe AI

### Safeguarded AI TA 1.4 funding call, a way of verifying quantitative properties of program pairs, and previews of the recent ICLR submissions.

There are no benefits of subscribing, except you get a cut of my hard earned shapley points.

As always, do reach out to inform me how the newsletter could be better, unless you want less biased and wrong quick takes. The biased and wrong quick takes are *mandatory*. Do you like inlining abstracts, or should I just expect people to click when they want to read the abstract? This month we’re inlining all of them.

If you're just joining us, background on GSAI here.

I attended an ARIA meeting this month, so most of the newsletter is papers I found there that happen to be recent-ish.

A little birdie told me there’s a meeting in Berkeley at the beginning of December– reach out for more info!

# Big list of openreview links for ICLR 2025 papers related to scaled interactive theorem proving

From the Lean Zulip. There’s probably more from ICLR I’d like to cover in the newsletter, but I haven’t parsed the openreview firehose personally. I’m freeloading off Lean Zulip users.

One thing that jumped out at me is in a thread about a polarizing/unpopular paper, the discussion of what programming practices would lead to ML papers having an easier time at making reasonable claims about the nontriviality of discovered proofs and broadly preventing contamination at scale. Terence Tao discusses placeholder design patterns, suggests a proof_wanted command to replace a theorem … : 0 = 1 := sorry, and discusses downsides with the practice of sorrying out 0=1 to do placeholding (spoiler: an AI can exploit the proof of a false proposition to misbehave). This is an important point: any GSAI stack will have collaboration between proof engineers of different power levels, and lots of collaboration in proof engineering is driven by very careful bookkeeping of placeholders (saying what we need, what we have, and what would be nice to have).

# A quantitative probabilistic relational hoare logic

**Hoare logic** is when you have precondition *P : Memory -> Prop* and postcondition *Q : Memory -> Prop*, and some imperative program *c* such that forall memories *m *and *n*, if *Pm* and executing *c* sends *m* to *n*, then you can prove *Qn*, written {P}c{Q}. Sometimes, you want to reason about relations between programs. **Relational hoare logic**** ****(RHL) **is when preconditions and postconditions are relations on memory states and you have two programs, so {P} c₁ ~ c₂ {Q} means that forall memories *m1 m2 n1 n2*, if *P m1 m2* and *exec c1 m1 = n1* and *exec c2 m2 = n2*, then *Q n1 n2*.

You might like to write a proof of a RHL quadruple for a cryptographic primitive, requiring you to reason about independence of two PRNGs. Independence is one of the things you don’t get in RHL, because RHL assumes deterministic programs. **Probabilistic relational hoare logic (pRHL)** looks again like {P} c1 ~ c2 {Q}, but this time Q is lifted to a relation between distributions over memory, reflecting the possible outcomes of programs involving probabilistic choice or random sampling. So precondition P is of type Memory -> Memory -> Prop, postcondition Q is of type ΔMemory -> ΔMemory -> Prop, and the pRHL quadruple unpacks to forall *m1 m2 : Memory, *forall* N1 N2 : ΔMemory,* if *P m1 m2* and *exec c1 m1 ~ N1 *and exec c2 m2 ~ N2, then *Q N1 N2*.

We introduce eRHL, a program logic for reasoning about relational expectation properties of pairs of probabilistic programs. eRHL is quantitative, i.e., its pre- and post-conditions take values in the extended non-negative reals. Thanks to its quantitative assertions, eRHL overcomes randomness alignment restrictions from prior logics, including PRHL, a popular relational program logic used to reason about security of cryptographic constructions, and apRHL, a variant of PRHL for differential privacy. As a result, eRHL is the first relational probabilistic program logic to be supported by non-trivial soundness and completeness results for all almost surely terminating programs. We show that eRHL is sound and complete with respect to program equivalence, statistical distance, and differential privacy. We also show that every PRHL judgment is valid iff it is provable in eRHL. We showcase the practical benefits of eRHL with examples that are beyond reach of PRHL and apRHL.

pRHL is not *expectational*. Since preconditions and postconditions land in Prop, we can’t *value* preconditions and postconditions as quantities. Last july, the authors introduced a generalization of pRHL called eRHL (e for expectation), where now preexpectations and postexpectations are of type Memory -> Memory -> NNReal, and the rest follows. eRHL is special because they can prove completeness theorems for almost surely terminating (Pr(halt) = 1) programs that pRHL can’t, along with some fancy desiderata like a statistical difference ⇔ eRHL quadruple correspondence and differential privacy.

If you’re wondering why you just read all that, here’s the juice: often in GSAI position papers there’ll be some reference to expectations that capture “harm” or “safety”. Preexpectations and postexpectations with respect to particular pairs of programs could be a great way to cash this out, cuz we could look at programs as interventions and simulate RCTs (labeling one program control and one treatment) in our world modeling stack. When it comes to harm and safety, Prop and bool are definitely not rich enough.

# Bengio talk coming up

November 14, 18:00-19:00 UTC

Description: Could there be safety advantages to the training of a Bayesian oracle that is trained to only do that job, i.e., estimate P(answer | question, data)? What are the scenarios in which such an AI could cause catastrophic harm? Can we even use such an oracle as the intelligence engine of an agent, e.g., by sampling actions that help to achieve goals? What can go wrong even if we assume that we have a perfect prediction of the Bayesian posterior, e.g., if the true explanatory theory is a minority voice in the Bayesian posterior regarding harm prediction? If such an oracle is estimated by a neural network with amortized inference, what could go wrong? Could the implicit optimization used to train the estimated posterior create loopholes with an optimistic bias regarding harm? Could we also use such a Bayesian oracle to obtain conservative risk estimates, i.e., bounds on the probability of harm, that can mitigate the imperfections in such an agent?

# AutoVerus

Verus is a framework for *annotating* rust code with preconditions (“requires”) and postconditions (“ensures”) and kicks static verification down to SMT. Here’s an example program for multiplying a number by 8:

fn octuple(x1: i8) -> (x8: i8)

requires

-16 <= x1,

x1 < 16,

ensures

x8 == 8 * x1,

{

let x2 = x1 + x1;

let x4 = x2 + x2;

x4 + x4

}

Without the “require”, we may overflow the 8 bit integer. The “ensure” is a functional specification of what it means for the function to be correct. The verus paper is here and the verus docs are here.

Generative AI has shown its values for many software engineering tasks. Still in its infancy, large language model (LLM)-based proof generation lags behind LLM-based code generation. In this paper, we present AutoVerus. AutoVerus uses LLM to automatically generate correctness proof for Rust code. AutoVerus is designed to match the unique features of Verus, a verification tool that can prove the correctness of Rust code using proofs and specifications also written in Rust. AutoVerus consists of a network of LLM agents that are crafted and orchestrated to mimic human experts' three phases of proof construction: preliminary proof generation, proof refinement guided by generic tips, and proof debugging guided by verification errors. To thoroughly evaluate AutoVerus and help foster future research in this direction, we have built a benchmark suite of 150 non-trivial proof tasks, based on existing code-generation benchmarks and verification benchmarks. Our evaluation shows that AutoVerus can automatically generate correct proof for more than 90% of them, with more than half of them tackled in less than 30 seconds or 3 LLM calls.

The autoverus paper is an example in the genre of using verification tools as the ground truth in a machine learning process. A proof system is a great data labeler. GSAI is happy when this basic viewpoint makes data quality more scalable than other forms of ground truth across machine learning.

# Sufficient and Necessary Barrier-like Conditions for Safety and Reach-avoid Verification of Stochastic Discrete-time Systems

A reach-avoid specification is something of the form “eventually P and always not Q”. Barrier certs are a way of proving that a system starting in the safe region stays in the safe region without directly solving the whole system of differential equations.

In this paper, we examine sufficient and necessary barrier-like conditions for the safety verification and reach-avoid verification of stochastic discrete-time systems. Safety verification aims to certify the satisfaction of the safety property, which stipulates that the probability of the system, starting from a specified initial state, remaining within a safe set is greater than or equal to a specified lower bound. A sufficient and necessary barrier-like condition is formulated for safety verification. In contrast, reach-avoid verification extends beyond safety to include reachability, seeking to certify the satisfaction of the reach-avoid property. It requires that the probability of the system, starting from a specified initial state, reaching a target set eventually while remaining within a safe set until the first hit of the target, is greater than or equal to a specified lower bound. Two sufficient and necessary barrier-like conditions are formulated under certain assumptions. These conditions are derived via relaxing Bellman equations.

# Compositional Design of Safety Controllers for Large-scale Stochastic Hybrid Systems

What is the problem? When is a solution “compositional”?

The problem is stochastic hybrid systems (SHS), which are very heterogeneous: dynamics (i.e. of the kind described by differential equations) that might be discrete time in some parts and continuous time in others, and discrete state in some parts but continuous state in others. Specifically, the problem is synthesizing guaranteed controllers for these systems (some actuator that intervenes to enact a predicted output). A solution to a problem is compositional when it allows you to piece together solutions to bigger problems out of solutions to smaller problems.

In this work, we propose a compositional scheme based on small-gain reasoning for the safety controller synthesis of interconnected stochastic hybrid systems with both continuous evolutions and instantaneous jumps. In our proposed setting, we first offer an augmented scheme to represent each stochastic hybrid subsystem with continuous and discrete evolutions in a unified framework, ensuring that the state trajectories match those of the original hybrid systems. We then introduce the concept of augmented control sub-barrier certificates (A-CSBC) for each subsystem, which allows the construction of augmented control barrier certificates (A-CBC) for interconnected systems and their safety controllers under small-gain compositional conditions. We eventually leverage the constructed A-CBC and quantify a guaranteed probabilistic bound across the safety of the interconnected system. While the computational complexity of designing a barrier certificate and its safety controller grows polynomially with network dimension using sum-of-squares (SOS) optimization program, our compositional approach significantly reduces it to a linear scale with respect to the number of subsystems. We verify the efficacy of our proposed approach over an interconnected stochastic hybrid system composed of 1000 nonlinear subsystems.

# DafnyBench (back in june)

The Beneficial AI Foundation and others shipped a benchmark for Dafny coding.

We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% success rate, and we quantify how this rate improves when retrying with error message feedback and how it deteriorates with the amount of required code and hints. We hope that DafnyBench will enable rapid improvements from this baseline as LLMs and verification techniques grow in quality.

There’s also this VSCode extension complimenting the paper.

# Safeguarded AI TA1.4 funding call

Economists and voting theorists are being summoned to help us reason about specification generating processes under multiple stakeholders. In video form!

We already knew that ARIA is being wildly ambitious. I’d be happy if we could get *any specs at all* on the new spec/modeling/cert stack that ARIA is building to be checked against prod code; it seems a lot harder to make them also socially/democratically desirable. Worried a little about scope creep with these guys, but what else is new.