How can we belief the correctness of a discovered mannequin on a selected enter of curiosity? Mannequin accuracy is usually measured on common over a distribution of inputs, giving no assure for any mounted enter. This paper proposes a theoretically-founded resolution to this downside: to coach Self-Proving fashions that show the correctness of their output to a verification algorithm V by way of an Interactive Proof. Self-Proving fashions fulfill that, with excessive likelihood over an enter sampled from a given distribution, the mannequin generates an accurate output and efficiently proves its correctness to V. The soundness property of V ensures that, for each enter, no mannequin can persuade V of the correctness of an incorrect output. Thus, a Self-Proving mannequin proves correctness of most of its outputs, whereas all incorrect outputs (of any mannequin) are detected by V. We devise and analyze two generic strategies for studying Self-Proving fashions: Transcript Studying (TL) which depends on entry to transcripts of accepting interactions, and Reinforcement Studying from Verifier Suggestions (RLVF) which trains a mannequin by emulating interactions with the verifier.
- †College of California, Berkeley
