With the elevated deployment of huge language fashions (LLMs), one concern is their potential misuse for producing dangerous content material. Our work research the alignment problem, with a deal with filters to forestall the technology of unsafe data. Two pure factors of intervention are the filtering of the enter immediate earlier than it reaches the mannequin, and filtering the output after technology. Our important outcomes show computational challenges in filtering each prompts and outputs. First, we present that there exist LLMs for which there aren’t any environment friendly immediate filters: adversarial prompts that elicit dangerous conduct may be simply constructed, that are computationally indistinguishable from benign prompts for any environment friendly filter. Our second important end result identifies a pure setting during which output filtering is computationally intractable. All of our separation outcomes are below cryptographic hardness assumptions. Along with these core findings, we additionally formalize and research relaxed mitigation approaches, demonstrating additional computational boundaries. We conclude that security can’t be achieved by designing filters exterior to the LLM internals (structure and weights); specifically, black-box entry to the LLM is not going to suffice. Based mostly on our technical outcomes, we argue that an aligned AI system’s intelligence can’t be separated from its judgment.
- † Ludwig-Maximilians-Universität in Munich (MCML)
- ‡ College of California, Berkeley
- § JPSM College of Maryland
- ¶ Stanford College
