Generative fashions are sometimes deployed to make selections on behalf of customers, reminiscent of vision-language fashions (VLMs) figuring out which individual in a room is a physician to assist visually impaired people. But, VLM selections are influenced by the perceived demographic attributes of individuals within the enter, which might result in biased outcomes like failing to determine girls as docs. Furthermore, when lowering bias results in efficiency loss, customers might have various wants for balancing bias mitigation with total mannequin capabilities, highlighting the demand for strategies that allow controllable bias discount throughout inference. Activation steering is a well-liked strategy for inference-time controllability that has proven potential in inducing safer conduct in massive language fashions (LLMs). Nonetheless, we observe that present steering strategies battle to right biases, the place equiprobable outcomes throughout demographic teams are required. To deal with this, we suggest Direct Steering Optimization (DSO) which makes use of reinforcement studying to seek out linear transformations for steering activations, tailor-made to mitigate bias whereas sustaining management over mannequin efficiency. We show that DSO achieves state-of-the-art trade-off between equity and capabilities on each VLMs and LLMs, whereas providing practitioners inference-time management over the trade-off. General, our work highlights the advantage of designing steering methods which might be instantly optimized to regulate mannequin conduct, offering more practical bias intervention than strategies that depend on pre-defined heuristics for controllability.
- † Carnegie Mellon College
- ‡ Equal contribution
- ** Work finished whereas at Apple
