Dense picture captioning is important for cross-modal alignment in vision-language pretraining and text-to-image technology, however scaling expert-quality annotations is prohibitively costly. Whereas artificial captioning through robust vision-language fashions (VLMs) is a sensible various, supervised distillation usually yields restricted output range and weak generalization. Reinforcement studying (RL) may overcome these limitations, however its successes have to date been concentrated in verifiable domains that depend on deterministic checkers — a luxurious not accessible in open-ended captioning. We deal with this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward indicators from LLM-written rubrics. RubiCap first assembles a various committee of candidate captions, then employs an LLM rubric author to extract consensus strengths and diagnose deficiencies within the present coverage. These insights are transformed into express analysis standards, enabling an LLM decide to decompose holistic high quality evaluation and exchange coarse scalar rewards with structured, multi-faceted evaluations. Throughout in depth benchmarks, RubiCap achieves the best win charges on CapArena, outperforming supervised distillation, prior RL strategies, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, it demonstrates superior phrase effectivity: our 7B mannequin matches Qwen2.5-VL-32B-Instruct, and our 3B mannequin surpasses its 7B counterpart. Remarkably, utilizing the compact RubiCap-3B as a captioner produces stronger pretrained VLMs than these skilled on captions from proprietary fashions.
- † College of Wisconsin–Madison
- ** Work accomplished whereas at Apple
