To check whether or not this downside holds for as we speak’s giant multimodal fashions, the workforce carried out a managed analysis. They skilled the chosen fashions on 5 goal duties, together with fine-grained fowl classification, counting, medical visible query answering, OCR studying, and time studying. They then measured how a lot efficiency dropped throughout eight customary benchmarks that weren’t a part of the fine-tuning set.
These experiments led to 2 key discoveries, in response to the paper. Tuning solely the self-attention projection layers (SA Proj), the a part of the mannequin that helps it determine which enter parts to give attention to, allowed the fashions to study new duties with little or no measurable forgetting. Additionally, what initially appeared as forgotten information usually resurfaced when the mannequin was later skilled on one other specialised process.
“We thus hypothesize that maybe what seems like forgetting or interference after fine-tuning on a slim goal process is definitely bias within the output distribution as a result of process distribution shift,” the researchers added. “By in-depth evaluation when tuning the counting process, we affirm this speculation: tuning the MLP will increase goal accuracy but in addition will increase the chance of outputting numeric tokens and a extremely correlated drop in held-out process accuracy, whereas tuning the self-attention achieves the goal studying with out a lot bias towards numeric tokens and with out dropping held-out accuracy.”