The composition of objects and their components, together with object-object positional relationships, offers a wealthy supply of data for illustration studying. Therefore, spatial-aware pretext duties have been actively explored in self-supervised studying. Current works generally begin from a grid construction, the place the aim of the pretext job entails predicting absolutely the place index of patches inside a set grid. Nonetheless, grid-based approaches fall wanting capturing the fluid and steady nature of real-world object compositions. We introduce PART, a self-supervised studying strategy that leverages steady relative transformations between off-grid patches to beat these limitations. By modeling how components relate to one another in a steady area, PART learns the relative composition of images-an off-grid structural relative positioning that’s much less tied to absolute look and may stay coherent underneath variations corresponding to partial visibility or stylistic modifications. In duties requiring exact spatial understanding corresponding to object detection and time sequence prediction, PART outperforms grid-based strategies like MAE and DropPos, whereas sustaining aggressive efficiency on international classification duties. By breaking free from grid constraints, PART opens up a brand new trajectory for common self-supervised pretraining throughout numerous datatypes-from photos to EEG signals-with potential in medical imaging, video, and audio.
- †College of Amsterdam
