Accepted at ECCV 2026 Main Conference

DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture

A JEPA-style self-supervised learner that decides where to predict and in what order, turning latent prediction into a curriculum from primary to secondary visual cues.

* Equal contribution

Comparison between I-JEPA parallel prediction and DSeq-JEPA discriminative sequential prediction
DSeq-JEPA replaces flat, independent target prediction with attention-ranked sequential latent prediction.
82.4%
Linear probing accuracy, +1.3 over I-JEPA, ViT-H
+1.5
FGVC tasks over I-JEPA, ViT-H
50.5
MS-COCO APbox, ViT-B/16
17.8
GFLOPs/image at inference, nearly unchanged

Abstract

Predictive SSL, ordered by visual importance.

Fig. 2 overview of DSeq-JEPA
Fig. 2. DSeq-JEPA selects and ranks discriminative regions, then predicts next-region embeddings in that ordered sequence.

Recent advances in self-supervised visual representation learning have demonstrated the effectiveness of predictive latent-space objectives for learning transferable features. In particular, Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns representations by predicting latent embeddings of masked target regions from visible context. However, it predicts target regions in parallel and all at once, lacking ability to order predictions meaningfully.

Inspired by human visual perception, which attends selectively and progressively from primary to secondary cues, we propose DSeq-JEPA, a Discriminative Sequential Joint-Embedding Predictive Architecture that bridges latent predictive and autoregressive self-supervised learning. DSeq-JEPA identifies primary discriminative regions using an attention-derived saliency map and predicts subsequent regions in discriminative order, inducing a curriculum-like semantic progression during pre-training.

01

Discriminative Region Prioritization

Attention-derived saliency identifies informative regions and ranks them by visual importance.

02

Sequential Next-Region Prediction

The predictor estimates each next region embedding from previous discriminative cues.

03

Broad Transfer Gains

Improvements hold across ImageNet, FGVC, dense prediction, CLEVR, and ablation settings.

Method

Discriminative sequential latent prediction.

DSeq-JEPA keeps the JEPA latent prediction objective while replacing random parallel target prediction with an attention-guided sequence. It follows the intuition of selective human visual perception: identify the most discriminative visual cue first, then use that context to predict progressively less dominant regions.

DSeq-JEPA architecture overview
Given an image, DSeq-JEPA estimates class-token to patch similarity, extracts connected attention-guided regions, ranks them by discriminative response, and predicts each next-region embedding in that order.
Where

Estimate saliency

Compute class-token to patch similarity from the target encoder as a lightweight proxy for discriminative visual content.

Rank

Select regions

Apply adaptive thresholding and connected components, then sort candidate regions by average normalized attention response.

Order

Predict next embeddings

Predict region embeddings from the most discriminative cue to the least, aligning predictions with target encoder features.

Results

Consistent gains across recognition, dense prediction, and reasoning.

The improvements are not isolated to one benchmark: sequential discriminative prediction transfers to global, fine-grained, dense, and low-level tasks.

ImageNet and Fine-Grained Transfer

Top-1 accuracy using ViT-H/16 at 448px, comparing I-JEPA and DSeq-JEPA.

Top-1 accuracy

Visual Analysis

Interpretable region order emerges during pre-training.

Qualitative analysis shows compact discriminative regions and patch clusters that become increasingly object-aligned over training.

Qualitative comparison of attention and selected target regions
DSeq-JEPA produces compact, ordered regions that track discriminative object parts.
Patch-level clustering evolution across pre-training epochs
Patch clusters become more coherent and object-aligned from 150 to 600 epochs.

Citation

Cite DSeq-JEPA.

Use this BibTeX entry for the ECCV 2026 version of the paper.

@inproceedings{he2026dseqjepa,
  title={DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture},
  author={He, Xiangteng and Sakai, Shunsuke and Chandhok, Shivam and Beery, Sara and Yuan, Kun and Padoy, Nicolas and Hasegawa, Tatsuhito and Sigal, Leonid},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  year={2026}
}