Universal Instance Perception as Object Discovery and Retrieval #

All instance perception tasks aim at finding certain objects specified by some queries such as category names, language expressions, and target annotations, but this complete field has been split into multiple independent subtasks. In this work, we present a universal instance perception model of the next generation, termed UNINEXT. UNINEXT reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm and can flexibly perceive different types of objects by simply changing the input prompts.

Intro #

Most current instance perception methods are developed for only a single or a part of sub-tasks and trained on data from specific domains.
Such philosophy leads to drawbacks as follow:

Independent designs hinder models from learning and sharing generic knowledge between different tasks and domains, causing redundant parameters.
The possibility of mutual collaboration between different tasks is overlooked. For example, object detection data enables models to recognize common objects, which can naturally improve the performance of REC and RES.
Restricted by fixed-size classifiers, traditional object detectors are hard to jointly train on multiple datasets with different label vocabularies and to dynamically change object categories to detect during inference .

Since essentially all instance perception tasks aim at finding certain objects according to some queries, it leads to a natural question: could we design a unified model to solve all mainstream instance perception tasks once and for all?

We propose a unified prompt-guided formulation for universal instance perception, reuniting previously fragmented instance-level sub-tasks into a whole.
Benefits:
1. enormous data from different tasks and label vocabularies can be exploited for jointly training general instance-level representations, which is especially beneficial for tasks lacking in training data.
2. the unified model is parameter-efficient and can save redundant computation when handling multiple tasks simultaneously.
Benefiting from the flexible object discovery and retrieval paradigm, UNINEXT can train on different tasks and domains, in no need of task-specific heads.
UNINEXT achieves superior performance on 20 challenging benchmarks from 10 instance perception tasks using a single model with the same model parameters.

Related Work #

Instance Perception Tasks #

There are mainly three types of instance perception tasks:

Retrieval by Category Names：
Retrieval by Language Expressions
Retrieval by Reference Annotations

Retrieval by Category Names #

Retrieval by Category names includes object detection and instance segmentation. Two tasks that are fundamental to instance perception tasks.

Object Detection Methods:

Two-stage method: First generate regions of interest (RoIs) and then classify each RoI into a specific category.RCNN (🖱️Clike here to read RCNN explained) in 2014 is a two-stage method.
One-stage method: directly predict the category or the category score for each pixel or anchor. “End to End Object Detection with Transformers” is a one-stage method proposed in 2020.
RCNNs, YOLO, DETR(detection transformer)etc. are methods with a box-level detectors, which output a list of bounding boxes along with class labels and confidence scores, indicating the likelihood that each box contains an object of the specified class.

Methods for Video Tasks:
Online “Detection then Association” Paradigm: Mainstream approach for MOT.
Offline Fashion for VIS: These tasks use shorter video clips, videos are processed before making decision. The offline methods perform well on simpler benchmarks but struggle on more challenging ones.

Retrieval by Language Expressions #

Tasks include:

REC: Referring Expression Comprehension. Given an image and a language expression, the model is required to find the object referred to by the expression.
RES: Referring Expression Segmentation. Given an image and a language expression, the model is required to segment the object referred to by the expression.
R-VOS: Referring Video Object Segmentation. Given a video and a language expression, the model is required to segment the object referred to by the expression in each frame.

For REC methods is more like a classification task, using one-stage, two-stage and transformer-based methods.

RES approaches focus more on designing diverse attention mechanisms to achieve vision-language alignment.

Current SOTA methods for R-VOS are Transformer-based and process the whole video in an offline fashion. However, the offline paradigm hinders the applications in the real world such as long videos and ongoing videos

Retrieval by Reference Annotations #

Single Object Tracking(SOT): Given a video and a bounding box annotation or mask annotation, the model is required to track the object in the video.
Video Object Segmentation(VOS): Given a video and a bounding box annotation or mask annotation, the model is required to segment the object in each frame.
Problems and Solutions:
1. hard to Extract Informative Features: Most SOT methods encode information by passing a template to siamese backbone.While VOS approaches usually pass multiple previous frames together with corresponding mask results to a memory encoder for extracting fine-grained target information.
1. Fusing Information: correlations are eraly adopted in SOT but suffer from information loss problems. Transformer-based methods are capable of finding more discriminative features. Besides, fusion for VOS is dominated by space-time memory networks (on papers with code✔️).

Summary #

Task Type	Sub-Tasks	Methods & Approaches	Challenges & Solutions
Retrieval by Category Names	- Object Detection - Instance Segmentation	- Two-Stage (e.g., RCNN) - One-Stage (e.g., YOLO) - Transformer-Based (e.g., DETR)	- Online “Detection then Association” for MOT - Offline Fashion struggles in VIS
Retrieval by Language Expressions	- REC - RES - R-VOS	- One-Stage, Two-Stage, Transformer-Based for REC - Diverse Attention Mechanisms for RES - Transformer-Based & Offline for R-VOS	- Offline paradigm hinders real-world applications
Retrieval by Reference Annotations	- SOT - VOS	- Siamese Backbone with a Template for SOT - Memory Encoder with Previous Frames for VOS	- Hard to Extract Informative Features - Fusing Information: Early methods used correlations, now moving to Transformers for SOT; Dominated by space-time memory networks for VOS

Unified Learning Paradigm #

MuST,INTERN,Unified-IO and OFA propose unified learning paradigm for different tasks. the commonality and inner relationship among different tasks are less explored and exploited.

MuST: A multi-task self-training approach for 6 vision tasks.
INTERN: Introduces a continuous learning scheme and shows strong generalization on 26 popular benchmarks.
Unified-IO and OFA: Propose a unified sequence-to-sequence framework for various vision, language, and multi-modal tasks.

Unified Model Architectures #

Goal: To design a unified model architecture that can handle a group of closely related tasks.

Examples:

Mask R-CNN: A unified network for object detection and instance segmentation.
Mask2Former: Capable of handling panoptic, instance, and semantic segmentation.
Pix2SeqV2: A unified pixel-to-sequence interface for object detection, instance segmentation, keypoint detection, and image captioning.
GLIP: Reformulates object detection as phrase grounding, allowing joint training on both detection and grounding data.
Unicorn: A unified solution for SOT, VOS, MOT, and MOTS, achieving superior performance on 8 benchmarks with the same model weights.

Method #

Before introducing detailed methods, we first categorize existing instance perception tasks into three classes.

Object detection, instance segmentation, MOT, MOTS, and VIS take category names as prompts to find all instances of specific classes.

REC, RES, and R-VOS exploit an expression as the prompt to localize a certain target.

SOT and VOS use the annotation given in the first frame as the prompt for predicting the trajectories of the tracked target.

The $Enc_L (Encoder_{Language})$ is a crucial component in the UNINEXT model designed for instance perception tasks. It serves the purpose of transforming diverse language-related prompts into a unified, standardized form that the model can process.

Prompt Generation #

Language Encoder
- Input: Takes in language expressions as prompts. These can be either category names (e.g., “person, bicycle, …, toothbrush” for COCO dataset) or specific language expressions (e.g., “the man holding an umbrella”).
- Output: Produces a prompt embedding $F_p$ that is $d-dimensional$ and has a sequence length of $L$.
Annotation Encoder
- Input: Takes in the annotation of the first frame as the prompt. The annotation can be a bounding box or a segmentation mask.
- Output: Produces a prompt embedding $F_p$ that is $d-dimensional$ and has a sequence length of $L$.
- Procedure:

So the whole prompt generation is:

Task for next week:

Image-Prompt Feature Fusion #

To enhance the original prompt embedding by the image contexts and to make the original visual features prompt-aware, an early fusion module is adopted.
Bi-directional cross-attention module (Bi-XAtt) is used to retrieve information

Object Discovery and Retrieval #

Encoder
input: $F_{p}’$
output: Stronger features $Fins ∈ R _{N×d}$ 编码器的输出通常被称为嵌入（embedding） Method: Use Multi-scale Deformable Attention from DETR

Besides, as performed in two-stage Deformable DETR [136], an auxiliary prediction head is appended at the end of the encoder, generating N initial reference points with the highest scores as the inputs of the decoder. (这里不懂)
With the help of the deformable attention, the object queries can efficiently retrieve prompt-aware visual features and learn strong instance embedding $Fins ∈ R _{N×d}$
Transformer Decoder
input: Encoded Features

Result #

File is here UNINEXT Tasks next week:

Read more about RCNN and run UNINEXT code.
learn up sampling techniques methods and implement them.
cross attention and self attention
learn Vision Transformer
DETR
NERF

Markdown Cheat Sheet

8 October 2023·251 words·2 mins

Tutorial markdown

Psychology III

8 October 2023·387 words·2 mins

Psychology English

CloudFlarePages

7 October 2023·31 words·1 min

服务器部署