OV-DEIM is a real-time DETR-style framework for open-vocabulary object detection. It extends DEIMv2 to the open-vocabulary setting and achieves state-of-the-art performance on open-vocabulary benchmarks with superior inference efficiency. The framework is further enhanced by the Query Supplement strategy, which improves Fixed AP without sacrificing speed. In addition, GridSynthetic is introduced as a data augmentation approach to mitigate noisy localization effects in classification learning and enhance robustness, particularly for rare categories.
Abstract
Real-time open-vocabulary object detection faces two key challenges: achieving high inference efficiency and maintaining robust semantic recognition across a large vocabulary. In this work, we present OV-DEIM, an end-to-end DETR-style detector built upon the recent DEIMv2 framework with vision–language modeling for efficient open-vocabulary inference. Unlike YOLO-style detectors, whose category-dependent post-processing cost increases with vocabulary size, OV-DEIM avoids such overhead and scales more gracefully to large vocabularies. This is further supported by a lightweight query supplement strategy that improves Fixed AP without sacrificing inference speed. Beyond architectural efficiency, we focus on strengthening classification robustness. We propose GridSynthetic, a simple yet effective data augmentation method that composes multiple training samples into structured image grids. By exposing the model to richer object co-occurrence patterns and spatial layouts within a single forward pass, GridSynthetic reduces the negative impact of noisy localization signals in the classification loss and enhances semantic discrimination, particularly for rare categories. Importantly, this improvement introduces no additional inference cost. Extensive experiments demonstrate that OV-DEIM achieves state-of-the-art performance on open-vocabulary detection benchmarks, delivering superior efficiency and notable gains on challenging rare categories. Code will be released upon publication.- For training data, OG denotes Objects365v1 and GoldG.
- FPS is measured on T4 with TensorRT
- Fixed AP is improved by the Query Supplement strategy
| Model | Size | Params | Data | FPS |
|
|
|
|
Config | Checkpoint |
|---|---|---|---|---|---|---|---|---|---|---|
| OV-DEIM-S | 640 | 11M | OG | 161 | 27.7 / 29.6 | 23.6 /25.2 | 28.1 / 30.2 | 28.0 / 30.0 | S | Baidu |
| OV-DEIM-M | 640 | 20M | OG | 109 | 30.6 / 32.6 | 25.3 /26.9 | 30.2 / 31.5 | 31.9 / 34.1 | M | Baidu |
| OV-DEIM-L | 640 | 36M | OG | 91 | 33.7 / 35.9 | 34.3 /36.8 | 33.4 / 35.5 | 34.0 / 36.0 | L | Baidu |
| Model | Size | Params | |||
|---|---|---|---|---|---|
| OV-DEIM-S | 640 | 11M | 40.8 | 56.3 | 44.4 |
| OV-DEIM-M | 640 | 20M | 43.3 | 60.2 | 48.0 |
| OV-DEIM-L | 640 | 36M | 45.9 | 62.3 | 49.9 |
conda create -p ./envs/ovod python=3.10
conda activate ./envs/ovod
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
pip install hydra-core --upgrade
pip install albumentations
pip install opencv-python
pip install lvis
pip install swanlab -i https://mirrors.cernet.edu.cn/pypi/web/simple
swanlab login
pip install pycocotools
pip install h5py
Locate the line:
from lvis import LVIS, LVISEval
Navigate to the source file where LVISEval is defined. In that file, modify lines 361 and 362 by replacing np.float with np.float64.
Download the original backbone weights from DEIMv2 and the corresponding text data, where the text embeddings are extracted using MobileCLIP-B(LT).
| Images | Raw Annotations |
|---|---|
| Objects365v1 | objects365_train.json |
| GQA | final_mixed_train_noo_coco.json |
| Flickr30k | final_flickr_separateGT_train.json |
torchrun \
--nnodes=${NNODES} \
--nproc_per_node=${NUM_GPUS} \
--node_rank=${NODE_RANK} \
--rdzv_backend=c10d \
--rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
train_ori_torchrun.py \
--config 'base_l' \
--collate_func "train_collate_final" \
--pipeline_type 'aug' \
--batch_size 16 \
--num_training_classes 150 \
--alpha 0.5 \
The code base is built with YOLO-World, YOLOE, MobileCLIP, RT-DETR, DINOv3 and DEIMv2.
@misc{wang2026ovdeim,
title={OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation},
author={Leilei Wang and Longfei Liu and Xi Shen and Xuanlong Yu and Ying Tiffany He and Fei Richard Yu and Yingyi Chen},
year={2026},
eprint={2603.07022},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.07022},
}
