This repository builds on GoMatching and
GoMatching++, adding three novel enhancements:
MST-Det (Multi-Scale Temporal-aware Detector) ·
LST-Tracker (Trajectory Smoothing & Feature Weighted Tracker) ·
CRN-Recognizer (Difficult Sample Mining Recognizer)
Introduction | Enhancements | Usage | Demo | Main Results | Statement
GoMatching/GoMatching++ are state-of-the-art video text spotters that turn an off-the-shelf query-based image text spotter (DeepSolo) into a video specialist via:
- A rescoring mechanism and long-short term matching module.
- Parameter-efficient fine-tuning (freezing backbone, training only ROI heads).
- The ArTVideo benchmark for curved-text evaluation.
Enhanced Version adds three complementary improvements targeting detection, tracking, and recognition weaknesses:
- MST-Det — Adaptive threshold and inter-frame temporal smoothing for more robust text detection.
- LST-Tracker — Motion-constrained association with weighted matching for improved ID consistency.
- CRN-Recognizer — Difficult sample mining with weighted focal loss for better recognition on hard cases.
All enhancements are config-optional: set ENHANCED.ENABLED: True in the config YAML to activate.
| Component | Description |
|---|---|
| Channel Attention (CAM) | Squeeze-and-excitation-like channel reweighting via avg+max pooling |
| Spatial Attention (SAM) | Spatial focus via concatenated avg+max pooling + 7×7 conv |
| Adaptive Threshold | Dynamic confidence threshold based on text area ratio (small text → lower threshold) |
| Temporal Smoothing | EMA-based inter-frame bounding box smoothing along matched tracks |
Config flags: ENHANCED.MST_DET.ADAPTIVE_THRESHOLD, ENHANCED.MST_DET.TEMPORAL_SMOOTHING, ENHANCED.MST_DET.SMOOTH_ALPHA
| Component | Description |
|---|---|
| Motion Consistency | Augments the matching score with a motion-prior term (center distance over box size), improving long-range ID association |
| Kalman Filter (experimental) | Per-track Kalman filter for state estimation and trajectory smoothing |
| Adaptive Weighted Matching | Balances appearance (ReID features) and position (IoU + motion) in the Hungarian matching cost |
Config flags: ENHANCED.LST_TRACKER.MOTION_CONSISTENCY, ENHANCED.LST_TRACKER.MOTION_WEIGHT, ENHANCED.LST_TRACKER.KALMAN_ENABLED
| Component | Description |
|---|---|
| Difficult Sample Focal Loss | Applies extra weight (up to 3×) to samples with confidence below a threshold (default p < 0.4), forcing the model to focus on hard cases |
| Text Quality Scorer (training only) | Predicts a quality score per text instance; used to filter low-quality detections in multi-frame selection |
| Label Smoothing | Replaces hard one-hot targets with smoothed distributions (ε = 0.1) to improve generalization |
Config flags: ENHANCED.CRN_RECOGNIZER.DIFFICULT_SAMPLE_LOSS, ENHANCED.CRN_RECOGNIZER.QUALITY_SCORING, ENHANCED.CRN_RECOGNIZER.LABEL_SMOOTHING
Videos in ICDAR15-video, DSText and BOVText should be extracted into frames. Use the JSON annotation files we provide [ICDAR15-video & DSText] for training. For ArTVideo, download to ./datasets. Expected structure:
|- ./datasets
|--- ICDAR15
| |--- frame/ frame_test/ train.json
|--- DSText
| |--- frame/ frame_test/ train.json ...
|--- BOVText
| |--- frame/ frame_test/ Train/ Test/ ...
|--- ArTVideo
|--- Train/ Test/ json/ ...
Processing raw data:
python tools/video2frame.py
python tools/convert_gom_label/icdar15.py # or bovtext.py / dstext.pyPython 3.8 + PyTorch 1.9.0 + CUDA 11.1 + Detectron2 v0.6
git clone https://github.com/Hxyz-123/GoMatching.git
cd GoMatching
conda create -n gomatching python=3.8 -y
conda activate gomatching
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.9/index.html
cd third_party
python setup.py build developDownload DeepSolo weights from Google Drive to ./pretrained_models.
If using custom weights, decouple backbone and transformer:
python tools/decouple_deepsolo.py --input path_to_weights --output output_pathEnhanced training (uses the Enh config to activate all three improvements):
python train_net.py --num-gpus 1 --config-file configs/GoMatching_PP_Enh_ICDAR15.yamlOriginal baselines:
# GoMatching
python train_net.py --num-gpus 1 --config-file configs/GoMatching_ICDAR15.yaml
# GoMatching++
python train_net.py --num-gpus 1 --config-file configs/GoMatching_PP_ICDAR15.yamlOther datasets (DSText, BOVText, ArTVideo) follow the same pattern with their respective config files.
python eval.py --config-file configs/GoMatching_PP_Enh_ICDAR15.yaml \
--input ./datasets/ICDAR15/frame_test/ \
--output output/enhanced_icdar15 \
--opts MODEL.WEIGHTS trained_models/GoMPP_Enh_IC15/xxx.pth
cd output/enhanced_icdar15/preds
zip -r ../preds.zip ./*
zip -r ../track.zip ./*.xmlSubmit preds.zip / track.zip to the ICDAR15 evaluation server.
For DSText, BOVText, and ArTVideo, see the original GoMatching++ evaluation protocol.
The demo.py script runs the full end-to-end pipeline on a video clip, displaying detection → tracking → recognition in real time. Built specifically for thesis defense presentations.
python demo.py --video path/to/test_video.mp4| Argument | Default | Description |
|---|---|---|
--video |
(required) | Input video file |
--config |
configs/GoMatching_PP_Enh_ICDAR15.yaml |
Model config |
--weights |
pretrained_models/deepsolo_icdar15_rescore.pth |
Checkpoint |
--output |
demo_output/ |
Output directory |
--vis-thresh |
0.3 |
Confidence threshold for visualization |
--max-frames |
— | Limit number of frames (quick test) |
--cpu |
— | Force CPU inference |
--save-frames |
— | Save individual frame images |
--no-display |
— | Disable live display window |
# Full pipeline with display
python demo.py --video datasets/test_clip.mp4
# Quick CPU test (first 30 frames, no display, save frames)
python demo.py --video datasets/test_clip.mp4 \
--weights pretrained_models/deepsolo_icdar15_rescore.pth \
--max-frames 30 --cpu --save-frames --no-display
# Comparison: enhanced vs baseline
python demo.py --video datasets/test_clip.mp4 \
--config configs/GoMatching_PP_Enh_ICDAR15.yaml \
--output demo_enhanced/
python demo.py --video datasets/test_clip.mp4 \
--config configs/GoMatching_PP_ICDAR15.yaml \
--output demo_baseline/demo_output/result.mp4— Annotated video with track IDs, bounding polygons, and recognized textdemo_output/frame_*.jpg— (if--save-frames) Individual frame outputs
| Method | MOTA | MOTP | IDF1 | Trainable Params (M) |
|---|---|---|---|---|
| GoMatching | 72.04 | 78.53 | 80.11 | 32.79 |
| GoMatching++ | 72.20 | 78.52 | 80.11 | 11.80 |
| Enhanced (ours) | TBD | TBD | TBD | 11.80 |
| Method | MOTA | MOTP | IDF1 | Trainable Params (M) |
|---|---|---|---|---|
| GoMatching | 22.83 | 80.43 | 46.06 | 32.79 |
| GoMatching++ | 23.23 | 80.24 | 46.24 | 11.80 |
| Enhanced (ours) | TBD | TBD | TBD | 11.80 |
| Method | MOTA | MOTP | IDF1 | Trainable Params (M) |
|---|---|---|---|---|
| GoMatching++ | 52.9 | 87.2 | 62.8 | 11.80 |
| Method | MOTA | MOTP | IDF1 | Trainable Params (M) |
|---|---|---|---|---|
| GoMatching++ | 75.7 | 83.5 | 82.3 | 11.80 |
Enhanced results will be updated after full training.
This project is for research purposes only. Original GoMatching/GoMatching++ by Haibin He et al. Enhanced modules implemented by Hello,Mr Crab. For questions, please contact the repository maintainer.
If you find this work helpful, please consider citing:
@inproceedings{he2024gomatching,
title={GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching},
author={He, Haibin and Ye, Maoyuan and Zhang, Jing and Liu, Juhua and Du, Bo and Tao, Dacheng},
booktitle={NeurIPS},
year={2024}
}
@article{he2025gomatching++,
title={GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking},
author={He, Haibin and Zhang, Jing and Ye, Maoyuan and Liu, Juhua and Du, Bo and Tao, Dacheng},
journal={arXiv:2505.22228},
year={2025}
}