Skip to content

Hello-Mr-Crab/GoMatching

 
 

Repository files navigation

GoMatching & GoMatching++ & Enhanced

This repository builds on GoMatching and GoMatching++, adding three novel enhancements:
MST-Det (Multi-Scale Temporal-aware Detector) · LST-Tracker (Trajectory Smoothing & Feature Weighted Tracker) · CRN-Recognizer (Difficult Sample Mining Recognizer)

Enhanced by Hello,Mr Crab

Introduction | Enhancements | Usage | Demo | Main Results | Statement


Introduction

GoMatching/GoMatching++ are state-of-the-art video text spotters that turn an off-the-shelf query-based image text spotter (DeepSolo) into a video specialist via:

  • A rescoring mechanism and long-short term matching module.
  • Parameter-efficient fine-tuning (freezing backbone, training only ROI heads).
  • The ArTVideo benchmark for curved-text evaluation.

Enhanced Version adds three complementary improvements targeting detection, tracking, and recognition weaknesses:

  1. MST-Det — Adaptive threshold and inter-frame temporal smoothing for more robust text detection.
  2. LST-Tracker — Motion-constrained association with weighted matching for improved ID consistency.
  3. CRN-Recognizer — Difficult sample mining with weighted focal loss for better recognition on hard cases.

All enhancements are config-optional: set ENHANCED.ENABLED: True in the config YAML to activate.


Enhancements

1. MST-Det (Multi-Scale Temporal-aware Detector)

Component Description
Channel Attention (CAM) Squeeze-and-excitation-like channel reweighting via avg+max pooling
Spatial Attention (SAM) Spatial focus via concatenated avg+max pooling + 7×7 conv
Adaptive Threshold Dynamic confidence threshold based on text area ratio (small text → lower threshold)
Temporal Smoothing EMA-based inter-frame bounding box smoothing along matched tracks

Config flags: ENHANCED.MST_DET.ADAPTIVE_THRESHOLD, ENHANCED.MST_DET.TEMPORAL_SMOOTHING, ENHANCED.MST_DET.SMOOTH_ALPHA

2. LST-Tracker (Trajectory Smoothing Feature Weighted Tracker)

Component Description
Motion Consistency Augments the matching score with a motion-prior term (center distance over box size), improving long-range ID association
Kalman Filter (experimental) Per-track Kalman filter for state estimation and trajectory smoothing
Adaptive Weighted Matching Balances appearance (ReID features) and position (IoU + motion) in the Hungarian matching cost

Config flags: ENHANCED.LST_TRACKER.MOTION_CONSISTENCY, ENHANCED.LST_TRACKER.MOTION_WEIGHT, ENHANCED.LST_TRACKER.KALMAN_ENABLED

3. CRN-Recognizer (Difficult Sample Mining Recognizer)

Component Description
Difficult Sample Focal Loss Applies extra weight (up to 3×) to samples with confidence below a threshold (default p < 0.4), forcing the model to focus on hard cases
Text Quality Scorer (training only) Predicts a quality score per text instance; used to filter low-quality detections in multi-frame selection
Label Smoothing Replaces hard one-hot targets with smoothed distributions (ε = 0.1) to improve generalization

Config flags: ENHANCED.CRN_RECOGNIZER.DIFFICULT_SAMPLE_LOSS, ENHANCED.CRN_RECOGNIZER.QUALITY_SCORING, ENHANCED.CRN_RECOGNIZER.LABEL_SMOOTHING


Usage

Dataset

Videos in ICDAR15-video, DSText and BOVText should be extracted into frames. Use the JSON annotation files we provide [ICDAR15-video & DSText] for training. For ArTVideo, download to ./datasets. Expected structure:

|- ./datasets
    |--- ICDAR15
    |      |--- frame/  frame_test/  train.json
    |--- DSText
    |      |--- frame/  frame_test/  train.json ...
    |--- BOVText
    |      |--- frame/  frame_test/  Train/  Test/ ...
    |--- ArTVideo
           |--- Train/  Test/  json/ ...

Processing raw data:

python tools/video2frame.py
python tools/convert_gom_label/icdar15.py   # or bovtext.py / dstext.py

Installation

Python 3.8 + PyTorch 1.9.0 + CUDA 11.1 + Detectron2 v0.6

git clone https://github.com/Hxyz-123/GoMatching.git
cd GoMatching
conda create -n gomatching python=3.8 -y
conda activate gomatching
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.9/index.html
cd third_party
python setup.py build develop

Pre-trained Model

Download DeepSolo weights from Google Drive to ./pretrained_models. If using custom weights, decouple backbone and transformer:

python tools/decouple_deepsolo.py --input path_to_weights --output output_path

Training

Enhanced training (uses the Enh config to activate all three improvements):

python train_net.py --num-gpus 1 --config-file configs/GoMatching_PP_Enh_ICDAR15.yaml

Original baselines:

# GoMatching
python train_net.py --num-gpus 1 --config-file configs/GoMatching_ICDAR15.yaml
# GoMatching++
python train_net.py --num-gpus 1 --config-file configs/GoMatching_PP_ICDAR15.yaml

Other datasets (DSText, BOVText, ArTVideo) follow the same pattern with their respective config files.

Evaluation

python eval.py --config-file configs/GoMatching_PP_Enh_ICDAR15.yaml \
    --input ./datasets/ICDAR15/frame_test/ \
    --output output/enhanced_icdar15 \
    --opts MODEL.WEIGHTS trained_models/GoMPP_Enh_IC15/xxx.pth

cd output/enhanced_icdar15/preds
zip -r ../preds.zip ./*
zip -r ../track.zip ./*.xml

Submit preds.zip / track.zip to the ICDAR15 evaluation server.

For DSText, BOVText, and ArTVideo, see the original GoMatching++ evaluation protocol.


Demo

The demo.py script runs the full end-to-end pipeline on a video clip, displaying detection → tracking → recognition in real time. Built specifically for thesis defense presentations.

Quick Start

python demo.py --video path/to/test_video.mp4

Options

Argument Default Description
--video (required) Input video file
--config configs/GoMatching_PP_Enh_ICDAR15.yaml Model config
--weights pretrained_models/deepsolo_icdar15_rescore.pth Checkpoint
--output demo_output/ Output directory
--vis-thresh 0.3 Confidence threshold for visualization
--max-frames Limit number of frames (quick test)
--cpu Force CPU inference
--save-frames Save individual frame images
--no-display Disable live display window

Example

# Full pipeline with display
python demo.py --video datasets/test_clip.mp4

# Quick CPU test (first 30 frames, no display, save frames)
python demo.py --video datasets/test_clip.mp4 \
    --weights pretrained_models/deepsolo_icdar15_rescore.pth \
    --max-frames 30 --cpu --save-frames --no-display

# Comparison: enhanced vs baseline
python demo.py --video datasets/test_clip.mp4 \
    --config configs/GoMatching_PP_Enh_ICDAR15.yaml \
    --output demo_enhanced/
python demo.py --video datasets/test_clip.mp4 \
    --config configs/GoMatching_PP_ICDAR15.yaml \
    --output demo_baseline/

Output

  • demo_output/result.mp4 — Annotated video with track IDs, bounding polygons, and recognized text
  • demo_output/frame_*.jpg — (if --save-frames) Individual frame outputs

Main Results

ICDAR15-video

Method MOTA MOTP IDF1 Trainable Params (M)
GoMatching 72.04 78.53 80.11 32.79
GoMatching++ 72.20 78.52 80.11 11.80
Enhanced (ours) TBD TBD TBD 11.80

DSText

Method MOTA MOTP IDF1 Trainable Params (M)
GoMatching 22.83 80.43 46.06 32.79
GoMatching++ 23.23 80.24 46.24 11.80
Enhanced (ours) TBD TBD TBD 11.80

BOVText

Method MOTA MOTP IDF1 Trainable Params (M)
GoMatching++ 52.9 87.2 62.8 11.80

ArTVideo

Method MOTA MOTP IDF1 Trainable Params (M)
GoMatching++ 75.7 83.5 82.3 11.80

Enhanced results will be updated after full training.


Statement

This project is for research purposes only. Original GoMatching/GoMatching++ by Haibin He et al. Enhanced modules implemented by Hello,Mr Crab. For questions, please contact the repository maintainer.

Citation

If you find this work helpful, please consider citing:

@inproceedings{he2024gomatching,
  title={GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching},
  author={He, Haibin and Ye, Maoyuan and Zhang, Jing and Liu, Juhua and Du, Bo and Tao, Dacheng},
  booktitle={NeurIPS},
  year={2024}
}

@article{he2025gomatching++,
  title={GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking},
  author={He, Haibin and Zhang, Jing and Ye, Maoyuan and Liu, Juhua and Du, Bo and Tao, Dacheng},
  journal={arXiv:2505.22228},
  year={2025}
}

Acknowledgements

Built on DeepSolo, GTR, TransDETR, and BOVText.

About

Enhanced Gomatching

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.1%
  • Jupyter Notebook 0.9%