GoMatching & GoMatching++ & Enhanced

This repository builds on GoMatching and GoMatching++, adding three novel enhancements:
MST-Det (Multi-Scale Temporal-aware Detector) · LST-Tracker (Trajectory Smoothing & Feature Weighted Tracker) · CRN-Recognizer (Difficult Sample Mining Recognizer)

Enhanced by Hello,Mr Crab

Introduction

GoMatching/GoMatching++ are state-of-the-art video text spotters that turn an off-the-shelf query-based image text spotter (DeepSolo) into a video specialist via:

A rescoring mechanism and long-short term matching module.
Parameter-efficient fine-tuning (freezing backbone, training only ROI heads).
The ArTVideo benchmark for curved-text evaluation.

Enhanced Version adds three complementary improvements targeting detection, tracking, and recognition weaknesses:

MST-Det — Adaptive threshold and inter-frame temporal smoothing for more robust text detection.
LST-Tracker — Motion-constrained association with weighted matching for improved ID consistency.
CRN-Recognizer — Difficult sample mining with weighted focal loss for better recognition on hard cases.

All enhancements are config-optional: set ENHANCED.ENABLED: True in the config YAML to activate.

Enhancements

1. MST-Det (Multi-Scale Temporal-aware Detector)

Component	Description
Channel Attention (CAM)	Squeeze-and-excitation-like channel reweighting via avg+max pooling
Spatial Attention (SAM)	Spatial focus via concatenated avg+max pooling + 7×7 conv
Adaptive Threshold	Dynamic confidence threshold based on text area ratio (small text → lower threshold)
Temporal Smoothing	EMA-based inter-frame bounding box smoothing along matched tracks

Config flags: ENHANCED.MST_DET.ADAPTIVE_THRESHOLD, ENHANCED.MST_DET.TEMPORAL_SMOOTHING, ENHANCED.MST_DET.SMOOTH_ALPHA

2. LST-Tracker (Trajectory Smoothing Feature Weighted Tracker)

Component	Description
Motion Consistency	Augments the matching score with a motion-prior term (center distance over box size), improving long-range ID association
Kalman Filter (experimental)	Per-track Kalman filter for state estimation and trajectory smoothing
Adaptive Weighted Matching	Balances appearance (ReID features) and position (IoU + motion) in the Hungarian matching cost

Config flags: ENHANCED.LST_TRACKER.MOTION_CONSISTENCY, ENHANCED.LST_TRACKER.MOTION_WEIGHT, ENHANCED.LST_TRACKER.KALMAN_ENABLED

3. CRN-Recognizer (Difficult Sample Mining Recognizer)

Component	Description
Difficult Sample Focal Loss	Applies extra weight (up to 3×) to samples with confidence below a threshold (default p < 0.4), forcing the model to focus on hard cases
Text Quality Scorer (training only)	Predicts a quality score per text instance; used to filter low-quality detections in multi-frame selection
Label Smoothing	Replaces hard one-hot targets with smoothed distributions (ε = 0.1) to improve generalization

Config flags: ENHANCED.CRN_RECOGNIZER.DIFFICULT_SAMPLE_LOSS, ENHANCED.CRN_RECOGNIZER.QUALITY_SCORING, ENHANCED.CRN_RECOGNIZER.LABEL_SMOOTHING

Usage

Dataset

Videos in ICDAR15-video, DSText and BOVText should be extracted into frames. Use the JSON annotation files we provide [ICDAR15-video & DSText] for training. For ArTVideo, download to ./datasets. Expected structure:

|- ./datasets
    |--- ICDAR15
    |      |--- frame/  frame_test/  train.json
    |--- DSText
    |      |--- frame/  frame_test/  train.json ...
    |--- BOVText
    |      |--- frame/  frame_test/  Train/  Test/ ...
    |--- ArTVideo
           |--- Train/  Test/  json/ ...

Processing raw data:

python tools/video2frame.py
python tools/convert_gom_label/icdar15.py   # or bovtext.py / dstext.py

Installation

Python 3.8 + PyTorch 1.9.0 + CUDA 11.1 + Detectron2 v0.6

git clone https://github.com/Hxyz-123/GoMatching.git
cd GoMatching
conda create -n gomatching python=3.8 -y
conda activate gomatching
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.9/index.html
cd third_party
python setup.py build develop

Pre-trained Model

Download DeepSolo weights from Google Drive to ./pretrained_models. If using custom weights, decouple backbone and transformer:

python tools/decouple_deepsolo.py --input path_to_weights --output output_path

Training

Enhanced training (uses the Enh config to activate all three improvements):

python train_net.py --num-gpus 1 --config-file configs/GoMatching_PP_Enh_ICDAR15.yaml

Original baselines:

# GoMatching
python train_net.py --num-gpus 1 --config-file configs/GoMatching_ICDAR15.yaml
# GoMatching++
python train_net.py --num-gpus 1 --config-file configs/GoMatching_PP_ICDAR15.yaml

Other datasets (DSText, BOVText, ArTVideo) follow the same pattern with their respective config files.

Evaluation

python eval.py --config-file configs/GoMatching_PP_Enh_ICDAR15.yaml \
    --input ./datasets/ICDAR15/frame_test/ \
    --output output/enhanced_icdar15 \
    --opts MODEL.WEIGHTS trained_models/GoMPP_Enh_IC15/xxx.pth

cd output/enhanced_icdar15/preds
zip -r ../preds.zip ./*
zip -r ../track.zip ./*.xml

Submit preds.zip / track.zip to the ICDAR15 evaluation server.

For DSText, BOVText, and ArTVideo, see the original GoMatching++ evaluation protocol.

Demo

The demo.py script runs the full end-to-end pipeline on a video clip, displaying detection → tracking → recognition in real time. Built specifically for thesis defense presentations.

Quick Start

python demo.py --video path/to/test_video.mp4

Options

Argument	Default	Description
`--video`	(required)	Input video file
`--config`	`configs/GoMatching_PP_Enh_ICDAR15.yaml`	Model config
`--weights`	`pretrained_models/deepsolo_icdar15_rescore.pth`	Checkpoint
`--output`	`demo_output/`	Output directory
`--vis-thresh`	`0.3`	Confidence threshold for visualization
`--max-frames`	—	Limit number of frames (quick test)
`--cpu`	—	Force CPU inference
`--save-frames`	—	Save individual frame images
`--no-display`	—	Disable live display window

Example

# Full pipeline with display
python demo.py --video datasets/test_clip.mp4

# Quick CPU test (first 30 frames, no display, save frames)
python demo.py --video datasets/test_clip.mp4 \
    --weights pretrained_models/deepsolo_icdar15_rescore.pth \
    --max-frames 30 --cpu --save-frames --no-display

# Comparison: enhanced vs baseline
python demo.py --video datasets/test_clip.mp4 \
    --config configs/GoMatching_PP_Enh_ICDAR15.yaml \
    --output demo_enhanced/
python demo.py --video datasets/test_clip.mp4 \
    --config configs/GoMatching_PP_ICDAR15.yaml \
    --output demo_baseline/

Output

demo_output/result.mp4 — Annotated video with track IDs, bounding polygons, and recognized text
demo_output/frame_*.jpg — (if --save-frames) Individual frame outputs

Main Results

ICDAR15-video

Method	MOTA	MOTP	IDF1	Trainable Params (M)
GoMatching	72.04	78.53	80.11	32.79
GoMatching++	72.20	78.52	80.11	11.80
Enhanced (ours)	TBD	TBD	TBD	11.80

DSText

Method	MOTA	MOTP	IDF1	Trainable Params (M)
GoMatching	22.83	80.43	46.06	32.79
GoMatching++	23.23	80.24	46.24	11.80
Enhanced (ours)	TBD	TBD	TBD	11.80

BOVText

Method	MOTA	MOTP	IDF1	Trainable Params (M)
GoMatching++	52.9	87.2	62.8	11.80

ArTVideo

Method	MOTA	MOTP	IDF1	Trainable Params (M)
GoMatching++	75.7	83.5	82.3	11.80

Enhanced results will be updated after full training.

Statement

This project is for research purposes only. Original GoMatching/GoMatching++ by Haibin He et al. Enhanced modules implemented by Hello,Mr Crab. For questions, please contact the repository maintainer.

Citation

If you find this work helpful, please consider citing:

@inproceedings{he2024gomatching,
  title={GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching},
  author={He, Haibin and Ye, Maoyuan and Zhang, Jing and Liu, Juhua and Du, Bo and Tao, Dacheng},
  booktitle={NeurIPS},
  year={2024}
}

@article{he2025gomatching++,
  title={GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking},
  author={He, Haibin and Zhang, Jing and Ye, Maoyuan and Liu, Juhua and Du, Bo and Tao, Dacheng},
  journal={arXiv:2505.22228},
  year={2025}
}

Acknowledgements

Built on DeepSolo, GTR, TransDETR, and BOVText.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GoMatching & GoMatching++ & Enhanced

This repository builds on GoMatching and GoMatching++, adding three novel enhancements:
MST-Det (Multi-Scale Temporal-aware Detector) · LST-Tracker (Trajectory Smoothing & Feature Weighted Tracker) · CRN-Recognizer (Difficult Sample Mining Recognizer)

Enhanced by Hello,Mr Crab

Introduction

Enhancements

1. MST-Det (Multi-Scale Temporal-aware Detector)

2. LST-Tracker (Trajectory Smoothing Feature Weighted Tracker)

3. CRN-Recognizer (Difficult Sample Mining Recognizer)

Usage

Dataset

Installation

Pre-trained Model

Training

Evaluation

Demo

Quick Start

Options

Example

Output

Main Results

ICDAR15-video

DSText

BOVText

ArTVideo

Statement

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.vscode		.vscode
configs		configs
datasets		datasets
figs		figs
gomatching		gomatching
pretrained_models		pretrained_models
third_party		third_party
tools		tools
README.md		README.md
chn_cls_list		chn_cls_list
demo.py		demo.py
eval.py		eval.py
plot_loss.ipynb		plot_loss.ipynb
requirements.txt		requirements.txt
simsun.ttc		simsun.ttc
train_net.py		train_net.py

Folders and files

Latest commit

History

Repository files navigation

GoMatching & GoMatching++ & Enhanced

This repository builds on GoMatching and GoMatching++, adding three novel enhancements: MST-Det (Multi-Scale Temporal-aware Detector) · LST-Tracker (Trajectory Smoothing & Feature Weighted Tracker) · CRN-Recognizer (Difficult Sample Mining Recognizer)

Enhanced by Hello,Mr Crab

Introduction

Enhancements

1. MST-Det (Multi-Scale Temporal-aware Detector)

2. LST-Tracker (Trajectory Smoothing Feature Weighted Tracker)

3. CRN-Recognizer (Difficult Sample Mining Recognizer)

Usage

Dataset

Installation

Pre-trained Model

Training

Evaluation

Demo

Quick Start

Options

Example

Output

Main Results

ICDAR15-video

DSText

BOVText

ArTVideo

Statement

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

This repository builds on GoMatching and GoMatching++, adding three novel enhancements:
MST-Det (Multi-Scale Temporal-aware Detector) · LST-Tracker (Trajectory Smoothing & Feature Weighted Tracker) · CRN-Recognizer (Difficult Sample Mining Recognizer)

Packages