Skip to content

Amr2054/Group-Activity-Recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hierarchical Deep Temporal Model for Group Activity Recognition

A modern, highly modular PyTorch implementation of the CVPR 2016 paper:
A Hierarchical Deep Temporal Model for Group Activity Recognition

Python PyTorch

Model Inference Demo


Project Overview

This project hierarchically models individual player actions and team-level dynamics from volleyball footage using a two-stage ResNet50 + LSTM pipeline. The final model (Baseline 8) achieves 91.85% group activity accuracy on the Volleyball dataset — up from 72.50% for the single-frame spatial baseline, and +9.95% above the original CVPR 2016 paper's benchmark.


Table of Contents

  1. Key Features
  2. Results
  3. Architecture
  4. Getting Started
  5. Project Structure
  6. Dataset Overview
  7. Ablation Study
  8. Cloud Training
  9. License

Key Features

This repository upgrades the original 2016 Caffe implementation to modern standards:

  • Modern PyTorch Pipeline: Upgraded the original 2016 Caffe architecture into a clean, modular, and easily extensible PyTorch implementation.
  • ResNet50 Backbone: Replaced the legacy AlexNet with a ResNet50 feature extractor to capture significantly richer spatial representations.
  • Multi-Modal Pooling: Fused both Max and Mean Pooling to capture both the dominant individual action and the overall team context.
  • Automatic Mixed Precision (AMP): Integrated PyTorch GradScaler and autocast to halve VRAM usage and roughly double training speed on modern GPUs.
  • Seamless Cloud-to-Local CI/CD: Built-in environment detection (env_utils.py) auto-routes dataset paths and multiprocessing settings (spawn vs fork) for Kaggle, Colab, or local runs.

Results

Group activity classification accuracy on the Volleyball dataset test split. The table compares the original 2016 paper's AlexNet/Caffe results against this repository's ResNet50 PyTorch reimplementation:

Baseline Description Paper's Accuracy My Accuracy My F1 Score Δ Accuracy
B1 Single-frame classifier (spatial only) 66.7% 72.50% 0.72 +5.8%
B3 Fine-tuned person crop pooling 68.1% 75.50% 0.76 +7.4%
B4 Full-frame LRCN (temporal) 63.1% 72.63% 0.73 +9.5%
B5 Person LSTM + frozen linear pool 67.6% 82.65% 0.82 +15.1%
B6 Group BiLSTM (no person LSTM) 74.7% 78.46% 0.78 +3.8%
B7 Full two-stage hierarchical model 80.2% 83.77% 0.83 +3.6%
B8 Two-stage + sub-group pooling 81.9% 91.85% 0.92 +9.95%

The PyTorch reimplementation consistently outperforms the original Caffe baselines across all stages, with the largest gains in person-level temporal modeling (B5: +15.1%). The final model surpasses the paper's SOTA by +9.95%, driven by the ResNet50 backbone, Center-Frame Spatial Anchoring, and Max+Mean feature concatenation.

Paper scores sourced from Table 5 of the original CVPR 2016 paper.

Sample confusion matrix from Baseline 8:

Confusion Matrix - Baseline 8

Architecture

For main models : B7, B8

Two-Stage Hierarchical Architecture

The model operates in two hierarchical stages:

  1. Stage 1 — Person-Level Temporal Modeling: Individual player bounding box crops are passed through a shared ResNet50 backbone frame-by-frame. The resulting feature sequences are fed into a Person LSTM that learns each player's action semantics over time.

  2. Stage 2 — Group-Level Temporal Modeling: Per-player LSTM outputs are spatially pooled (with optional left/right sub-group splitting in B8) and passed as a timeline into a Group BiLSTM, which classifies the overall team activity.


Getting Started

Requirements

Requirement Version
Python 3.12+
PyTorch 2.10+
CUDA 12.8
GPU VRAM 16 GB recommended
Training time ~2–5 hrs per baseline

1. Clone & Install

git clone https://github.com/Amr2054/Group-Activity-Recognition.git
cd Group-Activity-Recognition
pip install -r requirements.txt

2. Environment Configuration

This project uses environment variables to seamlessly route dataset and checkpoint paths across local and cloud environments.

Copy the template environment file:

cp .env.example .env

Open the created .env file and populate the variables.

3. Dataset Preparation

Download the Volleyball dataset and place it under data/. Then parse the raw annotations into the optimized .pkl format:

python -m data.data_annot_loader

4. Train a Model

Run training scripts from the root directory using module execution, passing the corresponding YAML config:

# Baseline 4 — Full-frame temporal LRCN
python -m models.baseline_4.trainer --config configs/baseline_4.yaml

# Baseline 7 — Full two-stage hierarchical model
python -m models.baseline_7.trainer --config configs/baseline_7.yaml

# Baseline 8 — Final model with sub-group pooling
python -m models.baseline_8.trainer --config configs/baseline_8.yaml

All outputs — .pth weights, TensorBoard logs, and confusion matrices — are automatically saved to:

models/baseline_X/outputs/run_[timestamp]/

5. Monitor Training

tensorboard --logdir models/baseline_X/outputs/

6. Evaluate a Checkpoint

python -m models.baseline_8.test_model --config configs/baseline_8.yaml --checkpoint path/to/weights.pth

Project Structure

Group-Activity-Recognition/
├── assets/                   # Images for README (header, architecture, demo GIF, etc.)
├── configs/                  # YAML files controlling all model/training parameters
│   ├── baseline_1.yaml
│   ├── baseline_3_phase_A.yaml
│   ├── baseline_3_phase_B.yaml
│   ├── baseline_4.yaml
│   ├── baseline_5_phase_A.yaml
│   ├── baseline_5_phase_B.yaml
│   ├── baseline_6.yaml
│   ├── baseline_7.yaml
│   └── baseline_8.yaml
├── data/                     # Data ingestion, pickling, and PyTorch Datasets
│   ├── box_annot.py
│   ├── data_annot_loader.py
│   └── data_loader.py        # Bounding Box, Frame-by-Frame, and Anchor-Sorted Datasets
├── utils/                    # Core engineering utilities
│   ├── env_utils.py          # Auto-detects Kaggle vs. local environments
│   └── helper.py             # Config parsers, seed setting, and logging formatters
├── models/                   # Architecture and training scripts
│   ├── train_utils.py        # Universal training/validation loop
│   ├── eval_utils.py         # Universal testing loop
│   ├── baseline_1/           # Static ResNet50 image classifier
│   ├── baseline_3/           # Spatial person & group classifier
│   ├── baseline_4/           # Full-frame temporal LRCN
│   ├── baseline_5/           # Person LSTM + deep-freeze linear pool
│   ├── baseline_6/           # Single-stage group BiLSTM (no person LSTM)
│   ├── baseline_7/           # Full two-stage model (person LSTM + group BiLSTM)
│   └── baseline_8/           # Two-stage model with left/right sub-group pooling
├── requirements.txt
└── README.md

Dataset Overview

The Volleyball dataset consists of publicly available YouTube volleyball videos with 4,830 annotated frames across 55 videos.

Group Activity Labels

Group Activity Class Instances
Right set 644
Right spike 623
Right pass 801
Right winpoint 295
Left winpoint 367
Left pass 826
Left spike 642
Left set 633

Player Action Labels

Action Class Instances
Waiting 3,601
Setting 1,332
Digging 2,333
Falling 1,241
Spiking 1,216
Blocking 2,458
Jumping 341
Moving 5,121
Standing 38,696

Train / Validation / Test Split

Split Video IDs
Train 1, 3, 6, 7, 10, 13, 15, 16, 18, 22, 23, 31, 32, 36, 38, 39, 40, 41, 42, 48, 50, 52, 53, 54
Validation 0, 2, 8, 12, 17, 19, 24, 26, 27, 28, 30, 33, 46, 49, 51
Test 4, 5, 9, 11, 14, 20, 21, 25, 29, 34, 35, 37, 43, 44, 45, 47

Ablation Study

Each baseline isolates one architectural variable to show its contribution to final accuracy.

Baseline Key Idea Spatial Temporal Person-Level Group-Level
B1 Single-frame ResNet50
B3 Pooled player crops
B4 Full-frame LRCN
B5 Person LSTM + frozen pool
B6 Group BiLSTM only
B7 Two-stage hierarchical
B8 + Sub-group pooling

Baseline 1 (Image Classification) — A purely spatial model using ResNet50 to classify group activity from a single static frame. Establishes the spatial-only ceiling.

Baseline 3 (Fine-tuned Person Classification) — ResNet50 extracts 2048-d features from individual player bounding boxes. Features are pooled across all players in a frame and fed to a classifier.

Baseline 4 (LRCN) — Introduces time. A 9-frame clip passes through ResNet50 frame-by-frame; the feature sequence is fed into an LSTM to capture group motion before classification.

Baseline 5 (Temporal Person Features) — Two-phase architecture. Phase A trains an LSTM to track individual player crops over 9 frames. Phase B freezes Phase A and pools the 12 player features into a final linear classifier using a Center-Frame Spatial Anchor to preserve left/right court orientation.

Baseline 6 (Group-Only Temporal) — Skips the person LSTM. Extracts spatial features for all 12 players, applies Max + Mean pooling to summarize team posture frame-by-frame, and passes the timeline into a Group BiLSTM.

Baseline 7 (Full Two-Stage Hierarchical) — Combines B5 and B6. Uses the Phase A Person LSTM for individual action semantics, applies frame-by-frame pooling, and feeds the temporal sequence into a Group BiLSTM.

Baseline 8 (Sub-Group Pooling) — Prevents the Group LSTM from confusing "Left Spike" vs "Right Spike" by introducing Anchor Frame X-axis Sorting physically slicing the court in half to map players explicitly to Left Team and Right Team tensors before temporal tracking.


Cloud Training

Kaggle / Colab Setup Instructions

This repository is designed to be edited locally (PyCharm / VSCode) and executed on cloud GPUs without path errors.

Steps:

  1. Push your code to GitHub.
  2. In a Kaggle Notebook, clone the repository:
!git clone https://github.com/Amr2054/Group-Activity-Recognition.git
  1. The internal setup_environment() call will automatically:

    • Detect the Kaggle kernel
    • Reroute dataset paths to /kaggle/input/
    • Write all outputs to /kaggle/working/
    • Set num_workers to prevent container deadlocks
  2. Pull latest changes and train:

!git pull origin main
!python -m models.baseline_8.trainer --config configs/baseline_8.yaml

License

This project is licensed under the MIT License.


Built on top of the original work by Mostafa Saad Ibrahim et al.

About

A modern pytorch implementation of CVPR 2016 paper: "A Hierarchical Deep Temporal Model for Group Activity Recognition."

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages