Skip to content

alshehriamal1994/uk-statute-retrieval

Repository files navigation

Neural Reranking for UK Statutory Retrieval

This repository contains the official code for the paper "Neural Reranking for UK Statutory Retrieval". It explains how to reproduce the retrieval, reranking, and evaluation experiments reported in the paper, including the distillation of a Voyage-law-2 teacher into a ModernBERT student model.

📂 Data Access

The dataset is NOT included in this repository. Please access the UK Legislation Corpus and Verified Query Set via the Durham University Research Data Repository:

Expected Data Structure

To run the scripts, place the downloaded data in the data/ directory or adjust paths accordingly:

  • data/Data.json: The corpus file.
  • data/queries.jsonl: The evaluation queries with relevance judgments.

🛠️ Installation

pip install -r requirements.txt

Requirements: torch, transformers, sentence-transformers, pyserini, scipy, pandas, faiss-cpu (or faiss-gpu), matplotlib.

🚀 Usage

All scripts are located in src/.

1. Retrieval (Candidate Generation)

Generate candidates using BM25 or Dense retrieval (MPNet).

# BM25
python retrieval/bm25_search.py \
  --data_json data/Data.json \
  --eval_jsonl data/queries.jsonl \
  --output_file outputs/candidates_bm25.jsonl \
  --k 100

# Dense (MPNet)
python retrieval/dense_search.py \
  --data_json data/Data.json \
  --eval_jsonl data/queries.jsonl \
  --output_file outputs/candidates_dense.jsonl \
  --k 100

2. Reranking

Run inference with various rerankers (CrossEncoders, ColBERT, RankZephyr, or APIs).

Distilled ModernBERT (Our Student Model):

python reranking/rerank.py \
  --model_type hf-seqcls \
  --model_name amal1994/distilled-voyage-modernbert \
  --data_json data/Data.json \
  --candidates outputs/candidates_dense.jsonl \
  --out_tsv results/results_modernbert.tsv

ColBERT:

python reranking/rerank_colbert.py \
  --model_name colbert-ir/colbertv2.0 \
  --data_json data/Data.json \
  --candidates outputs/candidates_dense.jsonl \
  --out_tsv results/results_colbert.tsv

RankZephyr (LLM):

python reranking/rerank_llm.py \
  --model_name castorini/rank_zephyr_7b_v1_full \
  --data_json data/Data.json \
  --candidates outputs/candidates_dense.jsonl \
  --out_tsv results/results_rankzephyr.tsv

3. Evaluation & Significance

Compute significance using Holm-Bonferroni correction and generate forest plots.

python evaluation/significance.py \
  --base_file results/results_modernbert.tsv \
  --others results/results_colbert.tsv results/results_rankzephyr.tsv \
  --k 10

4. Distillation (Training)

Train your own student model using teacher scores (e.g., from Voyage).

python distillation/train_distill.py \
  --data data/train.jsonl \
  --output_dir checkpoints/distilled_model \
  --model_name nomic-ai/modernbert-embed-base \
  --epochs 3

📖 Publication

If you use this code or the UK-STATUTECORPUS in your research, please cite our paper:

Alshehri, A.S., Eken, C., Bencomo, N. et al. Neural reranking for UK statutory retrieval: Provision-level evaluation and an open distilled model. Artif Intell Law (2025).

BibTeX

@article{alshehri2026neural,
  title={Neural reranking for UK statutory retrieval: Provision-level evaluation and an open distilled model},
  author={Alshehri, Amal Saad and Eken, Can and Bencomo, Nelly and Atapour-Abarghouei, Amir},
  journal={Artificial Intelligence and Law},
  year={2026},
  month={Feb},
  day={05},
  publisher={Springer},
  doi={10.1007/s10506-025-09501-6},
  url={https://doi.org/10.1007/s10506-025-09501-6}
}

About

Provision-level retrieval and neural reranking for UK legislation, with an open distilled ModernBERT reranker. Official code for the Artificial Intelligence and Law (2026) paper.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages