Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens
-
Firstly, we use the attention lens to identify the stages at which LVLMs handle visual data, discovering that the middle layers are crucial. Moreover, we find that these layers can be further divided into two stages: "visual information enrichment" and "semantic refinement" which respectively propagate visual data to object tokens and interpret it through text.
-
By analyzing attention patterns during the visual information enrichment stage, we find that real tokens consistently receive higher attention weights than hallucinated ones, serving as a strong indicator of hallucination.
-
Further examination of multi-head attention maps reveals that hallucination tokens often result from heads interacting with inconsistent objects.
Based on these insights, we propose a simple inference-time method that adjusts visual attention by integrating information across various heads.
We conduct our experiments based on:
Python version: 3.10.14
CUDA toolkit version: 12.1
Take LLaVA-1.5-7B as an example, the model weights can be downloaded from liuhaotian/llava-v1.5-7b.
To install the package in an existing environment, run
pip install -r requirements.txtOur experiments is based on MSCOCO 2014 Validation set. Please download here.
middle_layers_indicating_hallucination.ipynb: our main findings.modify_attention.py: implementation of Heads Guided Attention Intervention method.model_manager.py: used to load model, prepare inputs and decode output.chair.py: contains the CHAIR evaluator for sampling real and hallucinated object tokens.extract_VAR.py: used to extract visual attention ratio (VAR) features for detection.train.py: train the two-layer MLP classifier with extracted VAR features.
To extract the VAR features for real and hallucinated object tokens, run
python extract_VAR.py --instruction-path examples/toy_img_query_list.jsonl --output-file llava_7b_var_features.ptwhere the object token type is annotated by COCO gt labels. The features will be saved in var_data/.
To train the classifier with VAR features in a layer range, run
python train.py --data-path ./var_data/llava_7b_var_features.pt --model llava-1.5 --start-layer 5 --end-layer 18and the hyper-parameters (e.g., lr, hidden size) for training the classifier can be changed in train.py.
To use our attention intervention method, please insert the following code before model inference:
llama_head_guide(
model=model_manager.llm_model,
guided_layer_range=guided_layer_range,
aggregation='mean',
alpha=alpha,
img_start_idx=model_manager.img_start_idx,
img_end_idx=model_manager.img_end_idx
)where the parameter guided_layer_range represents the range of the visual information enrichment stage, e.g., 5-18 layers for LLaVA-1.5-7B.
First, generate the LVLM's descriptions and save them in log/[model]/[file_name].jsonl:
# Greedy heads guided decoding
python chair_eval.py --model llava-1.5 --use-head-guide --guide-range 5,18 --alpha 0.5Then, evaluate CHAIR metrics, run
python chair.py --cap_file ./log/[model]/[file_name].jsonlPlease cite our paper if it's helpful to your work!
@article{jiang2024devils,
title={Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens},
author={Jiang, Zhangqi and Chen, Junkai and Zhu, Beier and Luo, Tingjin and Shen, Yankun and Yang, Xu},
journal={arXiv preprint arXiv:2411.16724},
year={2024}
}
This repository borrows heavily from PAI and VLM-Visualizer. We thank them for their impressive work and contributions to open source!
