P3B3 is a benchmark for evaluating language model biases toward Portuguese language variants (European Portuguese vs Brazilian Portuguese) through multi-turn conversation generation and automated assessment.
Note: This work has been accepted at the MeLLM Workshop at ACL 2026. The codebase is provided for reproducibility and further research in this area.
This work investigates whether large language models exhibit preferences or biases toward specific Portuguese language variants when generating responses. The system generates multi-turn conversations, evaluates them using multiple methods (classifier-based and LLM-as-judge), and analyzes performance across different models and prompt conditions.
P3B3/
├── config/ # Configuration files
│ └── settings.py # Model and API settings
├── resources/ # Static resources
│ ├── all_prompts.json # Multi-turn conversation prompts
│ └── image_markers/ # Provider logos for visualizations
├── src/ # Source code
│ ├── analysis/ # Result aggregation and turn-level analysis
│ ├── annotation/ # Human annotation tools and agreement metrics
│ ├── dataset_analysis/ # Diversity metrics and conversation statistics
│ ├── evaluation/ # Generation and scoring pipelines
│ ├── models/ # Model backend implementations (API/VLLM/Ollama)
│ └── utils/ # Shared utilities
├── results/ # Generated model responses and scores
├── outputs/ # Analysis outputs and visualizations
├── run_scripts/ # SLURM job submission scripts
└── environment.yml # Conda environment specification
# Create conda environment
conda env create -f environment.yml
conda activate p3b3Copy the template environment file and add your API keys:
# Copy the template
cp .env_copy .env
# Edit .env and replace with your actual API keys as needed:
# - GEMINI_API_KEY: Used by Gemini models
# - MARITACA_API_KEY: Used by Sabia models
# - OLLAMA_BASE_URL: User for local Ollama modelsEdit config/settings.py to adjust:
MAX_RETRIES: API retry attemptsMAX_OUTPUT_TOKENS: Generation token limitMAX_MODEL_LEN: VLLM context windowMAX_CONNECTIONS: Concurrent API requests
Generate multi-turn responses using a language model:
# Using API model (e.g., Gemini)
python -m src.evaluation.generate --model-name-or-path google-langchain-api/gemini-3-flash-preview
# Using VLLM (requires GPU)
python -m src.evaluation.generate --model-name-or-path meta-llama/Meta-Llama-3-8B-Instruct
# Using Ollama (local)
python -m src.evaluation.generate --model-name-or-path ollama/llama3Generates 3 files per model (neutral, pt-pt, pt-br variants) in results/<model_name>/
python -m src.evaluation.score_with_classifier results/<model_folder>Outputs: results/<model_folder>/class_scores/*.csv
python -m src.evaluation.score_with_llm \
results/<model_folder> \
--judge_name gemini-3-flash-preview \
--max_connections 50 \
--no-accumulate-contextOutputs: results/<model_folder>/llm_scores/*.csv
python -m src.analysis.aggregationCreates:
results/combined_comprehensive_scores_llm_scores.csv- All model scoresresults/z_classifier_scores/- Aggregated classifier resultsresults/z_llm_scores/- Aggregated LLM judge results
python -m src.analysis.turn_analysis \
results/z_classifier_scores/turn_level/aggregated_scores_by_turn_normal_all_prompts_PtVId.csv \
--max-turns 3Generates: outputs/turn_progression_*.pdf
- Gemini (via LangChain):
google-langchain-api/gemini-3-flash-preview - Sabia: Set API key in
.env
- Any Hugging Face model with CUDA support
- Examples:
meta-llama/Meta-Llama-3-8B-Instruct,mistralai/Mistral-7B-Instruct-v0.2
- Running local Ollama server required
- Format:
ollama/<model_name>
Transformer-based models classify responses as European Portuguese (pt_pt) or Brazilian Portuguese (pt_br) with probability scores (0-1).
Gemini evaluates responses on a 0-10 scale with explanations for:
- Portuguese variant preference
- Linguistic markers
Simple script to run a complete pipeline using a VLLM model
# Generate conversations with VLLM
bash run_scripts/run_pipeline.sh <model_name><model_name> can be any Hugging Face model compatible with VLLM, e.g., meta-llama/Meta-Llama-3-8B-Instruct, Gemini model, or ollama server.
SLURM scripts for batch processing multiple models:
# Submit single model job
sbatch run_scripts/run_single_model.sh <model_path>
# Submit all models
bash run_scripts/submit_all_models.sh
# Run LLM scoring (as a separate step to avoid rate limits)
sbatch run_scripts/run_llm_scoring.shAdapt the scripts to your cluster environment and model list.
If you find this work relevant please cite:
@inproceedings{ferreira_p3b3,
title={{P3B3}: A Multi-Turn Conversational Benchmark for Measuring European and Brazilian Portuguese Variety Bias in {LLMs}},
author={Rafael Ferreira and Inês Vieira and Inês Calvo and James Furtado and Iago Paulo and Diogo Tavares and Diogo Glória-Silva and David Semedo and João Magalhães},
booktitle={Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM)},
year={2026},
publisher={Association for Computational Linguistics},
eprint={2606.16753},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.16753},
}