P3B3 - Portuguese Language Variant Bias Evaluation

P3B3 is a benchmark for evaluating language model biases toward Portuguese language variants (European Portuguese vs Brazilian Portuguese) through multi-turn conversation generation and automated assessment.

Note: This work has been accepted at the MeLLM Workshop at ACL 2026. The codebase is provided for reproducibility and further research in this area.

Overview

This work investigates whether large language models exhibit preferences or biases toward specific Portuguese language variants when generating responses. The system generates multi-turn conversations, evaluates them using multiple methods (classifier-based and LLM-as-judge), and analyzes performance across different models and prompt conditions.

Code Structure

P3B3/
├── config/                  # Configuration files
│   └── settings.py          # Model and API settings
├── resources/               # Static resources
│   ├── all_prompts.json     # Multi-turn conversation prompts
│   └── image_markers/       # Provider logos for visualizations
├── src/                     # Source code
│   ├── analysis/            # Result aggregation and turn-level analysis
│   ├── annotation/          # Human annotation tools and agreement metrics
│   ├── dataset_analysis/    # Diversity metrics and conversation statistics
│   ├── evaluation/          # Generation and scoring pipelines
│   ├── models/              # Model backend implementations (API/VLLM/Ollama)
│   └── utils/               # Shared utilities
├── results/                 # Generated model responses and scores
├── outputs/                 # Analysis outputs and visualizations
├── run_scripts/             # SLURM job submission scripts
└── environment.yml          # Conda environment specification

Installation

Environment Setup

# Create conda environment
conda env create -f environment.yml
conda activate p3b3

Configuration

Copy the template environment file and add your API keys:

# Copy the template
cp .env_copy .env

# Edit .env and replace with your actual API keys as needed:
# - GEMINI_API_KEY: Used by Gemini models
# - MARITACA_API_KEY: Used by Sabia models
# - OLLAMA_BASE_URL: User for local Ollama models

Edit config/settings.py to adjust:

MAX_RETRIES: API retry attempts
MAX_OUTPUT_TOKENS: Generation token limit
MAX_MODEL_LEN: VLLM context window
MAX_CONNECTIONS: Concurrent API requests

Usage

1. Generate Conversations

Generate multi-turn responses using a language model:

# Using API model (e.g., Gemini)
python -m src.evaluation.generate --model-name-or-path google-langchain-api/gemini-3-flash-preview

# Using VLLM (requires GPU)
python -m src.evaluation.generate --model-name-or-path meta-llama/Meta-Llama-3-8B-Instruct

# Using Ollama (local)
python -m src.evaluation.generate --model-name-or-path ollama/llama3

Generates 3 files per model (neutral, pt-pt, pt-br variants) in results/<model_name>/

2. Score Responses

Classifier-based Scoring

python -m src.evaluation.score_with_classifier results/<model_folder>

Outputs: results/<model_folder>/class_scores/*.csv

LLM Judge Scoring

python -m src.evaluation.score_with_llm \
    results/<model_folder> \
    --judge_name gemini-3-flash-preview \
    --max_connections 50 \
    --no-accumulate-context

Outputs: results/<model_folder>/llm_scores/*.csv

3. Aggregate Results

python -m src.analysis.aggregation

Creates:

results/combined_comprehensive_scores_llm_scores.csv - All model scores
results/z_classifier_scores/ - Aggregated classifier results
results/z_llm_scores/ - Aggregated LLM judge results

4. Visualize Turn-Level Performance

python -m src.analysis.turn_analysis \
    results/z_classifier_scores/turn_level/aggregated_scores_by_turn_normal_all_prompts_PtVId.csv \
    --max-turns 3

Generates: outputs/turn_progression_*.pdf

Supported Model Backends

API Models

Gemini (via LangChain): google-langchain-api/gemini-3-flash-preview
Sabia: Set API key in .env

VLLM Models

Any Hugging Face model with CUDA support
Examples: meta-llama/Meta-Llama-3-8B-Instruct, mistralai/Mistral-7B-Instruct-v0.2

Ollama Models

Running local Ollama server required
Format: ollama/<model_name>

Evaluation Metrics

Classifier Scores

Transformer-based models classify responses as European Portuguese (pt_pt) or Brazilian Portuguese (pt_br) with probability scores (0-1).

LLM Judge Scores

Gemini evaluates responses on a 0-10 scale with explanations for:

Portuguese variant preference
Linguistic markers

Run and Slurm Scripts for Generating and Scoring

Simple script to run a complete pipeline using a VLLM model

# Generate conversations with VLLM
bash run_scripts/run_pipeline.sh <model_name>

<model_name> can be any Hugging Face model compatible with VLLM, e.g., meta-llama/Meta-Llama-3-8B-Instruct, Gemini model, or ollama server.

SLURM scripts for batch processing multiple models:

# Submit single model job
sbatch run_scripts/run_single_model.sh <model_path>

# Submit all models
bash run_scripts/submit_all_models.sh

# Run LLM scoring (as a separate step to avoid rate limits)
sbatch run_scripts/run_llm_scoring.sh

Adapt the scripts to your cluster environment and model list.

Citation

If you find this work relevant please cite:

@inproceedings{ferreira_p3b3,
      title={{P3B3}: A Multi-Turn Conversational Benchmark for Measuring European and Brazilian Portuguese Variety Bias in {LLMs}}, 
      author={Rafael Ferreira and Inês Vieira and Inês Calvo and James Furtado and Iago Paulo and Diogo Tavares and Diogo Glória-Silva and David Semedo and João Magalhães},
      booktitle={Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM)},
      year={2026},
      publisher={Association for Computational Linguistics},
      eprint={2606.16753},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.16753}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

P3B3 - Portuguese Language Variant Bias Evaluation

Overview

Code Structure

Installation

Environment Setup

Configuration

Usage

1. Generate Conversations

2. Score Responses

Classifier-based Scoring

LLM Judge Scoring

3. Aggregate Results

4. Visualize Turn-Level Performance

Supported Model Backends

API Models

VLLM Models

Ollama Models

Evaluation Metrics

Classifier Scores

LLM Judge Scores

Run and Slurm Scripts for Generating and Scoring

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
config		config
resources		resources
run_scripts		run_scripts
src		src
.env_copy		.env_copy
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Folders and files

Latest commit

History

Repository files navigation

P3B3 - Portuguese Language Variant Bias Evaluation

Overview

Code Structure

Installation

Environment Setup

Configuration

Usage

1. Generate Conversations

2. Score Responses

Classifier-based Scoring

LLM Judge Scoring

3. Aggregate Results

4. Visualize Turn-Level Performance

Supported Model Backends

API Models

VLLM Models

Ollama Models

Evaluation Metrics

Classifier Scores

LLM Judge Scores

Run and Slurm Scripts for Generating and Scoring

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages