OcWikiDialects: A Wikipedia Dataset with Rich Metadata for Occitan Dialect Identification

This repository contains the code, dataset, models and analyses described in the following paper, presented at VarDial 2026:

@inproceedings{nedey-etal-2026-ocwikidialects,
    title = "{O}c{W}iki{D}ialects: A {W}ikipedia Dataset With Rich Metadata for {O}ccitan Dialect Identification",
    author = "N{\'e}dey, Oriane  and
      Bawden, Rachel  and
      Cl{\'e}rice, Thibault  and
      Sagot, Beno{\^i}t",
    editor = {Scherrer, Yves  and
      Aepli, No{\"e}mi  and
      Blaschke, Verena  and
      Jauhiainen, Tommi  and
      Ljube{\v{s}}i{\'c}, Nikola  and
      Nakov, Preslav  and
      Tiedemann, J{\"o}rg  and
      Zampieri, Marcos},
    booktitle = "Proceedings of the 13th Workshop on {NLP} for Similar Languages, Varieties and Dialects (VarDial 2026)",
    month = mar,
    year = "2026",
    address = "Rabat, Morocco",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.vardial-1.4/",
    doi = "10.18653/v1/2026.vardial-1.4",
    pages = "45--57"
}

Released assets

The released dataset and models can be found in the assets of the GitHub release. We released the OcWikiDialects dataset (JSONL + CSV splits) as well as the SVM, FastText and BERT models trained on the concatenated datasets without text preprocessing. Due to its size, our custom FastText embedding model will be released at another location.

The OcWikiDialects dataset

The OcWikiDialects dataset contains 7,247 articles, each split into paragraphs (57k in total) and into sentences (288k), for a total of approximately 4M tokens. 10 varieties of (or closely-related) Occitan are represented: Auvergnat, Gascon, Limousine, Languedocian, Provençal, Vivaroalpine, Aranese, Niçard, Aguianese and Marchese.

This dataset was built from two sources:

metadata from the XML dump at date 20250901, taking into account all revisions until 2025-08-20
clean Markdown texts from FineWiki, built from the HTML dump at date 20250820

The dataset is available in JSONL format. Each article is associated to the following elements:

id
title
dialect: dialect category found in the XML dump
text_xml: raw text from the XML dump, in WikiText format
text_finewiki: clean Markdown text from FineWiki
paragraphs: list of lightly cleaned paragraphs extracted from text_finewiki
segments: list of lightly cleaned segments extracted from text_finewiki
len_bytes_creation: size of the article (in bytes) in its first revision
len_bytes_latest: size of the article (in bytes) in the latest revision
date_creation: date of the first revision
date_latest_xml: date of the latest revision in the XML dump
date_latest_finewiki: date of the latest revision in FineWiki
latest_revision_id
nb_revisions
nb_contributors
user_creator_id: ID of the user who made the first revision
user_latest_id: ID of the user who made the latest revision
user_most_contrib_id: user with the highest number of contributions to this article
user_biggest_contrib_id: user who made the most changes on this article (in bytes, negative contributions are counted positively with a 0.5 factor)
user_rank_freq: ranking of users by the number of contributions made to the article
user_rank_size: ranking of users by the size of their contributions made to the article (negative contributions are counted positively with a 0.5 factor)
oc_level_first: Occitan proficiency level declared by the user of the first revision
oc_level_max: highest Occitan proficiency level declared among all contributors of the article
oc_level_rank_freq: ranking of Occitan levels by the number of contributions from users declaring each level
oc_level_rank_size: ranking of Occitan levels by the size of contributions from users declaring each level (negative contributions are counted positively with a 0.5 factor)
bot_created: True if the user of the first contribution was detected as a bot (username and/or tag in user page)
bot_first_author: True if the user of the first contribution with more than +100 (or +200 ??) bytes
bot_nb_revisions: Number of revisions authored by a user detected as a bot (username and/or tag in user page)
user_dialects: list of dialects declared among all contributors of the article (whitespace separated)

Code to build the dataset

The dataset was built using the script ocwikidialects/ocwikidialects-pipeline.py, with all default options.

The code depends on the internal package wisteps as well as 3rd-party dependencies which can be installed as follows:

git clone git@github.com:DEFI-COLaF/OcWikiDialects.git
cd OcWikiDialects
pip install .

Note that the script will take some time to download the XML dump as well as to parse it, and that it creates an intermediary SQLite database that was used in our analyses.

Occitan Dialect Identification (ocDI) experiments

Our experiments rely on the internal ocDI package which can be installed by running pip install . at the root of this repository.

Data preparation

We experimented with the OcWikiDialects dataset as well as other datasets labelled with Occitan varieties.

The data preprocessing steps are described in ocdi_scripts/ocdi_data_prep/README.md.

Note that the experiments were conducted on an early version of OcWikiDialects that was based on the XML dump dated 20250801, where articles created after that date were not included. The 14 missing paragraphs were appended afterwards at the end of the released train and full splits.

Experiments with SVM

We trained SVM models using scikit-learn. The training and evaluation pipelines are available in the following notebooks:

models without text preprocessing: experiments/svm/ocDI_svm_noprep.ipynb
models with text preprocessing: experiments/svm/ocDI_svm_prep.ipynb

We release the SVM model trained on the concatenated datasets without text preprocessing.

Example of model usage:

from ocdi.svm.save import load_sklearn_pipeline
from ocdi.svm.predict import predict_with_svm

model = load_sklearn_pipeline(model_path)

predict_with_svm(model, ["Adishatz tot lo monde !"])

Experiments with FastText

We trained custom FastText embeddings using the script ocdi_scripts/training/fasttext/train_embs.py, then trained various classifiers with the script ocdi_scripts/training/fasttexttrain_ft_classifer.py.

Training configs can be found under experiments/fasttext/configs:

Config fasttext_did_config_ccembs.yaml enables training a classifier on top of the official FastText vectors trained on CommonCrawl (source, paper)
Config fasttext_did_config_ocdiembs.yaml enables training a classifier on top of our custom embeddings
Config fasttext_preprocess_lapduen.yaml is meant to preprocess a dataset before training (cf. ocdi_scripts/ocdi_data_prep/fasttext/README.md)

We release our custom FastText embedding model as well as the FastText ocDI model trained with these embeddings on the concatenated datasets, without text preprocessing.

Predict with ocDI-fasttext:

import fasttext

model = fasttext.load_model(model_path)
label, score = model.predict("Adishatz tot lo monde !")

Evaluate:

import pandas as pd
from ocdi.fasttext.evaluate import evaluate_fasttext

df_test = pd.DataFrame({"text": ["Adishatz tot lo monde !"], "label": ["gas"]})
results = evaluate_fasttext(model_path, df_test)

Experiments with BERT

We fine-tuned pretrained BERT models using the script ocdi_scripts/training/bert/finetune.py and training configs in experiments/bert/configs.

We compared performance of the following models:

The suffix _prep in the config name indicates that the data is preprocessed before passing to the model.

We release the oc-mBERT model fine-tuned for the ocDI task on the concatenated datasets, without text preprocessing.

Predict with ocDI-bert:

from transformers import pipeline

pipe = pipeline("text-classification", model=model_path)
pipe("Adishatz tot lo monde!")

Evaluate:

from transformers import BertTokenizer, BertForSequenceClassification

from ocdi.bert.evaluate import evaluate_bert

model = BertForSequenceClassification.from_pretrained(model_path, local_files_only=True)
tokenizer = BertTokenizer.from_pretrained(model_path, local_files_only=True)

model_results, preds, true_labels = evaluate_bert(
    model,
    test_path,  # CSV file
    tokenizer,
    test_label_scheme="fasttext",  # label scheme of the test set (see ocdi.labels.SOURCE_TO_SCHEME)
    train_label_scheme="fasttext",  # label scheme of the model (for CONCAT, use 'fasttext')
    batch_size=1,
    num_workers=1,
    only_report=False,  # Only show the classification summary, do not return scores
    show_confusion_matrix=False,  # Set to True to display the confusion matrix   
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OcWikiDialects: A Wikipedia Dataset with Rich Metadata for Occitan Dialect Identification

Released assets

The OcWikiDialects dataset

Code to build the dataset

Occitan Dialect Identification (ocDI) experiments

Data preparation

Experiments with SVM

Experiments with FastText

Experiments with BERT

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
experiments		experiments
ocdi		ocdi
ocdi_scripts		ocdi_scripts
ocwikidialects		ocwikidialects
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

OcWikiDialects: A Wikipedia Dataset with Rich Metadata for Occitan Dialect Identification

Released assets

The OcWikiDialects dataset

Code to build the dataset

Occitan Dialect Identification (ocDI) experiments

Data preparation

Experiments with SVM

Experiments with FastText

Experiments with BERT

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages