This repository contains code to reproduce the paper: Fast Organic Crystal Structure Prediction with Unit Cell Flow Matching (arXiv).
To install only the required packages for Clari to run inference:
pip install clarior install all dependencies by cloning the repository and running
uv syncSee full inference guide: inference reference, which doubles as an agent skill and can be printed by running uv run clari-skill.
The workflow has three steps:
clari— sample candidate crystal structures →predictions.parquet(+.ciffiles by default)rank— score with FairChem UMA energy →rankings.csvexport-cifs— re-export.ciffiles from saved samples (e.g. a ranked subset)
clari writes .cif files automatically after sampling; pass --no-export-cifs
to write only predictions.parquet.
Models (clari-m, clari-l, clari-h) download automatically from HuggingFace on
first use. The default is clari-h (highest quality); pass --model clari-m for
the fastest model.
# 10 candidate structures for ethanol, written to results/CCO_x4/
clari "CCO" --samples 10The grammar is clari SMILES [copies] [SMILES [copies]]... — a request is a flat
list of (component, copies) pairs. Dots in a SMILES split into components, a
copies value broadcasts over the dot components of its token, and omitted copies
default to 4 (the Z value, molecules per unit cell). Hydrogens are added
automatically.
Known limitation: Clari ignores tetrahedral chirality tags (e.g.
[C@@H]), and will generate samples with arbitrary tetrahedral chirality.
# `uv run clari` is equivalent to `clari`
uv run clari "CC(=O)Oc1ccccc1C(=O)O" 1 "O" 3 --samples 8 # aspirin trihydrate co-crystal
uv run clari "CCO.O" 2 # dotted SMILES: (CCO,2),(O,2)
uv run clari "CCO" --model clari-h --id ethanol # pick model, label outputs
uv run clari "CCO" --samples 10 --seed 0 # reproducible sampling--smiles/--copies flags are a synonym of the positional form (use one or the
other): clari --smiles "CC(=O)Oc1ccccc1C(=O)O" --copies 1 --smiles "O" --copies 3.
--id labels the output rows and becomes the CIF subdirectory name;
auto-generated from SMILES if omitted. Prefer setting it explicitly — the
auto-generated SMILES-based name is cryptic and can collide. --output-dir defaults to results/<id>.
uv run clari --config batch.json{
"model": "clari-m",
"output_dir": "results/batch_run",
"requests": [
{ "id": "ethanol", "smiles": "CCO", "copies": 4, "samples": 4 },
{
"id": "aspirin_trihydrate",
"smiles": [["CC(=O)Oc1ccccc1C(=O)O", 1], ["O", 3]],
"samples": 4,
"batch_size": 8
}
]
}Top-level keys (all optional): model, output_dir, use_ema, use_bf16, pbar, filter_clashing.
Per-request keys: id, smiles, copies, samples, batch_size.
Batch configs are convenience orchestration for running several independent requests. Each request is sampled and written separately, rather than combining independent requests into one parquet:
results/batch_run/
manifest.json
ethanol/
predictions.parquet
config.json
aspirin_trihydrate/
predictions.parquet
config.json
Requires fairchem-core:
pip install "clari[uma]" # or: uv sync --extra uma
uv run --extra uma rank results/ethanolclari already exports CIFs to <output_dir>/cifs/<id>/ after sampling (disable
with --no-export-cifs). Use export-cifs to re-export later — for example a
ranked subset once rank has produced rankings.csv:
uv run export-cifs results/ethanol # all samples
uv run export-cifs results/ethanol --top-k 3 # top 3 ranked (requires rankings.csv)
uv run export-cifs results/ethanol --sample-idx 0 --sample-idx 2
uv run export-cifs results/batch_run/ethanol # one request from a batch run
uv run export-cifs results/ethanol --output-dir my_cifs/Filenames: <id>/sample_000000.cif without rankings, <id>/rank_0000_sample_000000.cif with.
from clari.inference import ClariSampler
sampler = ClariSampler("clari-m")
crystals = sampler.sample("CCO", id="ethanol", samples=8) # in-memory
sampler.sample("CCO", id="ethanol", samples=8, output_dir="results/ethanol") # disk-backed
# Co-crystal: dot-separated SMILES (uniform copies) or list (per-component copies)
sampler.sample("CCO.O", id="ethanol_hydrate", copies=2, samples=4)
sampler.sample(
["CC(=O)Oc1ccccc1C(=O)O", "O"],
id="aspirin_trihydrate",
copies=[1, 3],
samples=4,
output_dir="results/aspirin_trihydrate",
)sample() kwargs: id, copies (int or list, default 4), samples (default 1), output_dir, filter_clashing.
Pass filter_clashing=True to the ClariSampler(...) constructor or top-level sample(...) helper to discard clashing structures and resample replacements.
from clari.inference import save, rank, export_cifs
crystals = sampler.sample("CCO", id="ethanol", samples=100)
save(crystals, "results/ethanol")
df = rank("results/ethanol") # writes energies.csv + rankings.csv, returns DataFrame
df = rank(crystals) # fully in-memory: ranks a list of Crystals, writes nothing
export_cifs("results/ethanol")
export_cifs("results/ethanol", top_k=3)
export_cifs("results/ethanol", sample_idx=[0, 2])
export_cifs("results/ethanol", output_dir="my_cifs/ethanol")
export_cifs(crystals, output_dir="my_cifs/", id="ethanol")To install all dependencies needed for development, use uv:
uv syncuv run -s *.py ... resolves and caches that environment. You will still need a valid CCDC license configured on the machine.
We expect the final data folder to be structured as follows:
data/
raw/
csd_metadata.parquet
csd_conquest.parquet
csd/
config.json
metdata.parquet
{train,val,test}.pt
To generate the data, first extract the metadata of entries in CSD:
uv run -s scripts/data/0_metadata.py
This creates the csd_metadata.parquet file from above. Next, download ALL of CSD in .mol2 and .cif format using ConQuest (not csd-python-api since it sanitizes molecules and removes some bond information) into the csd_conquest.parquet file. Finally, generate the data/csd folder with:
uv run python -m scripts.data.1_process --num_workers=16
For reference, the CSD refcodes we use and our dataset split are uploaded to HuggingFace.
Training and evaluation paths default to data/, results/, and logs/ under the current working directory. Override them with CLARI_DATA_DIR, CLARI_RESULTS_DIR, and CLARI_LOG_DIR; see clari/paths.py.
To reproduce the paper numbers, run the stages below in order. Each stage writes into the same <experiment_dir> and reads what the previous stage produced.
These evaluation commands require the prepared data/csd directory described above. When running CLARI from an installed package, run the commands from a working directory containing data/csd, or set CLARI_DATA_DIR=/path/to/data.
# 0. One-time: build the GT CIF cache the standalone compack script reads
uv run python clari/evaluation/build_test_cifs_cache.py
# 1. Sample the CSD test set, creates a folder results/experiment_dir.
# Use clari-m, clari-l, or clari-h as the first argument.
uv run sample-test clari-m <num_samples> <experiment_dir> --subset <teaching/oxtal>
# 2. Clash check (writes collision.csv)
uv run collision <experiment_dir>
# 3. UMA energies (writes energies.csv)
uv run compute-energies <experiment_dir>
# 4. COMPACK packing similarity (writes compack.csv, isolated uv script env)
uv run -s clari/evaluation/compack.py <experiment_dir> --num-processes n
# 5. Summary table (SolC per subset, all k)
uv run summarize <experiment_dir>The exact commands used for to train our ablated and final models can be found in scripts/train. After running inference as above, the metrics used for ablations are defined in:
from clari.pipelines.utils.metrics import assess_crystals_eval
The source code of Clari is released under the MIT License (see LICENSE).
However, Clari was trained on data from the Cambridge Structure Database (CSD). Therefore, the model weights are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) (see LICENSE_MODEL). For commercial use of the model weights, please ensure that you have a proper CCDC License.
@misc{lo2026clari,
title={Fast Organic Crystal Structure Prediction with Unit Cell Flow Matching},
author={Alston Lo and Luka Mucko and Austin H. Cheng and Andy Cai and Alastair J. A. Price and Wojciech Matusik and Alán Aspuru-Guzik},
year={2026},
eprint={2606.03199},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2606.03199},
}