Learning to defer (L2D) algorithms improve human-AI collaboration (HAIC) by deferring decisions to human experts when they are more likely to be correct than the AI model. This framework hinges on machine learning (ML) models’ ability to assess their own certainty and that of human experts. L2D struggles in dynamic environments, where distribution shifts impair deferral. We argue that robust HAIC in dynamic environments requires uncertainty-driven policy switching rather than reliance on a single deferral strategy. To operationalize this principle, we introduce two uncertainty-aware approaches that estimate epistemic uncertainty to guide the deferral policy choice. Both methods are the first uncertainty-aware approaches for HAIC that also address limitations of L2D systems including cost-sensitive scenarios, limited human predictions, and capacity constraints. Empirical evaluation in fraud detection shows both approaches outperform state-of-the-art baselines while improving calibration and supporting real-world adoption.
In order to ensure complete reproducibility, we provide users with the code used to run experiments. Datasets, models and results are organized in the repository as follows.
Data
data/— Main dataset and config (data_config.yaml).alert_data/— Alert data and preprocessing.drift_alert_data/— Drift alert data (for creating drift scenarios).
Synthetically generated data
L2D/synthetic_experts/— Expert predictions and probabilities of error (train/test, per noise level and seed); expert parameters and properties.capacity_constraints/— Human capacity matrix H and batch vector b for each deferral rate and noise scenario.
Models
alert_model/— Alert model.L2D/classifier_h/— Classifier h (selected models and training artifacts inselected_model/,models/,param_spaces/).L2D/expert_models/ova/— One-vs-all (OvA) expert classifiers.density_softmax/— Density-softmax / expert density models (expert_density_models/).density_based_CP/— Conformal prediction and feature extraction; processed data infeature_extraction/processed_data/.
Results
results/— Analysis notebook (results.ipynb), figures (figs/), LaTeX tables (latex_tables/,short_results_table.tex), and deferral experiment outputs: assignment matrices, counts (e.g.count_rl,count_l2d), and results of the alpha parameter tuning underdeferral_results/.
Note: LightGBM models can produce different results depending on operating system, Python version, and number of cores used during training.
The paper is available here. For the FiFAR dataset and OpenL2D framework, see Alves et al. (2025a, 2025b) and the links in the paper.
-
Install Python 3.7 (if not already installed):
pyenv install 3.7.16
-
Create a virtual environment:
pyenv virtualenv 3.7.16 uncertainty-l2d
-
Activate the virtual environment:
pyenv activate uncertainty-l2d # or source ~/.pyenv/versions/3.7.16/envs/uncertainty-l2d/bin/activate
-
Install dependencies:
pip install -r requirements.txt
The paper uses the FiFAR dataset (Alves et al., 2025a): 30K instances flagged by a fraud detection model (months 4–8) with synthetic expert predictions. That dataset already includes the alert set and fraud model scores, so the alert model is not trained in the paper’s pipeline. The steps below assume you have the alert data in the expected form; if you start from raw BAF (e.g. data/Base.csv) instead of FiFAR, use the optional Step 2 to create it.
Attention: Run each Python script from the directory that contains it (e.g. cd L2D/classifier_h && python training.py) so relative paths in the code resolve correctly.
Activate the Python environment and install dependencies as described in Creating the python environment.
If you are not using the pre-built FiFAR alert set, create the alert set from the raw BAF data:
- alert_model/training.py — Trains the fraud (alert) model on months <3, scores deployment months, writes
alert_data/BAF_alert_model_score.parquet. - alert_data/preprocessing.py — Applies the 5% FPR threshold to get the 30K alerts and writes
alert_data/alerts.parquet.
If you already have FiFAR (or equivalent), place the 30K-alert table as alert_data/alerts.parquet (and optionally the full scored deployment set as alert_data/BAF_alert_model_score.parquet) and skip this step.
Create noisy test sets for distribution-shift evaluation (Section 6 of the paper): run drift_alert_data/create_drift.py from drift_alert_data/. This writes parquet files under drift_alert_data/ for different noise levels and seeds.
Both the uncertainty-aware L2D and the conformal prediction for HAIC systems share the same classifier h. Run L2D/classifier_h/training.py from L2D/classifier_h/. It uses alerts.parquet and (for noisy evaluation) the drift data from Step 3.
Train the shared MLP feature extractor (penultimate layer used for density estimation). Run density_based_CP/feature_extraction/training.py from density_based_CP/feature_extraction/. This produces latent features under feature_extraction/processed_data/ used by both the conformal prediction and density-softmax pipelines.
The core expert data (expert predictions, probability of error, and expert parameters per instance) come from FiFAR / OpenL2D (Alves et al., 2025a); place expert_predictions.parquet, prob_of_error.parquet, and expert_parameters.parquet in L2D/synthetic_experts/. This repo does not generate them. From that, run expert_pred_on_drift_data.py and expert_prob_of_error_on_drift_data.py (from L2D/synthetic_experts/) after Step 3 to build train/test and noise-specific files (train_expert_predictions.parquet, test_expert_predictions_{noise}_seed_{seed}.parquet, and the corresponding prob_of_error files). expert_properties.py uses those files for analysis and plots only.
- Density-softmax (RealNVP): Run density_softmax/training.py from
density_softmax/to train the density models used for uncertainty-aware L2D. - OvA expert classifiers: Train the One-vs-All LightGBM classifiers (classifier h and per-expert correctness models) via the code in L2D/expert_models/, which uses classifier h from Step 4 and expert data from Step 6.
- Density-based conformal prediction: Train class-conditional density models and compute calibration quantiles using the code in density_based_CP/ (feature extractor from Step 5).
- L2D with density-softmax: Run density_softmax/get_preds.py from
density_softmax/to produce deferral predictions underL2D/deferral/l2d_ds_predictions/. - Capacity constraints: Run capacity_constraints/define_H_and_b.py from
capacity_constraints/to define the capacity matrix H and batch vector b for each deferral rate and expert count; outputs go tocapacity_constraints/. - All of
results/deferral_results/: Run density_based_CP/deferral/deferral.py fromdensity_based_CP/deferral/. It reads the setting lists inresults/deferral_results/settings_*.json(cp, ds, l2d, rl, random) and, for each strategy, solves the capacity-constrained assignment and writes assignment matrices toassignment_matrices/,ds_assignment_matrices/,l2d_assignment_matrices/,rl_assignment_matrices/, andrandom_assignment_matrices/. For the CP strategy it also writes toalphas/,count_rl/, andcount_l2d/. Then run density_based_CP/deferral/results.py from the same directory to compute misclassification costs and other metrics from those assignment matrices.
Run the analysis and reproduce figures/tables from results/results.ipynb (run the notebook from the results/ directory). Outputs include results/figs/, results/latex_tables/, and results/deferral_results/.
The results/results.ipynb notebook loads the data and outputs from the pipeline (deferral_results, expert predictions, classifier and density models) to reproduce the paper’s figures and tables. Sections include: feature extraction and density estimation (t-SNE, density scores, effect on classifier probabilities); uncertainty-aware modelling (classifier h and expert correctness models, with ROC and calibration curves with and without density-softmax); conformal prediction for HAIC (ECE on non-null and null prediction sets for the classifier and experts); and evaluation against baselines (misclassification cost across strategies and settings, LaTeX tables, and summary plots). The notebook writes figures to results/figs/ and can export LaTeX to results/latex_tables/.
If you use this code in your research, please cite our paper and the FiFAR/OpenL2D dataset:
@article{
pearson2026uncertaintyaware,
title={Uncertainty-Aware Systems for Human-{AI} Collaboration},
author={Vasco Pearson and Jean V. Alves and Jacopo Bono and Mario A. T. Figueiredo and Pedro Bizarro},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2026},
url={https://openreview.net/forum?id=PiRYCyNBqQ},
note={}
}

