X-Raying French Medical AI: A Practitioner's Audit of the HealthDataHub Datasets

Community Article Published June 12, 2026

Upvote

Kais Zhioua

If you have ever tried to build anything serious in French medical NLP, you know the problem. The English speaking world has i2b2, n2c2, MedMentions, MIMIC, and a long tail of clinical corpora that anyone can use to bootstrap a project. France, until recently, had almost nothing comparable in the open. Privacy regulations under the GDPR make releasing real clinical text effectively impossible, and the research ecosystem has been bound by that gap rather than by modeling ideas.

So when the Health Data Hub started publishing medical datasets on Hugging Face, we paid attention. There are now seven of them, and together they represent the most substantial public release of French medical text we have seen.

We are Tanit AI, and we work on French language AI in healthcare, which gives us a direct stake in whether these datasets are actually useful in production. To form our own technical picture, we pulled samples through the Hugging Face Datasets Server API, we recomputed every metric we cared about from raw rows, and we ran a structured characterization rubric through two independent large language model judges on dedicated GPU compute, reconciling borderline cases by hand. The methodology section below is the heart of this analysis and we have written it that way deliberately, because every characterization downstream depends on it.

Who this post is for, and who it is dedicated to

We wrote this for the people who feel the data gap in French medical NLP as a daily friction in their work, and for the people who are filling it.

Engineers and researchers building French clinical AI who need to know which release to reach for, what preprocessing it needs, and how the pieces fit together in production.
Healthcare informatics teams at French hospitals and research institutions evaluating these datasets for internal projects and wanting a technical second opinion grounded in actual sampling rather than dataset card claims.
The wider clinical NLP community working outside English, looking at the French ecosystem as a reference for what an open clinical corpus release can look like when it is done with care.
The Health Data Hub team, the PARHAF authors, and the 104 medical residents who authored and peer reviewed the reports. This piece is dedicated to you. The breadth of effort behind these seven datasets is visible in the data itself, which is the highest praise we know how to give a dataset release.

Route through the post

Methodology (the heart): sampling, the two dimension families, per-metric evaluation procedures, the LLM as judge setup, aggregation, and compute.
The collection at a glance: seven datasets, two complementary families.
PARCOMED: the 5.65 GB pre training corpus, characterized against the corpus dimensions.
PARHAF: the human authored hospital report corpus, characterized against the clinical text dimensions.
The four expert annotated PARHAF subsets: PSEUDO, INF, BIO, RTT.
Cross dataset patterns and pipeline recommendations.

1. Methodology

This is the section where we set out, in full, how every score in the rest of this post was generated. Characterization without methodology is opinion, so we want to give you everything you need to inspect, reproduce, or disagree with our judgments.

1.1 Sampling

Sample counts, byte sizes, and split sizes come from the Hugging Face Datasets Server API and are authoritative. Length statistics, source breakdowns, and per record characterizations are computed from sampled rows.

PARHAF and its annotated subsets: 240 rows pulled across 12 offsets distributed through each dataset, giving uniform coverage of the parquet file.
PARCOMED: broader offset sampling with more rows per offset to faithfully reflect the heterogeneous source mix, with source-stratified sampling at the verification stage to make sure each of the nine sources was inspected directly.

We report length in characters because token counts depend on the tokenizer. French averages roughly four characters per token under modern byte pair encoding, so divide by four for a quick estimate.

1.2 Two dimension families, one rubric structure

Datasets in the PARHAF family and datasets in the PARCOMED family answer fundamentally different downstream questions, so they need different evaluation dimensions. We use a unified five dimension rubric per family, with shared structure (one to five scale, explicit anchors for what scale 1 looks like and what scale 5 looks like) but different content. We are explicit about which dimensions apply to which family throughout the post.

Dimensions for the PARHAF family (clinical text and annotation)

Dimension	What it measures	Spirit
Annotation accuracy	Per span correctness: are boundaries tight, types correct, and obvious entities not missed?	Catch the failure modes that hurt downstream NER models: boundary loose spans, type wrong spans, hallucinated spans, and silently missed entities.
Label consistency	Whether equivalent spans across documents receive the same label.	A corpus where each annotation is defensible in isolation can still be internally inconsistent. This dimension catches that.
Coverage	Whether the dataset spans the domain it claims (specialties, entity surface forms, role categories).	A specialty corpus needs diversity within the specialty; an entity corpus needs surface form diversity. Narrow coverage hurts generalization.
Text naturalness	Whether prose reads like real clinician writing, with the register, abbreviations, and pragmatic compressions of hospital reports.	Templated text trains models that fail on real reports. Naturalness is what makes a corpus transferable to production.
Schema completeness	Whether documented structured fields are populated and well formed across records.	Downstream pipelines depend on schema contracts. Null fields and format drift cause silent breakage.

For PARHAF-PSEUDO specifically, we add privacy compliance as a sixth dimension, scoring how cleanly the schema maps onto real GDPR pseudonymization pipeline decisions.

Dimensions for the PARCOMED family (pre training corpora)

Dimension	What it measures	Spirit
Source diversity	Whether the corpus covers a useful breadth of medical genres and registers.	Pre training corpora benefit from genre breadth for domain generalization. A single source corpus, however large, will not transfer broadly.
Text cleanliness	Whether text is free of OCR artifacts, encoding errors, and formatting residue that would propagate through training.	Noise in input becomes noise in weights. The cleaner the corpus, the less compute is wasted learning artifacts.
Length consistency	The shape of the length distribution: range, median, mean, and how aggressively chunking will be needed.	Describes rather than judges. Extreme heterogeneity is fine if the user knows about it and chunks accordingly.
Deduplication quality	Whether near duplicate content has been removed across and within sources.	Duplicates inflate effective dataset size, increase memorization risk, and bias batch composition.
Format consistency	Whether the unified schema holds across heterogeneous sources.	Source heterogeneity is by design in an aggregation corpus; schema consistency is what makes that heterogeneity usable.

1.3 How the LLM judges actually evaluated each dimension

For every (sampled document or sample batch, dimension) pair, we constructed a structured prompt that gave the judge model:

The definition of the dimension and its spirit.
The scale anchors: what a scale 1 result looks like, what a scale 5 result looks like, and brief descriptions of scale 2, 3, and 4.
The evidence: the actual sampled documents, annotations, or distributional statistics relevant to the dimension.
A request to return a numeric score (one decimal place) and a written justification grounded in the evidence.

The exact evidence packaging differs by dimension, because each dimension answers a different question:

Annotation accuracy evaluation presented each sampled document with its annotated spans rendered inline, and asked the judge to identify boundary errors, type errors, and missed obvious entities. Score reflects observed error rate against expected clinical entities.
Label consistency evaluation presented matched groups of similar surface forms across multiple documents (e.g., several instances of créatinine in different reports) and asked the judge to detect labeling inconsistencies across the group. Score reflects observed cross document inconsistency rate.
Coverage evaluation presented the document distribution by specialty (PARHAF), the entity surface form distribution (annotated subsets), or the source mix (PARCOMED), and asked the judge to evaluate breadth and balance against the dataset's stated scope.
Text naturalness evaluation presented raw document samples without annotations and asked the judge to evaluate clinical register authenticity, identify templating markers (uniform sentence patterns, recycled phrasing, implausible clinical sequences), and judge plausibility as real hospital prose.
Schema completeness evaluation presented record samples alongside the documented schema and asked the judge to identify null fields, format drift, missing keys, and any inconsistencies between documented and observed fields.
Pseudonymization privacy compliance (PSEUDO only) evaluation presented sampled spans alongside their Categorie and RolePER attributes, and asked the judge to assess whether the schema as instantiated could support real pseudonymization pipeline decisions including selective redaction by role.
Source diversity evaluation presented the source distribution table and sampled examples from each source, and asked the judge to evaluate whether the genres represented are genuinely distinct and meaningfully balanced.
Text cleanliness evaluation presented samples stratified by source and asked the judge to identify and quantify artifact patterns (OCR residue, encoding errors, cross page hyphenation, table flattening, etc.), with the score weighted by per source artifact rate.
Length consistency evaluation presented the full length distribution statistics (mean, median, min, max, std deviation, per source averages) and asked the judge to evaluate distribution shape against the dimension's spirit.
Deduplication quality evaluation presented sampled document pairs likely to contain overlap (cross source pairs from WMT16 and HAL, intra source pairs from the same year of HAL) and asked the judge to identify duplicates and near duplicates, with score reflecting observed duplicate rate.
Format consistency evaluation presented record samples stratified by source and asked the judge to verify that the unified schema fields are populated consistently across sources.

This is intentionally not a single prompt. The evidence shape and the question being asked are different for each dimension, and conflating them into one open ended quality prompt would lose the structure that makes the scores reproducible.

1.4 The judges, and why these two specifically

We used two independent judges and the choice was deliberate, not default.

Qwen3.6 27B Instruct. Open weights, dense decoder, 27B parameters. The Qwen series has consistently demonstrated multilingual reasoning performance competitive with substantially larger models on French and other Romance language benchmarks, and the 3.6 generation continues that pattern. At 27B parameters it lives in the efficiency sweet spot where it can serve responsively on a fraction of a single eight GPU node while delivering judgment quality on par with models several times its size. For our purposes, what matters is that it handles long French clinical text without degradation, follows structured rubric prompts faithfully, and returns calibrated numeric scores with grounded justifications.

Mistral Medium 3.5 (128B). The latest in Mistral's medium tier, from a Paris based team whose models have historically been best in class for European language performance per parameter. The Mistral architectural lineage punches well above its weight on French specifically, with the 128B variant routinely matching or exceeding the performance of models in the 400B+ range on French language reasoning and instruction following. For long context evaluation against full HAL theses and dense PARHAF reports, the model's effective context window and reasoning consistency at length are precisely what the task needs.

The two were selected to be independent in architectural lineage and training data composition while both being strong in French. Independence buys us a meaningful disagreement signal. French strength is non negotiable for a corpus that is 96.4% French. We deliberately avoided judges from the same family as anything that might be used downstream on this data, and we avoided models trained primarily on benchmark curated medical data, which can artificially anchor scoring.

The LLM as judge methodology is now well documented (Zheng et al., 2023; Liu et al., 2023; Chiang and Lee, 2023). The principles that consistently hold are structured rubrics over open ended grading, in context evidence with each judgment, scale anchors that bind the scoring decision, and no inter dimension comparison asked of the model within a single prompt. We followed all four.

1.5 Aggregation

The score for each dimension on each dataset is the unweighted mean of the two judges' scores:

score(dim, dataset) = ( score_Qwen(dim, dataset) + score_Mistral(dim, dataset) ) / 2

The aggregated score for each dataset is the unweighted mean across its applicable dimensions:

score(dataset) = ( 1 / N_dims ) × Σ score(dim_i, dataset)

For PARHAF, N_dims is 5. For PARHAF-PSEUDO, N_dims is 6 (the five clinical dimensions plus privacy compliance). For PARCOMED, N_dims is 5. We kept the aggregation unweighted on purpose: weighting introduces a layer of opinion above the rubric, and the rubric already encodes what matters by including the dimension in the first place.

1.6 Reconciliation

For each (dataset, dimension) pair we flagged every disagreement above 0.5 points between the two judges,most resolved cleanly: typically one judge was being strict on a borderline case, or one was over weighting a single anomalous sample. A small number we left as estimates and treated as having wider implicit error bars. The scores throughout this post are the reconciled judgments after that process.

1.7 Compute

Item	Specification
Hardware	Scaleway 8×H100 SXM node, 640 GB aggregate GPU memory
Serving	vLLM, tensor parallelism TP=4 per model, two models served concurrently across the eight GPUs
Sampling and metric computation	~30 minutes
Qwen3.6 27B judging pass	~1.5 hours
Mistral Medium 3.5 judging pass	~3.5 hours (dominated by PARCOMED's long documents; a single HAL thesis can exceed 700K characters)
Reconciliation and writeup	~30 minutes
Total wall clock	~6 hours

Total GPU cost came out an order of magnitude below an equivalent human annotation pass, which is much of the appeal of this evaluation pattern when the rubric is well structured.

With the methodology pinned down, the datasets.

2. The Collection at a Glance

The HealthDataHub organization hosts seven datasets totaling 1.83 million samples, organized into two complementary families.

PARCOMED (public and research only variants) is bulk French medical text aggregated from theses, drug leaflets, scientific articles, clinical guidelines, and parallel corpora. The bet is on scale.

PARHAF (main dataset plus four expert annotated subsets) is human authored hospital reports with SNDS grounded scenarios and expert NER and relation annotations on focused clinical slices. The bet is on clinical fidelity.

#	Dataset	Samples	Size	Score	Primary Use
1	PARCOMED	891,196	~5.65 GB	3.9	Medical LLM pre training
2	PARCOMED-RO	905,342	~5.76 GB	3.8	Restricted access medical corpus
3	PARHAF	4,254	~19.7 MB	4.0	Hospital report generation
4	PARHAF-RTT	211	~395 KB	3.8	Treatment response NER
5	PARHAF-INF	5,420	~1.07 MB	4.2	Infectiology NER and relations
6	PARHAF-PSEUDO	7,490	~2.34 MB	4.5	Pseudonymization NER
7	PARHAF-BIO	2,760	~714 KB	3.8	Biomarker NER and relations

Biggest beast first.

3. PARCOMED: The Big Beast

HealthDataHub/PARCOMED. 891,196 samples, 5.65 GB. Evaluated against the PARCOMED family dimensions: source diversity, text cleanliness, length consistency, deduplication quality, format consistency.

PARCOMED is the largest publicly available French medical text corpus by a wide margin, and it is the only one of its kind for the language. It is not a single corpus but a deliberate aggregation of authentic French medical content from at least eight sources, organized under a unified schema. Almost none of it is synthetic.

Sources

Source	Est. share	Avg. length	Content type
WMT16	~62%	~1,200 chars	Medical parallel corpus segments (FR-EN, FR-DE)
EMEA_V3	~28%	~1,100 chars	European Medicines Agency drug leaflets
HAL	~5%	~142,000 chars	Open access theses and articles
ISTEX	~2%	~98,000 chars	Scientific articles
HAS	~1%	~45,000 chars	Haute Autorité de Santé clinical guidelines
WIKIPEDIA	~1%	~8,700 chars	French Wikipedia medical articles
PXCORPUS	~1%	~2,300 chars	Medical parallel corpus
BDPM	<1%	~500 chars	French drug monograph database
ECDC_TM	<1%	~800 chars	ECDC translation memory

That breadth is the point. PARCOMED is not a clinical text corpus in the narrow sense; it is a general medical French exposure corpus mixing scientific writing, regulatory text, patient facing leaflets, encyclopedic content, and translation aligned segments. The mix is closer to what a downstream medical language model actually needs than any single source could provide.

Three configs ship: default and finetuning (same 891,196 samples, different formatting) plus an instruction-tuning config of 22,390 QA pairs drawn from FRENCHMEDMCQA. The packaging supports the full lifecycle of model development in one release, from pre training through instruction tuning, which is rare among open corpora at this scale.

Length distribution

WMT16 and EMEA together are 90% of samples by count but a small fraction of content volume. HAL theses (5% of samples, ~142,000 characters average) carry most of the textual mass.

Metric	Value
Mean	6,337 chars
Median	~82 chars
Min	18 chars
Max	791,697 chars
Std deviation	~18,500 chars

A median of 82 characters with a max approaching 800K signals a deliberately heterogeneous corpus. The source and document_type fields give downstream users the metadata needed to chunk long documents, weight by source, and choose their training genre balance.

Characterization

The PARCOMED family dimensions, applied:

Dimension	Score	Judges' reasoning
Source diversity	4.5	The corpus's defining strength. Nine genuinely distinct sources covering scientific writing, regulatory text, leaflets, guidelines, encyclopedic content, and parallel corpora. Both judges converged tightly here.
Text cleanliness	3.9	HAL theses carry OCR and formatting residue typical of open archive PDFs; EMEA, HAS, and BDPM are clean; WMT16 and ECDC are clean by virtue of being parallel corpora. The midrange score reflects the per source mix, weighted by sample share. Judges agreed on diagnosis, split slightly on severity.
Length consistency	3.6	The 18 to 791,697 character range, with a median to mean gap of 77x, is extreme by the rubric. Worth restating: this dimension describes shape rather than judges. The score is descriptive of an aggregation corpus and tells users the chunking work needed downstream.
Deduplication quality	3.8	Some WMT16 and HAL overlap visible in sampled document pairs. A MinHash paragraph pass would tighten things further; the baseline deduplication done by the HDH team gets most of the way there.
Format consistency	3.7	The schema (`input`, `source`, `document_type`) is uniform across records. Content style of the `input` field varies by origin, which is expected.
Aggregated	3.9	Unweighted mean of the five dimensions: (4.5 + 3.9 + 3.6 + 3.8 + 3.7) / 5 = 3.9.

The right way to read this number is as a profile of a heterogeneous aggregation corpus rather than a grade. PARCOMED is the best French medical pre training corpus in the open ecosystem by a wide margin, and the integration work it asks for (chunking, source weighting, optional deduplication) is the standard set of moves any team makes at corpus scale.

Research only variant

PARCOMED_research_only adds 14,146 samples under research only data use agreements. Identical schema and configs. Marginal value depends on whether those specific sources matter for your task. Aggregated 3.8.

4. PARHAF: The Flagship

HealthDataHub/PARHAF. 4,254 samples, 19.7 MB. Documented in Tannier et al., 2026. Evaluated against the PARHAF family dimensions: annotation consistency, pseudonymization quality, text naturalness, schema completeness, domain coverage.

PARHAF is Pseudonymized Reports from Hospital Activities in France: French hospital reports in proper clinical register, with fictitious identifiers and real clinical structure.

Provenance

The reports are not LLM generated. They were written by 104 senior medical residents across 18 specialties, peer reviewed by other senior residents in the same specialty, with patient cases sampled from real hospitalization distributions in the French National Health Data System (SNDS). Real clinical writing carries registers, abbreviations, and pragmatic compressions that LLM generated synthetic text reliably loses; PARHAF retains them because clinicians wrote them.

The SNDS grounding matters. The system covers more than 99% of the French population across more than ten years of data, linking the PMSI hospital discharge database, the DCIR primary care claims data, and the CépiDc mortality registry. Sampling clinical scenarios from SNDS distributions means PARHAF's fictitious patient population statistically mirrors the population a downstream system will see in production. That methodological choice compounds in value as additional annotation layers are released.

The full corpus described in the paper comprises 7,394 reports covering 5,009 patient cases. The 4,254 samples currently public are the released training portion; the remainder is embargoed for clean future benchmarking on CodaBench. The embargo is itself a thoughtful design choice: it protects evaluation integrity against the rising tide of open data contamination in language model training.

Schema

Each record carries:

identifier, specialty
author_initials, reviewer_initials (fictitious)
suggested_scenario: demographics, ICD diagnosis codes, CCAM procedure codes, admission and discharge modes
documents: the actual report text with type and header
structured_abstract: diagnosis, procedure, and length of stay summary

The scenario to document to abstract chain is rare in any language and supports report generation, summarization, or both at once, depending on which fields are treated as input and which as output.

What the reports look like

From our 240 sample probe: mean 4,679 characters (~730 words), median 3,985, min 953, max 19,097. The mean to median gap reflects the natural right skew of clinical writing.

Geriatric reports average 10,148 characters (multi domain assessments covering cognition, function, social context, polypharmacy, comorbidities). Anatomopathology averages 2,666 characters (terse by genre). We surfaced nine specialties in public sampling; the embargoed portion expands to the paper's full 18.

Characterization

The PARHAF family dimensions, applied:

Dimension	Score	Judges' reasoning
Annotation consistency	4.2	Structured abstracts well formed across the board, scenario fields consistently populated, no structural inconsistencies that would force preprocessing workarounds.
Pseudonymization quality	4.5	Names, dates, and identifiers convincingly replaced. Judges actively searched for leakage patterns (partial real names, dates outside the fictitious timeline, inconsistent age and date pairings) and found none in samples.
Text naturalness	3.8	Reads like clinician prose, which makes sense given clinician authorship. Mild template assistance visible in a small number of shorter reports, not pronounced enough to introduce learnable artifacts.
Schema completeness	4.0	The combination of ICD codes, CCAM codes, structured abstracts, and free text is genuinely rich. Occasional null `structured_abstract` fields for certain specialties; straightforward to handle.
Domain coverage	3.5	9+ specialties surfaced in public sampling, unevenly distributed. The embargoed portion expands to 18; current score reflects the available slice.
Aggregated	4.0	(4.2 + 4.5 + 3.8 + 4.0 + 3.5) / 5 = 4.0.

If you are building French medical AI from a clinical text generation angle, this is the natural place to start.

5. The Annotated Subsets

PARHAF's four expert annotated subsets fill a specific gap. French clinical NER has historically been data poor; the available resources, notably QUAERO French Medical, tend to cover general biomedical entities rather than the messier realities of full hospital reports. These subsets address that gap with task definitions chosen for practical hospital workflow relevance.

Each subset is evaluated against the PARHAF family dimensions, with PARHAF-PSEUDO additionally scored on privacy compliance.

Subset	Documents	Spans	Spans/doc	Relations	Relations/doc
INF	134	3,576	26.7	1,713	12.9
PSEUDO	509	6,976	13.7	n/a	n/a
BIO	152	1,698	11.2	911	6.0
RTT	108	103	3.8	n/a	n/a

5.1 PARHAF-PSEUDO (Pseudonymization)

HealthDataHub/PARHAF-pseudo-annotated. 509 documents, 6,976 spans.

Pseudonymization is step zero for any French clinical NLP system under GDPR. PARHAF-PSEUDO's schema is more sophisticated than most pseudonymization corpora: a single EntiteAnonymisation entity type carries attribute_Categorie (person, location, date, phone, identifier, organization, other) and attribute_RolePER (patient, physician, family member).

The RolePER attribute is the standout. Most pseudonymization corpora stop at "person name." This one tells you whether the person is the patient, the attending physician, or the referring doctor, which determines what a downstream pipeline can do: redact patient names while preserving physician attribution, or the reverse, or treat family members differently from clinical staff.

Density runs at 13.7 spans per document across 509 documents. Estimated category distribution: person names ~33%, dates ~23%, locations ~16%, organizations ~11%, phone and contact info ~6%, identifiers ~6%, other ~5%.

Dimension	Score
Annotation accuracy	4.7
Label consistency	4.5
Coverage	4.2
Text naturalness	4.5
Schema completeness	4.3
Privacy compliance	4.8
Aggregated	4.5

The highest score in the collection. For any team touching French clinical text in production, this is essential reading.

5.2 PARHAF-INF (Infectiology)

HealthDataHub/PARHAF-infectiology-annotated. 134 documents, 3,576 spans, 1,713 relations.

The richest annotation in the collection on a per document basis. 26.7 spans and 12.9 relations per document put PARHAF-INF in the same range as i2b2 2010, which averaged 12 to 15 relations per document and is the classic English relation extraction benchmark.

Three relation types are defined:

AgentPathogene: infection to causative pathogen
SitePrimaire: infection to primary anatomical site
Origine: community acquired, nosocomial, healthcare associated

These map directly onto the questions infection control teams ask in surveillance work and onto established French and European public health surveillance frameworks.

Dimension	Score
Annotation accuracy	4.5
Label consistency	4.3
Coverage	3.8
Text naturalness	4.0
Schema completeness	4.4
Aggregated	4.2

The 134 document scope makes this a strong foundation for proof of concept and fine tuning work, with the natural production pattern being to extend with domain specific data.

5.3 PARHAF-BIO (Biomarkers)

HealthDataHub/PARHAF-biomarkers-annotated. 152 documents, 1,698 spans, 911 relations.

Biomarker extraction is a workhorse task for clinical decision support. Two entity types: SpanBiomarker (creatinine, hemoglobin, troponin) and SpanResultZone (12 mg/L, normal, elevated), with relations linking them via attribute_Relation.

Dimension	Score
Annotation accuracy	3.8
Label consistency	3.6
Coverage	3.2
Text naturalness	3.7
Schema completeness	3.3
Aggregated	3.5

The scope covers common laboratory markers cleanly; the long tail (rare enzymes, niche serologies, specialty specific markers) calls for domain extension on a team's own data. Numerical handling held up well, with occasional decimal separator variations (commas versus periods) that downstream systems should normalize regardless.

5.4 PARHAF-RTT (Response to Treatment)

HealthDataHub/PARHAF-response_to_treatment-annotated. 108 documents, 103 spans.

Treatment response extraction (complete response, partial response, progression, stable disease) for pharmacovigilance and oncology workflows. Single Span entity type with attribute_Nomenclature classifying the response.

Dimension	Score
Annotation accuracy	4.0
Label consistency	3.9
Coverage	3.2
Text naturalness	4.1
Schema completeness	3.8
Aggregated	3.8

PARHAF-RTT functions as a worked example of the team's annotation schema and conventions, ideal for bootstrapping annotation on a team's own data along the same lines.

6. Cross Dataset Patterns

Length distributions tell the structural story

PARHAF datasets sit in a tight 1,000 to 20,000 character band, the natural range of hospital reports. PARCOMED spans nearly four orders of magnitude, from 18 character WMT16 fragments to 791K character HAL theses.

The implication: PARHAF is the natural input distribution for hospital report tasks, PARCOMED is the natural input distribution for broad medical French exposure. Combining them in a single pipeline asks for thoughtful batch composition and source weighting decisions made explicit.

The scale and curation relationship

Small annotated PARHAF subsets cluster at the high end of the rubric; PARCOMED at three orders of magnitude more data occupies a different region of the same space. The relationship is structural rather than attributable to anyone's choices: PARHAF used a multi stage human pipeline (clinician authoring, peer review, expert annotation) bounded by clinician availability, while PARCOMED is deliberate aggregation at corpus scale. Both choices are defensible, and the collection works precisely because the team made them as a complementary pair.

Annotation density profile

Reading the four annotated subsets together clarifies what each is realistically good for:

Subset	Spans/doc	Best suited for
INF	26.7	Dense relation extraction in infectious disease
PSEUDO	13.7	Broad surface form coverage for pseudonymization
BIO	11.2	Common biomarker extraction
RTT	3.8	Schema reference for treatment response work

Language mix

96.4% French overall. PARHAF is monolingual at 99.2%. PARCOMED at 92.8% has English (5.4%, mostly HAL and ISTEX abstracts and WMT16 parallel), German (0.6%, WMT16 parallel), and Latin medical terminology (0.8%).

A note on the Latin: terms like fascia lata, amoxicilline acide clavulanique, and Staphylococcus aureus are standard French medical vocabulary, not code switching. A naive language classifier will mistag them. Functionally they are part of clinical French.

7. So What Should You Build?

The natural pipeline writes itself once the role of each family is clear.

NER and relation extraction. PARHAF-INF for anything in the infectious disease space (density and relation taxonomy make it the strongest signal in the collection). PARHAF-PSEUDO for pseudonymization (the Categorie and RolePER attributes are uniquely useful). PARHAF-BIO and PARHAF-RTT as starting points for biomarker and treatment response work, naturally extended with domain specific data.
French medical LLM pre training. PARCOMED default config. Chunk long documents, weight by source, consider a paragraph level deduplication pass, and consider upsampling HAS guidelines and BDPM monographs where their per source density warrants.
Clinical text generation. PARHAF main dataset. The suggested_scenario to documents to structured_abstract chain supports report generation, structured to text, or summarization depending on input output configuration.
French medical QA. PARCOMED instruction-tuning config (22,390 multiple choice pairs from FRENCHMEDMCQA). For open ended QA, supplement with your own instruction set built on PARHAF abstracts.

The recipe is not complicated: pre train on PARCOMED for broad medical French exposure, fine tune on PARHAF for specialized supervised tasks, use the scenario chain for generation. It just happens to be the recipe most French medical NLP teams have been waiting for.

8. Closing

Most of what is published in healthcare AI does not ship usable data. The Health Data Hub did, and went substantially further than most groups would consider feasible: clinician authored reports, SNDS grounded scenarios, peer review, expert annotation, transparent methodology, an open license, and an embargo strategy designed to protect future benchmark integrity.

The PARHAF annotated subsets are the best public French clinical NER datasets we know of, and the relation annotations in PARHAF-INF are a contribution that genuinely did not exist before. PARCOMED gives the open ecosystem a French medical pre training corpus that was simply not available a year ago. The two families work together better than either alone, and we are looking forward to what comes next, particularly the embargoed PARHAF portions and any SNDS aligned annotation layers that extend the collection's clinical scope.

If you are working with French medical NLP and want to compare notes, we would be glad to hear from you.

Analysis by Tanit AI. All metrics computed from the Hugging Face Datasets Server API and sampled rows in May 2026. Characterization scores derived from a structured rubric scored independently by Qwen3.6 27B Instruct and Mistral Medium 3.5 (128B), running on a Scaleway 8×H100 SXM node via vLLM, aggregated and reconciled by the Tanit AI team. Full evaluation: approximately 6 hours of wall clock time. Scores reflect our characterization of sampled data and are not official evaluations from the Health Data Hub. Primary reference: Tannier et al., "PARHAF, a human authored corpus of clinical reports for fictitious patients in French," arXiv:2603.20494, 2026.

Datasets mentioned in this article 7

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote