ai4data
/

datause-extraction

@@ -7,80 +7,99 @@ tags:
   - two-pass-hybrid
 base_model: fastino/gliner2-large-v1
 library_name: gliner2
 ---
-# GLiNER2 Data Mention Extractor (v1-hybrid-entities)
 Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from
 development economics and humanitarian research documents.
-## Architecture: Two-Pass Hybrid
-This adapter uses a **two-pass** inference strategy to bypass the count_pred/count_embed
-mode collapse that limits native `extract_json` to 1 mention per chunk:
 - **Pass 1** (`extract_entities`): Finds ALL data mention spans using 3 entity types
   (`named_mention`, `descriptive_mention`, `vague_mention`). Bypasses count_pred entirely.
-- **Pass 2** (`extract_json`): Classifies each span individually using sentence-level context.
-  count=1 is always correct since each call contains exactly 1 mention.
-See `finetuning/ARCHITECTURE.md` for the full rationale.
-## Task
-Given a document passage, extracts structured information about each dataset mentioned:
-- **Entity types** (Pass 1 — span detection):
-  - `named_mention`: Proper names and acronyms (DHS, LSMS, FAOSTAT)
-  - `descriptive_mention`: Described data with identifying detail but no formal name
-  - `vague_mention`: Generic data references with minimal identifying detail
-- **Classification fields** (Pass 2 — fixed choices):
-  - `typology_tag`: survey / census / database / administrative / indicator / geospatial / microdata / report / other
-  - `is_used`: True / False
-  - `usage_context`: primary / supporting / background
-## Training
-- **Base model**: `fastino/gliner2-large-v1`
-- **Method**: LoRA (r=16, alpha=32.0)
-- **Target modules**: ['encoder', 'span_rep']
-- **Training examples**: 8087
-- **Val examples**: 563
-- **Best val loss**: None
 ## Usage
 ```python
 from gliner2 import GLiNER2
-# Install the patched library first
-# pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror
 extractor = GLiNER2.from_pretrained("fastino/gliner2-large-v1")
-extractor.load_adapter("rafmacalaba/gliner2-datause-large-v1-hybrid-entities")
-# Pass 1: Extract all mention spans
-entity_schema = {
     "entities": ["named_mention", "descriptive_mention", "vague_mention"],
     "entity_descriptions": {
-        "named_mention": "A proper name or well-known acronym for a data source...",
-        "descriptive_mention": "A described data reference with enough detail...",
-        "vague_mention": "A generic or loosely specified reference to data...",
-    },
-}
-spans = extractor.extract(text, entity_schema, threshold=0.3)
-# Pass 2: Classify each span
-json_schema = {
-    "data_mention": {
-        "mention_name": "",
-        "typology_tag": {"choices": ["survey", "census", "administrative", "database",
-                                     "indicator", "geospatial", "microdata", "report", "other"]},
-        "is_used": {"choices": ["True", "False"]},
-        "usage_context": {"choices": ["primary", "supporting", "background"]},
     },
 }
-for span in spans.get("named_mention", []):
-    context = extract_sentence_context(text, span)
-    tags = extractor.extract(context, json_schema)
 ```

   - two-pass-hybrid
 base_model: fastino/gliner2-large-v1
 library_name: gliner2
+license: apache-2.0
 ---
+# GLiNER2 Data Mention Extractor — datause-extraction
 Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from
 development economics and humanitarian research documents.
+Mirrored from [`rafmacalaba/gliner2-datause-large-v1-hybrid-entities`](https://huggingface.co/rafmacalaba/gliner2-datause-large-v1-hybrid-entities).
+## Architecture: Two-Pass Hybrid
 - **Pass 1** (`extract_entities`): Finds ALL data mention spans using 3 entity types
   (`named_mention`, `descriptive_mention`, `vague_mention`). Bypasses count_pred entirely.
+- **Pass 2** (`extract_json`): Classifies each span individually (count=1).
+## Entity Types
+- `named_mention`: Proper names and acronyms (DHS, LSMS, FAOSTAT)
+- `descriptive_mention`: Described data with identifying detail but no formal name
+- `vague_mention`: Generic data references with minimal identifying detail
+## Classification Fields
+- `typology_tag`: survey / census / administrative / database / indicator / geospatial / microdata / report / other
+- `is_used`: True / False
+- `usage_context`: primary / supporting / background
+## Installation
+```bash
+pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror
+```
 ## Usage
 ```python
 from gliner2 import GLiNER2
+import re
 extractor = GLiNER2.from_pretrained("fastino/gliner2-large-v1")
+extractor.load_adapter("ai4data/datause-extraction")
+ENTITY_SCHEMA = {
     "entities": ["named_mention", "descriptive_mention", "vague_mention"],
     "entity_descriptions": {
+        "named_mention": "A proper name or well-known acronym for a data source (DHS, LSMS, FAOSTAT).",
+        "descriptive_mention": "A described data reference with identifying detail but no formal name.",
+        "vague_mention": "A generic or loosely specified reference to data.",
     },
 }
+def extract_sentence_context(text, char_start, char_end, margin=1):
+    boundaries = [0] + [m.end() for m in re.finditer(r"[.!?]\s+", text)] + [len(text)]
+    for i in range(len(boundaries) - 1):
+        if boundaries[i] <= char_start < boundaries[i + 1]:
+            s = max(0, i - margin)
+            e = min(len(boundaries) - 1, i + margin + 1)
+            return text[boundaries[s]:boundaries[e]].strip()
+    return text
+json_schema = (
+    extractor.create_schema()
+    .structure("data_mention")
+    .field("mention_name", dtype="str")
+    .field("typology_tag", dtype="str", choices=["survey","census","administrative","database","indicator","geospatial","microdata","report","other"])
+    .field("is_used", dtype="str", choices=["True", "False"])
+    .field("usage_context", dtype="str", choices=["primary", "supporting", "background"])
+)
+text = "The analysis draws on the DHS 2018 and administrative records from the National Statistics Office."
+# Pass 1 — span detection
+pass1 = extractor.extract(text, ENTITY_SCHEMA, threshold=0.3, include_confidence=True, include_spans=True)
+entities = pass1.get("entities", {})
+# Pass 2 — classification per span
+results = []
+for etype in ["named_mention", "descriptive_mention", "vague_mention"]:
+    for span in entities.get(etype, []):
+        mention_text = span.get("text", span) if isinstance(span, dict) else span
+        char_start = span.get("start", text.find(mention_text)) if isinstance(span, dict) else text.find(mention_text)
+        char_end = span.get("end", char_start + len(mention_text)) if isinstance(span, dict) else char_start + len(mention_text)
+        context = extract_sentence_context(text, char_start, char_end)
+        tags = extractor.extract(context, json_schema)
+        tag = (tags.get("data_mention") or [{}])[0]
+        results.append({
+            "mention_name": mention_text,
+            "specificity": etype.replace("_mention", ""),
+            "typology": tag.get("typology_tag"),
+            "is_used": tag.get("is_used"),
+            "usage_context": tag.get("usage_context"),
+        })
+for r in results:
+    print(r)
 ```