rafmacalaba commited on
Commit
e29e88f
·
verified ·
1 Parent(s): 96b4e8b

Add model card

Browse files
Files changed (1) hide show
  1. README.md +68 -49
README.md CHANGED
@@ -7,80 +7,99 @@ tags:
7
  - two-pass-hybrid
8
  base_model: fastino/gliner2-large-v1
9
  library_name: gliner2
 
10
  ---
11
 
12
- # GLiNER2 Data Mention Extractor (v1-hybrid-entities)
13
 
14
  Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from
15
  development economics and humanitarian research documents.
16
 
17
- ## Architecture: Two-Pass Hybrid
18
 
19
- This adapter uses a **two-pass** inference strategy to bypass the count_pred/count_embed
20
- mode collapse that limits native `extract_json` to 1 mention per chunk:
21
 
22
  - **Pass 1** (`extract_entities`): Finds ALL data mention spans using 3 entity types
23
  (`named_mention`, `descriptive_mention`, `vague_mention`). Bypasses count_pred entirely.
24
- - **Pass 2** (`extract_json`): Classifies each span individually using sentence-level context.
25
- count=1 is always correct since each call contains exactly 1 mention.
26
 
27
- See `finetuning/ARCHITECTURE.md` for the full rationale.
28
 
29
- ## Task
 
 
30
 
31
- Given a document passage, extracts structured information about each dataset mentioned:
32
 
33
- - **Entity types** (Pass 1 span detection):
34
- - `named_mention`: Proper names and acronyms (DHS, LSMS, FAOSTAT)
35
- - `descriptive_mention`: Described data with identifying detail but no formal name
36
- - `vague_mention`: Generic data references with minimal identifying detail
37
- - **Classification fields** (Pass 2 — fixed choices):
38
- - `typology_tag`: survey / census / database / administrative / indicator / geospatial / microdata / report / other
39
- - `is_used`: True / False
40
- - `usage_context`: primary / supporting / background
41
 
42
- ## Training
43
 
44
- - **Base model**: `fastino/gliner2-large-v1`
45
- - **Method**: LoRA (r=16, alpha=32.0)
46
- - **Target modules**: ['encoder', 'span_rep']
47
- - **Training examples**: 8087
48
- - **Val examples**: 563
49
- - **Best val loss**: None
50
 
51
  ## Usage
52
-
53
  ```python
54
  from gliner2 import GLiNER2
55
-
56
- # Install the patched library first
57
- # pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror
58
 
59
  extractor = GLiNER2.from_pretrained("fastino/gliner2-large-v1")
60
- extractor.load_adapter("rafmacalaba/gliner2-datause-large-v1-hybrid-entities")
61
 
62
- # Pass 1: Extract all mention spans
63
- entity_schema = {
64
  "entities": ["named_mention", "descriptive_mention", "vague_mention"],
65
  "entity_descriptions": {
66
- "named_mention": "A proper name or well-known acronym for a data source...",
67
- "descriptive_mention": "A described data reference with enough detail...",
68
- "vague_mention": "A generic or loosely specified reference to data...",
69
- },
70
- }
71
- spans = extractor.extract(text, entity_schema, threshold=0.3)
72
-
73
- # Pass 2: Classify each span
74
- json_schema = {
75
- "data_mention": {
76
- "mention_name": "",
77
- "typology_tag": {"choices": ["survey", "census", "administrative", "database",
78
- "indicator", "geospatial", "microdata", "report", "other"]},
79
- "is_used": {"choices": ["True", "False"]},
80
- "usage_context": {"choices": ["primary", "supporting", "background"]},
81
  },
82
  }
83
- for span in spans.get("named_mention", []):
84
- context = extract_sentence_context(text, span)
85
- tags = extractor.extract(context, json_schema)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  ```
 
7
  - two-pass-hybrid
8
  base_model: fastino/gliner2-large-v1
9
  library_name: gliner2
10
+ license: apache-2.0
11
  ---
12
 
13
+ # GLiNER2 Data Mention Extractor — datause-extraction
14
 
15
  Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from
16
  development economics and humanitarian research documents.
17
 
18
+ Mirrored from [`rafmacalaba/gliner2-datause-large-v1-hybrid-entities`](https://huggingface.co/rafmacalaba/gliner2-datause-large-v1-hybrid-entities).
19
 
20
+ ## Architecture: Two-Pass Hybrid
 
21
 
22
  - **Pass 1** (`extract_entities`): Finds ALL data mention spans using 3 entity types
23
  (`named_mention`, `descriptive_mention`, `vague_mention`). Bypasses count_pred entirely.
24
+ - **Pass 2** (`extract_json`): Classifies each span individually (count=1).
 
25
 
26
+ ## Entity Types
27
 
28
+ - `named_mention`: Proper names and acronyms (DHS, LSMS, FAOSTAT)
29
+ - `descriptive_mention`: Described data with identifying detail but no formal name
30
+ - `vague_mention`: Generic data references with minimal identifying detail
31
 
32
+ ## Classification Fields
33
 
34
+ - `typology_tag`: survey / census / administrative / database / indicator / geospatial / microdata / report / other
35
+ - `is_used`: True / False
36
+ - `usage_context`: primary / supporting / background
 
 
 
 
 
37
 
38
+ ## Installation
39
 
40
+ ```bash
41
+ pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror
42
+ ```
 
 
 
43
 
44
  ## Usage
 
45
  ```python
46
  from gliner2 import GLiNER2
47
+ import re
 
 
48
 
49
  extractor = GLiNER2.from_pretrained("fastino/gliner2-large-v1")
50
+ extractor.load_adapter("ai4data/datause-extraction")
51
 
52
+ ENTITY_SCHEMA = {
 
53
  "entities": ["named_mention", "descriptive_mention", "vague_mention"],
54
  "entity_descriptions": {
55
+ "named_mention": "A proper name or well-known acronym for a data source (DHS, LSMS, FAOSTAT).",
56
+ "descriptive_mention": "A described data reference with identifying detail but no formal name.",
57
+ "vague_mention": "A generic or loosely specified reference to data.",
 
 
 
 
 
 
 
 
 
 
 
 
58
  },
59
  }
60
+
61
+ def extract_sentence_context(text, char_start, char_end, margin=1):
62
+ boundaries = [0] + [m.end() for m in re.finditer(r"[.!?]\s+", text)] + [len(text)]
63
+ for i in range(len(boundaries) - 1):
64
+ if boundaries[i] <= char_start < boundaries[i + 1]:
65
+ s = max(0, i - margin)
66
+ e = min(len(boundaries) - 1, i + margin + 1)
67
+ return text[boundaries[s]:boundaries[e]].strip()
68
+ return text
69
+
70
+ json_schema = (
71
+ extractor.create_schema()
72
+ .structure("data_mention")
73
+ .field("mention_name", dtype="str")
74
+ .field("typology_tag", dtype="str", choices=["survey","census","administrative","database","indicator","geospatial","microdata","report","other"])
75
+ .field("is_used", dtype="str", choices=["True", "False"])
76
+ .field("usage_context", dtype="str", choices=["primary", "supporting", "background"])
77
+ )
78
+
79
+ text = "The analysis draws on the DHS 2018 and administrative records from the National Statistics Office."
80
+
81
+ # Pass 1 — span detection
82
+ pass1 = extractor.extract(text, ENTITY_SCHEMA, threshold=0.3, include_confidence=True, include_spans=True)
83
+ entities = pass1.get("entities", {})
84
+
85
+ # Pass 2 — classification per span
86
+ results = []
87
+ for etype in ["named_mention", "descriptive_mention", "vague_mention"]:
88
+ for span in entities.get(etype, []):
89
+ mention_text = span.get("text", span) if isinstance(span, dict) else span
90
+ char_start = span.get("start", text.find(mention_text)) if isinstance(span, dict) else text.find(mention_text)
91
+ char_end = span.get("end", char_start + len(mention_text)) if isinstance(span, dict) else char_start + len(mention_text)
92
+ context = extract_sentence_context(text, char_start, char_end)
93
+ tags = extractor.extract(context, json_schema)
94
+ tag = (tags.get("data_mention") or [{}])[0]
95
+ results.append({
96
+ "mention_name": mention_text,
97
+ "specificity": etype.replace("_mention", ""),
98
+ "typology": tag.get("typology_tag"),
99
+ "is_used": tag.get("is_used"),
100
+ "usage_context": tag.get("usage_context"),
101
+ })
102
+
103
+ for r in results:
104
+ print(r)
105
  ```