browndw commited on
Commit
7308f56
1 Parent(s): 208cbe9

Update spaCy pipeline

Browse files
.gitattributes CHANGED
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *tfevents* filter=lfs diff=lfs merge=lfs -text
35
  en_docusco_spacy_fc_trf-any-py3-none-any.whl filter=lfs diff=lfs merge=lfs -text
36
  transformer/model filter=lfs diff=lfs merge=lfs -text
 
 
34
  *tfevents* filter=lfs diff=lfs merge=lfs -text
35
  en_docusco_spacy_fc_trf-any-py3-none-any.whl filter=lfs diff=lfs merge=lfs -text
36
  transformer/model filter=lfs diff=lfs merge=lfs -text
37
+ en_docusco_spacy_cd_trf-any-py3-none-any.whl filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -6,7 +6,7 @@ language:
6
  - en
7
  license: mit
8
  model-index:
9
- - name: en_docusco_spacy_fc_trf
10
  results:
11
  - task:
12
  name: NER
@@ -14,44 +14,44 @@ model-index:
14
  metrics:
15
  - name: NER Precision
16
  type: precision
17
- value: 0.889028963
18
  - name: NER Recall
19
  type: recall
20
- value: 0.8833963688
21
  - name: NER F Score
22
  type: f_score
23
- value: 0.886203716
24
  - task:
25
  name: TAG
26
  type: token-classification
27
  metrics:
28
  - name: TAG (XPOS) Accuracy
29
  type: accuracy
30
- value: 0.9838746739
31
  ---
32
- English pipeline for part-of-speech and rhetorical tagging.
33
 
34
  | Feature | Description |
35
  | --- | --- |
36
- | **Name** | `en_docusco_spacy_fc_trf` |
37
- | **Version** | `1.1` |
38
- | **spaCy** | `>=3.4.3,<3.5.0` |
39
  | **Default Pipeline** | `transformer`, `tagger`, `ner` |
40
  | **Components** | `transformer`, `tagger`, `ner` |
41
  | **Vectors** | 0 keys, 0 unique vectors (0 dimensions) |
42
  | **Sources** | n/a |
43
  | **License** | `MIT` |
44
- | **Author** | [David Brown](https://browndw.github.io/docuscope-docs/) |
45
 
46
  ### Label Scheme
47
 
48
  <details>
49
 
50
- <summary>View label scheme (269 labels for 2 components)</summary>
51
 
52
  | Component | Labels |
53
  | --- | --- |
54
- | **`tagger`** | `APPGE`, `AT`, `AT1`, `BCL21`, `BCL22`, `CC`, `CCB`, `CS`, `CS21`, `CS22`, `CS31`, `CS32`, `CS33`, `CS41`, `CS42`, `CS43`, `CS44`, `CSA`, `CSN`, `CST`, `CSW`, `CSW31`, `CSW32`, `CSW33`, `DA`, `DA1`, `DA2`, `DAR`, `DAT`, `DB`, `DB2`, `DD`, `DD1`, `DD2`, `DDQ`, `DDQGE`, `DDQV`, `DDQV31`, `DDQV32`, `DDQV33`, `EX`, `FO`, `FU`, `FW`, `GE`, `IF`, `II`, `II21`, `II22`, `II31`, `II32`, `II33`, `II41`, `II42`, `II43`, `II44`, `IO`, `IW`, `JJ`, `JJ21`, `JJ22`, `JJ31`, `JJ32`, `JJ33`, `JJR`, `JJT`, `JK`, `MC`, `MC1`, `MC2`, `MC221`, `MC222`, `MCMC`, `MD`, `MF`, `ND1`, `NN`, `NN1`, `NN121`, `NN122`, `NN131`, `NN132`, `NN133`, `NN141`, `NN142`, `NN143`, `NN144`, `NN2`, `NN21`, `NN22`, `NN221`, `NN222`, `NN231`, `NN232`, `NN233`, `NN31`, `NN33`, `NNA`, `NNB`, `NNL1`, `NNL2`, `NNO`, `NNO2`, `NNT1`, `NNT2`, `NNU`, `NNU1`, `NNU2`, `NNU21`, `NNU22`, `NP`, `NP1`, `NP2`, `NPD1`, `NPD2`, `NPM1`, `NPM2`, `PN`, `PN1`, `PN121`, `PN122`, `PN21`, `PN22`, `PNQO`, `PNQS`, `PNQS31`, `PNQS32`, `PNQS33`, `PNQV`, `PNX1`, `PPGE`, `PPH1`, `PPHO1`, `PPHO2`, `PPHS1`, `PPHS2`, `PPIO1`, `PPIO2`, `PPIS1`, `PPIS2`, `PPX1`, `PPX121`, `PPX122`, `PPX2`, `PPX221`, `PPX222`, `PPY`, `RA`, `RA21`, `RA22`, `REX`, `REX21`, `REX22`, `REX41`, `REX42`, `REX43`, `REX44`, `RG`, `RG21`, `RG22`, `RGQ`, `RGQV`, `RGQV31`, `RGQV32`, `RGQV33`, `RGR`, `RGT`, `RL`, `RL21`, `RL22`, `RP`, `RPK`, `RR`, `RR21`, `RR22`, `RR31`, `RR32`, `RR33`, `RR41`, `RR42`, `RR43`, `RR44`, `RR51`, `RR52`, `RR53`, `RR54`, `RR55`, `RRQ`, `RRQV`, `RRQV31`, `RRQV32`, `RRQV33`, `RRR`, `RRT`, `RT`, `RT21`, `RT22`, `RT31`, `RT32`, `RT33`, `RT41`, `RT42`, `RT43`, `RT44`, `TO`, `UH`, `UH21`, `UH22`, `UH31`, `UH32`, `UH33`, `VB0`, `VBDR`, `VBDZ`, `VBG`, `VBI`, `VBM`, `VBN`, `VBR`, `VBZ`, `VD0`, `VDD`, `VDG`, `VDI`, `VDN`, `VDZ`, `VH0`, `VHD`, `VHG`, `VHI`, `VHN`, `VHZ`, `VM`, `VM21`, `VM22`, `VMK`, `VV0`, `VVD`, `VVG`, `VVGK`, `VVI`, `VVN`, `VVNK`, `VVZ`, `XX`, `Y`, `ZZ1`, `ZZ2`, `ZZ221`, `ZZ222` |
55
  | **`ner`** | `ActorsAbstractions`, `ActorsFirstPerson`, `ActorsPeople`, `ActorsPublicEntities`, `CitationAuthority`, `CitationControversy`, `CitationNeutral`, `ConfidenceHedged`, `ConfidenceHigh`, `OrganizationNarrative`, `OrganizationReasoning`, `PlanningFuture`, `PlanningStrategy`, `SentimentNegative`, `SentimentPositive`, `SignpostingAcademicWritingMoves`, `SignpostingMetadiscourse`, `StanceEmphatic`, `StanceModerated` |
56
 
57
  </details>
@@ -60,10 +60,10 @@ English pipeline for part-of-speech and rhetorical tagging.
60
 
61
  | Type | Score |
62
  | --- | --- |
63
- | `TAG_ACC` | 98.39 |
64
- | `ENTS_F` | 88.62 |
65
- | `ENTS_P` | 88.90 |
66
- | `ENTS_R` | 88.34 |
67
- | `TRANSFORMER_LOSS` | 2319800.36 |
68
- | `TAGGER_LOSS` | 669777.78 |
69
- | `NER_LOSS` | 2048423.35 |
 
6
  - en
7
  license: mit
8
  model-index:
9
+ - name: en_docusco_spacy_cd_trf
10
  results:
11
  - task:
12
  name: NER
 
14
  metrics:
15
  - name: NER Precision
16
  type: precision
17
+ value: 0.8975978922
18
  - name: NER Recall
19
  type: recall
20
+ value: 0.8996163997
21
  - name: NER F Score
22
  type: f_score
23
+ value: 0.8986060124
24
  - task:
25
  name: TAG
26
  type: token-classification
27
  metrics:
28
  - name: TAG (XPOS) Accuracy
29
  type: accuracy
30
+ value: 0.9860324848
31
  ---
32
+ English pipeline for part-of-speech and rhetorical tagging using a smaller 'common dictionary'.
33
 
34
  | Feature | Description |
35
  | --- | --- |
36
+ | **Name** | `en_docusco_spacy_cd_trf` |
37
+ | **Version** | `1.3` |
38
+ | **spaCy** | `>=3.7.4,<3.8.0` |
39
  | **Default Pipeline** | `transformer`, `tagger`, `ner` |
40
  | **Components** | `transformer`, `tagger`, `ner` |
41
  | **Vectors** | 0 keys, 0 unique vectors (0 dimensions) |
42
  | **Sources** | n/a |
43
  | **License** | `MIT` |
44
+ | **Author** | [David Brown](https://docuscope.github.io) |
45
 
46
  ### Label Scheme
47
 
48
  <details>
49
 
50
+ <summary>View label scheme (289 labels for 2 components)</summary>
51
 
52
  | Component | Labels |
53
  | --- | --- |
54
+ | **`tagger`** | `APPGE`, `AT`, `AT1`, `BCL21`, `BCL22`, `CC`, `CCB`, `CS`, `CS21`, `CS22`, `CS31`, `CS32`, `CS33`, `CS41`, `CS42`, `CS43`, `CS44`, `CSA`, `CSN`, `CST`, `CSW`, `CSW31`, `CSW32`, `CSW33`, `DA`, `DA1`, `DA2`, `DAR`, `DAT`, `DB`, `DB2`, `DD`, `DD1`, `DD2`, `DDQ`, `DDQGE`, `DDQGE31`, `DDQGE32`, `DDQGE33`, `DDQV`, `DDQV31`, `DDQV32`, `DDQV33`, `EX`, `FO`, `FU`, `FW`, `GE`, `IF`, `II`, `II21`, `II22`, `II31`, `II32`, `II33`, `II41`, `II42`, `II43`, `II44`, `IO`, `IW`, `JJ`, `JJ21`, `JJ22`, `JJ31`, `JJ32`, `JJ33`, `JJ41`, `JJ42`, `JJ43`, `JJ44`, `JJR`, `JJT`, `JK`, `MC`, `MC1`, `MC2`, `MC221`, `MC222`, `MCMC`, `MD`, `MF`, `ND1`, `NN`, `NN1`, `NN121`, `NN122`, `NN131`, `NN132`, `NN133`, `NN141`, `NN142`, `NN143`, `NN144`, `NN2`, `NN21`, `NN22`, `NN221`, `NN222`, `NN31`, `NN32`, `NN33`, `NNA`, `NNB`, `NNL1`, `NNL2`, `NNO`, `NNO2`, `NNT1`, `NNT131`, `NNT132`, `NNT133`, `NNT2`, `NNU`, `NNU1`, `NNU2`, `NNU21`, `NNU22`, `NNU221`, `NNU222`, `NP`, `NP1`, `NP2`, `NPD1`, `NPD2`, `NPM1`, `NPM2`, `PN`, `PN1`, `PN121`, `PN122`, `PN21`, `PN22`, `PNQO`, `PNQS`, `PNQS31`, `PNQS32`, `PNQS33`, `PNQV`, `PNQV31`, `PNQV32`, `PNQV33`, `PNX1`, `PPGE`, `PPH1`, `PPHO1`, `PPHO2`, `PPHS1`, `PPHS2`, `PPIO1`, `PPIO2`, `PPIS1`, `PPIS2`, `PPX1`, `PPX121`, `PPX122`, `PPX2`, `PPX221`, `PPX222`, `PPY`, `RA`, `RA21`, `RA22`, `REX`, `REX21`, `REX22`, `REX41`, `REX42`, `REX43`, `REX44`, `RG`, `RG21`, `RG22`, `RG41`, `RG42`, `RG43`, `RG44`, `RGQ`, `RGQV`, `RGQV31`, `RGQV32`, `RGQV33`, `RGR`, `RGT`, `RL`, `RL21`, `RL22`, `RL31`, `RL32`, `RL33`, `RP`, `RPK`, `RR`, `RR21`, `RR22`, `RR31`, `RR32`, `RR33`, `RR41`, `RR42`, `RR43`, `RR44`, `RR51`, `RR52`, `RR53`, `RR54`, `RR55`, `RRQ`, `RRQV`, `RRQV31`, `RRQV32`, `RRQV33`, `RRR`, `RRT`, `RT`, `RT21`, `RT22`, `RT31`, `RT32`, `RT33`, `RT41`, `RT42`, `RT43`, `RT44`, `TO`, `UH`, `UH21`, `UH22`, `UH31`, `UH32`, `UH33`, `VB0`, `VBDR`, `VBDZ`, `VBG`, `VBI`, `VBM`, `VBN`, `VBR`, `VBZ`, `VD0`, `VDD`, `VDG`, `VDI`, `VDN`, `VDZ`, `VH0`, `VHD`, `VHG`, `VHI`, `VHN`, `VHZ`, `VM`, `VM21`, `VM22`, `VMK`, `VV0`, `VVD`, `VVG`, `VVGK`, `VVI`, `VVN`, `VVNK`, `VVZ`, `XX`, `Y`, `ZZ1`, `ZZ2`, `ZZ221`, `ZZ222` |
55
  | **`ner`** | `ActorsAbstractions`, `ActorsFirstPerson`, `ActorsPeople`, `ActorsPublicEntities`, `CitationAuthority`, `CitationControversy`, `CitationNeutral`, `ConfidenceHedged`, `ConfidenceHigh`, `OrganizationNarrative`, `OrganizationReasoning`, `PlanningFuture`, `PlanningStrategy`, `SentimentNegative`, `SentimentPositive`, `SignpostingAcademicWritingMoves`, `SignpostingMetadiscourse`, `StanceEmphatic`, `StanceModerated` |
56
 
57
  </details>
 
60
 
61
  | Type | Score |
62
  | --- | --- |
63
+ | `TAG_ACC` | 98.60 |
64
+ | `ENTS_F` | 89.86 |
65
+ | `ENTS_P` | 89.76 |
66
+ | `ENTS_R` | 89.96 |
67
+ | `TRANSFORMER_LOSS` | 4671131.21 |
68
+ | `TAGGER_LOSS` | 1405830.04 |
69
+ | `NER_LOSS` | 4168254.47 |
config.cfg CHANGED
@@ -1,6 +1,6 @@
1
  [paths]
2
- train = "/content/drive/MyDrive/DS Bert/SpacyTrain/spacy_train_cd.spacy"
3
- dev = "/content/drive/MyDrive/DS Bert/SpacyTrain/spacy_test_cd.spacy"
4
  vectors = null
5
  init_tok2vec = null
6
 
@@ -11,12 +11,13 @@ seed = 0
11
  [nlp]
12
  lang = "en"
13
  pipeline = ["transformer","tagger","ner"]
14
- batch_size = 128
15
  disabled = []
16
  before_creation = null
17
  after_creation = null
18
  after_pipeline_creation = null
19
  tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
 
20
 
21
  [components]
22
 
@@ -44,6 +45,7 @@ upstream = "*"
44
 
45
  [components.tagger]
46
  factory = "tagger"
 
47
  neg_prefix = "!"
48
  overwrite = false
49
  scorer = {"@scorers":"spacy.tagger_scorer.v1"}
@@ -106,13 +108,14 @@ train_corpus = "corpora.train"
106
  seed = ${system.seed}
107
  gpu_allocator = ${system.gpu_allocator}
108
  dropout = 0.1
109
- patience = 1600
110
- max_epochs = 0
111
- max_steps = 20000
112
- eval_frequency = 200
113
  frozen_components = []
114
  annotating_components = []
115
  before_to_disk = null
 
116
 
117
  [training.batcher]
118
  @batchers = "spacy.batch_by_padded.v1"
@@ -137,13 +140,13 @@ eps = 0.00000001
137
 
138
  [training.optimizer.learn_rate]
139
  @schedules = "warmup_linear.v1"
140
- warmup_steps = 250
141
- total_steps = 20000
142
  initial_rate = 0.00005
143
 
144
  [training.score_weights]
145
- tag_acc = 0.5
146
- ents_f = 0.5
147
  ents_p = 0.0
148
  ents_r = 0.0
149
  ents_per_type = null
@@ -164,14 +167,14 @@ after_init = null
164
 
165
  [initialize.components.ner.labels]
166
  @readers = "spacy.read_labels.v1"
167
- path = "\"/content/drive/MyDrive/DS Bert/SpacyTrain/ner-sample.json"
168
  require = false
169
 
170
  [initialize.components.tagger]
171
 
172
  [initialize.components.tagger.labels]
173
  @readers = "spacy.read_labels.v1"
174
- path = "/content/drive/MyDrive/DS Bert/SpacyTrain/tagger-sample.json"
175
  require = false
176
 
177
  [initialize.tokenizer]
 
1
  [paths]
2
+ train = "spacy_train_05.spacy"
3
+ dev = "spacy_dev_05.spacy"
4
  vectors = null
5
  init_tok2vec = null
6
 
 
11
  [nlp]
12
  lang = "en"
13
  pipeline = ["transformer","tagger","ner"]
14
+ batch_size = 32
15
  disabled = []
16
  before_creation = null
17
  after_creation = null
18
  after_pipeline_creation = null
19
  tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
20
+ vectors = {"@vectors":"spacy.Vectors.v1"}
21
 
22
  [components]
23
 
 
45
 
46
  [components.tagger]
47
  factory = "tagger"
48
+ label_smoothing = 0.0
49
  neg_prefix = "!"
50
  overwrite = false
51
  scorer = {"@scorers":"spacy.tagger_scorer.v1"}
 
108
  seed = ${system.seed}
109
  gpu_allocator = ${system.gpu_allocator}
110
  dropout = 0.1
111
+ patience = 20000
112
+ max_epochs = -1
113
+ max_steps = 30000
114
+ eval_frequency = 500
115
  frozen_components = []
116
  annotating_components = []
117
  before_to_disk = null
118
+ before_update = null
119
 
120
  [training.batcher]
121
  @batchers = "spacy.batch_by_padded.v1"
 
140
 
141
  [training.optimizer.learn_rate]
142
  @schedules = "warmup_linear.v1"
143
+ warmup_steps = 500
144
+ total_steps = 25000
145
  initial_rate = 0.00005
146
 
147
  [training.score_weights]
148
+ tag_acc = 0.4
149
+ ents_f = 0.6
150
  ents_p = 0.0
151
  ents_r = 0.0
152
  ents_per_type = null
 
167
 
168
  [initialize.components.ner.labels]
169
  @readers = "spacy.read_labels.v1"
170
+ path = "ner.json"
171
  require = false
172
 
173
  [initialize.components.tagger]
174
 
175
  [initialize.components.tagger.labels]
176
  @readers = "spacy.read_labels.v1"
177
+ path = "tagger.json"
178
  require = false
179
 
180
  [initialize.tokenizer]
en_docusco_spacy_cd_trf-any-py3-none-any.whl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:86ef06b7e24e61928327ae520998138f37ccf11b878ef1da7f848d83aabad23f
3
+ size 466599656
meta.json CHANGED
@@ -1,14 +1,14 @@
1
  {
2
  "lang":"en",
3
- "name":"docusco_spacy_fc_trf",
4
- "version":"1.1",
5
- "description":"English pipeline for part-of-speech and rhetorical tagging.",
6
  "author":"David Brown",
7
  "email":"dwb2@andrew.cmu.edu",
8
- "url":"https://browndw.github.io/docuscope-docs/",
9
  "license":"MIT",
10
- "spacy_version":">=3.4.3,<3.5.0",
11
- "spacy_git_version":"Unknown",
12
  "vectors":{
13
  "width":0,
14
  "vectors":0,
@@ -56,6 +56,9 @@
56
  "DD2",
57
  "DDQ",
58
  "DDQGE",
 
 
 
59
  "DDQV",
60
  "DDQV31",
61
  "DDQV32",
@@ -84,6 +87,10 @@
84
  "JJ31",
85
  "JJ32",
86
  "JJ33",
 
 
 
 
87
  "JJR",
88
  "JJT",
89
  "JK",
@@ -112,10 +119,8 @@
112
  "NN22",
113
  "NN221",
114
  "NN222",
115
- "NN231",
116
- "NN232",
117
- "NN233",
118
  "NN31",
 
119
  "NN33",
120
  "NNA",
121
  "NNB",
@@ -124,12 +129,17 @@
124
  "NNO",
125
  "NNO2",
126
  "NNT1",
 
 
 
127
  "NNT2",
128
  "NNU",
129
  "NNU1",
130
  "NNU2",
131
  "NNU21",
132
  "NNU22",
 
 
133
  "NP",
134
  "NP1",
135
  "NP2",
@@ -149,6 +159,9 @@
149
  "PNQS32",
150
  "PNQS33",
151
  "PNQV",
 
 
 
152
  "PNX1",
153
  "PPGE",
154
  "PPH1",
@@ -180,6 +193,10 @@
180
  "RG",
181
  "RG21",
182
  "RG22",
 
 
 
 
183
  "RGQ",
184
  "RGQV",
185
  "RGQV31",
@@ -190,6 +207,9 @@
190
  "RL",
191
  "RL21",
192
  "RL22",
 
 
 
193
  "RP",
194
  "RPK",
195
  "RR",
@@ -307,112 +327,112 @@
307
 
308
  ],
309
  "performance":{
310
- "tag_acc":0.9838746739,
311
- "ents_f":0.886203716,
312
- "ents_p":0.889028963,
313
- "ents_r":0.8833963688,
314
  "ents_per_type":{
315
  "ActorsFirstPerson":{
316
- "p":0.9048672566,
317
- "r":0.9176833544,
318
- "f":0.9112302444
319
  },
320
- "ActorsAbstractions":{
321
- "p":0.8884982639,
322
- "r":0.8868047132,
323
- "f":0.8876506808
324
  },
325
- "SentimentPositive":{
326
- "p":0.8560008306,
327
- "r":0.827811245,
328
- "f":0.8416700694
 
 
 
 
 
329
  },
330
  "ActorsPeople":{
331
- "p":0.9271072667,
332
- "r":0.9305028034,
333
- "f":0.9288019317
334
  },
335
  "SignpostingMetadiscourse":{
336
- "p":0.9420750336,
337
- "r":0.9215780036,
338
- "f":0.9317138023
 
 
 
 
 
339
  },
340
  "OrganizationReasoning":{
341
- "p":0.9138317376,
342
- "r":0.8952304929,
343
- "f":0.9044354839
344
  },
345
- "SentimentNegative":{
346
- "p":0.8280952381,
347
- "r":0.8157206996,
348
- "f":0.8218613915
349
  },
350
- "OrganizationNarrative":{
351
- "p":0.8888659154,
352
- "r":0.8726276261,
353
- "f":0.8806719246
354
  },
355
- "ActorsPublicEntities":{
356
- "p":0.913087316,
357
- "r":0.8978782471,
358
- "f":0.9054189162
359
  },
360
- "ConfidenceHedged":{
361
- "p":0.9044895625,
362
- "r":0.9052661706,
363
- "f":0.9048776999
364
  },
365
- "StanceEmphatic":{
366
- "p":0.864783265,
367
- "r":0.9101225601,
368
- "f":0.8868738251
369
  },
370
  "ConfidenceHigh":{
371
- "p":0.8696095076,
372
- "r":0.8573819886,
373
- "f":0.8634524612
374
- },
375
- "PlanningFuture":{
376
- "p":0.8828828829,
377
- "r":0.8994576554,
378
- "f":0.8910932011
379
  },
380
- "SignpostingAcademicWritingMoves":{
381
- "p":0.7609090909,
382
- "r":0.7678899083,
383
- "f":0.7643835616
384
- },
385
- "PlanningStrategy":{
386
- "p":0.8513819665,
387
- "r":0.8205240175,
388
- "f":0.8356682233
389
  },
390
- "CitationAuthority":{
391
- "p":0.8544839255,
392
- "r":0.8265139116,
393
- "f":0.840266223
394
  },
395
- "StanceModerated":{
396
- "p":0.8590852905,
397
- "r":0.8853503185,
398
- "f":0.8720200753
399
  },
400
- "CitationNeutral":{
401
- "p":0.8832214765,
402
- "r":0.8832214765,
403
- "f":0.8832214765
404
  },
405
- "CitationControversy":{
406
- "p":0.8739837398,
407
- "r":0.8884297521,
408
- "f":0.881147541
409
  }
410
  },
411
- "transformer_loss":23198.0035903843,
412
- "tagger_loss":6697.7777622218,
413
- "ner_loss":20484.2334804777
414
  },
415
  "requirements":[
416
- "spacy-transformers>=1.1.8,<1.2.0"
417
  ]
418
  }
 
1
  {
2
  "lang":"en",
3
+ "name":"docusco_spacy_cd_trf",
4
+ "version":"1.3",
5
+ "description":"English pipeline for part-of-speech and rhetorical tagging using a smaller 'common dictionary'.",
6
  "author":"David Brown",
7
  "email":"dwb2@andrew.cmu.edu",
8
+ "url":"https://docuscope.github.io",
9
  "license":"MIT",
10
+ "spacy_version":">=3.7.4,<3.8.0",
11
+ "spacy_git_version":"bff8725f4",
12
  "vectors":{
13
  "width":0,
14
  "vectors":0,
 
56
  "DD2",
57
  "DDQ",
58
  "DDQGE",
59
+ "DDQGE31",
60
+ "DDQGE32",
61
+ "DDQGE33",
62
  "DDQV",
63
  "DDQV31",
64
  "DDQV32",
 
87
  "JJ31",
88
  "JJ32",
89
  "JJ33",
90
+ "JJ41",
91
+ "JJ42",
92
+ "JJ43",
93
+ "JJ44",
94
  "JJR",
95
  "JJT",
96
  "JK",
 
119
  "NN22",
120
  "NN221",
121
  "NN222",
 
 
 
122
  "NN31",
123
+ "NN32",
124
  "NN33",
125
  "NNA",
126
  "NNB",
 
129
  "NNO",
130
  "NNO2",
131
  "NNT1",
132
+ "NNT131",
133
+ "NNT132",
134
+ "NNT133",
135
  "NNT2",
136
  "NNU",
137
  "NNU1",
138
  "NNU2",
139
  "NNU21",
140
  "NNU22",
141
+ "NNU221",
142
+ "NNU222",
143
  "NP",
144
  "NP1",
145
  "NP2",
 
159
  "PNQS32",
160
  "PNQS33",
161
  "PNQV",
162
+ "PNQV31",
163
+ "PNQV32",
164
+ "PNQV33",
165
  "PNX1",
166
  "PPGE",
167
  "PPH1",
 
193
  "RG",
194
  "RG21",
195
  "RG22",
196
+ "RG41",
197
+ "RG42",
198
+ "RG43",
199
+ "RG44",
200
  "RGQ",
201
  "RGQV",
202
  "RGQV31",
 
207
  "RL",
208
  "RL21",
209
  "RL22",
210
+ "RL31",
211
+ "RL32",
212
+ "RL33",
213
  "RP",
214
  "RPK",
215
  "RR",
 
327
 
328
  ],
329
  "performance":{
330
+ "tag_acc":0.9860324848,
331
+ "ents_f":0.8986060124,
332
+ "ents_p":0.8975978922,
333
+ "ents_r":0.8996163997,
334
  "ents_per_type":{
335
  "ActorsFirstPerson":{
336
+ "p":0.9297243488,
337
+ "r":0.9421626555,
338
+ "f":0.9359021772
339
  },
340
+ "OrganizationNarrative":{
341
+ "p":0.8982249764,
342
+ "r":0.9052289888,
343
+ "f":0.901713382
344
  },
345
+ "ConfidenceHedged":{
346
+ "p":0.9133998382,
347
+ "r":0.925173412,
348
+ "f":0.9192489282
349
+ },
350
+ "StanceEmphatic":{
351
+ "p":0.9163952226,
352
+ "r":0.9306501792,
353
+ "f":0.9234676931
354
  },
355
  "ActorsPeople":{
356
+ "p":0.9048275066,
357
+ "r":0.9085233815,
358
+ "f":0.9066716777
359
  },
360
  "SignpostingMetadiscourse":{
361
+ "p":0.9521945378,
362
+ "r":0.9343999277,
363
+ "f":0.9432133122
364
+ },
365
+ "PlanningStrategy":{
366
+ "p":0.867487328,
367
+ "r":0.8729657518,
368
+ "f":0.8702179177
369
  },
370
  "OrganizationReasoning":{
371
+ "p":0.9162113643,
372
+ "r":0.913893106,
373
+ "f":0.9150507669
374
  },
375
+ "ActorsAbstractions":{
376
+ "p":0.8978776116,
377
+ "r":0.9052445851,
378
+ "f":0.9015460488
379
  },
380
+ "SentimentPositive":{
381
+ "p":0.8603518268,
382
+ "r":0.8566270255,
383
+ "f":0.8584853859
384
  },
385
+ "SentimentNegative":{
386
+ "p":0.8577821301,
387
+ "r":0.8418267418,
388
+ "f":0.8497295439
389
  },
390
+ "CitationAuthority":{
391
+ "p":0.8555627846,
392
+ "r":0.8453873353,
393
+ "f":0.8504446241
394
  },
395
+ "StanceModerated":{
396
+ "p":0.8848971874,
397
+ "r":0.9246727587,
398
+ "f":0.9043478261
399
  },
400
  "ConfidenceHigh":{
401
+ "p":0.8963930348,
402
+ "r":0.9093432591,
403
+ "f":0.9028217093
 
 
 
 
 
404
  },
405
+ "CitationControversy":{
406
+ "p":0.8772563177,
407
+ "r":0.9109653233,
408
+ "f":0.8937931034
 
 
 
 
 
409
  },
410
+ "CitationNeutral":{
411
+ "p":0.9121713201,
412
+ "r":0.9254675468,
413
+ "f":0.9187713311
414
  },
415
+ "PlanningFuture":{
416
+ "p":0.891873065,
417
+ "r":0.915613826,
418
+ "f":0.9035875319
419
  },
420
+ "ActorsPublicEntities":{
421
+ "p":0.9129542262,
422
+ "r":0.9113132257,
423
+ "f":0.9121329879
424
  },
425
+ "SignpostingAcademicWritingMoves":{
426
+ "p":0.7986216171,
427
+ "r":0.8133881185,
428
+ "f":0.8059372349
429
  }
430
  },
431
+ "transformer_loss":46711.3121389837,
432
+ "tagger_loss":14058.3003581261,
433
+ "ner_loss":41682.5447270232
434
  },
435
  "requirements":[
436
+ "spacy-transformers>=1.3.5,<1.4.0"
437
  ]
438
  }
ner/model CHANGED
Binary files a/ner/model and b/ner/model differ
 
ner/moves CHANGED
@@ -1 +1 @@
1
- ��moves�P{"0":{},"1":{"ActorsAbstractions":574627,"SentimentNegative":505726,"ActorsPeople":489704,"SentimentPositive":329499,"OrganizationNarrative":327796,"SignpostingMetadiscourse":285541,"ActorsFirstPerson":242622,"OrganizationReasoning":182971,"StanceEmphatic":148905,"ActorsPublicEntities":141386,"ConfidenceHedged":130515,"ConfidenceHigh":119696,"PlanningFuture":91199,"PlanningStrategy":77436,"SignpostingAcademicWritingMoves":45355,"CitationNeutral":28827,"StanceModerated":24981,"CitationAuthority":24697,"CitationControversy":7780},"2":{"ActorsAbstractions":574627,"SentimentNegative":505726,"ActorsPeople":489704,"SentimentPositive":329499,"OrganizationNarrative":327796,"SignpostingMetadiscourse":285541,"ActorsFirstPerson":242622,"OrganizationReasoning":182971,"StanceEmphatic":148905,"ActorsPublicEntities":141386,"ConfidenceHedged":130515,"ConfidenceHigh":119696,"PlanningFuture":91199,"PlanningStrategy":77436,"SignpostingAcademicWritingMoves":45355,"CitationNeutral":28827,"StanceModerated":24981,"CitationAuthority":24697,"CitationControversy":7780},"3":{"ActorsAbstractions":574627,"SentimentNegative":505726,"ActorsPeople":489704,"SentimentPositive":329499,"OrganizationNarrative":327796,"SignpostingMetadiscourse":285541,"ActorsFirstPerson":242622,"OrganizationReasoning":182971,"StanceEmphatic":148905,"ActorsPublicEntities":141386,"ConfidenceHedged":130515,"ConfidenceHigh":119696,"PlanningFuture":91199,"PlanningStrategy":77436,"SignpostingAcademicWritingMoves":45355,"CitationNeutral":28827,"StanceModerated":24981,"CitationAuthority":24697,"CitationControversy":7780},"4":{"ActorsAbstractions":574627,"SentimentNegative":505726,"ActorsPeople":489704,"SentimentPositive":329499,"OrganizationNarrative":327796,"SignpostingMetadiscourse":285541,"ActorsFirstPerson":242622,"OrganizationReasoning":182971,"StanceEmphatic":148905,"ActorsPublicEntities":141386,"ConfidenceHedged":130515,"ConfidenceHigh":119696,"PlanningFuture":91199,"PlanningStrategy":77436,"SignpostingAcademicWritingMoves":45355,"CitationNeutral":28827,"StanceModerated":24981,"CitationAuthority":24697,"CitationControversy":7780,"":1},"5":{"":1}}�cfg��neg_key�
 
1
+ ��moves�l{"0":{},"1":{"ActorsPeople":1591194,"ActorsAbstractions":1564271,"SentimentNegative":1302786,"OrganizationNarrative":871730,"SentimentPositive":863940,"SignpostingMetadiscourse":697430,"ActorsFirstPerson":650913,"OrganizationReasoning":427949,"StanceEmphatic":377909,"ActorsPublicEntities":354014,"ConfidenceHedged":320162,"ConfidenceHigh":296184,"PlanningFuture":224817,"PlanningStrategy":197087,"SignpostingAcademicWritingMoves":113408,"CitationNeutral":68527,"StanceModerated":60423,"CitationAuthority":56832,"CitationControversy":16582},"2":{"ActorsPeople":1591194,"ActorsAbstractions":1564271,"SentimentNegative":1302786,"OrganizationNarrative":871730,"SentimentPositive":863940,"SignpostingMetadiscourse":697430,"ActorsFirstPerson":650913,"OrganizationReasoning":427949,"StanceEmphatic":377909,"ActorsPublicEntities":354014,"ConfidenceHedged":320162,"ConfidenceHigh":296184,"PlanningFuture":224817,"PlanningStrategy":197087,"SignpostingAcademicWritingMoves":113408,"CitationNeutral":68527,"StanceModerated":60423,"CitationAuthority":56832,"CitationControversy":16582},"3":{"ActorsPeople":1591194,"ActorsAbstractions":1564271,"SentimentNegative":1302786,"OrganizationNarrative":871730,"SentimentPositive":863940,"SignpostingMetadiscourse":697430,"ActorsFirstPerson":650913,"OrganizationReasoning":427949,"StanceEmphatic":377909,"ActorsPublicEntities":354014,"ConfidenceHedged":320162,"ConfidenceHigh":296184,"PlanningFuture":224817,"PlanningStrategy":197087,"SignpostingAcademicWritingMoves":113408,"CitationNeutral":68527,"StanceModerated":60423,"CitationAuthority":56832,"CitationControversy":16582},"4":{"ActorsPeople":1591194,"ActorsAbstractions":1564271,"SentimentNegative":1302786,"OrganizationNarrative":871730,"SentimentPositive":863940,"SignpostingMetadiscourse":697430,"ActorsFirstPerson":650913,"OrganizationReasoning":427949,"StanceEmphatic":377909,"ActorsPublicEntities":354014,"ConfidenceHedged":320162,"ConfidenceHigh":296184,"PlanningFuture":224817,"PlanningStrategy":197087,"SignpostingAcademicWritingMoves":113408,"CitationNeutral":68527,"StanceModerated":60423,"CitationAuthority":56832,"CitationControversy":16582,"":1},"5":{"":1}}�cfg��neg_key�
tagger/cfg CHANGED
@@ -1,4 +1,5 @@
1
  {
 
2
  "labels":[
3
  "APPGE",
4
  "AT",
@@ -36,6 +37,9 @@
36
  "DD2",
37
  "DDQ",
38
  "DDQGE",
 
 
 
39
  "DDQV",
40
  "DDQV31",
41
  "DDQV32",
@@ -64,6 +68,10 @@
64
  "JJ31",
65
  "JJ32",
66
  "JJ33",
 
 
 
 
67
  "JJR",
68
  "JJT",
69
  "JK",
@@ -92,10 +100,8 @@
92
  "NN22",
93
  "NN221",
94
  "NN222",
95
- "NN231",
96
- "NN232",
97
- "NN233",
98
  "NN31",
 
99
  "NN33",
100
  "NNA",
101
  "NNB",
@@ -104,12 +110,17 @@
104
  "NNO",
105
  "NNO2",
106
  "NNT1",
 
 
 
107
  "NNT2",
108
  "NNU",
109
  "NNU1",
110
  "NNU2",
111
  "NNU21",
112
  "NNU22",
 
 
113
  "NP",
114
  "NP1",
115
  "NP2",
@@ -129,6 +140,9 @@
129
  "PNQS32",
130
  "PNQS33",
131
  "PNQV",
 
 
 
132
  "PNX1",
133
  "PPGE",
134
  "PPH1",
@@ -160,6 +174,10 @@
160
  "RG",
161
  "RG21",
162
  "RG22",
 
 
 
 
163
  "RGQ",
164
  "RGQV",
165
  "RGQV31",
@@ -170,6 +188,9 @@
170
  "RL",
171
  "RL21",
172
  "RL22",
 
 
 
173
  "RP",
174
  "RPK",
175
  "RR",
 
1
  {
2
+ "label_smoothing":0.0,
3
  "labels":[
4
  "APPGE",
5
  "AT",
 
37
  "DD2",
38
  "DDQ",
39
  "DDQGE",
40
+ "DDQGE31",
41
+ "DDQGE32",
42
+ "DDQGE33",
43
  "DDQV",
44
  "DDQV31",
45
  "DDQV32",
 
68
  "JJ31",
69
  "JJ32",
70
  "JJ33",
71
+ "JJ41",
72
+ "JJ42",
73
+ "JJ43",
74
+ "JJ44",
75
  "JJR",
76
  "JJT",
77
  "JK",
 
100
  "NN22",
101
  "NN221",
102
  "NN222",
 
 
 
103
  "NN31",
104
+ "NN32",
105
  "NN33",
106
  "NNA",
107
  "NNB",
 
110
  "NNO",
111
  "NNO2",
112
  "NNT1",
113
+ "NNT131",
114
+ "NNT132",
115
+ "NNT133",
116
  "NNT2",
117
  "NNU",
118
  "NNU1",
119
  "NNU2",
120
  "NNU21",
121
  "NNU22",
122
+ "NNU221",
123
+ "NNU222",
124
  "NP",
125
  "NP1",
126
  "NP2",
 
140
  "PNQS32",
141
  "PNQS33",
142
  "PNQV",
143
+ "PNQV31",
144
+ "PNQV32",
145
+ "PNQV33",
146
  "PNX1",
147
  "PPGE",
148
  "PPH1",
 
174
  "RG",
175
  "RG21",
176
  "RG22",
177
+ "RG41",
178
+ "RG42",
179
+ "RG43",
180
+ "RG44",
181
  "RGQ",
182
  "RGQV",
183
  "RGQV31",
 
188
  "RL",
189
  "RL21",
190
  "RL22",
191
+ "RL31",
192
+ "RL32",
193
+ "RL33",
194
  "RP",
195
  "RPK",
196
  "RR",
tagger/model CHANGED
Binary files a/tagger/model and b/tagger/model differ
 
tokenizer CHANGED
The diff for this file is too large to render. See raw diff
 
transformer/model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:22576948b84086ef0634f91f089cb600a12dcf97e5c37e27caf9ddf1d2cebfb8
3
- size 502030632
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fe6a244db75cb192f39add536b26df730b2a4d4eb7ea298436e2f907f54a6d91
3
+ size 502027402
vocab/strings.json CHANGED
The diff for this file is too large to render. See raw diff