lucafrost commited on
Commit
cfc16a2
β€’
1 Parent(s): 427d3ec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -10
README.md CHANGED
@@ -1,7 +1,5 @@
1
  ---
2
- license: apache-2.0
3
- tags:
4
- - generated_from_trainer
5
  metrics:
6
  - precision
7
  - recall
@@ -10,14 +8,23 @@ metrics:
10
  model-index:
11
  - name: directquote-variedStyles
12
  results: []
 
 
 
 
 
 
13
  ---
 
 
 
 
 
14
 
15
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
16
- should probably proofread and complete it, then remove this comment. -->
17
 
18
- # directquote-variedStyles
19
-
20
- This model is a fine-tuned version of [distilbert-base-cased](https://huggingface.co/distilbert-base-cased) on the None dataset.
21
  It achieves the following results on the evaluation set:
22
  - Loss: 0.2339
23
  - Precision: 0.7440
@@ -27,7 +34,8 @@ It achieves the following results on the evaluation set:
27
 
28
  ## Model description
29
 
30
- More information needed
 
31
 
32
  ## Intended uses & limitations
33
 
@@ -35,10 +43,49 @@ More information needed
35
 
36
  ## Training and evaluation data
37
 
38
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
  ## Training procedure
41
 
 
 
 
 
 
 
 
 
 
42
  ### Training hyperparameters
43
 
44
  The following hyperparameters were used during training:
@@ -67,3 +114,10 @@ The following hyperparameters were used during training:
67
  - Pytorch 1.10.2+cu113
68
  - Datasets 2.8.0
69
  - Tokenizers 0.13.2
 
 
 
 
 
 
 
 
1
  ---
2
+ license: agpl-3.0
 
 
3
  metrics:
4
  - precision
5
  - recall
 
8
  model-index:
9
  - name: directquote-variedStyles
10
  results: []
11
+ datasets:
12
+ - DirectQuote
13
+ language:
14
+ - en
15
+ pipeline_tag: token-classification
16
+ library_name: transformers
17
  ---
18
+ <!--- WHISP DEVELOPMENT LOGO ~ RESPONSIVE TO LIGHT/DARK MODE --->
19
+ <picture>
20
+ <source media="(prefers-color-scheme: dark)" srcset="https://i.imgur.com/eO4igg9.png" height="37", style="height: 37px">
21
+ <img src="https://i.imgur.com/ihiqdVt.png" height="37", style="height: 37px">
22
+ </picture>
23
 
24
+ # quote extraction & attribution on [DirectQuote](https://arxiv.org/abs/2110.07827) dataset with BERT-based token classification πŸ’¬
25
+ **this repository stores the code to train and perform inference with a DistilBERT model using the DirectQuote corpus (Zhang, et al. 2021).**
26
 
27
+ **directquote-variedStyles πŸ’¬** is a fine-tuned [distilbert-base-cased](https://huggingface.co/distilbert-base-cased) model that performs token classification on a modified version of the [DirectQuote](https://arxiv.org/abs/2110.07827) dataset.
 
 
28
  It achieves the following results on the evaluation set:
29
  - Loss: 0.2339
30
  - Precision: 0.7440
 
34
 
35
  ## Model description
36
 
37
+ **directquote-variedStyles** performs Quote Extraction and Attribution (QEA) on texts, enabling NLP applications to suitably process quotations in texts and corpora. Further implementations of QEA have been proposed in the realm of 'modular journalism' (See: ['Talking sense: using machine learning to understand quotes'](https://www.theguardian.com/info/2021/nov/25/talking-sense-using-machine-learning-to-understand-quotes)).
38
+
39
 
40
  ## Intended uses & limitations
41
 
 
43
 
44
  ## Training and evaluation data
45
 
46
+ the **[DirectQuote dataset](https://arxiv.org/abs/2110.07827)** presented by Zhang, et al. (2021) represents a corpus of 19,760 paragraphs containing 10,279 direct quotations β€” this manually-annotated corpus is, as per the authors, "the largest and most complete corpus focusing on direct quotations in news texts" [1].
47
+ ```
48
+ # DirectQuote Distribution of Data Sources
49
+ | Region | Name | Numbers |
50
+ |-------------|-------------------------------------|-------------|
51
+ | U.S. | Associated Press | 438 |
52
+ | | Cable News Network | 627 |
53
+ | | American Broadcasting Company | 240 |
54
+ | | New York Times | 5,642 |
55
+ | | CBS Broadcasting | 4,890 |
56
+ | UK | British Broadcasting Corporation | 926 |
57
+ | | Reuters | 5,836 |
58
+ | | The Guardian | 4,302 |
59
+ | Canada | The Globe and Mail | 1,955 |
60
+ | | The Star | 13,769 |
61
+ | New Zealand | NZ Herald | 115 |
62
+ | Australia | Australian Broadcasting Corporation | 312 |
63
+ | | Sydney Morning Herald | 93 |
64
+ ```
65
+ Quote extraction and attribution appears to be an underserved area of NLP, however, a small handful of systems exist that perform this task, namely Stanford's [CoreNLP model bundle](https://stanfordnlp.github.io/CoreNLP/quote.html) [2]. Quote Extraction and Attribution (QEA) solutions generally fall into one of two broad categories, (1.) rule-based systems that identify quotation marks and common verbiage associated with a quotation (See: [Textacy QEA](https://textacy.readthedocs.io/en/latest/api_reference/extract.html#textacy.extract.triples.direct_quotations) [3]), or (2.) probabilistic model-based approaches that typically rely on LTSMs and neural network architectures.
66
+
67
+ Existing solutions of both categories lack the comparative speed and accuracy of newer, transformer-based systems. With reference to CoreNLP, the system _does not_ support GPU-optimised inference. Rule-based systems, such as Textacy, are significantly faster but sorely lacking in terms of precision (Textacy refused to process 28% of documents from a 1000-doc sample of the Whisp corpus) β€” this issue is compounded by the vast array of different quotation mark 'styles' available within Unicode, as below, there are well over a dozen differing quotation marks.
68
+
69
+ ### Modifications to the DirectQuote Corpus
70
+ As per the [CoreNLP documentation](https://stanfordnlp.github.io/CoreNLP/quote.html) on quote extraction and attribution (QEA), there exists a multitude of varying quotation styles (12+), any of which may appear in texts ingested by Whisp. For the reasons outlined in the introduction, it is necessary to adapt the DirectQuote dataset to represent a wider range of quotation styles.
71
+
72
+ > Considers regular ascii (β€œβ€, β€˜β€™, ``’’, and `’) as well as β€œsmart” and international quotation marks as follows: β€œβ€,β€˜β€™, «», β€Ήβ€Ί, γ€Œγ€, γ€Žγ€, β€žβ€, and β€šβ€™.
73
+ >
74
+ > **From CoreNLP Docs ~ Pipeline > Quote Extraction And Attribution**
75
+
76
+ I have included 11 quotation 'sets' to replace/populate pre-existing quotation marks in the DirectQuote dataset. These styles include both ASCII and Unicode quotation marks, including a small variety of international styles β€” largely confined to those used by French and German speaking populations in Europe. Chinese-style quotation marks have not been included due to the limited overlap in publishing between Mandarin and English content.
77
 
78
  ## Training procedure
79
 
80
+ **Token Labels**
81
+ The DirectQuote corpus provides the following 5 labels, following the IOB1 format:
82
+
83
+ * LeftSpeaker β€” Quotation, the corresponding speaker is in the preceding text
84
+ * RightSpeaker β€” Quotation, the corresponding speaker is in the following text
85
+ * Unknown β€” Quotation, no corresponding speaker
86
+ * Speaker β€” Speaker
87
+ * Out β€” N/A
88
+
89
  ### Training hyperparameters
90
 
91
  The following hyperparameters were used during training:
 
114
  - Pytorch 1.10.2+cu113
115
  - Datasets 2.8.0
116
  - Tokenizers 0.13.2
117
+
118
+ ## References
119
+ [1] Zhang, Y., & Liu, Y. (2021, October 15). DirectQuote: A dataset for direct quotation extraction and attribution in news articles. arXiv.Org. https://arxiv.org/abs/2110.07827
120
+
121
+ [2] Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.
122
+
123
+ [3] Chartbeat Labs, & DeWilde, B. (2016, February). Information Extraction. Textacy ~ NLP, before and after spaCy. https://textacy.readthedocs.io/en/latest/api_reference/extract.html#textacy.extract.triples.direct_quotations