Transformers
PyTorch
English
longformer
Inference Endpoints
dgiofre commited on
Commit
5fbd09f
1 Parent(s): 84b3db1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +172 -0
README.md CHANGED
@@ -7,3 +7,175 @@ language:
7
  library_name: transformers
8
  ---
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  library_name: transformers
8
  ---
9
 
10
+
11
+
12
+ # Model Card for budgetlongformer-diverse-base
13
+
14
+ <!-- Provide a quick summary of what the model is/does. [Optional] -->
15
+ Legal pretrained model using Replaced Token Detection (RTD) task, trained on Pile-of-Law dataset with 4096 tokens as context windows.
16
+
17
+
18
+
19
+ # Model Details
20
+
21
+ ## Model Description
22
+
23
+ <!-- Provide a longer summary of what this model is/does. -->
24
+ Legal pretrained model using ELECTRA objective task, trained on Pile-of-Law dataset with 4096 tokens as context windows.
25
+
26
+ - **Developed by:** Joel Niklaus, Dabiele Giofré
27
+ - **Model type:** Language model
28
+ - **Language(s) (NLP):** en
29
+ - **License:** other
30
+ - **Resources for more information:**
31
+
32
+ - [Associated Paper](https://arxiv.org/abs/2211.17135)
33
+
34
+ # Uses
35
+
36
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
37
+
38
+ ## Direct Use
39
+
40
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
41
+ <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
42
+
43
+ The model can directly be used to generate embeddings for example for similarity search. It likely works best on US focused legal data.
44
+
45
+
46
+ ## Downstream Use
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+ <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
50
+
51
+ The model can be finetuned for any NLU task or when coupled with a decoder also for generative tasks. In our experiments on summarization with the BillSum dataset, we found that random initialization of the decoder improved performance.
52
+
53
+
54
+
55
+ ## Out-of-Scope Use
56
+
57
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
58
+ <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
59
+
60
+ This model will likely work worse on non-legal text in non-English languages originating from outside the US.
61
+
62
+
63
+ # Bias, Risks, and Limitations
64
+
65
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
66
+
67
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
68
+
69
+
70
+ ## Recommendations
71
+
72
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
73
+
74
+ As with any large LM there is the risk of it producing biased or unfair output. Researchers using the model should put into place respective safeguards to identify biased and/or toxic language.
75
+
76
+
77
+
78
+ # Training Details
79
+
80
+ ## Training Data
81
+
82
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
83
+
84
+ The diverse model was trained on caselaw (“Court Listener Opinions” &amp; “Court Listener Docket Entry Documents”), legislation (“US Code”, “State Codes” &amp; “EURLEX”) and contracts (“Atticus Contracts” &amp; “EDGAR Contracts”) from public Pile-of-Law dataset. To balance the training data, we limited the number of documents to 500K (this affects Court Listener Opinions, Court Listener Docket Entry Documents and EDGAR Contracts).
85
+
86
+
87
+ ## Training Procedure
88
+
89
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
90
+
91
+ ### Preprocessing
92
+
93
+ More information needed
94
+
95
+ ### Speeds, Sizes, Times
96
+
97
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
98
+
99
+ More information needed
100
+
101
+ # Evaluation
102
+
103
+ <!-- This section describes the evaluation protocols and provides the results. -->
104
+
105
+ ## Testing Data, Factors & Metrics
106
+
107
+ ### Testing Data
108
+
109
+ <!-- This should link to a Data Card if possible. -->
110
+
111
+ We tested the model on the BillSum and PubMed summarization datasets achieving SotA Rouge scores for the respective parameter sizes in August 2022.
112
+
113
+
114
+ ### Factors
115
+
116
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
117
+
118
+ More information needed
119
+
120
+ ### Metrics
121
+
122
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
123
+
124
+ We followed the standard in research on summarization datasets and used Rouge 1, 2 and L.
125
+
126
+ ## Results
127
+
128
+ More information needed
129
+
130
+
131
+ # Environmental Impact
132
+
133
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
134
+
135
+ Carbon emissions estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
136
+
137
+ - **Hardware Type:** 4 x 16GB NVIDIA V100
138
+ - **Hours used:** 144
139
+ - **Cloud Provider:** AWS
140
+ - **Compute Region:** US East
141
+ - **Carbon Emitted:** 15.98
142
+
143
+
144
+ ## Model Architecture and Objective
145
+
146
+ We used a Longformer attention window of 256 as generator and discriminator. The generator model was three times smaller than the discriminator model. In particular, the generator’s depth (number of hidden layers) instead of its width (embedding size, hidden size and intermediate size). We used a MLM probability of 25\% for the generator.
147
+
148
+ ## Compute Infrastructure
149
+
150
+ Amazon SageMaker Notebooks.
151
+
152
+ ### Hardware
153
+
154
+ 4 x 16GB NVIDIA V100
155
+
156
+ ### Software
157
+
158
+ transformers
159
+
160
+ # Citation
161
+
162
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
163
+
164
+ **BibTeX:**
165
+
166
+ @misc{niklaus2022budgetlongformer,
167
+ title={BudgetLongformer: Can we Cheaply Pretrain a SotA Legal Language Model From Scratch?},
168
+ author={Joel Niklaus and Daniele Giofré},
169
+ year={2022},
170
+ eprint={2211.17135},
171
+ archivePrefix={arXiv},
172
+ primaryClass={cs.CL}
173
+ }
174
+
175
+
176
+
177
+ # Model Card Authors
178
+
179
+ <!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->
180
+
181
+ Joel Niklaus, Daniele Giofré