zlucia commited on
Commit
4794aa8
·
1 Parent(s): 33eb38c

Add usage examples

Browse files
Files changed (1) hide show
  1. README.md +72 -1
README.md CHANGED
@@ -1,6 +1,8 @@
1
  ---
2
  language:
3
  - en
 
 
4
  pipeline_tag: fill-mask
5
  ---
6
 
@@ -8,14 +10,83 @@ pipeline_tag: fill-mask
8
  Pretrained model on English language legal and administrative text using the [RoBERTa](https://arxiv.org/abs/1907.11692) pretraining objective.
9
 
10
  ## Model description
11
- Pile of Law BERT large is a transformers model with the [BERT large model (uncased)](https://huggingface.co/bert-large-uncased) architecture pretrained on the Pile of Law, a dataset consisting of ~256GB of English language legal and administrative text for language model pretraining.
12
 
13
  ## Intended uses & limitations
14
  You can use the raw model for masked language modeling or fine-tune it for a downstream task. Since this model was pretrained on a English language legal and administrative text corpus, legal downstream tasks will likely be more in-domain for this model.
15
 
16
  ## How to use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
  ## Limitations and bias
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ## Training data
21
  The Pile of Law BERT large model was pretrained on the Pile of Law, a dataset consisting of ~256GB of English language legal and administrative text for language model pretraining. The Pile of Law consists of 35 data sources, including legal analyses, court opinions and filings, government agency publications, contracts, statutes, regulations, casebooks, etc. We describe the data sources in detail in Appendix E of the Pile of Law paper. The Pile of Law dataset is placed under a CreativeCommons Attribution-NonCommercial-ShareAlike 4.0 International license.
 
1
  ---
2
  language:
3
  - en
4
+ datasets:
5
+ - pile-of-law/pile-of-law
6
  pipeline_tag: fill-mask
7
  ---
8
 
 
10
  Pretrained model on English language legal and administrative text using the [RoBERTa](https://arxiv.org/abs/1907.11692) pretraining objective.
11
 
12
  ## Model description
13
+ Pile of Law BERT large is a transformers model with the [BERT large model (uncased)](https://huggingface.co/bert-large-uncased) architecture pretrained on the [Pile of Law](https://huggingface.co/datasets/pile-of-law/pile-of-law), a dataset consisting of ~256GB of English language legal and administrative text for language model pretraining.
14
 
15
  ## Intended uses & limitations
16
  You can use the raw model for masked language modeling or fine-tune it for a downstream task. Since this model was pretrained on a English language legal and administrative text corpus, legal downstream tasks will likely be more in-domain for this model.
17
 
18
  ## How to use
19
+ You can use the model directly with a pipeline for masked language modeling:
20
+ ```python
21
+ >>> from transformers import pipeline
22
+ >>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-1')
23
+ >>> pipe("An [MASK] is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.")
24
+
25
+ [{'sequence': 'an appeal is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
26
+ 'score': 0.6343119740486145,
27
+ 'token': 1151, '
28
+ token_str': 'appeal'},
29
+ {'sequence': 'an objection is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
30
+ 'score': 0.10488124936819077,
31
+ 'token': 3542,
32
+ 'token_str': 'objection'},
33
+ {'sequence': 'an application is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
34
+ 'score': 0.0708756372332573,
35
+ 'token': 1999,
36
+ 'token_str': 'application'},
37
+ {'sequence': 'an example is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
38
+ 'score': 0.02558572217822075,
39
+ 'token': 3677,
40
+ 'token_str': 'example'},
41
+ {'sequence': 'an action is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
42
+ 'score': 0.013266939669847488,
43
+ 'token': 1347,
44
+ 'token_str': 'action'}]
45
+ ```
46
+
47
+ Here is how to use this model to get the features of a given text in PyTorch:
48
+
49
+ ```python
50
+ from transformers import BertTokenizer, BertModel
51
+ tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
52
+ model = BertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
53
+ text = "Replace me by any text you'd like."
54
+ encoded_input = tokenizer(text, return_tensors='pt')
55
+ output = model(**encoded_input)
56
+ ```
57
+
58
+ and in TensorFlow:
59
+
60
+ ```python
61
+ from transformers import BertTokenizer, TFBertModel
62
+ tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
63
+ model = TFBertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
64
+ text = "Replace me by any text you'd like."
65
+ encoded_input = tokenizer(text, return_tensors='tf')
66
+ output = model(encoded_input)
67
+ ```
68
 
69
  ## Limitations and bias
70
+ Please see Appendix G of the Pile of Law paper for copyright limitations related to dataset and model use.
71
+
72
+ This model can have biased predictions. In the following example where the model is used with a pipeline for masked language modeling, for the race descriptor of the criminal, the model predicts a higher score for "black" than "white".
73
+
74
+ ```python
75
+ >>> from transformers import pipeline
76
+ >>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-1')
77
+ >>> pipe("The clerk described the robber as a “thin [MASK] male, about six foot tall, wearing a gray hoodie, blue jeans", targets=["black", "white"])
78
+
79
+ [{'sequence': 'the clerk described the robber as a thin black male, about six foot tall, wearing a gray hoodie, blue jeans',
80
+ 'score': 0.0013972163433209062,
81
+ 'token': 4311,
82
+ 'token_str': 'black'},
83
+ {'sequence': 'the clerk described the robber as a thin white male, about six foot tall, wearing a gray hoodie, blue jeans',
84
+ 'score': 0.0009401230490766466,
85
+ 'token': 4249, '
86
+ token_str': 'white'}]
87
+ ```
88
+
89
+ This bias will also affect all fine-tuned versions of this model.
90
 
91
  ## Training data
92
  The Pile of Law BERT large model was pretrained on the Pile of Law, a dataset consisting of ~256GB of English language legal and administrative text for language model pretraining. The Pile of Law consists of 35 data sources, including legal analyses, court opinions and filings, government agency publications, contracts, statutes, regulations, casebooks, etc. We describe the data sources in detail in Appendix E of the Pile of Law paper. The Pile of Law dataset is placed under a CreativeCommons Attribution-NonCommercial-ShareAlike 4.0 International license.