patrickvonplaten commited on
Commit
ca39186
1 Parent(s): fe3d5c1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -100
README.md CHANGED
@@ -36,7 +36,7 @@ the art or competitive performance to predominant approaches.*
36
 
37
  ## Intended uses & limitations
38
 
39
- You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
40
  See the [model hub](https://huggingface.co/models?filter=data2vec-text) to look for fine-tuned versions on a task that
41
  interests you.
42
 
@@ -44,105 +44,6 @@ Note that this model is primarily aimed at being fine-tuned on tasks that use th
44
  to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
45
  generation you should look at model like GPT2.
46
 
47
- ### How to use
48
-
49
- You can use this model directly with a pipeline for masked language modeling:
50
-
51
- ```python
52
- >>> from transformers import pipeline
53
- >>> unmasker = pipeline('fill-mask', model='facebook/data2vec-text-base')
54
- >>> unmasker("Hello I'm a <mask> model.")
55
-
56
- [{'sequence': "<s>Hello I'm a male model.</s>",
57
- 'score': 0.3306540250778198,
58
- 'token': 2943,
59
- 'token_str': 'Ġmale'},
60
- {'sequence': "<s>Hello I'm a female model.</s>",
61
- 'score': 0.04655390977859497,
62
- 'token': 2182,
63
- 'token_str': 'Ġfemale'},
64
- {'sequence': "<s>Hello I'm a professional model.</s>",
65
- 'score': 0.04232972860336304,
66
- 'token': 2038,
67
- 'token_str': 'Ġprofessional'},
68
- {'sequence': "<s>Hello I'm a fashion model.</s>",
69
- 'score': 0.037216778844594955,
70
- 'token': 2734,
71
- 'token_str': 'Ġfashion'},
72
- {'sequence': "<s>Hello I'm a Russian model.</s>",
73
- 'score': 0.03253649175167084,
74
- 'token': 1083,
75
- 'token_str': 'ĠRussian'}]
76
- ```
77
-
78
- Here is how to use this model to get the features of a given text in PyTorch:
79
-
80
- ```python
81
- from transformers import AutoTokenizer, AutoModel
82
- tokenizer = AutoTokenizer.from_pretrained('facebook/data2vec-text-base')
83
- model = AutoModel.from_pretrained('facebook/data2vec-text-base')
84
- text = "Replace me by any text you'd like."
85
- encoded_input = tokenizer(text, return_tensors='pt')
86
- output = model(**encoded_input)
87
- ```
88
-
89
- ### Limitations and bias
90
-
91
- The training data used for this model contains a lot of unfiltered content from the internet, which is far from
92
- neutral. Therefore, the model can have biased predictions:
93
-
94
- ```python
95
- >>> from transformers import pipeline
96
- >>> unmasker = pipeline('fill-mask', model='facebook/data2vec-text-base')
97
- >>> unmasker("The man worked as a <mask>.")
98
-
99
- [{'sequence': '<s>The man worked as a mechanic.</s>',
100
- 'score': 0.08702439814805984,
101
- 'token': 25682,
102
- 'token_str': 'Ġmechanic'},
103
- {'sequence': '<s>The man worked as a waiter.</s>',
104
- 'score': 0.0819653645157814,
105
- 'token': 38233,
106
- 'token_str': 'Ġwaiter'},
107
- {'sequence': '<s>The man worked as a butcher.</s>',
108
- 'score': 0.073323555290699,
109
- 'token': 32364,
110
- 'token_str': 'Ġbutcher'},
111
- {'sequence': '<s>The man worked as a miner.</s>',
112
- 'score': 0.046322137117385864,
113
- 'token': 18678,
114
- 'token_str': 'Ġminer'},
115
- {'sequence': '<s>The man worked as a guard.</s>',
116
- 'score': 0.040150221437215805,
117
- 'token': 2510,
118
- 'token_str': 'Ġguard'}]
119
-
120
- >>> unmasker("The Black woman worked as a <mask>.")
121
-
122
- [{'sequence': '<s>The Black woman worked as a waitress.</s>',
123
- 'score': 0.22177888453006744,
124
- 'token': 35698,
125
- 'token_str': 'Ġwaitress'},
126
- {'sequence': '<s>The Black woman worked as a prostitute.</s>',
127
- 'score': 0.19288744032382965,
128
- 'token': 36289,
129
- 'token_str': 'Ġprostitute'},
130
- {'sequence': '<s>The Black woman worked as a maid.</s>',
131
- 'score': 0.06498628109693527,
132
- 'token': 29754,
133
- 'token_str': 'Ġmaid'},
134
- {'sequence': '<s>The Black woman worked as a secretary.</s>',
135
- 'score': 0.05375480651855469,
136
- 'token': 2971,
137
- 'token_str': 'Ġsecretary'},
138
- {'sequence': '<s>The Black woman worked as a nurse.</s>',
139
- 'score': 0.05245552211999893,
140
- 'token': 9008,
141
- 'token_str': 'Ġnurse'}]
142
- ```
143
-
144
- This bias will also affect all fine-tuned versions of this model.
145
-
146
  ## Training data
147
 
148
  The RoBERTa model was pretrained on the reunion of five datasets:
 
36
 
37
  ## Intended uses & limitations
38
 
39
+ The model is intended to be fine-tuned on a downstream task.
40
  See the [model hub](https://huggingface.co/models?filter=data2vec-text) to look for fine-tuned versions on a task that
41
  interests you.
42
 
 
44
  to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
45
  generation you should look at model like GPT2.
46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  ## Training data
48
 
49
  The RoBERTa model was pretrained on the reunion of five datasets: