Tristan commited on
Commit
cdf87ed
1 Parent(s): 6cf19d8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -5
README.md CHANGED
@@ -32,11 +32,12 @@ set a seed for reproducibility:
32
  >>> # the previous text.
33
  >>> generator = pipeline('text-generation', model='olm/olm-gpt2-dec-2022', bad_words_ids=[[0,2]])
34
  >>> set_seed(42)
35
- >>> # This example also illustrates that sometimes our model generates
36
- >>> # bloggy/spammy/webb-y things, even though it gets higher evaluation results
37
- >>> # than the original GPT-2 accross a variety of benchmarks. See the first output.
38
  >>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
39
- TODO
 
 
 
 
40
  ```
41
 
42
  Here is how to use this model to get the features of a given text in PyTorch:
@@ -52,7 +53,7 @@ output = model(**encoded_input)
52
 
53
  ## Dataset
54
 
55
- The model and tokenizer were trained with this [December 2022 cleaned Common Crawl dataset](TODO) plus this [December 2022 cleaned Wikipedia dataset](TODO).\
56
  The tokenized version of these concatenated datasets is [here](https://huggingface.co/datasets/olm/olm-december-2022-tokenized-1024).\
57
  The datasets were created with this [repo](https://github.com/huggingface/olm-datasets).
58
 
 
32
  >>> # the previous text.
33
  >>> generator = pipeline('text-generation', model='olm/olm-gpt2-dec-2022', bad_words_ids=[[0,2]])
34
  >>> set_seed(42)
 
 
 
35
  >>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
36
+ [{'generated_text': "Hello, I'm a language model, but you want to know if I have a language in that language. Is this possible? Please explain"},
37
+ {'generated_text': "Hello, I'm a language model, and here's some useful news for you all: The C++ API is becoming more and more popular for"},
38
+ {'generated_text': "Hello, I'm a language model, I'm not trying to learn or understand a new tool, my job is to be as happy as"},
39
+ {'generated_text': "Hello, I'm a language model, a language analyst, and a language system designer. I'm just a curious guy.\n"},
40
+ {'generated_text': "Hello, I'm a language model, I'm not doing anything that needs to be done for the current time (or previous)."}]
41
  ```
42
 
43
  Here is how to use this model to get the features of a given text in PyTorch:
 
53
 
54
  ## Dataset
55
 
56
+ The model and tokenizer were trained with this [December 2022 cleaned Common Crawl dataset](https://huggingface.co/datasets/olm/olm-CC-MAIN-2022-49-sampling-ratio-olm-0.15114822547) plus this [December 2022 cleaned Wikipedia dataset](https://huggingface.co/datasets/olm/olm-wikipedia-20221220).\
57
  The tokenized version of these concatenated datasets is [here](https://huggingface.co/datasets/olm/olm-december-2022-tokenized-1024).\
58
  The datasets were created with this [repo](https://github.com/huggingface/olm-datasets).
59