File size: 2,884 Bytes
a7941f1
 
fe2a050
 
 
 
 
 
 
a7941f1
fe2a050
777c9ef
fe2a050
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42cf800
fe2a050
 
 
 
777c9ef
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
---
license: mit
language:
- en
pipeline_tag: text-generation
tags:
- legal
- news
library_name: transformers
---
# GPT-Neo-1.3B SimCTG for Conditional News Generation
[SimCTG](https://github.com/yxuansu/SimCTG) model (released by Su et al. in this [paper](https://arxiv.org/abs/2202.06417)), leveraging [GPT-Neo-1.3B](https://huggingface.co/EleutherAI/gpt-neo-1.3B) (a large language model).

## Data Details
It was trained on a large news corpus containing news content from 19 different publishers. Detailed dataset configuration is as follow: 

|    Publisher     | Data Number |
| :--------------: | :---------: |
|     Guardian     |   250,000   |
|       BBC        |   240,872   |
| WashingtonPost  |   167,401   |
|    USAToday     |   234,648   |
|     Reuters      |   822,110   |
|  NYT (New York Times)  |   245,150   |
|       CNBC       |   231,060   |
|       Hill       |   205,410   |
|      People      |   132,630   |
|       CNN        |   121,760   |
|       Vice       |   97,750    |
|     Mashable     |   91,100    |
|     Refinery     |   84,100    |
| BI (Business Insider) |   53,014    |
|    TechCrunch    |   49,040    |
|      Verge       |   48,327    |
|       TMZ        |   46,490    |
|      Axios       |   44,280    |
|       Vox        |    44120    |

## Training Details
We use the prompt template `Publisher: {vox} article: ` for training. We trained the model about 3 epochs on 3 NVIDIA A40 GPU.

## How to use
```python
>>> from transformers import GPTNeoForCausalLM, AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("PahaII/gpt-neo-1.3b-simctg-NewsCtrlGen")
>>> model = GPTNeoForCausalLM.from_pretrained("PahaII/gpt-neo-1.3b-simctg-NewsCtrlGen")

>>> publisher = "Reuters"
>>> assert publisher in ["Reuters", "NYT", "CNBC", "Hill", "People", "CNN", "Vice", "Mashable", "Refinery", "BI", "TechCrunch", "Verge", "TMZ", "Axios", "Vox", "Guardian", "BBCNews", "WashingtonPost", "USAToday"]
>>> prompt = f"{tokenizer.bos_token}Publisher: {publisher.lower()} article: Local police is dealing with a car accident"

>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> out = model.generate(**inputs, penalty_alpha=0.6)
>>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])

## Publisher: reuters article: Local police is dealing with a car accident that killed two people and injured several others. The incident happened in the town of Dharamshala,
## where an SUV crashed into a truck on Sunday evening. According to eyewitnesses, the vehicle was traveling at high speed when it collided with another vehicle.
## The driver of the SUV then tried to flee the scene but could not do so due to the large number of onlookers. Police officers are now searching for the driver of the SUV who they suspect may have been driving
## under the influence of alcohol or drugs. It’s unclear what caused the crash. ... ...
```