GPT-J-Title-Teaser-10k

gptj-title-teaser-10k
Version 1.0 / 22 December 2022

A fine-tuned version of the GPT-J-6B-8bit model for generating titles and teasers for news.

Model Details

Model Description

Test generation capabilities here: https://snipaid.tech

A GPT-J model finetuned on german language news using a causal language modeling (CLM) objective.

GPT-J is a transformers model pretrained on a very large corpus of english data The Pile in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.

Inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the predictions for the token i only uses the inputs from 1 to i but not the future tokens.

The pretrained model learns an inner representation of the english language that can then be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a prompt. A prompt is a piece of text inserted in the input examples, so that the original task can be formulated as a (masked) language modeling problem.

To fit the model to the domain of german news for the downstream task of title and teaser generation it was finetuned on a dataset with 10,000 german news articles in a multi-task finetuning fashion. Hence the finetuned models name drives from the model it was finetuned from (gptj), the downstream generation tasks (title, teaser) and the size of the finetuning dataset (10k).

Developed by: snipaid
Model type: gptj
Language(s) (NLP): de
License: MIT
Finetuned from model: GPT-J-6B-8bit

Uses

The model is intended for generating titles and teasers of news documents.

News document: A news story's fulltext in plain text.
Title: A few words that reflect the essence of the news story, also known as headline.
Teaser: A few sentences that spark curiousity about the "best of the rest" of the news story.

Direct Use and how to get started with the model

The model is built on GPT-J-6B-8bit to make the model usable and fine-tunable on a single GPU with ~11 GB memory. Running it requires some utility code for the 8 bit quantization and loRa adapters.

Here's how to get started:

Out-of-Scope Use

Misuse:

Generating and spreading misinformation
Generating content that is discriminating, violent or otherwise harmful

Use cases the model will not work well for:

Generating snippets other than title and teaser

Bias, Risks, and Limitations

The base model GPT-J was trained on the Pile, a dataset scraped from many different websites. This dataset is known contain profanity, lewd, and otherwise abrasive language alongside certain biases. Fine-tuning does not eliminate those risks and biases. Depending upon input gptj-title-teaser-10k may produce socially unacceptable output. To learn more about biases in the Pile see Sections 5 and 6 of the Pile paper.

Recommendations

When generating text with the model please keep in mind, that the statistically most likely next token or word often does not produce the most "accurate" text. Never depend upon those models to produce factually accurate output! We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to ensure the quality of the generared output. For further information see limitations and biases of GPT-J.

Training Details

Training Data

The model was finetuned on a collection of 10,000 news items scraped from different online news outlets* in german language.

* Namely: Speedweek, n-tv, Welt, Tagesspiegel, Faz, Merkur, Bild, Focus, Rp-Online, Freie Presse, Weser-Kurier, Tz, Stern, Kicker, Taz, Schwäbische Zeitung, Frankfurter Rundschau, Stuttgarter Zeitung, Abendzeitung, Donaukurier, Hessische Neidersächsiche Allgemeine, Kreiszeitung, Heise Online, Augsburger Allgemeine, SPOX, Nordbayern, Offenbach Post Online, inFranken, Westfälischer Anzeiger, Tagesschau, Nordkurier, Wallstreet online, Computer Bild, Die Rheinlandpfalz, Morgenweb, Bunte, Sport1, LR-Online, Gala, Wirtschaftswoche, Chip, Brigitte, NWZ Online.

For each news item the dataset contains title, teaser and fulltext.

[
 {
    "title": ...,
    "teaser": ...,
    "fulltext": ...
  },
]

The dataset contains news items within the categories of sports, politics, panorama, culture, technology, health, knowledge, cars, travel, economy and other in equal proportions.

Training Procedure

The model was finetuned using a causal language modeling (CLM) objective for multitask finetuning.

Preprocessing

For each news item, two inputs were concatenated like below.

f"[Text]: {item.fulltext} \n [Title]: {item.title}"
f"[Text]: {item.fulltext} \n [Teaser]: {item.teaser}"

This results in one input per task for each news item.

Note: The inserted prompt "[Text]:" marks the beginning of the news item's fulltext.
In the same manner "[Title]:" prompts the news item's title and "[Teaser]:" the news item's teaser.

Environmental Impact

Carbon emissions were estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: A100 SXM4
Hours used: 27h 42min
Cloud Provider: Vast.ai
Compute Region: Unknown
Carbon Emitted: ~4.79kg co2e

Glossary

News Document, plain text form of a news article or news item.
News Item, aka news article. A particular piece of news, usually from a journalistic source.
Snippet, a small section of text that is related to a news document.
Title aka headline. A few words that reflect the essence of the news story.
Teaser aka lede. A few sentences that spark curiosity about the "best of the rest" of the news story.