Tweety Face

Finetuned language model based on GPT-2 to generate Tweets in a users style.

Model description

Tweety Face is a transformer model finetuned using GTP-2 and Tweets from various Twitter users. It was created to generate a Twitter Tweet for a given user similar to their specific writing style. It accepts a prompt for a user and completes the text.

This finetuned model uses the smallest version of GPT-2, with 124M parameters.

Intended uses & limitations

This model was created to experiment with prompt inputs and is not intended to create real Tweets. The generated text is not a real representation of the given users opinion, political affiliation, behaviour, etc. Do not use this model to impersonate a user.

How to use

You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='ML-Projects-Kiel/tweetyface')
>>> set_seed(42)
>>> generator("User: elonmusk\nTweet: Twitter is", max_length=30, num_return_sequences=5)

[{'generated_text': 'User: elonmusk\nTweet: Twitter is more active than ever. Even though you can’t see your entire phone list, your'},
 {'generated_text': 'User: elonmusk\nTweet: Twitter is just in a few hours until an announcement which has been approved by President. This should be a'},
 {'generated_text': 'User: elonmusk\nTweet: Twitter is currently down to a minimum of 13 tweets per day, a decline that was significantly worse than Twitter'},
 {'generated_text': 'User: elonmusk\nTweet: Twitter is a great investment to us. Will go above his legal fees to join Twitter in many countries,'},
 {'generated_text': 'User: elonmusk\nTweet: Twitter is not doing something like this – they are not using Twitter to give out their content – other than'}]

Training data

The training data used for this model has been released as a dataset one can browse here. The raw data can be found in our Github Repository. The raw data can be found in two versions. All data on the develop branch is used in a debugging dataset. All data in the qa branch is used in the final dataset.

Training procedure

Preprocessing

For training first all retweets (RT) have been removed. Next the newline characters "\n" have been replaced by white spaces and all URLs haven been replaced with the word URL.

The texts are tokenized using a byte-level version of Byte Pair Encoding (BPE) (for unicode characters).

ML-Projects-Kiel
/

tweetyface