How to use this model directly from the
from transformers import AutoTokenizer, AutoModelWithLMHead tokenizer = AutoTokenizer.from_pretrained("bagdaebhishek/IndianPoliticalTweetsLM") model = AutoModelWithLMHead.from_pretrained("bagdaebhishek/IndianPoliticalTweetsLM")
Indian Political Tweets LM
Note: This model is based on GPT2, if you want a bigger model based on GPT2-medium and finetuned on the same data please take a look at the IndianPoliticalTweetsLMMedium model.
This is a GPT2 Language model with LM head fine-tuned on tweets crawled from handles which belong predominantly to Indian Politics. For more information about the crawled data, you can go through this blog post.
This finetuned model can be used to generate tweets which are related to Indian politics.
from transformers import AutoTokenizer,AutoModelWithLMHead,pipeline tokenizer = AutoTokenizer.from_pretrained("bagdaebhishek/IndianPoliticalTweetsLM") model = AutoModelWithLMHead.from_pretrained("bagdaebhishek/IndianPoliticalTweetsLM") text_generator = pipeline("text-generation",model=model, tokenizer=tokenizer) init_sentence = "India will always be" print(text_generator(init_sentence))
I used the pre-trained gpt2 model from Huggingface transformers repository and fine-tuned it on custom data set crawled from twitter. The method used to identify the political handles is mentioned in detail in a blog post. I used tweets from both the Pro-BJP and Anti-BJP clusters mentioned in the blog.
For pre-processing, I removed tweets from handles which are not very influential in their cluster. I removed them by calculating Eigenvector centrality on the twitter graph and pruning handles which have this measure below a certain threshold. This threshold was set manually after experimenting with different values.
I then separated tweets by these handles based on their language. I trained the LM with English tweets from both handles.
This model took roughly 36 hours to fine-tune.