system HF staff commited on
Commit
55de5e7
1 Parent(s): 8d0e785

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -0
README.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "en"
3
+ thumbnail: "https://bagdeabhishek.github.io/twitterAnalysis_files/networkfin.jpg"
4
+ tags:
5
+ - India
6
+ - politics
7
+ - tweets
8
+ - BJP
9
+ - Congress
10
+ - AAP
11
+ - pytorch
12
+ - gpt2
13
+ - lm-head
14
+ - text-generation
15
+ license: "Apache"
16
+ datasets:
17
+ - Twitter
18
+ - IndianPolitics
19
+ ---
20
+
21
+ # Model name
22
+ Indian Political Tweets LM
23
+
24
+ ## Model description
25
+
26
+ This is a GPT2 Language model with LM head fine-tuned on tweets crawled from handles which belong predominantly to Indian Politics. For more information about the crawled data, you can go through this [blog](https://bagdeabhishek.github.io/twitterAnalysis) post.
27
+
28
+ ## Intended uses & limitations
29
+ This finetuned model can be used to generate tweets which are related to Indian politics.
30
+ #### How to use
31
+
32
+ ```python
33
+ from transformers import AutoTokenizer,AutoModelWithLMHead,pipeline
34
+ tokenizer = AutoTokenizer.from_pretrained("bagdaebhishek/IndianPoliticalTweetsLM")
35
+ model = AutoModelWithLMHead.from_pretrained("bagdaebhishek/IndianPoliticalTweetsLM")
36
+
37
+ text_generator = pipeline("text-generation",model=model, tokenizer=tokenizer)
38
+
39
+ init_sentence = "India will always be"
40
+
41
+ print(text_generator(init_sentence))
42
+
43
+ ```
44
+
45
+ #### Limitations and bias
46
+ 1. The tweets used to train the model were not manually labelled, so the generated text may not always be in English. I've cleaned the data to remove non-English tweets but the model may generate "Hinglish" text and hence no assumptions should be made about the language of the generated text.
47
+ 2. I've taken enough care to remove tweets from twitter handles which are not very influential but since it's not curated by hand there might be some artefacts like "-sent via NamoApp" etc.
48
+ 3. Like any language model trained on real-world data this model also exhibits some biases which unfortunately are a part of the political discourse on Twitter. Please keep this in mind while using the output from this model.
49
+
50
+ ## Training data
51
+ I used the pre-trained gpt2 model from Huggingface transformers repository and fine-tuned it on custom data set crawled from twitter. The method used to identify the political handles is mentioned in detail in a [blog](https://bagdeabhishek.github.io/twitterAnalysis) post. I used tweets from both the Pro-BJP and Anti-BJP clusters mentioned in the blog.
52
+
53
+ ## Training procedure
54
+
55
+ For pre-processing, I removed tweets from handles which are not very influential in their cluster. I removed them by calculating Eigenvector centrality on the twitter graph and pruning handles which have this measure below a certain threshold. This threshold was set manually after experimenting with different values.
56
+
57
+ I then separated tweets by these handles based on their language. I trained the LM with English tweets from both handles.
58
+
59
+ ### Hardware
60
+ 1. GPU: GTX 1080Ti
61
+ 2. CPU: Ryzen 3900x
62
+ 3. RAM: 32GB
63
+
64
+ This model took roughly 36 hours to fine-tune.
65
+