ankitkupadhyay commited on
Commit
cf016e3
1 Parent(s): 39fe891

Upload copy_of_coding_challenge_for_fatima_fellowship_1.py

Browse files
copy_of_coding_challenge_for_fatima_fellowship_1.py ADDED
@@ -0,0 +1,392 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ """Copy of Coding Challenge for Fatima Fellowship
3
+
4
+ Automatically generated by Colaboratory.
5
+
6
+ Original file is located at
7
+ https://colab.research.google.com/drive/1fCgH-E1EykMl_2gkpitYV7fXPjNRqMcv
8
+
9
+ # Fatima Fellowship Quick Coding Challenge (Pick 1)
10
+
11
+ Thank you for applying to the Fatima Fellowship. To help us select the Fellows and assess your ability to do machine learning research, we are asking that you complete a short coding challenge. Please pick **1 of these 5** coding challenges, whichever is most aligned with your interests.
12
+
13
+ **Due date: 1 week**
14
+
15
+ **How to submit**: Please make a copy of this colab notebook, add your code and results, and submit your colab notebook to the submission link below. If you have never used a colab notebook, [check out this video](https://www.youtube.com/watch?v=i-HnvsehuSw).
16
+
17
+ **Submission link**: https://airtable.com/shrXy3QKSsO2yALd3
18
+
19
+ # 1. Deep Learning for Vision
20
+
21
+ **Upside down detector**: Train a model to detect if images are upside down
22
+
23
+ * Pick a dataset of natural images (we suggest looking at datasets on the [Hugging Face Hub](https://huggingface.co/datasets?task_categories=task_categories:image-classification&sort=downloads))
24
+ * Synthetically turn some of images upside down. Create a training and test set.
25
+ * Build a neural network (using Tensorflow, PyTorch, or any framework you like)
26
+ * Train it to classify image orientation until a reasonable accuracy is reached
27
+ * [Upload the the model to the Hugging Face Hub](https://huggingface.co/docs/hub/adding-a-model), and add a link to your model below.
28
+ * Look at some of the images that were classified incorrectly. Please explain what you might do to improve your model's performance on these images in the future (you do not need to impelement these suggestions)
29
+
30
+ **Submission instructions**: Please write your code below and include some examples of images that were classified
31
+ """
32
+
33
+ ### WRITE YOUR CODE TO TRAIN THE MODEL HERE
34
+
35
+ """**Write up**:
36
+ * Link to the model on Hugging Face Hub:
37
+ * Include some examples of misclassified images. Please explain what you might do to improve your model's performance on these images in the future (you do not need to impelement these suggestions)
38
+
39
+ # 2. Deep Learning for NLP
40
+
41
+ **Fake news classifier**: Train a text classification model to detect fake news articles!
42
+
43
+ * Download the dataset here: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset
44
+ * Develop an NLP model for classification that uses a pretrained language model
45
+ * Finetune your model on the dataset, and generate an AUC curve of your model on the test set of your choice.
46
+ * [Upload the the model to the Hugging Face Hub](https://huggingface.co/docs/hub/adding-a-model), and add a link to your model below.
47
+ * *Answer the following question*: Look at some of the news articles that were classified incorrectly. Please explain what you might do to improve your model's performance on these news articles in the future (you do not need to impelement these suggestions)
48
+ """
49
+
50
+ ### WRITE YOUR CODE TO TRAIN THE MODEL HERE
51
+
52
+ """# **Downloading dataset from Kaggle in my drive and giving appropriate permissions**"""
53
+
54
+ !pip install kaggle
55
+
56
+ #upload kaggle.json file in drive
57
+
58
+ ! mkdir ~/.kaggle
59
+
60
+ ! cp kaggle.json ~/.kaggle/
61
+
62
+ ! chmod 600 ~/.kaggle/kaggle.json
63
+
64
+ ! kaggle datasets download clmentbisaillon/fake-and-real-news-dataset
65
+
66
+ ! unzip fake-and-real-news-dataset.zip
67
+
68
+ """# **Setting Developer environment for using transformers**"""
69
+
70
+ # setting developer environment
71
+ !pip install datasets transformers[sentencepiece]
72
+
73
+ import os
74
+ os.environ["'WANDB_MODE'"] = "online"
75
+ import numpy as np
76
+ import pandas as pd
77
+ import torch
78
+ from tqdm import tqdm
79
+ from sklearn.metrics import accuracy_score, precision_recall_fscore_support
80
+
81
+ from transformers import Trainer, TrainingArguments
82
+ from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
83
+ # from transformers import (RobertaForSequenceClassification, RobertaTokenizer, AdamW)
84
+ # wanted to try RobertA model too but seems like collab had reached RAM limit.
85
+
86
+ !pip install wandb
87
+
88
+ import wandb
89
+ wandb.login()
90
+
91
+ # Commented out IPython magic to ensure Python compatibility.
92
+ # %env WANDB_PROJECT=fake_news_classifier
93
+
94
+ data_path = os.path.join("..", "content")
95
+ true_df = pd.read_csv(os.path.join(data_path, "True.csv"))
96
+ fake_df = pd.read_csv(os.path.join(data_path, "Fake.csv"))
97
+
98
+ # adding label
99
+ fake_df["label"] = [0]*len(fake_df)
100
+ true_df["label"] = [1]*len(true_df)
101
+
102
+ fake_df
103
+
104
+ df = pd.concat([true_df, fake_df]).sample(frac=1).reset_index(drop=True)
105
+
106
+ """# **Merging title and body under one column**"""
107
+
108
+ df["body"] = df["title"] + '' + df['text']
109
+
110
+ """# **Removing erraneous data elements**"""
111
+
112
+ df.drop(["title", "text", "subject", "date"], axis=1, inplace=True)
113
+
114
+ # droping duplicates
115
+ df.drop_duplicates(inplace=True)
116
+
117
+ # empty rows
118
+ df.dropna(inplace=True)
119
+
120
+ df.head(15)
121
+
122
+ """# **Preparing Training, Validation and Test set**"""
123
+
124
+ train, valid, test = np.split(df.sample(frac=1, random_state=42), [int(.6*len(df)), int(.8*len(df))])
125
+ X_train, y_train = train.body, train.label
126
+ X_valid, y_valid = valid.body, valid.label
127
+ X_test, y_test = test.body , test.label
128
+
129
+ assert len(X_train) == len(y_train) and len(X_valid) == len(y_valid) and len(X_test) == len(y_test)
130
+
131
+ assert len(df) == len(X_train) + len(X_valid) + len(X_test)
132
+
133
+ len(X_train)
134
+
135
+ len(y_train)
136
+
137
+ """# **Using "distilbert-base-uncased" pretrained model**
138
+
139
+
140
+
141
+ """
142
+
143
+ model_name = "distilbert-base-uncased"
144
+ tokenizer = DistilBertTokenizer.from_pretrained(model_name, do_lower_case=True)
145
+ model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=1)
146
+
147
+ def encode_samples(samples, tokenizer, max_length):
148
+ """
149
+ Converts words to (BERT) tokens.
150
+ words can be composed of multiple tokens.
151
+
152
+ Parameters
153
+ ----------
154
+ samples: list(str)
155
+ A X_train_tokens is list of strings where each string is a sentence.
156
+ tokenizer: transformers.PreTrainedTokenizer
157
+ The BERT's pre-trained tokenizer.
158
+
159
+ Returns
160
+ -------
161
+ X: list(int)
162
+ A list of integers where each integer represents the sub-word index
163
+ according to the `tokenizer`.
164
+ """
165
+ X = {"input_ids": [], "attention_mask": []}
166
+ for i in tqdm(range(len(samples)//50+1), "Encoding"):
167
+ batch = samples[i*50:50*(i+1)]
168
+ tokens_batch = tokenizer(batch, truncation=True, padding=True, max_length=max_length)
169
+ X["input_ids"].extend(tokens_batch.data["input_ids"])
170
+ X["attention_mask"].extend(tokens_batch.data["attention_mask"])
171
+ return X
172
+
173
+ max_length = 512
174
+ X_train_tokens = encode_samples(X_train.tolist(), tokenizer, max_length)
175
+ X_valid_tokens = encode_samples(X_valid.tolist(), tokenizer, max_length)
176
+ X_test_tokens = encode_samples(X_test.tolist(), tokenizer, max_length)
177
+
178
+ class KaggleNewsDataset(torch.utils.data.Dataset):
179
+ def __init__(self, samples, labels):
180
+ self.samples = samples
181
+ self.labels = labels
182
+
183
+ def __getitem__(self, idx):
184
+ item = {k: torch.tensor(v[idx]) for k, v in self.samples.items()}
185
+ item["labels"] = torch.tensor([self.labels[idx]], dtype=torch.float)
186
+ return item
187
+
188
+ def __len__(self):
189
+ return len(self.samples["input_ids"])
190
+
191
+ train_dataset = KaggleNewsDataset(X_train_tokens, y_train.tolist())
192
+ valid_dataset = KaggleNewsDataset(X_valid_tokens, y_valid.tolist())
193
+ test_dataset = KaggleNewsDataset(X_test_tokens, y_test.tolist())
194
+
195
+ """# **Computing metrics**"""
196
+
197
+ def compute_metrics(pred, threshold=0.5):
198
+ labels = pred.label_ids
199
+ preds = torch.nn.Sigmoid()(torch.from_numpy(pred.predictions)) > threshold
200
+ acc = accuracy_score(labels, preds)
201
+ precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
202
+ return {
203
+ 'accuracy': acc,
204
+ 'f1': f1,
205
+ 'precision': precision,
206
+ 'recall': recall
207
+ }
208
+
209
+ #defining epochs and batch_size
210
+ epochs = 5
211
+ batch_size = 16
212
+
213
+ #emptying cache
214
+ import torch
215
+ torch.cuda.empty_cache()
216
+
217
+ trainer = Trainer(
218
+ model=model, # Transformers model to be trained
219
+ train_dataset=train_dataset, # training dataset
220
+ eval_dataset=valid_dataset, # validation set used as evaluation dataset
221
+ compute_metrics=compute_metrics, # computes metrics
222
+ args=TrainingArguments(
223
+ output_dir='./results', # output directory
224
+ num_train_epochs=epochs, # number of training epochs
225
+ per_device_train_batch_size=batch_size, # batch size per device
226
+ per_device_eval_batch_size=batch_size, # batch size for evaluation
227
+ warmup_steps=500, # number of warmup steps for learning rate scheduler
228
+ weight_decay=0.01, # strength of weight decay
229
+ logging_dir='./logs', # directory for storing logs
230
+ load_best_model_at_end=True, # load the best model when finished training (default metric is loss)
231
+ logging_steps=400, # log & save weights each logging_steps
232
+ save_steps=400,
233
+ evaluation_strategy="steps", # evaluate each `logging_steps`
234
+ report_to="wandb",
235
+ )
236
+ )
237
+
238
+ # training the model
239
+ trainer.train()
240
+
241
+
242
+
243
+ test_predictions = trainer.predict(test_dataset)
244
+
245
+ test_predictions
246
+
247
+ test_predictions[0]
248
+
249
+ test_pred=test_predictions[0].round(2)
250
+
251
+ test_pred
252
+
253
+ y_pred = np.where(test_pred > 0.5, 1, 0)
254
+ print(y_pred)
255
+
256
+ df_1=pd.DataFrame(test_predictions[0], columns=['prediction'])
257
+
258
+ df_1
259
+
260
+ test_predictions[1]
261
+
262
+ df_2=pd.DataFrame(test_predictions[1], columns=['actual'])
263
+
264
+ df_2
265
+
266
+ result = pd.concat([df_1, df_2], axis=1, join='inner')
267
+ display(result)
268
+
269
+ """# **Reading misclassified news articles**"""
270
+
271
+ for i in range(len(result)):
272
+ if(y_pred[i]!=test_predictions[1][i]):
273
+ print(i)
274
+
275
+ #predicted label
276
+ result.prediction[338]
277
+
278
+ #actual label
279
+ result.actual[338]
280
+
281
+ for index in [338,1657, 3838, 7601]:
282
+ print(df.iat[index,1])
283
+ print("\n")
284
+
285
+ result.prediction[1657]
286
+
287
+ result.actual[1657]
288
+
289
+ result.prediction[3838]
290
+
291
+ result.actual[3838]
292
+
293
+ """# **Plotting Confusion Matrix**
294
+
295
+
296
+ """
297
+
298
+ from sklearn.metrics import confusion_matrix
299
+
300
+ cf=confusion_matrix(y_test, y_pred)
301
+
302
+ import seaborn as sns
303
+ import matplotlib.pyplot as plt
304
+ ax = sns.heatmap(cf, annot=True, cmap='Blues')
305
+ plt.show()
306
+
307
+ """# **Generating AUC Curve**"""
308
+
309
+ from sklearn import metrics
310
+ from sklearn.metrics import roc_curve
311
+
312
+ labels = np.array(test_dataset.labels)
313
+ predictions = torch.nn.Sigmoid()(torch.from_numpy(test_predictions.predictions)).numpy()
314
+ fpr, tpr, thresholds = roc_curve(labels, predictions)
315
+ auc = metrics.roc_auc_score(y_test, predictions)
316
+
317
+ import matplotlib.pyplot as plt
318
+
319
+ #create ROC curve
320
+ plt.plot(fpr,tpr,label="AUC="+str(auc))
321
+ plt.ylabel('True Positive Rate')
322
+ plt.xlabel('False Positive Rate')
323
+ plt.legend(loc=4)
324
+ plt.show()
325
+
326
+ """# **Generating related graphs**"""
327
+
328
+ wandb.finish()
329
+
330
+ """# **Link to the model on Hugging Face Hub:** [link to model](https://huggingface.co/ankitkupadhyay/fake_news_classifier)
331
+
332
+ # **Possible reasons for misclassification**
333
+
334
+ 1. Donald Trump has a higher frequency in Fake dataset, so there is apossibility of a bias in the news articles that talks about Donald Trump. This may be one of the reasons of misclassification.
335
+
336
+ 2. News articles containing hashtags(#) are mostly present in Fake dataset. So any true news containing hashtags has higher probability of getting classified as fake.
337
+
338
+ **Write up**:
339
+ * Link to the model on Hugging Face Hub:
340
+ * Include some examples of misclassified news articles. Please explain what you might do to improve your model's performance on these news articles in the future (you do not need to impelement these suggestions)
341
+
342
+ # 3. Deep RL / Robotics
343
+
344
+ **RL for Classical Control:** Using any of the [classical control](https://github.com/openai/gym/blob/master/docs/environments.md#classic-control) environments from OpenAI's `gym`, implement a deep NN that learns an optimal policy which maximizes the reward of the environment.
345
+
346
+ * Describe the NN you implemented and the behavior you observe from the agent as the model converges (or diverges).
347
+ * Plot the reward as a function of steps (or Epochs).
348
+ Compare your results to a random agent.
349
+ * Discuss whether you think your model has learned the optimal policy and potential methods for improving it and/or where it might fail.
350
+ * (Optional) [Upload the the model to the Hugging Face Hub](https://huggingface.co/docs/hub/adding-a-model), and add a link to your model below.
351
+
352
+
353
+ You may use any frameworks you like, but you must implement your NN on your own (no pre-defined/trained models like [`stable_baselines`](https://stable-baselines.readthedocs.io/en/master/)).
354
+
355
+ You may use any simulator other than `gym` _however_:
356
+ * The environment has to be similar to the classical control environments (or more complex like [`robosuite`](https://github.com/ARISE-Initiative/robosuite)).
357
+ * You cannot choose a game/Atari/text based environment. The purpose of this challenge is to demonstrate an understanding of basic kinematic/dynamic systems.
358
+ """
359
+
360
+ ### WRITE YOUR CODE TO TRAIN THE MODEL HERE
361
+
362
+ """**Write up**:
363
+ * (Optional) link to the model on Hugging Face Hub:
364
+ * Discuss whether you think your model has learned the optimal policy and potential methods for improving it and/or where it might fail.
365
+
366
+ # 4. Theory / Linear Algebra
367
+
368
+ **Implement Contrastive PCA** Read [this paper](https://www.nature.com/articles/s41467-018-04608-8) and implement contrastive PCA in Python.
369
+
370
+ * First, please discuss what kind of dataset this would make sense to use this method on
371
+ * Implement the method in Python (do not use previous implementations of the method if they already exist)
372
+ * Then create a synthetic dataset and apply the method to the synthetic data. Compare with standard PCA.
373
+
374
+ **Write up**: Discuss what kind of dataset it would make sense to use Contrastive PCA
375
+ """
376
+
377
+ ### WRITE YOUR CODE HERE
378
+
379
+ """# 5. Systems
380
+
381
+ **Inference on the edge**: Measure the inference times in various computationally-constrained settings
382
+
383
+ * Pick a few different speech detection models (we suggest looking at models on the [Hugging Face Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads))
384
+ * Simulate different memory constraints and CPU allocations that are realistic for edge devices that might run such models, such as smart speakers or microcontrollers, and measure what is the average inference time of the models under these conditions
385
+ * How does the inference time vary with (1) choice of model (2) available system memory (3) available CPU (4) size of input?
386
+
387
+ Are there any surprising discoveries? (Note that this coding challenge is fairly open-ended, so we will be considering the amount of effort invested in discovering something interesting here).
388
+ """
389
+
390
+ ### WRITE YOUR CODE HERE
391
+
392
+ """**Write up**: What surprising discoveries do you see?"""