{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Preparation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Я нашел три датасета на kaggle по классификации фейков. Они все на английском, поэтому для поддержки русскуязычных статей будем использовать специально обученную для перевода новостей модель wmt19-ru-en. \n", "\n", "Выбранные датасеты:\n", "* https://www.kaggle.com/c/fake-news/data\n", "* https://www.kaggle.com/c/fakenewskdd2020/data\n", "* https://www.kaggle.com/c/classifying-the-fake-news/data" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df1_train = pd.read_csv('./data1/train.csv')" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | id | \n", "title | \n", "author | \n", "text | \n", "label | \n", "
---|---|---|---|---|---|
0 | \n", "0 | \n", "House Dem Aide: We Didn’t Even See Comey’s Let... | \n", "Darrell Lucus | \n", "House Dem Aide: We Didn’t Even See Comey’s Let... | \n", "1 | \n", "
1 | \n", "1 | \n", "FLYNN: Hillary Clinton, Big Woman on Campus - ... | \n", "Daniel J. Flynn | \n", "Ever get the feeling your life circles the rou... | \n", "0 | \n", "
2 | \n", "2 | \n", "Why the Truth Might Get You Fired | \n", "Consortiumnews.com | \n", "Why the Truth Might Get You Fired October 29, ... | \n", "1 | \n", "
3 | \n", "3 | \n", "15 Civilians Killed In Single US Airstrike Hav... | \n", "Jessica Purkiss | \n", "Videos 15 Civilians Killed In Single US Airstr... | \n", "1 | \n", "
4 | \n", "4 | \n", "Iranian woman jailed for fictional unpublished... | \n", "Howard Portnoy | \n", "Print \\nAn Iranian woman has been sentenced to... | \n", "1 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
20795 | \n", "20795 | \n", "Rapper T.I.: Trump a ’Poster Child For White S... | \n", "Jerome Hudson | \n", "Rapper T. I. unloaded on black celebrities who... | \n", "0 | \n", "
20796 | \n", "20796 | \n", "N.F.L. Playoffs: Schedule, Matchups and Odds -... | \n", "Benjamin Hoffman | \n", "When the Green Bay Packers lost to the Washing... | \n", "0 | \n", "
20797 | \n", "20797 | \n", "Macy’s Is Said to Receive Takeover Approach by... | \n", "Michael J. de la Merced and Rachel Abrams | \n", "The Macy’s of today grew from the union of sev... | \n", "0 | \n", "
20798 | \n", "20798 | \n", "NATO, Russia To Hold Parallel Exercises In Bal... | \n", "Alex Ansary | \n", "NATO, Russia To Hold Parallel Exercises In Bal... | \n", "1 | \n", "
20799 | \n", "20799 | \n", "What Keeps the F-35 Alive | \n", "David Swanson | \n", "David Swanson is an author, activist, journa... | \n", "1 | \n", "
20800 rows × 5 columns
\n", "\n", " | text | \n", "label | \n", "
---|---|---|
0 | \n", "Get the latest from TODAY Sign up for our news... | \n", "1 | \n", "
1 | \n", "2d Conan On The Funeral Trump Will Be Invited... | \n", "1 | \n", "
2 | \n", "It’s safe to say that Instagram Stories has fa... | \n", "0 | \n", "
3 | \n", "Much like a certain Amazon goddess with a lass... | \n", "0 | \n", "
4 | \n", "At a time when the perfect outfit is just one ... | \n", "0 | \n", "
... | \n", "... | \n", "... | \n", "
4982 | \n", "The storybook romance of WWE stars John Cena a... | \n", "0 | \n", "
4983 | \n", "The actor told friends he’s responsible for en... | \n", "0 | \n", "
4984 | \n", "Sarah Hyland is getting real. The Modern Fami... | \n", "0 | \n", "
4985 | \n", "Production has been suspended on the sixth and... | \n", "0 | \n", "
4986 | \n", "A jury ruled against Bill Cosby in his sexual ... | \n", "0 | \n", "
4986 rows × 2 columns
\n", "\n", " | text | \n", "label | \n", "
---|---|---|
0 | \n", "House Dem Aide: We Didn’t Even See Comey’s Let... | \n", "1 | \n", "
1 | \n", "FLYNN: Hillary Clinton, Big Woman on Campus - ... | \n", "0 | \n", "
2 | \n", "Why the Truth Might Get You Fired.Why the Trut... | \n", "1 | \n", "
3 | \n", "15 Civilians Killed In Single US Airstrike Hav... | \n", "1 | \n", "
4 | \n", "Iranian woman jailed for fictional unpublished... | \n", "1 | \n", "
... | \n", "... | \n", "... | \n", "
57209 | \n", "CHICAGO TRUMP RALLY CANCELLED: Radicals And BL... | \n", "1 | \n", "
57210 | \n", "Trump supports completion of Dakota Access Pip... | \n", "0 | \n", "
57211 | \n", "Obama Can’t Stop Winning As New Jobs Report S... | \n", "1 | \n", "
57212 | \n", "Turkey bank regulator dismisses 'rumors' after... | \n", "0 | \n", "
57213 | \n", "California mayors ask for governor's support f... | \n", "0 | \n", "
57214 rows × 2 columns
\n", "Epoch | \n", "Training Loss | \n", "Validation Loss | \n", "Accuracy | \n", "
---|---|---|---|
1 | \n", "1.124500 | \n", "0.655170 | \n", "0.631423 | \n", "
2 | \n", "0.635900 | \n", "0.616928 | \n", "0.696435 | \n", "
3 | \n", "0.617400 | \n", "0.592879 | \n", "0.727019 | \n", "
4 | \n", "0.591200 | \n", "0.577941 | \n", "0.734533 | \n", "
5 | \n", "0.577100 | \n", "0.564665 | \n", "0.747466 | \n", "
6 | \n", "0.569300 | \n", "0.556096 | \n", "0.749913 | \n", "
7 | \n", "0.563200 | \n", "0.551389 | \n", "0.755330 | \n", "
8 | \n", "0.559900 | \n", "0.546756 | \n", "0.754981 | \n", "
9 | \n", "0.554800 | \n", "0.544496 | \n", "0.759000 | \n", "
10 | \n", "0.554000 | \n", "0.543604 | \n", "0.760398 | \n", "
"
],
"text/plain": [
"