{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "lawNHLqffR_m"
},
"source": [
"# SCC0633/SCC5908 - Processamento de Linguagem Natural\n",
"> **Docente:** Thiago Alexandre Salgueiro Pardo \\\n",
"> **Estagiário PAE:** Germano Antonio Zani Jorge\n",
"\n",
"\n",
"# Integrantes do Grupo: GPTrouxas\n",
"> André Guarnier De Mitri - 11395579 \\\n",
"> Daniel Carvalho - 10685702 \\\n",
"> Fernando - 11795342 \\\n",
"> Lucas Henrique Sant'Anna - 10748521 \\\n",
"> Magaly L Fujimoto - 4890582"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pV6WGoBln8id"
},
"source": [
"# New Section"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Abordagem Estatístico\n",
"A arquitetura da solução estatística/neural envolve duas abordagens que\n",
"serão descritas neste documento. A primeira abordagem envolve utilizar\n",
"TF-IDF e Naive Bayes. E a segunda abordagem irá utilizar Word2Vec e um\n",
"modelo transformers pré-treinado da família BERT, realizando finetuning do\n",
"modelo.\n",
"\n",
"Na primeira abordagem, utilizaremos o TF-IDF, que leva em consideração a\n",
"frequência de ocorrência dos termos em um corpus e gera uma sequência de\n",
"vetores que serão fornecidos ao Naive Bayes para classificação da review como\n",
"positiva ou negativa.\n",
"\n",
"\n",
"Na segunda abordagem, utilizaremos o Word2Vec para vetorizar as reviews.\n",
"Após dividir em treino e teste, faremos o fine tuning de um modelo do tipo BERT\n",
"para o nosso problema e dataset específico. Com o BERT adaptado, faremos a\n",
"classificação de nossos textos, medindo o seu desempenho com F1 score e\n",
"acurácia.\n",
"\n",
"![alt text](../imagens/BERT_TDIDF.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vfP54aryxZBg"
},
"source": [
"\n",
"## # Etapas da Abordagem Estatística\n",
"\n",
"1. **Bibliotecas**: Importamos as bibliotecas necessárias, considerando pandas para manipulação de dados, train_test_split para dividir o conjunto de dados em conjuntos de treinamento e teste, TfidfVectorizer para vetorização de texto usando TF-IDF, MultinomialNB para implementar o classificador Naive Bayes Multinomial e algumas métricas de avaliação.\n",
"\n",
"2. **Conjunto de dados**: Carregar o conjunto de dados e armazená-lo em um dataframe usando pandas.\n",
"\n",
"3. **Dividir o conjunto de dados**: Usamos `train_test_split` para dividir o DataFrame em conjuntos de treinamento e teste.\n",
"\n",
"4. **TF-IDF**: Usamos `TfidfVectorizer` para converter as revisões de texto em vetores numéricos usando a técnica TF-IDF. Em seguida, ajustamos e transformamos tanto o conjunto de treinamento quanto o conjunto de teste.\n",
"\n",
"5. **Naive Bayes**: Treinamos um classificador Naive Bayes Multinomial e usamos o modelo treinado para prever os sentimentos no conjunto de teste usando `predict`.\n",
"\n",
"6. **Avaliação e Resultados**: Salvamos os resultados em um novo dataframe `results_df` contendo as revisões do conjunto de teste, os sentimentos originais e os sentimentos previstos pelo modelo. Além disso, avaliamos o modelo verificando algumas métricas e a matriz de confusão.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TbLraa4UhWDJ"
},
"source": [
"\n",
"## # Baixando, Carregando os dados e Pré Processamento\n",
"\n",
"1. Transformar todos os textos em lowercase \\\\\n",
"2. Remoção de caracteres especiais \\\\\n",
"3. Remoção de stop words \\\\\n",
"4. Lematização (Lemmatization) \\\\\n",
"5. Tokenização \\\\"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "bIWmIe0qfTbE"
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "Wf0n2yPdAn4C",
"outputId": "37eb3c4d-40c1-41a0-9b1a-d93ed6e272f3"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
review
\n",
"
sentiment
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
One of the other reviewers has mentioned that ...
\n",
"
positive
\n",
"
\n",
"
\n",
"
1
\n",
"
A wonderful little production. <br /><br />The...
\n",
"
positive
\n",
"
\n",
"
\n",
"
2
\n",
"
I thought this was a wonderful way to spend ti...
\n",
"
positive
\n",
"
\n",
"
\n",
"
3
\n",
"
Basically there's a family where a little boy ...
\n",
"
negative
\n",
"
\n",
"
\n",
"
4
\n",
"
Petter Mattei's \"Love in the Time of Money\" is...
\n",
"
positive
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" review sentiment\n",
"0 One of the other reviewers has mentioned that ... positive\n",
"1 A wonderful little production.
The... positive\n",
"2 I thought this was a wonderful way to spend ti... positive\n",
"3 Basically there's a family where a little boy ... negative\n",
"4 Petter Mattei's \"Love in the Time of Money\" is... positive"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"db = pd.read_csv('../data/imdb_reviews.csv')\n",
"db.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "6PlfPScGMF1_",
"outputId": "2a0bd4a1-e22a-429d-82a4-5984eeab7b9d"
},
"outputs": [
{
"data": {
"text/plain": [
"sentiment\n",
"positive 25000\n",
"negative 25000\n",
"Name: count, dtype: int64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"db['sentiment'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Kev0EaSmMa4N",
"outputId": "eab73a61-ba36-4d72-e4f2-82236f9f2880"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Quantidade de valores faltantes para cada variável do dataset:\n",
"review 0\n",
"sentiment 0\n",
"dtype: int64\n"
]
}
],
"source": [
"valores_ausentes = db.isnull().sum(axis=0)\n",
"print('Quantidade de valores faltantes para cada variável do dataset:')\n",
"print(valores_ausentes)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 276
},
"id": "1AI3rN0KMuUq",
"outputId": "7ea5c91b-362e-49eb-82a7-6e8535f0e591"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package stopwords to\n",
"[nltk_data] C:\\Users\\andre\\AppData\\Roaming\\nltk_data...\n",
"[nltk_data] Package stopwords is already up-to-date!\n",
"[nltk_data] Downloading package wordnet to\n",
"[nltk_data] C:\\Users\\andre\\AppData\\Roaming\\nltk_data...\n",
"[nltk_data] Package wordnet is already up-to-date!\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
review
\n",
"
sentiment
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
one reviewer mentioned watching 1 oz episode h...
\n",
"
positive
\n",
"
\n",
"
\n",
"
1
\n",
"
wonderful little production filming technique ...
\n",
"
positive
\n",
"
\n",
"
\n",
"
2
\n",
"
thought wonderful way spend time hot summer we...
\n",
"
positive
\n",
"
\n",
"
\n",
"
3
\n",
"
basically family little boy jake think zombie ...
\n",
"
negative
\n",
"
\n",
"
\n",
"
4
\n",
"
petter mattei love time money visually stunnin...
\n",
"
positive
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" review sentiment\n",
"0 one reviewer mentioned watching 1 oz episode h... positive\n",
"1 wonderful little production filming technique ... positive\n",
"2 thought wonderful way spend time hot summer we... positive\n",
"3 basically family little boy jake think zombie ... negative\n",
"4 petter mattei love time money visually stunnin... positive"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import re\n",
"import nltk\n",
"from nltk.corpus import stopwords\n",
"from nltk.stem import PorterStemmer\n",
"from nltk.stem import WordNetLemmatizer\n",
"\n",
"def lowercase_text(text):\n",
" return text.lower()\n",
"\n",
"def remove_html(text):\n",
" return re.sub(r'<[^<]+?>', '', text)\n",
"\n",
"def remove_url(text):\n",
" return re.sub(r'http[s]?://\\S+|www\\.\\S+', '', text)\n",
"\n",
"def remove_punctuations(text):\n",
" tokens_list = '!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'\n",
" for char in text:\n",
" if char in tokens_list:\n",
" text = text.replace(char, ' ')\n",
"\n",
" return text\n",
"\n",
"def remove_emojis(text):\n",
" emojis = re.compile(\"[\"\n",
" u\"\\U0001F600-\\U0001F64F\"\n",
" u\"\\U0001F300-\\U0001F5FF\"\n",
" u\"\\U0001F680-\\U0001F6FF\"\n",
" u\"\\U0001F1E0-\\U0001F1FF\"\n",
" u\"\\U00002500-\\U00002BEF\"\n",
" u\"\\U00002702-\\U000027B0\"\n",
" u\"\\U00002702-\\U000027B0\"\n",
" u\"\\U000024C2-\\U0001F251\"\n",
" u\"\\U0001f926-\\U0001f937\"\n",
" u\"\\U00010000-\\U0010ffff\"\n",
" u\"\\u2640-\\u2642\"\n",
" u\"\\u2600-\\u2B55\"\n",
" u\"\\u200d\"\n",
" u\"\\u23cf\"\n",
" u\"\\u23e9\"\n",
" u\"\\u231a\"\n",
" u\"\\ufe0f\"\n",
" u\"\\u3030\"\n",
" \"]+\", re.UNICODE)\n",
"\n",
" text = re.sub(emojis, '', text)\n",
" return text\n",
"\n",
"def remove_stop_words(text):\n",
" stop_words = stopwords.words('english')\n",
" new_text = ''\n",
" for word in text.split():\n",
" if word not in stop_words:\n",
" new_text += ''.join(f'{word} ')\n",
"\n",
" return new_text.strip()\n",
"\n",
"def lem_words(text):\n",
" lemma = WordNetLemmatizer()\n",
" new_text = ''\n",
" for word in text.split():\n",
" new_text += ''.join(f'{lemma.lemmatize(word)} ')\n",
"\n",
" return new_text\n",
"\n",
"def preprocess_text(text):\n",
" text = lowercase_text(text)\n",
" text = remove_html(text)\n",
" text = remove_url(text)\n",
" text = remove_punctuations(text)\n",
" text = remove_emojis(text)\n",
" text = remove_stop_words(text)\n",
" text = lem_words(text)\n",
"\n",
" return text\n",
"\n",
"nltk.download('stopwords')\n",
"nltk.download('wordnet')\n",
"db['review'] = db['review'].apply(preprocess_text)\n",
"db.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QgufZpgHnPa4"
},
"source": [
"# **Conjunto de Treino e teste**"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"id": "s0lJ6Q0tnPka"
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"X= db['review']\n",
"y= db['sentiment']\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 12)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "nz4erCEJuD4-",
"outputId": "88d57536-66e7-4d9b-e016-bf40183d4c45"
},
"outputs": [
{
"data": {
"text/plain": [
"35235 disagree people saying lousy horror film good ...\n",
"36936 husband wife doctor team carole nile nelson mo...\n",
"46486 like cast pretty much however story sort unfol...\n",
"27160 movie awful bad bear expend anything word avoi...\n",
"19490 purchased blood castle dvd ebay buck knowing s...\n",
" ... \n",
"36482 strange thing see film scene work rather weakl...\n",
"40177 saw cheap dvd release title entity force since...\n",
"19709 one peculiar oft used romance movie plot one s...\n",
"38555 nothing positive say meandering nonsense huffi...\n",
"14155 low moment life bewildered depressed sitting r...\n",
"Name: review, Length: 40000, dtype: object"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6LX-6e-QlioJ"
},
"source": [
"# **TD-IDF e Naive Bayes**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"id": "gscB9-obNusA"
},
"outputs": [],
"source": [
"from sklearn.metrics import confusion_matrix,classification_report\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.preprocessing import StandardScaler as encoder\n",
"from sklearn.metrics import (\n",
" accuracy_score,\n",
" confusion_matrix,\n",
" ConfusionMatrixDisplay,\n",
" f1_score,\n",
")\n",
"\n",
"\n",
"tfidf = TfidfVectorizer()\n",
"tfidf_train = tfidf.fit_transform(X_train)\n",
"tfidf_test = tfidf.transform(X_test)\n",
"\n",
"from sklearn.naive_bayes import MultinomialNB\n",
"\n",
"naive_bayes = MultinomialNB()\n",
"\n",
"naive_bayes.fit(tfidf_train, y_train)\n",
"y_pred = naive_bayes.predict(tfidf_test)\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "RfJ7AHMZvAb8",
"outputId": "685701e1-b1e8-47fb-9dc5-1bc04dd3894b"
},
"outputs": [
{
"data": {
"text/html": [
"