{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "# OptimAbstract\n",
    "It aims at building a meta-model on top of T5 model in order to adapt the model choice relatively to the complextity of the text to compress.\n",
    "\n",
    "Several steps. During learning phase:\n",
    "1. Find relevant features that represents the complexity with low computational time\n",
    "2. Apply the candidate models and select the best with regard with a fixed criteria (BertScore)\n",
    "3. Fit a classifier to predict, from the features, the best model\n",
    "In the inference: simply predict the classifier, and choose the right model."
   ],
   "id": "bec99a2ab93de91b"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from datasets import load_dataset\n",
    "from bert_score import score\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "from model import MetaModel, save_object\n",
    "import time\n",
    "from model import T5Model, extract_features, get_best_model"
   ],
   "id": "5d14705fffbcfb64",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Data loading\n",
    "\n",
    "For the first idea, let us work on a very small amount of data"
   ],
   "id": "3bffb33f36f005c2"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": "dataset = load_dataset(\"cnn_dailymail\", \"3.0.0\", split=\"train\")",
   "id": "4c35f3d88583bb80",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "# I want a wide diversity of complexity\n",
    "train_dataset = dataset.map(lambda x: {\"text_length\": len(x[\"article\"])})\n",
    "train_dataset = train_dataset.sort(\"text_length\")\n",
    "num_samples = 500\n",
    "indices = np.linspace(0, len(train_dataset) - 1, num_samples, dtype=int)\n",
    "selected_samples = train_dataset.select(indices)\n",
    "print([ex[\"text_length\"] for ex in selected_samples])"
   ],
   "id": "622dbc95cf5e0cc5",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": "selected_samples",
   "id": "f038b84f4f9aee51",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": "model_names = [\"google-t5/t5-small\", \"google-t5/t5-base\", \"google-t5/t5-large\"]",
   "id": "dbecc71e2eb0df4b",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## Exploring features and classifier",
   "id": "b3fe4e5fd705255"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "models = {name: T5Model(name) for name in model_names}\n",
    "train_texts = selected_samples[\"article\"]\n",
    "train_summaries = selected_samples[\"highlights\"]"
   ],
   "id": "187ade3986ce0021",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": "features_name = list(extract_features(train_texts[0]).keys())",
   "id": "2337a1e10f568364",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": "X = np.array([list(extract_features(text).values()) for text in train_texts])",
   "id": "b6d795089320dd18",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": "y = get_best_model(models, train_texts, train_summaries, tolerance=0)",
   "id": "aeaa0061b8d274a5",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.DataFrame(\n",
    "    columns=[\"best_model_name\"] + features_name, data=np.concatenate((y.reshape(-1, 1), X), axis=1)\n",
    ")"
   ],
   "id": "a01e40f0d6915fdb",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.figure(figsize=(10, 30))\n",
    "for i, feature in enumerate(features_name):\n",
    "    plt.subplot(len(features_name) // 2, len(features_name) // 2, i + 1)\n",
    "    sns.boxplot(x=\"best_model_name\", y=feature, data=df)\n",
    "    plt.xticks(rotation=45)\n",
    "    plt.yticks(rotation=0)\n",
    "    plt.locator_params(axis=\"y\", nbins=6)\n",
    "    plt.title(feature)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ],
   "id": "3c3ed90a12128ce6",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "The features do not seem to be relevant, I have to work further.\n",
    "\n",
    "## MetaModel"
   ],
   "id": "c9f46b201ff00ec7"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\", message=\".*Some weights of RobertaModel.*\")\n",
    "meta_model = MetaModel(model_names, base_classifier=RandomForestClassifier(), tolerance=0.01)\n",
    "meta_model.fit(selected_samples[\"article\"], selected_samples[\"highlights\"])\n",
    "save_object(meta_model, \"first_model.pkl\")"
   ],
   "id": "6d68f234e372396d",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": "test_dataset = dataset.shuffle(seed=42).select(range(100))",
   "id": "59f0d58080ac7b44",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "meta_model_scores = []\n",
    "meta_model_times = []\n",
    "model_scores = {name: [] for name in model_names}\n",
    "model_times = {name: [] for name in model_names}\n",
    "\n",
    "for i, dataset_ in enumerate(test_dataset):\n",
    "    predicted_summary, meta_time = meta_model.summarize(dataset_[\"article\"])\n",
    "    P, R, F1 = score([predicted_summary], [dataset_[\"highlights\"]], lang=\"en\", verbose=False)\n",
    "    meta_model_scores.append(F1.item())\n",
    "    meta_model_times.append(meta_time)\n",
    "\n",
    "    model_results = []\n",
    "    for model_name in model_names:\n",
    "        model = meta_model.models[model_name]\n",
    "        summary, elapsed_time = model.summarize(dataset_[\"article\"])\n",
    "        P, R, F1 = score([summary], [dataset_[\"highlights\"]], lang=\"en\", verbose=False)\n",
    "        f1_score = F1.item()\n",
    "\n",
    "        model_scores[model_name].append(f1_score)\n",
    "        model_times[model_name].append(elapsed_time)\n",
    "        model_results.append((model_name, f1_score, elapsed_time))\n",
    "\n"
   ],
   "id": "6fd91b97e4b6e588",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "",
   "id": "9f8cc25886f69b6c"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-19T08:05:59.621340Z",
     "start_time": "2025-02-19T08:05:59.616976Z"
    }
   },
   "cell_type": "code",
   "source": [
    "print(\"\\n===== Model Evaluation =====\")\n",
    "for model_name in model_names:\n",
    "    avg_score = np.mean(model_scores[model_name])\n",
    "    avg_time = np.mean(model_times[model_name])\n",
    "    print(f\"{model_name}: BERTScore={avg_score:.4f}, Time={avg_time:.4f}s\")\n",
    "\n",
    "print(\n",
    "    f\" MetaModel : BERTScore={np.mean(meta_model_scores):.4f}, \"\n",
    "    f\"Time={np.mean(meta_model_times):.4f}s\"\n",
    ")"
   ],
   "id": "ffa22ef5f39d30bf",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "===== Model Evaluation =====\n",
      "google-t5/t5-small: BERTScore=0.8639, Time=2.0048s\n",
      "google-t5/t5-base: BERTScore=0.8720, Time=5.2173s\n",
      "google-t5/t5-large: BERTScore=0.8664, Time=15.8678s\n",
      " MetaModel : BERTScore=0.8681, Time=3.2380s\n"
     ]
    }
   ],
   "execution_count": 9
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "17/02/25 : The results are better with tol at 1%. I should rerun with :\n",
    "- add the feature computation in the meta model time cost\n",
    "- analyze more deeply the features and the classifier performances\n",
    "- Should change the MetaModel structure because it is too large to be commited (4GB)"
   ],
   "id": "3785e798f9dfaa6d"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": "",
   "id": "d5f89bd659f54ed2"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}