{ "cells": [ { "metadata": {}, "cell_type": "markdown", "source": [ "# OptimAbstract\n", "It aims at building a meta-model on top of T5 model in order to adapt the model choice relatively to the complextity of the text to compress.\n", "\n", "Several steps. During learning phase:\n", "1. Find relevant features that represents the complexity with low computational time\n", "2. Apply the candidate models and select the best with regard with a fixed criteria (BertScore)\n", "3. Fit a classifier to predict, from the features, the best model\n", "In the inference: simply predict the classifier, and choose the right model." ], "id": "bec99a2ab93de91b" }, { "metadata": {}, "cell_type": "code", "source": [ "import numpy as np\n", "import pandas as pd\n", "from datasets import load_dataset\n", "from bert_score import score\n", "from sklearn.ensemble import RandomForestClassifier\n", "\n", "from model import MetaModel, save_object\n", "import time\n", "from model import T5Model, extract_features, get_best_model" ], "id": "5d14705fffbcfb64", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "## Data loading\n", "\n", "For the first idea, let us work on a very small amount of data" ], "id": "3bffb33f36f005c2" }, { "metadata": {}, "cell_type": "code", "source": "dataset = load_dataset(\"cnn_dailymail\", \"3.0.0\", split=\"train\")", "id": "4c35f3d88583bb80", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "# I want a wide diversity of complexity\n", "train_dataset = dataset.map(lambda x: {\"text_length\": len(x[\"article\"])})\n", "train_dataset = train_dataset.sort(\"text_length\")\n", "num_samples = 500\n", "indices = np.linspace(0, len(train_dataset) - 1, num_samples, dtype=int)\n", "selected_samples = train_dataset.select(indices)\n", "print([ex[\"text_length\"] for ex in selected_samples])" ], "id": "622dbc95cf5e0cc5", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": "selected_samples", "id": "f038b84f4f9aee51", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": "model_names = [\"google-t5/t5-small\", \"google-t5/t5-base\", \"google-t5/t5-large\"]", "id": "dbecc71e2eb0df4b", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": "## Exploring features and classifier", "id": "b3fe4e5fd705255" }, { "metadata": {}, "cell_type": "code", "source": [ "models = {name: T5Model(name) for name in model_names}\n", "train_texts = selected_samples[\"article\"]\n", "train_summaries = selected_samples[\"highlights\"]" ], "id": "187ade3986ce0021", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": "features_name = list(extract_features(train_texts[0]).keys())", "id": "2337a1e10f568364", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": "X = np.array([list(extract_features(text).values()) for text in train_texts])", "id": "b6d795089320dd18", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": "y = get_best_model(models, train_texts, train_summaries, tolerance=0)", "id": "aeaa0061b8d274a5", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame(\n", " columns=[\"best_model_name\"] + features_name, data=np.concatenate((y.reshape(-1, 1), X), axis=1)\n", ")" ], "id": "a01e40f0d6915fdb", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "\n", "plt.figure(figsize=(10, 30))\n", "for i, feature in enumerate(features_name):\n", " plt.subplot(len(features_name) // 2, len(features_name) // 2, i + 1)\n", " sns.boxplot(x=\"best_model_name\", y=feature, data=df)\n", " plt.xticks(rotation=45)\n", " plt.yticks(rotation=0)\n", " plt.locator_params(axis=\"y\", nbins=6)\n", " plt.title(feature)\n", "plt.tight_layout()\n", "plt.show()" ], "id": "3c3ed90a12128ce6", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "The features do not seem to be relevant, I have to work further.\n", "\n", "## MetaModel" ], "id": "c9f46b201ff00ec7" }, { "metadata": {}, "cell_type": "code", "source": [ "import warnings\n", "warnings.filterwarnings(\"ignore\", message=\".*Some weights of RobertaModel.*\")\n", "meta_model = MetaModel(model_names, base_classifier=RandomForestClassifier(), tolerance=0.01)\n", "meta_model.fit(selected_samples[\"article\"], selected_samples[\"highlights\"])\n", "save_object(meta_model, \"first_model.pkl\")" ], "id": "6d68f234e372396d", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": "test_dataset = dataset.shuffle(seed=42).select(range(100))", "id": "59f0d58080ac7b44", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "meta_model_scores = []\n", "meta_model_times = []\n", "model_scores = {name: [] for name in model_names}\n", "model_times = {name: [] for name in model_names}\n", "\n", "for i, dataset_ in enumerate(test_dataset):\n", " predicted_summary, meta_time = meta_model.summarize(dataset_[\"article\"])\n", " P, R, F1 = score([predicted_summary], [dataset_[\"highlights\"]], lang=\"en\", verbose=False)\n", " meta_model_scores.append(F1.item())\n", " meta_model_times.append(meta_time)\n", "\n", " model_results = []\n", " for model_name in model_names:\n", " model = meta_model.models[model_name]\n", " summary, elapsed_time = model.summarize(dataset_[\"article\"])\n", " P, R, F1 = score([summary], [dataset_[\"highlights\"]], lang=\"en\", verbose=False)\n", " f1_score = F1.item()\n", "\n", " model_scores[model_name].append(f1_score)\n", " model_times[model_name].append(elapsed_time)\n", " model_results.append((model_name, f1_score, elapsed_time))\n", "\n" ], "id": "6fd91b97e4b6e588", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": "", "id": "9f8cc25886f69b6c" }, { "metadata": { "ExecuteTime": { "end_time": "2025-02-19T08:05:59.621340Z", "start_time": "2025-02-19T08:05:59.616976Z" } }, "cell_type": "code", "source": [ "print(\"\\n===== Model Evaluation =====\")\n", "for model_name in model_names:\n", " avg_score = np.mean(model_scores[model_name])\n", " avg_time = np.mean(model_times[model_name])\n", " print(f\"{model_name}: BERTScore={avg_score:.4f}, Time={avg_time:.4f}s\")\n", "\n", "print(\n", " f\" MetaModel : BERTScore={np.mean(meta_model_scores):.4f}, \"\n", " f\"Time={np.mean(meta_model_times):.4f}s\"\n", ")" ], "id": "ffa22ef5f39d30bf", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "===== Model Evaluation =====\n", "google-t5/t5-small: BERTScore=0.8639, Time=2.0048s\n", "google-t5/t5-base: BERTScore=0.8720, Time=5.2173s\n", "google-t5/t5-large: BERTScore=0.8664, Time=15.8678s\n", " MetaModel : BERTScore=0.8681, Time=3.2380s\n" ] } ], "execution_count": 9 }, { "metadata": {}, "cell_type": "markdown", "source": [ "17/02/25 : The results are better with tol at 1%. I should rerun with :\n", "- add the feature computation in the meta model time cost\n", "- analyze more deeply the features and the classifier performances\n", "- Should change the MetaModel structure because it is too large to be commited (4GB)" ], "id": "3785e798f9dfaa6d" }, { "metadata": {}, "cell_type": "code", "outputs": [], "execution_count": null, "source": "", "id": "d5f89bd659f54ed2" } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }