uartimcs
/

donut-booking-extract

@@ -3,11 +3,13 @@
     {
       "cell_type": "markdown",
       "source": [
-        "1. Download the donut folder from Github https://github.com/clovaai/donut\n",
-        "2. Copy a config file in folder and change the name to hold your configuration.\n",
-        "3. Place your dataset (train, validation, test) along with JSONL files on the dataset folder.\n",
-        "4. Refer to donut_training.ipynb to train your model. Use A-100/V-100 GPU to avoid troublesome settings / slow training time.\n",
-        "5. Run the trained model using this ipynb file."
       ],
       "metadata": {
         "id": "L5U1ACZZBxfh"
@@ -47,7 +49,8 @@
         "# import necessary modules\n",
         "from donut import DonutModel\n",
         "from PIL import Image\n",
-        "import torch"
       ],
       "metadata": {
         "id": "gSatjcDn5S89"
@@ -58,11 +61,11 @@
     {
       "cell_type": "code",
       "source": [
-        "# Test the model with testing data. Just to initiate model.\n",
-        "!python test.py --task_name Booking --dataset_name_or_path dataset/Booking --pretrained_model_name_or_path ./result/train_Booking/donut-booking-extract"
       ],
       "metadata": {
-        "id": "dyOv9Omo8dJU"
       },
       "execution_count": null,
       "outputs": []
@@ -70,7 +73,6 @@
     {
       "cell_type": "code",
       "source": [
-        "\n",
         "model = DonutModel.from_pretrained(\"./result/train_Booking/donut-booking-extract\")\n",
         "if torch.cuda.is_available():\n",
         "    model.half()\n",

     {
       "cell_type": "markdown",
       "source": [
+        "1. Download the repo from Github https://github.com/clovaai/donut using git command or through direct download.\n",
+        "2. (The base model config for document classification / document parsing / document Q&A tasks is stored under /config.\n",
+        "3. Copy a copy of any YAML file, rename arbitarily and set your parameters.\n",
+        "3. Prepare your dataset (train, validation, test) along with JSONL files on the /dataset folder. You can use program to generate JSONL files from csv files. Be remind of the format. One line per one data. One JSONL file in each folder (train/valdidation/test)\n",
+        "4. Refer to donut_training.ipynb to train your model. Use A-100/V-100 GPU to avoid troublesome settings / slow training time. The trained model is stored under /result folder.\n",
+        "5. Run the trained model using this ipynb file.\n",
+        "6. Don't change the version of transformers and timm. It is a nightmare if you don't understand what you do."
       ],
       "metadata": {
         "id": "L5U1ACZZBxfh"
         "# import necessary modules\n",
         "from donut import DonutModel\n",
         "from PIL import Image\n",
+        "import torch\n",
+        "import argparse"
       ],
       "metadata": {
         "id": "gSatjcDn5S89"
     {
       "cell_type": "code",
       "source": [
+        "# Input the default arguments\n",
+        "parser = argparse.ArgumentParser()"
       ],
       "metadata": {
+        "id": "RZSmy3Riz7ia"
       },
       "execution_count": null,
       "outputs": []
     {
       "cell_type": "code",
       "source": [
         "model = DonutModel.from_pretrained(\"./result/train_Booking/donut-booking-extract\")\n",
         "if torch.cuda.is_available():\n",
         "    model.half()\n",