replace-wrong-weights

by gardari - opened Mar 19

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

+196

-165

Files changed (6) hide show

README.md +20 -7
config.json +1 -1
handler.py +0 -120
run_model.py +175 -29
spiece.model +0 -3
tokenizer_config.json +0 -5

README.md CHANGED Viewed

@@ -10,21 +10,34 @@ ICELANDIC GPT-SW3 FOR SPELL AND GRAMMAR CHECKING
 This is a model for correcting spelling and grammar errors in Icelandic text. It is a GPT-SW3 model (https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b) finetuned on Icelandic and particularly on the spell and grammar checking task.
-Provided here is the model along with a script for running it through a Hugging Face endpoint. An authorized Hugging Face API key is required to do so. Once you have retrieved an API key and it has been authorized, add it to you environment as "HF_API_KEY".
-To run the model you will need a python3 environment. Install the required dependencies by running
 > pip install -r requirements.txt
-The current version of transformers includes a bug in the GPTSw3Tokenizer class which causes it to use the wrong BOS and PAD tokens if the tokenizer is loaded through `AI-Sweden-Models/gpt-sw3-6.7b`. Load the tokenizer through `mideind/icelandic-gpt-sw3-6.7b-gec` instead to avoid this bug.
-The model is fine-tuned on the following three tasks. Output examples for each task are shown in ./example_outputs.
   - Task 1: The model evaluates one text with regards to e.g. grammar and spelling, and returns all errors in the input text as a list, with their position in the text and their corrections.
   - Task 2: The model evaluates two texts and chooses which one is better with regards to e.g. grammar and spelling.
   - Task 3: The model evaluates one text with regards to e.g. grammar and spelling, and returns a corrected version of the text.
-Run the model with
-> python run_model.py
-Input text(s) and the task type need to be specified in the script.

 This is a model for correcting spelling and grammar errors in Icelandic text. It is a GPT-SW3 model (https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b) finetuned on Icelandic and particularly on the spell and grammar checking task.
+Provided here is the model along with a script for running it locally.
+To run the script you will need a python3 environment. Install the required dependencies by running
 > pip install -r requirements.txt
+The current version of transformers includes a bug which has to be fixed in the user's environment before the model can be run. To fix it, change "gpt-sw3-7b" in line no. 138 in transformers/models/gpt_sw3/tokenization_gpt_sw3.py to "gpt-sw3-6.7b".
+After that you can run the script with an input file consisting of text to correct.
+The model is fine-tuned on the following three tasks:
   - Task 1: The model evaluates one text with regards to e.g. grammar and spelling, and returns all errors in the input text as a list, with their position in the text and their corrections.
   - Task 2: The model evaluates two texts and chooses which one is better with regards to e.g. grammar and spelling.
   - Task 3: The model evaluates one text with regards to e.g. grammar and spelling, and returns a corrected version of the text.
+The script which runs the model takes the following three arguments:
+  - --task: A number (1-3) representing the intended task. The script includes prompts for each task.
+  - --input-file: A file containing text to be evaluated. The format of the input file differs between tasks, and is described further below.
+  - --output-file: A path to a desired output file to be created by the script. The format of the file differs between tasks, and is described further below.
+An input file for tasks 1 and 3 should be a .txt file consisting of texts per line. An example of both files can be found under ./example_inputs.
+An input file for task 2 should be a .jsonl file, where each line is a dictionary object showing two texts. Keys in the dictionary are "a" and "b" and texts to be evaluated are their values. An example of this file can be found under ./example_inputs.
+All output files are .txt files and output examples for each task are shown in ./example_outputs. An output file for task 1 shows each text which was evaluated, followed by a list of corrections. Text outputs are separated by an empty line. An output file for task 2 shows 'A' or 'B' for which text is preferred, one choice per line. An output file for task 3 shows the corrected text, one text per line.
+Run the script with
+> python run_model.py --task 3 --input-file example_inputs/task3_example.txt --output-file example_outputs/task3_example.txt
+The script we provide runs in CPU-only mode and should work on most systems that have enough RAM to load the model. Users that wish to accelerate their corrections with specialized hardware (eg GPUs) will need to install appropriate support packages for their hardware. We refer to the PyTorch documentation: https://pytorch.org/get-started/locally/ . After the extra packages are installed, add the `device` parameter to the pipeline constructor. See the HuggingFace documentation (https://huggingface.co/docs/transformers/main_classes/pipelines) for more details.

config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "_name_or_path": "mideind/icelandic-gpt-sw3-6.7b-gec",
   "activation_function": "gelu",
   "apply_query_key_layer_scaling": true,
   "architectures": [

 {
+  "_name_or_path": "mideind/icelandic-gpt-sw3",
   "activation_function": "gelu",
   "apply_query_key_layer_scaling": true,
   "architectures": [

handler.py DELETED Viewed

@@ -1,120 +0,0 @@
-from typing import Dict, List, Any
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import torch
-import logging
-logging.basicConfig(level=logging.INFO)
-LOGGER = logging.getLogger(__name__)
-# Prompts for the different tasks
-START_PROMPT_TASK1 = "Hér er texti sem ég vil að þú skoðir vel og vandlega. Þú skalt skoða hvert einasta orð, orðasamband, og setningu og meta hvort þér finnist eitthvað athugavert, til dæmis hvað varðar málfræði, stafsetningu, skringilega merkingu og svo framvegis.\nHér er textinn:\n\n"
-END_PROMPT_TASK1 = "Sérðu eitthvað sem mætti betur fara í textanum? Búðu til lista af öllum slíkum tilvikum þar sem hver lína tilgreinir hver villan er, hvar hún er, og hvað væri gert í staðinn fyrir villuna.\n\n"
-START_PROMPT_TASK2 = "Hér er texti sem ég vil að þú skoðir vel og vandlega. Þú skalt skoða hvert einasta orð, orðasamband, og setningu og meta hvort þér finnist eitthvað athugavert, til dæmis hvað varðar málfræði, stafsetningu, skringilega merkingu og svo framvegis.Ég er með tvær útgáfur af textanum, A og B, og önnur þeirra gæti verið betri en hin á einhvern hátt, t.d. hvað varðar stafsetningu, málfræði o.s.frv.\nHér er texti A:\n\n"
-MIDDLE_PROMPT_TASK2 = "Hér er texti B:\n\n"
-END_PROMPT_TASK2 = "Hvorn textann líst þér betur á?\n\n"
-START_PROMPT_TASK3 = "Hér er texti sem ég vil að þú skoðir vel og vandlega. Þú skalt skoða hvert einasta orð, orðasamband, og setningu og meta hvort þér finnist eitthvað athugavert, til dæmis hvað varðar málfræði, stafsetningu, skringilega merkingu og svo framvegis.\nHér er textinn:\n\n"
-END_PROMPT_TASK3 = "Reyndu nú að laga textann þannig að hann líti betur út, eins og þér finnst best við hæfi.\n\n"
-START_PROMPT_TASK = {
-    1: START_PROMPT_TASK1,
-    2: START_PROMPT_TASK2,
-    3: START_PROMPT_TASK3,
-}
-END_PROMPT_TASK = {1: END_PROMPT_TASK1, 2: END_PROMPT_TASK2, 3: END_PROMPT_TASK3}
-SEP = "\n\n"
-class EndpointHandler:
-    def __init__(self, path=""):
-        self.model = AutoModelForCausalLM.from_pretrained(
-            path, device_map="auto", torch_dtype=torch.bfloat16
-        )
-        self.tokenizer = AutoTokenizer.from_pretrained(path)
-        LOGGER.info(f"Inference model loaded from {path}")
-        LOGGER.info(f"Model device: {self.model.device}")
-    def check_valid_inputs(self, input_a: str, input_b: str, task: int) -> bool:
-        """
-        Check if the inputs are valid
-        """
-        if task not in [1, 2, 3]:
-            return False
-        if task == 1 or task == 3:
-            if input_a is None:
-                return False
-        elif task == 2:
-            if input_a is None or input_b is None:
-                return False
-        return True
-    def tokenize_input(self, input_a: str, input_b: str, task: int) -> List[int]:
-        """
-        Tokenize the input
-        """
-        if task == 1 or task == 3:
-            tokenized_start = self.tokenizer(START_PROMPT_TASK[task])["input_ids"]
-            tokenized_end = self.tokenizer(END_PROMPT_TASK[task])["input_ids"]
-            tokenized_sentence = self.tokenizer(input_a + SEP)["input_ids"]
-            concatted_data = (
-                [self.tokenizer.bos_token_id]
-                + tokenized_start
-                + tokenized_sentence
-                + tokenized_end
-            )
-        elif task == 2:
-            tokenized_start = self.tokenizer(START_PROMPT_TASK[task])["input_ids"]
-            tokenized_middle = self.tokenizer(MIDDLE_PROMPT_TASK2)["input_ids"]
-            tokenized_end = self.tokenizer(END_PROMPT_TASK[task])["input_ids"]
-            tokenized_sentence_a = self.tokenizer(input_a + SEP)["input_ids"]
-            tokenized_sentence_b = self.tokenizer(input_b + SEP)["input_ids"]
-            concatted_data = (
-                [self.tokenizer.bos_token_id]
-                + tokenized_start
-                + tokenized_sentence_a
-                + tokenized_middle
-                + tokenized_sentence_b
-                + tokenized_end
-            )
-        return concatted_data
-    def __call__(self, data: Dict[str, Any]) -> List[Dict[str, Any]]:
-        """
-         data args:
-              inputs (:obj: `str` | `PIL.Image` | `np.array`)
-              kwargs
-        Return:
-              A :obj:`list` | `dict`: will be serialized and returned
-        """
-        LOGGER.info(f"Received data: {data}")
-        # Get inputs
-        input_a = data.pop("input_a", None)
-        input_b = data.pop("input_b", None)
-        task = data.pop("task", None)
-        parameters = data.pop("parameters", {})
-        # Check valid inputs
-        if not self.check_valid_inputs(input_a, input_b, task):
-            return [{"error": "Invalid inputs"}]
-        if "max_new_tokens" not in parameters and "max_length" not in parameters:
-            parameters["max_new_tokens"] = 512
-        # Tokenize the input
-        tokenized_input = self.tokenize_input(input_a, input_b, task)
-        # Move the input to the device
-        input_ids = torch.tensor(tokenized_input).to(self.model.device)
-        input_ids = input_ids.unsqueeze(0)
-        # Generate the output
-        output = self.model.generate(input_ids, **parameters)
-        # Decode only the new part of the output
-        decoded_output = self.tokenizer.decode(
-            output[0][len(tokenized_input) :], skip_special_tokens=True
-        ).strip()
-        return [{"output": decoded_output}]

run_model.py CHANGED Viewed

@@ -1,34 +1,180 @@
-"""
-Script for running the model using the Hugging Face endpoint. An authorized Hugging Face API key is required.
-"""
-import requests
-import os
-API_URL = "https://otaf5w2ge8huxngl.eu-west-1.aws.endpoints.huggingface.cloud"
-# Set your Hugging Face API key as an environment variable
-api_key = os.environ.get("HF_API_KEY")
-headers = {
-    "Accept": "application/json",
-    "Authorization": f"Bearer {api_key}",
-    "Content-Type": "application/json",
 }
-def query(payload):
-    response = requests.post(API_URL, headers=headers, json=payload)
-    return response.json()
-output = query(
-    {
-        "inputs": "",  # Can be left empty.
-        "input_a": "<text A>",  # Required for all tasks.
-        "input_b": "<text B>",  # Required for task 2 but not for task 1 or 3.
-        "task": 1 | 2 | 3,  # Choose the task number.
-        "parameters": {
-            # Can be left empty
-        },
-    }
-)
-print(output)

+"""This script runs the trained model on data and saves the predictions to a file."""
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+import logging
+import random
+import tqdm
+import json
+import argparse
+# Set the logging level to info
+logging.basicConfig(level=logging.INFO)
+# Set the device to GPU if available
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+logging.info(f"Device: {device}")
+# Prompts for the different tasks
+START_PROMPT_TASK1 = "Hér er texti sem ég vil að þú skoðir vel og vandlega. Þú skalt skoða hvert einasta orð, orðasamband, og setningu og meta hvort þér finnist eitthvað athugavert, til dæmis hvað varðar málfræði, stafsetningu, skringilega merkingu og svo framvegis.\nHér er textinn:\n\n"
+END_PROMPT_TASK1 = "Sérðu eitthvað sem mætti betur fara í textanum? Búðu til lista af öllum slíkum tilvikum þar sem hver lína tilgreinir hver villan er, hvar hún er, og hvað væri gert í staðinn fyrir villuna.\n\n"
+START_PROMPT_TASK2 = "Hér er texti sem ég vil að þú skoðir vel og vandlega. Þú skalt skoða hvert einasta orð, orðasamband, og setningu og meta hvort þér finnist eitthvað athugavert, til dæmis hvað varðar málfræði, stafsetningu, skringilega merkingu og svo framvegis.Ég er með tvær útgáfur af textanum, A og B, og önnur þeirra gæti verið betri en hin á einhvern hátt, t.d. hvað varðar stafsetningu, málfræði o.s.frv.\nHér er texti A:\n\n"
+MIDDLE_PROMPT_TASK2 = "Hér er texti B:\n\n"
+END_PROMPT_TASK2 = "Hvorn textann líst þér betur á?\n\n"
+START_PROMPT_TASK3 = "Hér er texti sem ég vil að þú skoðir vel og vandlega. Þú skalt skoða hvert einasta orð, orðasamband, og setningu og meta hvort þér finnist eitthvað athugavert, til dæmis hvað varðar málfræði, stafsetningu, skringilega merkingu og svo framvegis.\nHér er textinn:\n\n"
+END_PROMPT_TASK3 = "Reyndu nú að laga textann þannig að hann líti betur út, eins og þér finnst best við hæfi.\n\n"
+START_PROMPT_TASK = {
+    1: START_PROMPT_TASK1,
+    2: START_PROMPT_TASK2,
+    3: START_PROMPT_TASK3,
 }
+END_PROMPT_TASK = {1: END_PROMPT_TASK1, 2: END_PROMPT_TASK2, 3: END_PROMPT_TASK3}
+SEP = "\n\n"
+def set_seed(seed):
+    """Set the random seed for reproducibility."""
+    torch.manual_seed(seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(seed)
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
+    random.seed(seed)
+def tokenize_data(tokenizer, data, task, max_length):
+    """Tokenize the data and return the input_ids and attention_mask."""
+    tokenized_start = tokenizer(START_PROMPT_TASK[task])["input_ids"]
+    tokenized_end = tokenizer(END_PROMPT_TASK[task])["input_ids"]
+    if task == 2:
+        tokenized_middle = tokenizer(MIDDLE_PROMPT_TASK2)["input_ids"]
+    # Tokenize the data
+    tokenized_data = []
+    if task == 1 or task == 3:
+        for sentence in data:
+            tokenized_sentence = tokenizer(sentence + SEP)["input_ids"]
+            # Concatenate the tokenized data
+            concatted_data = (
+                [tokenizer.bos_token_id]
+                + tokenized_start
+                + tokenized_sentence
+                + tokenized_end
+            )
+            # Truncate the data
+            concatted_data = concatted_data[:max_length]
+            tokenized_data.append(concatted_data)
+    elif task == 2:
+        for line in data:
+            data_a = line["a"]
+            data_b = line["b"]
+            tokenized_sentence_a = tokenizer(data_a + SEP)["input_ids"]
+            tokenized_sentence_b = tokenizer(data_b + SEP)["input_ids"]
+            # Concatenate the tokenized data
+            concatted_data = (
+                [tokenizer.bos_token_id]
+                + tokenized_start
+                + tokenized_sentence_a
+                + tokenized_middle
+                + tokenized_sentence_b
+                + tokenized_end
+            )
+            # Truncate the data
+            concatted_data = concatted_data[:max_length]
+            tokenized_data.append(concatted_data)
+    return tokenized_data
+def run_model_on_data(model_path, tokenizer_name, arguments):
+    """Run the model on the data and save the predictions to a file."""
+    # Load the model and tokenizer
+    model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16)
+    model.to(device)
+    model.eval()
+    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
+    # Load the data
+    if arguments.task == 1 or arguments.task == 3:
+        with open(arguments.input_file, "r") as file:
+            data = file.read().splitlines()
+    elif arguments.task == 2:
+        with open(arguments.input_file, "r") as file:
+            data = file.read().splitlines()
+            data = [json.loads(line) for line in data]
+    # Tokenize the data
+    data_tokenized = tokenize_data(
+        tokenizer, data, arguments.task, tokenizer.model_max_length
+    )
+    logging.info(f"Number of examples: {len(data_tokenized)}")
+    # Run the model on the data
+    predictions = []
+    progress_bar = tqdm.tqdm(total=len(data_tokenized), desc="Running model on data")
+    for input_ids in data_tokenized:
+        progress_bar.update(1)
+        # Generate the predictions
+        with torch.cuda.amp.autocast(dtype=torch.bfloat16):
+            input_ids_tensor = torch.tensor(input_ids).unsqueeze(0).to(device)
+            output = model.generate(
+                input_ids=input_ids_tensor, max_new_tokens=500, num_return_sequences=1
+            )
+            # Only get the part of the prediction that was generated
+            prediction = tokenizer.decode(
+                output[0][len(input_ids) :], skip_special_tokens=True
+            )
+            predictions.append(prediction)
+    progress_bar.close()
+    # Save the predictions to a file
+    with open(arguments.output_file, "w") as file:
+        if arguments.task == 1:
+            # We want to include the original text in the output file
+            counter = 0
+            for prediction in predictions:
+                file.write(data[counter] + "\n")
+                file.write(prediction.split("\n\n")[0] + "\n\n")
+                counter += 1
+        else:
+            for prediction in predictions:
+                file.write(prediction.split("\n\n")[0] + "\n")
+    logging.info(f"Predictions written to file: {arguments.output_file}")
+if __name__ == "__main__":
+    # Parse the arguments
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--task", type=int, help="The task type (1, 2, or 3)")
+    parser.add_argument(
+        "--input-file",
+        type=str,
+        help="The path to the input file with data to be corrected",
+    )
+    parser.add_argument(
+        "--output-file",
+        type=str,
+        help="The path to the output file where the corrected data will be saved",
+    )
+    args = parser.parse_args()
+    model_path = "./gpt-sw3-model"
+    tokenizer_name = "AI-Sweden-Models/gpt-sw3-6.7b"
+    set_seed(42)
+    run_model_on_data(model_path, tokenizer_name, args)

spiece.model DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:8a76244a65ab35adda1b1cdb7b49be970d143bcc489d7b05d87551a12de78878
-size 1071963

tokenizer_config.json DELETED Viewed

@@ -1,5 +0,0 @@
-{
-  "name_or_path": "AI-Sweden-Models/gpt-sw3-6.7b",
-  "bos_token": "<|endoftext|>",
-  "pad_token": "<unk>"
-}