--- license: gemma language: - 'no' --- # PIIMask-NOR Model The PIIMask-NOR model is a specialized language model fine-tuned for the task of Personal Identifiable Information (PII) redaction in Norwegian, Bokmål. It is based on the "google/gemma-1.1-2b-it" model and trained to identify and redact various types of PII in text while maintaining the grammatical structure of sentences. ## Model Description - **Model Name:** PIIMask-NOR - **Base Model:** [google/gemma-1.1-2b-it](https://huggingface.co/google/gemma-1.1-2b-it) - **Quantization:** 4-bit quantization using NF4 with double quantization and float16 compute dtype. - **Training Steps:** The model checkpoints are available at 1, 2, 3, and 4 epochs. ## Methodology The PIIMask-NOR model was fine-tuned using the ai4privacy/pii-masking-65k dataset, which was machine translated into Norwegian, Bokmål. The training process involved several epochs to improve the model's ability to accurately redact PII from text. The quantization configuration was applied to make the model more efficient for deployment. ## Usage ### Installation To use the PIIMask-NOR model, you need to have the `transformers` and `datasets` libraries installed. You can install them using pip: ```bash pip install transformers datasets ``` ### Code Example Here is a code example to load and use the PIIMask-NOR model for PII redaction: ```python from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig import torch # Quantization configuration bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, ) # System instructions for PII redaction system_instructions = """Erstatt følgende typer personopplysninger i teksten nedenfor med '[REDACTED]': [FIRST_NAME_x], [CITY_x], [COUNTRY_x]. Sørg for at hver type informasjon erstattes på en måte som opprettholder den grammatiske strukturen i setningen. Du skal kun returnere den nye teksten med de relevante erstatningene utført, uten den opprinnelige teksten eller noen tilleggsannotasjoner. Input:""" example_prompt = "Jeg heter Clara og bor i Bergen, Norge." # Load model function def load_model(repo, step): model = AutoModelForCausalLM.from_pretrained(repo, device_map="cuda:0", trust_remote_code=True, quantization_config=bnb_config, adapter_kwargs={"subfolder": f"checkpoint-{step}"}, attn_implementation="flash_attention_2") return model # Initialize tokenizer and model device = "cuda" tokenizer = AutoTokenizer.from_pretrained("google/gemma-1.1-2b-it", use_fast=True) # Apply chat template for input chat = [ {"role": "system", "content": system_instructions}, {"role": "user", "content": example_prompt}, ] inputs = tokenizer.apply_chat_template(chat, tokenize=False, return_tensors="pt", padding=True, truncation=False) model = load_model("UlrikKoren/PIIMask-NOR", step=1) outputs = model.generate(input_ids=inputs['input_ids'].to(device), max_new_tokens=2048) decoded_outputs = [tokenizer.decode(output, skip_special_tokens=False) for output in outputs] print(decoded_outputs[0]) ``` ### Checkpoints The model checkpoints for different training epochs can be accessed as follows: - **Epoch 1:** `UlrikKoren/PIIMask-NOR/tree/main/checkpoint-579` - **Epoch 2:** `UlrikKoren/PIIMask-NOR/checkpoint-1159` - **Epoch 3:** `UlrikKoren/PIIMask-NOR/checkpoint-1739` - **Epoch 4:** `UlrikKoren/PIIMask-NOR/checkpoint-2316` ## Compliance with Gemma Terms of Use This model is a derivative of the "google/gemma-1.1-2b-it" model and complies with the Gemma Terms of Use: - **Distribution:** Any distribution of this model or its derivatives must include the use restrictions specified in the Gemma Terms of Use and provide notice to subsequent users. - **Notices:** The model is distributed with the following notice: “Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms”. - **Modifications:** Any modified files carry prominent notices stating the modifications made. - **Prohibited Uses:** The use of this model is subject to the restrictions outlined in the Gemma Prohibited Use Policy. - **Trademarks:** This distribution does not grant any rights to use Google’s trademarks, trade names, or logos. ## License The PIIMask-NOR model is distributed under the same terms as the base model. For more details, please refer to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms).