PEFT
Safetensors
Japanese
English
h-iida's picture
Update README.md
8a77cc2 verified
|
raw
history blame
4.54 kB
metadata
base_model:
  - meta-llama/Llama-2-7b-hf
library_name: peft
license: apache-2.0
datasets:
  - wikimedia/wikipedia
language:
  - ja
  - en

Model Info

This is a model that applies LLM2Vec to Llama-2. Only the PEFT Adapter is distributed. LLM2Vec is fine-tuned on two tasks: MNTP and SimCSE, and this repository contains the results of applying SimCSE after MNTP. For the MNTP Adapter, please refer to this link.

Model Details

Model Description

  • Model type: PEFT
  • Language(s) (NLP): Japanese
  • License: Apache2.0
  • Finetuned from model: Llama-2-7b-hf

Model Sources [optional]

Usage

BenchMark

  • Followings are summaries. Details are here

MTEB(Japansese)

Classification Clustering PairClassification Reranking BitextMining Retrieval STS AVG
Llama2-Llm2vec-eng 0.527 0.258 0.501 0.217 0.275 0.296 0.765 0.408
Llama2-Llm2vec-jpn (This repo) 0.570 0.365 0.510 0.349 0.470 0.417 0.795 0.498
Swallow-Llm2vec-jpn 0.621 0.391 0.510 0.475 0.475 0.491 0.832 0.523

MTEB(English)

Classification Clustering Pair_Classification Reranking Retrieval STS 平均
Llama2-Llm2vec-eng 0.709 0.386 0.780 0.588 0.329 0.723 0.586
Llama2-Llm2vec-jpn (This repo) 0.722 0.428 0.785 0.594 0.371 0.717 0.603
Swallow-Llm2vec-jpn 0.695 0.385 0.751 0.576 0.318 0.710 0.572

Training Details

Training Data

  • Make Corpus from SimCSE from Wikipedia
  • Script for making SimCSE Corpus
import argparse
import random
import re
from pathlib import Path
from datasets import load_dataset
from tqdm import tqdm

def main(args):
    random.seed(args.seed)
    wiki_ds = load_dataset("wikimedia/wikipedia", "20231101.ja")
    sampled_index = random.sample(range(len(wiki_ds["train"])), args.N)
    sample_wiki = wiki_ds["train"][sampled_index]
    output_texts = []
    for title, text in tqdm(zip(sample_wiki["title"], sample_wiki["text"])):
        output_texts.append(title)
        sentences = re.split("[\n。]", text)
        for sentence in sentences:
            if len(sentence) > args.min_sentence_len: 
                output_texts.append(sentence.strip()+"。")
    with args.output_path.open(mode="w") as f:
        for line in output_texts:
            f.write(line)
            f.write("\n")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--N", default=200000, type=int)
    parser.add_argument("--seed", default=42, type=int)
    parser.add_argument("-o", "--output_path", type=Path)
    parser.add_argument("--min_sentence_len", default=50, type=int)

    args = parser.parse_args()
    main(args)

Training Hyperparameter

  • simcse_dropout: 0.3
  • bidirectional: true
  • pooling_mode: "mean"
  • remove_unused_columns: false
  • learning_rate: 3e-5
  • loss_scale: 20
  • batch_size: 256
  • gradient_accumulation_steps: 1
  • max_seq_length: 128
  • lora_r: 16
  • torch_dtype: "bfloat16"
  • attn_implementation: "flash_attention_2"
  • seed: 42
  • bf16: true
  • gradient_checkpointing: true

Accelerator Settings

  • deepspeed_config:
    • gradient_accumulation_steps: 1
    • gradient_clipping: 1.0
    • offload_optimizer_device: nvme
    • offload_optimizer_nvme_path: /nvme
    • zero3_save_16bit_model: true
    • zero_stage: 2
  • distributed_type: DEEPSPEED
  • downcast_bf16: 'no'
  • dynamo_config:
    • dynamo_backend: INDUCTOR
    • dynamo_mode: default
    • dynamo_use_dynamic: true
    • dynamo_use_fullgraph: true
  • enable_cpu_affinity: false
  • machine_rank: 0
  • main_training_function: main
  • mixed_precision: bf16
  • num_machines: 1
  • num_processes: 2
  • rdzv_backend: static
  • same_network: true
  • quse_cpu: false

Framework versions

  • Python: 3.12.3
  • PEFT 0.11.1
  • Sentence Transformers: 3.0.1
  • Transformers: 4.41.0
  • PyTorch: 2.3.0
  • Accelerate: 0.30.1
  • Datasets: 2.20.0
  • Tokenizers: 0.19.1
  • MTEB: 1.13.0