metadata
library_name: peft
base_model: microsoft/phi-2
Model Card for Model ID
phi-2-mongodb is a fine-tuned version of microsoft/phi-2 to generate MongoDB pipeline queries. It was fine-tuned on a custom curated natural language to MongoDB queries dataset, I'll be releasing that next week.
Model Details
Further details about fine-tuned model can be found at : https://github.com/Chirayu-Tripathi/nl2query. It can also be used via nl2query library.
Model Description
- Fine-tuned by:
Chirayu Tripathi
- Developed by: [
Microsoft
] - Language(s) (NLP): English
- License: MIT
- Finetuned from model:
microsoft/phi-2
Prompt Template
prompt_template = f"""<s>
Task Description:
Your task is to create a MongoDB query that accurately fulfills the provided Instruct while strictly adhering to the given MongoDB schema. Ensure that the query solely relies on keys and columns present in the schema. Minimize the usage of lookup operations wherever feasible to enhance query efficiency.
MongoDB Schema:
{db_schema}
### Instruct:
{text}
### Output:
"""
How to Get Started with the Model
Use the code sample provided in the original post to interact with the model.
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
BitsAndBytesConfig,
)
import torch
from peft import PeftModel
db_schema = '''{
"collections": [
{
"name": "shipwrecks",
"indexes": [
{
"key": {
"_id": 1
}
},
{
"key": {
"feature_type": 1
}
},
{
"key": {
"chart": 1
}
},
{
"key": {
"latdec": 1,
"londec": 1
}
}
],
"uniqueIndexes": [],
"document": {
"properties": {
"_id": {
"bsonType": "string"
},
"recrd": {
"bsonType": "string"
},
"vesslterms": {
"bsonType": "string"
},
"feature_type": {
"bsonType": "string"
},
"chart": {
"bsonType": "string"
},
"latdec": {
"bsonType": "double"
},
"londec": {
"bsonType": "double"
},
"gp_quality": {
"bsonType": "string"
},
"depth": {
"bsonType": "string"
},
"sounding_type": {
"bsonType": "string"
},
"history": {
"bsonType": "string"
},
"quasou": {
"bsonType": "string"
},
"watlev": {
"bsonType": "string"
},
"coordinates": {
"bsonType": "array",
"items": {
"bsonType": "double"
}
}
}
}
}
],
"version": 1
}'''
text = ''''Find the count of shipwrecks for each unique combination of "latdec" and "longdec"'''
prompt = f"""<s>
Task Description:
Your task is to create a MongoDB query that accurately fulfills the provided Instruct while strictly adhering to the given MongoDB schema. Ensure that the query solely relies on keys and columns present in the schema. Minimize the usage of lookup operations wherever feasible to enhance query efficiency.
MongoDB Schema:
{db_schema}
### Instruct:
{text}
### Output:
"""
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
base_model_id = "microsoft/phi-2"
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
base_model_id,
trust_remote_code=True,
quantization_config=bnb_config,
revision="refs/pr/23",
device_map={"": 0},
torch_dtype="auto",
flash_attn=True,
flash_rotary=True,
fused_dense=True,
)
adapter = 'Chirayu/phi-2-mongodb'
model = PeftModel.from_pretrained(model, adapter).to(device)
model_inputs = tokenizer(prompt, return_tensors="pt").to(device)
output = model.generate(
**model_inputs,
max_length=1024,
no_repeat_ngram_size=10,
repetition_penalty=1.02,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
)[0]
prompt_length = model_inputs['input_ids'].shape[1]
query = tokenizer.decode(output[prompt_length:], skip_special_tokens=False)
try:
stop_idx = query.index("</s>")
except Exception as e:
print(e)
stop_idx = len(query)
print(query[: stop_idx].strip())
- PEFT 0.10.0