CyberNative AI for CyberSecurity | Q/A Evaluation | Lily scored 63/100!
Greetings Bryan,
Well done with the model! We are also working on a new cybersecurity assistant and so we created a Cybersecurity Evaluation Dataset to measure the performance of cybersecurity models.
This is a cybersecurity Q/A test and Lily is the first model it was tested on!
Correct (62.8): 314 | Incorrect: 186
We will of course exclude eval dataset from our future model to prevent contamination.
eval_lily_exl2.py
import time
import random
random.seed(time.time())
import torch
print(f"PyTorch version: {torch.__version__}")
from exllamav2 import(
ExLlamaV2,
ExLlamaV2Config,
ExLlamaV2Cache,
ExLlamaV2Tokenizer,
)
from exllamav2.generator import (
ExLlamaV2BaseGenerator,
ExLlamaV2Sampler
)
# https://huggingface.co/segolilylabs/Lily-Cybersecurity-7B-v0.2
og_model_dir = "Lily-Cybersecurity-7B-v0.2"
stop_tkn_id = 2
config = ExLlamaV2Config()
config.debug_mode = True
config.model_dir = og_model_dir
config.prepare()
config.max_seq_len = 8192
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy = True)
model.load_autosplit(cache)
tokenizer = ExLlamaV2Tokenizer(config)
# Initialize generator
generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)
def ai_complete(system_prompt, user_message, assistant_pre_message="", max_gen=4096, temperature = 0.3):
settings = ExLlamaV2Sampler.Settings()
settings.temperature = temperature
settings.top_p = 0.7
time_begin = time.time()
system_prompt = system_prompt.strip()
user_message = user_message.strip()
prompt = f"### Instruction:\n{system_prompt}\n\n### Input:\n{user_message}\n\n### Response:\n"+assistant_pre_message
generator.warmup()
random_seed = time_begin * random.randint(1, 1000) / random.randint(1, 1000)
og_output = generator.generate_simple(prompt, settings, num_tokens=max_gen, seed = random_seed, stop_token=stop_tkn_id, decode_special_tokens=True, encode_special_tokens=True)
og_output = og_output.replace(prompt, "").strip()
return og_output
QA_EXTRACTOR_PROMPT = """You are Lily, a helpful and friendly cybersecurity subject matter expert. You obey all requests and answer all questions truthfully. Answer with correct answer option only, do not say anything else."""
def Eval_QA(question):
user_message = question
assistant_pre = "The correct answer is:"
predicted_text = ai_complete(QA_EXTRACTOR_PROMPT, user_message, assistant_pre, max_gen=5, temperature=0.1).strip()
predicted_text = predicted_text.split(" ")[0].strip()
return predicted_text
run_eval_cybersec_lily.py
# https://huggingface.co/datasets/CyberNative/CyberSecurityEval
eval_dataset_file = "cybersec_qa_eval_500_pairs.jsonl"
import jsonlines
qa_pairs = []
with jsonlines.open(eval_dataset_file) as reader:
for obj in reader:
qa_pairs.append(obj)
print(len(qa_pairs))
import eval_lily_exl2
SCORE_CORRECT = 0
SCORE_INCORRECT = 0
for pair in qa_pairs:
print("===")
question = pair["question"]
answer = pair["answer"]
answer = answer.replace("The correct answer is: ", "")
print(f"Question: {question}")
lily_answer = eval_lily_exl2.Eval_QA(question)
print(f"OG Answer: {answer} | Lily Answer: {lily_answer}")
# replace . and ) from answers
answer = answer.replace(".", "").replace(")", "").lower().strip()
lily_answer = lily_answer.replace(".", "").replace(")", "").lower().strip()
if answer == lily_answer:
print("### Correct")
SCORE_CORRECT += 1
else:
print("### Incorrect")
SCORE_INCORRECT += 1
correct_percent = (SCORE_CORRECT / (SCORE_CORRECT + SCORE_INCORRECT)) * 100
print(f"Correct ({correct_percent}): {SCORE_CORRECT} | Incorrect: {SCORE_INCORRECT}")
Awesome! Thanks for the review.
I am hoping to release a new 3 B parameter model trained on a much larger dataset of 3 million data pairs. However, as is, it is mainly a conversational, short answer, and long answer dataset, similar to the 20k data pair this model was trained on. I haven't created any multiple choice single answer or multiple choice multiple answer questions examples. I may have to expand it a bit to verify it does well on taking exams.
Hi
@unshadow
, sounds exciting!
Yeah I was thinking about creating a more realistic cybersecurity evaluation dataset but that is a complex task, this Q/A is just better than nothing.
Looking forward to seeing your new model!
Cheers