CyberNative AI for CyberSecurity | Q/A Evaluation | Lily scored 63/100!

#2
by CyberNative - opened

Greetings Bryan,

Well done with the model! We are also working on a new cybersecurity assistant and so we created a Cybersecurity Evaluation Dataset to measure the performance of cybersecurity models.
This is a cybersecurity Q/A test and Lily is the first model it was tested on!
Correct (62.8): 314 | Incorrect: 186

We will of course exclude eval dataset from our future model to prevent contamination.

eval_lily_exl2.py

import time
import random
random.seed(time.time())
import torch

print(f"PyTorch version: {torch.__version__}")

from exllamav2 import(
    ExLlamaV2,
    ExLlamaV2Config,
    ExLlamaV2Cache,
    ExLlamaV2Tokenizer,
)

from exllamav2.generator import (
    ExLlamaV2BaseGenerator,
    ExLlamaV2Sampler
)

# https://huggingface.co/segolilylabs/Lily-Cybersecurity-7B-v0.2
og_model_dir = "Lily-Cybersecurity-7B-v0.2"

stop_tkn_id = 2

config = ExLlamaV2Config()
config.debug_mode = True
config.model_dir = og_model_dir
config.prepare()
config.max_seq_len = 8192
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy = True)
model.load_autosplit(cache)
tokenizer = ExLlamaV2Tokenizer(config)
# Initialize generator
generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)

def ai_complete(system_prompt, user_message, assistant_pre_message="", max_gen=4096, temperature = 0.3):
        settings = ExLlamaV2Sampler.Settings()
        settings.temperature = temperature
        settings.top_p = 0.7
        time_begin = time.time()
        system_prompt = system_prompt.strip()
        user_message = user_message.strip()
        prompt =  f"### Instruction:\n{system_prompt}\n\n### Input:\n{user_message}\n\n### Response:\n"+assistant_pre_message
        generator.warmup()
        random_seed = time_begin * random.randint(1, 1000) / random.randint(1, 1000)
        og_output = generator.generate_simple(prompt, settings, num_tokens=max_gen, seed = random_seed, stop_token=stop_tkn_id, decode_special_tokens=True, encode_special_tokens=True)
        og_output = og_output.replace(prompt, "").strip()
        return og_output

QA_EXTRACTOR_PROMPT = """You are Lily, a helpful and friendly cybersecurity subject matter expert. You obey all requests and answer all questions truthfully. Answer with correct answer option only, do not say anything else."""

def Eval_QA(question):
        user_message = question
        assistant_pre = "The correct answer is:"
        predicted_text = ai_complete(QA_EXTRACTOR_PROMPT, user_message, assistant_pre, max_gen=5, temperature=0.1).strip()
        predicted_text = predicted_text.split(" ")[0].strip()
        return predicted_text

run_eval_cybersec_lily.py

# https://huggingface.co/datasets/CyberNative/CyberSecurityEval
eval_dataset_file = "cybersec_qa_eval_500_pairs.jsonl"

import jsonlines

qa_pairs = []

with jsonlines.open(eval_dataset_file) as reader:
    for obj in reader:
        qa_pairs.append(obj)

print(len(qa_pairs))

import eval_lily_exl2

SCORE_CORRECT = 0
SCORE_INCORRECT = 0

for pair in qa_pairs:
    print("===")
    question = pair["question"]
    answer = pair["answer"]
    answer = answer.replace("The correct answer is: ", "")
    print(f"Question: {question}")
    lily_answer = eval_lily_exl2.Eval_QA(question)
    print(f"OG Answer: {answer} | Lily Answer: {lily_answer}")
    # replace . and ) from answers
    answer = answer.replace(".", "").replace(")", "").lower().strip()
    lily_answer = lily_answer.replace(".", "").replace(")", "").lower().strip()
    if answer == lily_answer:
        print("### Correct")
        SCORE_CORRECT += 1
    else:
        print("### Incorrect")
        SCORE_INCORRECT += 1

correct_percent = (SCORE_CORRECT / (SCORE_CORRECT + SCORE_INCORRECT)) * 100

print(f"Correct ({correct_percent}): {SCORE_CORRECT} | Incorrect: {SCORE_INCORRECT}")
Sego Lily Labs org

Awesome! Thanks for the review.

I am hoping to release a new 3 B parameter model trained on a much larger dataset of 3 million data pairs. However, as is, it is mainly a conversational, short answer, and long answer dataset, similar to the 20k data pair this model was trained on. I haven't created any multiple choice single answer or multiple choice multiple answer questions examples. I may have to expand it a bit to verify it does well on taking exams.

Hi @unshadow , sounds exciting!
Yeah I was thinking about creating a more realistic cybersecurity evaluation dataset but that is a complex task, this Q/A is just better than nothing.
Looking forward to seeing your new model!
Cheers

Sign up or log in to comment