Spaces:
Sleeping
Sleeping
File size: 8,266 Bytes
d00901f fdf2776 6abdf31 d00901f fdf2776 6abdf31 fdf2776 d00901f 6c5f0d7 ffcb078 6c5f0d7 d00901f 6c5f0d7 d00901f 6c5f0d7 d00901f d9cbd92 6c5f0d7 d00901f 6c5f0d7 ffcb078 d00901f ffcb078 6abdf31 d9cbd92 d00901f b76af8a d9cbd92 b3c492a d9cbd92 b76af8a d9cbd92 d00901f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 |
import torch
import joblib
import numpy as np
import pandas as pd
import gradio as gr
from nltk.data import load as nltk_load
from transformers import AutoTokenizer, AutoModelForCausalLM
print("Loading model & Tokenizer...")
model_id = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
print("Loading NLTL & and scikit-learn model...")
NLTK = nltk_load('data/english.pickle')
sent_cut_en = NLTK.tokenize
clf = joblib.load(f'data/gpt2-small-model')
CROSS_ENTROPY = torch.nn.CrossEntropyLoss(reduction='none')
example = """\
The perplexity (PPL) is commonly used as a metric for evaluating the performance of language models (LM). It is defined as the \
exponential of the negative average log-likelihood of the text under the LM. A lower PPL indicates that the language model is more confident \
in its predictions, and is therefore considered to be a better model. The training of LMs is carried out on large-scale text corpora, it can \
be considered that it has learned some common language patterns and text structures. Therefore, PPL can be used to measure how well a text \
conforms to common characteristics.
I used all variants of the open-source GPT-2 model except xl size to compute the PPL (both text-level and sentence-level PPLs) of the collected \
texts. It is observed that, regardless of whether it is at the text level or the sentence level, the content generated by LLMs have relatively \
lower PPLs compared to the text written by humans. LLM captured common patterns and structures in the text it was trained on, and is very good at \
reproducing them. As a result, text generated by LLMs have relatively concentrated low PPLs.\
"""
def gpt2_features(text, tokenizer, model, sent_cut):
# Tokenize
input_max_length = tokenizer.model_max_length - 2
token_ids, offsets = list(), list()
sentences = sent_cut(text)
for s in sentences:
tokens = tokenizer.tokenize(s)
ids = tokenizer.convert_tokens_to_ids(tokens)
difference = len(token_ids) + len(ids) - input_max_length
if difference > 0:
ids = ids[:-difference]
offsets.append((len(token_ids), len(token_ids) + len(ids)))
token_ids.extend(ids)
if difference >= 0:
break
input_ids = torch.tensor([tokenizer.bos_token_id] + token_ids)
logits = model(input_ids).logits
# Shift so that n-1 predict n
shift_logits = logits[:-1].contiguous()
shift_target = input_ids[1:].contiguous()
loss = CROSS_ENTROPY(shift_logits, shift_target)
all_probs = torch.softmax(shift_logits, dim=-1)
sorted_ids = torch.argsort(all_probs, dim=-1, descending=True) # stable=True
expanded_tokens = shift_target.unsqueeze(-1).expand_as(sorted_ids)
indices = torch.where(sorted_ids == expanded_tokens)
rank = indices[-1]
counter = [
rank < 10,
(rank >= 10) & (rank < 100),
(rank >= 100) & (rank < 1000),
rank >= 1000
]
counter = [c.long().sum(-1).item() for c in counter]
# compute different-level ppl
text_ppl = loss.mean().exp().item()
sent_ppl = list()
for start, end in offsets:
nll = loss[start: end].sum() / (end - start)
sent_ppl.append(nll.exp().item())
max_sent_ppl = max(sent_ppl)
sent_ppl_avg = sum(sent_ppl) / len(sent_ppl)
if len(sent_ppl) > 1:
sent_ppl_std = torch.std(torch.tensor(sent_ppl)).item()
else:
sent_ppl_std = 0
mask = torch.tensor([1] * loss.size(0))
step_ppl = loss.cumsum(dim=-1).div(mask.cumsum(dim=-1)).exp()
max_step_ppl = step_ppl.max(dim=-1)[0].item()
step_ppl_avg = step_ppl.sum(dim=-1).div(loss.size(0)).item()
if step_ppl.size(0) > 1:
step_ppl_std = step_ppl.std().item()
else:
step_ppl_std = 0
ppls = [
text_ppl, max_sent_ppl, sent_ppl_avg, sent_ppl_std,
max_step_ppl, step_ppl_avg, step_ppl_std
]
return ppls + counter # type: ignore
def predict_out(features, classifier, id_to_label):
x = np.asarray([features])
pred = classifier.predict(x)[0]
prob = classifier.predict_proba(x)[0, pred]
return [id_to_label[pred], prob]
def predict(text):
with torch.no_grad():
feats = gpt2_features(text, tokenizer, model, sent_cut_en)
out = predict_out(feats, clf, ['Human Written', 'LLM Generated'])
return out
with gr.Blocks() as demo:
gr.Markdown(
"""\
## Detect text generated using LLMs 🤖
Linguistic features such as Perplexity and other SOTA methods such as GLTR were used to classify between Human written and LLM Generated \
texts. This solution scored an ROC of 0.956 and 8th position in the DAIGT LLM Competition on Kaggle.
- Source & Credits: [https://github.com/Hello-SimpleAI/chatgpt-comparison-detection](https://github.com/Hello-SimpleAI/chatgpt-comparison-detection)
- Competition: [https://www.kaggle.com/competitions/llm-detect-ai-generated-text/leaderboard](https://www.kaggle.com/competitions/llm-detect-ai-generated-text/leaderboard)
- Solution WriteUp: [https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/470224](https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/470224)\
"""
)
with gr.Row():
gr.Markdown(
"""\
### Linguistic Analysis: Language Model Perplexity
The perplexity (PPL) is commonly used as a metric for evaluating the performance of language models (LM). It is defined as the exponential \
of the negative average log-likelihood of the text under the LM. A lower PPL indicates that the language model is more confident in its \
predictions, and is therefore considered to be a better model. The training of LMs is carried out on large-scale text corpora, it can \
be considered that it has learned some common language patterns and text structures. Therefore, PPL can be used to measure how \
well a text conforms to common characteristics.
I used all variants of the open-source GPT-2 model except xl size to compute the PPL (both text-level and sentence-level PPLs) of the \
collected texts. It is observed that, regardless of whether it is at the text level or the sentence level, the content generated by LLMs \
have relatively lower PPLs compared to the text written by humans. LLM captured common patterns and structures in the text it was trained on, \
and is very good at reproducing them. As a result, text generated by LLMs have relatively concentrated low PPLs.
Humans have the ability to express themselves in a wide variety of ways, depending on the context, audience, and purpose of the text they are \
writing. This can include using creative or imaginative elements, such as metaphors, similes, and unique word choices, which can make it more \
difficult for GPT2 to predict.
### GLTR: Giant Language Model Test Room
This idea originates from the following paper: arxiv.org/pdf/1906.04043.pdf. It studies 3 tests to compute features of an input text. Their \
major assumption is that to generate fluent and natural-looking text, most decoding strategies sample high probability tokens from the head \
of the distribution. I selected the most powerful Test-2 feature, which is the number of tokens in the Top-10, Top-100, Top-1000, and 1000+ \
ranks from the LM predicted probability distributions.
### Modelling
Scikit-learn's VotingClassifier consisting of XGBClassifier, LGBMClassifier, CatBoostClassifier and RandomForestClassifier with default parameters\
"""
)
with gr.Column():
a1 = gr.Textbox( lines=7, label='Text', value=example )
button1 = gr.Button("🤖 Predict!")
gr.Markdown("Prediction:")
label1 = gr.Textbox(lines=1, label='Predicted Label')
score1 = gr.Textbox(lines=1, label='Predicted Probability')
button1.click(predict, inputs=[a1], outputs=[label1, score1])
demo.launch() |