metadata
license: apache-2.0
language:
- zh
pipeline_tag: text-classification
library_name: transformers
risk-model-zh-v0.1
Introduction
This is a BERT model fine-tuned on a high-quality Chinese financial dataset. It generates a security risk score, which helps to identify and remove data with security risks from financial datasets, thereby reducing the proportion of illegal or undesirable data. For the complete data cleaning process, please refer to YiZhao.
Quickstart
Here is an example code snippet for generating security risk scores using this model.
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "risk-model-zh-v0.1"
dataset_file = "your_dataset.jsonl"
text_column = "text"
output_file = "your_output.jsonl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, torch_dtype=torch.bfloat16)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
dataset = load_dataset('json', data_files=dataset_file, cache_dir="cache/", split='train', num_proc=12)
def compute_scores(batch):
inputs = tokenizer(batch[text_column], return_tensors="pt", padding="longest", truncation=True).to(device)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits.squeeze(-1).float().cpu().numpy()
batch["risk_score"] = logits.tolist()
return batch
dataset = dataset.map(compute_scores, batched=True, batch_size=512)
dataset.to_json(output_file)