Korcen

korcen-ml은 기존 키워드 기반의 korcen의 우회가 쉽다는 단점을 극복하기위해 딥러닝을 통해 정확도를 한층 더 올리려는 프로젝트입니다.

일부 모델만 공개하고 있으며 모델 파일은 여기에서 확인이 가능합니다.

더 많은 모델 파일과 학습 데이터를 다운받고 싶다면 문의주세요.

	데이터 문장수
VDCNN(23.4.30)	200,000개
VDCNN_KOGPT2(23.5.28)	2,000,000개
VDCNN_LLAMA2(23.9.30)	5,000,000개
VDCNN_LLAMA2_V2(24.1.29)	10,000,000개

키워드 기반 기존 라이브러리 : py version, ts version

서포트 디스코드 서버

모델 검증

데이터마다 욕설의 기준이 달라 오차가 있다는 걸 감안하고 확인하시기 바랍니다.

	korean-malicious-comments-dataset	Curse-detection-data	kmhas_korean_hate_speech	Korean Extremist Website Womad Hate Speech Data
korcen(v0.3.5)	0.7121	0.8415	0.6800	0.6305
VDCNN(23.4.30)	0.6900	0.4885		0.4885
VDCNN_KOGPT2(23.6.15)	0.7545	0.7824		0.7055
VDCNN_LLAMA2(23.9.30)	0.7762	0.8104	0.7296	V2로 대체
VDCNN_LLAMA2_V2(24.1.29)	0.8322	0.8410	0.7837	0.7120
badword_check(23.10.1)	0.5829	0.6761
CurseDetector(24.1.10)	0.5679	시간소요로 테스트 블가		0.5785

example

#py: 3.10, tf: 2.10
import tensorflow as tf
import numpy as np
import pickle
from tensorflow.keras.preprocessing.sequence import pad_sequences

maxlen = 1000

model_path = 'vdcnn_model.h5'
tokenizer_path = "tokenizer.pickle"

model = tf.keras.models.load_model(model_path)
with open(tokenizer_path, "rb") as f:
    tokenizer = pickle.load(f)

def preprocess_text(text):
    text = text.lower()
    
    return text

def predict_text(text):
    sentence = preprocess_text(text)
    encoded_sentence = tokenizer.encode_plus(sentence,
                                             max_length=maxlen,
                                             padding="max_length",
                                             truncation=True)['input_ids']
    sentence_seq = pad_sequences([encoded_sentence], maxlen=maxlen, truncating="post")
    prediction = model.predict(sentence_seq)[0][0]
    return prediction
    
while True:
    text = input("Enter the sentence you want to test: ")
    result = predict_text(text)
    if result >= 0.5:
        print("This sentence contains abusive language.")
    else:
        print("It's a normal sentence.")

Maker

Tanat

github:   Tanat05
discord:  Tanat05
email:    tanat@tanat.kr