|
--- |
|
language: |
|
- ko |
|
tags: |
|
- classification |
|
license: mit |
|
datasets: |
|
- nsmc |
|
widget: |
|
- text: "불후의 명작입니다! 이렇게 감동적인 내용은 처음이에요" |
|
example_title: "Positive" |
|
- text: "시간이 정말 아깝습니다. 10점 만점에 1점도 아까워요.." |
|
example_title: "Negative" |
|
metrics: |
|
- accuracy |
|
- f1 |
|
- precision |
|
- recall- accuracy |
|
--- |
|
|
|
# Sentiment Binary Classification (fine-tuning with KoELECTRA-Small-v3 model and Naver Sentiment Movie Corpus dataset) |
|
|
|
## Usage (Amazon SageMaker inference applicable) |
|
It uses the interface of the SageMaker Inference Toolkit as is, so it can be easily deployed to SageMaker Endpoint. |
|
|
|
### inference_nsmc.py |
|
|
|
```python |
|
import json |
|
import sys |
|
import logging |
|
import torch |
|
from torch import nn |
|
from transformers import ElectraConfig |
|
from transformers import ElectraModel, AutoTokenizer, ElectraTokenizer, ElectraForSequenceClassification |
|
|
|
logging.basicConfig( |
|
level=logging.INFO, |
|
format='[{%(filename)s:%(lineno)d} %(levelname)s - %(message)s', |
|
handlers=[ |
|
logging.FileHandler(filename='tmp.log'), |
|
logging.StreamHandler(sys.stdout) |
|
] |
|
) |
|
logger = logging.getLogger(__name__) |
|
|
|
max_seq_length = 128 |
|
classes = ['Neg', 'Pos'] |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("daekeun-ml/koelectra-small-v3-nsmc") |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
|
|
def model_fn(model_path=None): |
|
#### |
|
# If you have your own trained model |
|
# Huggingface pre-trained model: 'monologg/koelectra-small-v3-discriminator' |
|
#### |
|
#config = ElectraConfig.from_json_file(f'{model_path}/config.json') |
|
#model = ElectraForSequenceClassification.from_pretrained(f'{model_path}/model.pth', config=config) |
|
|
|
# Download model from the Huggingface hub |
|
model = ElectraForSequenceClassification.from_pretrained('daekeun-ml/koelectra-small-v3-nsmc') |
|
model.to(device) |
|
return model |
|
|
|
|
|
def input_fn(input_data, content_type="application/jsonlines"): |
|
data_str = input_data.decode("utf-8") |
|
jsonlines = data_str.split("\n") |
|
transformed_inputs = [] |
|
|
|
for jsonline in jsonlines: |
|
text = json.loads(jsonline)["text"][0] |
|
logger.info("input text: {}".format(text)) |
|
encode_plus_token = tokenizer.encode_plus( |
|
text, |
|
max_length=max_seq_length, |
|
add_special_tokens=True, |
|
return_token_type_ids=False, |
|
padding="max_length", |
|
return_attention_mask=True, |
|
return_tensors="pt", |
|
truncation=True, |
|
) |
|
transformed_inputs.append(encode_plus_token) |
|
|
|
return transformed_inputs |
|
|
|
|
|
def predict_fn(transformed_inputs, model): |
|
predicted_classes = [] |
|
|
|
for data in transformed_inputs: |
|
data = data.to(device) |
|
output = model(**data) |
|
|
|
softmax_fn = nn.Softmax(dim=1) |
|
softmax_output = softmax_fn(output[0]) |
|
_, prediction = torch.max(softmax_output, dim=1) |
|
|
|
predicted_class_idx = prediction.item() |
|
predicted_class = classes[predicted_class_idx] |
|
score = softmax_output[0][predicted_class_idx] |
|
logger.info("predicted_class: {}".format(predicted_class)) |
|
|
|
prediction_dict = {} |
|
prediction_dict["predicted_label"] = predicted_class |
|
prediction_dict['score'] = score.cpu().detach().numpy().tolist() |
|
|
|
jsonline = json.dumps(prediction_dict) |
|
logger.info("jsonline: {}".format(jsonline)) |
|
predicted_classes.append(jsonline) |
|
|
|
predicted_classes_jsonlines = "\n".join(predicted_classes) |
|
return predicted_classes_jsonlines |
|
|
|
|
|
def output_fn(outputs, accept="application/jsonlines"): |
|
return outputs, accept |
|
``` |
|
|
|
### test.py |
|
```python |
|
>>> from inference_nsmc import model_fn, input_fn, predict_fn, output_fn |
|
>>> with open('samples/nsmc.txt', mode='rb') as file: |
|
>>> model_input_data = file.read() |
|
>>> model = model_fn() |
|
>>> transformed_inputs = input_fn(model_input_data) |
|
>>> predicted_classes_jsonlines = predict_fn(transformed_inputs, model) |
|
>>> model_outputs = output_fn(predicted_classes_jsonlines) |
|
>>> print(model_outputs[0]) |
|
|
|
[{inference_nsmc.py:47} INFO - input text: 이 영화는 최고의 영화입니다 |
|
[{inference_nsmc.py:47} INFO - input text: 최악이에요. 배우의 연기력도 좋지 않고 내용도 너무 허접합니다 |
|
[{inference_nsmc.py:77} INFO - predicted_class: Pos |
|
[{inference_nsmc.py:84} INFO - jsonline: {"predicted_label": "Pos", "score": 0.9619030952453613} |
|
[{inference_nsmc.py:77} INFO - predicted_class: Neg |
|
[{inference_nsmc.py:84} INFO - jsonline: {"predicted_label": "Neg", "score": 0.9994170665740967} |
|
{"predicted_label": "Pos", "score": 0.9619030952453613} |
|
{"predicted_label": "Neg", "score": 0.9994170665740967} |
|
``` |
|
|
|
### Sample data (samples/nsmc.txt) |
|
``` |
|
{"text": ["이 영화는 최고의 영화입니다"]} |
|
{"text": ["최악이에요. 배우의 연기력도 좋지 않고 내용도 너무 허접합니다"]} |
|
``` |
|
|
|
## References |
|
- KoELECTRA: https://github.com/monologg/KoELECTRA |
|
- Naver Sentiment Movie Corpus Dataset: https://github.com/e9t/nsmc |