--- language: ja license: cc-by-sa-4.0 tags: - generated_from_trainer - text-classification metrics: - accuracy widget: - text: 💪(^ω^ 🍤) example_title: Facemark 1 - text: (੭ु∂∀6)੭ु⁾⁾ ஐ•*¨*•.¸¸ example_title: Facemark 2 - text: :-P example_title: Facemark 3 - text: (o.o) example_title: Facemark 4 - text: (10/7~) example_title: Non-facemark 1 - text: ??<<「ニャア(しゃーねぇな)」プイッ example_title: Non-facemark 2 - text: (0.01) example_title: Non-facemark 3 base_model: cl-tohoku/bert-base-japanese-whole-word-masking --- # Facemark Detection This model classifies given text into facemark(1) or not(0). This model is a fine-tuned version of [cl-tohoku/bert-base-japanese-whole-word-masking](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking) on an original facemark dataset. It achieves the following results on the evaluation set: - Loss: 0.1301 - Accuracy: 0.9896 ## Model description This model classifies given text into facemark(1) or not(0). ## Intended uses & limitations Extract a facemark-prone potion of text and apply the text to the model. Extraction of a facemark can be done by regex but usually includes many non-facemarks. For example, I used the following regex pattern to extract a facemark-prone text by perl. ```perl $input_text = "facemark prne text" my $text = '[0-9A-Za-zぁ-ヶ一-龠]'; my $non_text = '[^0-9A-Za-zぁ-ヶ一-龠]'; my $allow_text = '[ovっつ゜ニノ三二]'; my $hw_kana = '[ヲ-゚]'; my $open_branket = '[\(∩꒰(]'; my $close_branket = '[\)∩꒱)]'; my $around_face = '(?:' . $non_text . '|' . $allow_text . ')*'; my $face = '(?!(?:' . $text . '|' . $hw_kana . '){3,8}).{3,8}'; my $face_char = $around_face . $open_branket . $face . $close_branket . $around_face; my $facemark; if ($input_text=~/($face_char)/) { $facemark = $1; } ``` Example of facemarks are: ``` (^U^)← 。\n\n⊂( *・ω・ )⊃ っ(。>﹏<) タカ( ˘ω' ) ヤスゥ… 。(’↑▽↑) ……💰( ˘ω˘ )💰 ーーー(*´꒳`*)!( …(o:∇:o) !!…(;´Д`)? (*´﹃ `*)✿ ``` Examples of non-facemarks are: ``` (3,000円) : (1/3) (@nVApO) (10/7~) ?<<「ニャア(しゃーねぇな)」プイッ (残り 51字) (-0.1602) (25-0) (コーヒー飲んだ) (※軽トラ) ``` This model intended to use for a facemark-prone text like above. ## Training and evaluation data Facemark data is collected manually and automatically from Twitter timeline. * train.csv : 35591 samples (29911 facemark, 5680 non-facemark) * test.csv : 3954 samples (3315 facemark, 639 non-facemark) ## Training procedure ```bash python ./examples/pytorch/text-classification/run_glue.py \ --model_name_or_path=cl-tohoku/bert-base-japanese-whole-word-masking \ --do_train --do_eval \ --max_seq_length=128 --per_device_train_batch_size=32 \ --use_fast_tokenizer=False --learning_rate=2e-5 --num_train_epochs=50 \ --output_dir=facemark_classify \ --save_steps=1000 --save_total_limit=3 \ --train_file=train.csv \ --validation_file=test.csv ``` ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 32 - eval_batch_size: 8 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 50.0 ### Training results It achieves the following results on the evaluation set: - Loss: 0.1301 - Accuracy: 0.9896 ### Framework versions - Transformers 4.26.0.dev0 - Pytorch 1.11.0+cu102 - Datasets 2.7.1 - Tokenizers 0.13.2