Initial commit
Browse files- README.md +95 -3
- config.json +58 -0
- model_args.json +1 -0
- optimizer.pt +3 -0
- pytorch_model.bin +3 -0
- scheduler.pt +3 -0
- special_tokens_map.json +1 -0
- tokenizer_config.json +1 -0
- training_args.bin +3 -0
- vocab.txt +0 -0
README.md
CHANGED
@@ -1,3 +1,95 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
tags:
|
5 |
+
- punctuation
|
6 |
+
license: mit
|
7 |
+
datasets:
|
8 |
+
- yelp_polarity
|
9 |
+
metrics:
|
10 |
+
- f1
|
11 |
+
---
|
12 |
+
# ✨ bert-restore-punctuation
|
13 |
+
[![forthebadge](https://forthebadge.com/images/badges/gluten-free.svg)]()
|
14 |
+
|
15 |
+
This a bert-base-uncased model finetuned for punctuation restoration on [Yelp Reviews](https://www.tensorflow.org/datasets/catalog/yelp_polarity_reviews).
|
16 |
+
|
17 |
+
The model predicts the punctuation and upper-casing of plain, lower-cased text. An example use case can be ASR output. Or other cases when text has lost punctuation.
|
18 |
+
|
19 |
+
This model is intended for direct use as a punctuation restoration model for the general English language. Alternatively, you can use this for further fine-tuning on domain-specific texts for punctuation restoration tasks.
|
20 |
+
|
21 |
+
Model restores the following punctuations -- **[! ? . , - : ; ' ]**
|
22 |
+
|
23 |
+
The model also restores the upper-casing of words.
|
24 |
+
|
25 |
+
-----------------------------------------------
|
26 |
+
## 🚋 Usage
|
27 |
+
**Below is a quick way to get up and running with the model.**
|
28 |
+
1. First, install the package.
|
29 |
+
```bash
|
30 |
+
pip install rpunct
|
31 |
+
```
|
32 |
+
2. Sample python code.
|
33 |
+
```python
|
34 |
+
from rpunct import RestorePuncts
|
35 |
+
# The default language is 'english'
|
36 |
+
rpunct = RestorePuncts()
|
37 |
+
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
|
38 |
+
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
|
39 |
+
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
|
40 |
+
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
|
41 |
+
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
|
42 |
+
# Outputs the following:
|
43 |
+
# In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
|
44 |
+
# resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
|
45 |
+
# thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B.
|
46 |
+
# Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
|
47 |
+
# sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.
|
48 |
+
```
|
49 |
+
|
50 |
+
**This model works on arbitrarily large text in English language and uses GPU if available.**
|
51 |
+
|
52 |
+
-----------------------------------------------
|
53 |
+
## 📡 Training data
|
54 |
+
|
55 |
+
Here is the number of product reviews we used for finetuning the model:
|
56 |
+
|
57 |
+
| Language | Number of text samples|
|
58 |
+
| -------- | ----------------- |
|
59 |
+
| English | 560,000 |
|
60 |
+
|
61 |
+
We found the best convergence around _**3 epochs**_, which is what presented here and available via a download.
|
62 |
+
|
63 |
+
-----------------------------------------------
|
64 |
+
## 🎯 Accuracy
|
65 |
+
The fine-tuned model obtained the following accuracy on 45,990 held-out text samples:
|
66 |
+
|
67 |
+
| Accuracy | Overall F1 | Eval Support |
|
68 |
+
| -------- | ---------------------- | ------------------- |
|
69 |
+
| 91% | 90% | 45,990
|
70 |
+
|
71 |
+
Below is a breakdown of the performance of the model by each label:
|
72 |
+
|
73 |
+
| label | precision | recall | f1-score | support|
|
74 |
+
| --------- | -------------|-------- | ----------|--------|
|
75 |
+
| **!** | 0.45 | 0.17 | 0.24 | 424
|
76 |
+
| **!+Upper** | 0.43 | 0.34 | 0.38 | 98
|
77 |
+
| **'** | 0.60 | 0.27 | 0.37 | 11
|
78 |
+
| **,** | 0.59 | 0.51 | 0.55 | 1522
|
79 |
+
| **,+Upper** | 0.52 | 0.50 | 0.51 | 239
|
80 |
+
| **-** | 0.00 | 0.00 | 0.00 | 18
|
81 |
+
| **.** | 0.69 | 0.84 | 0.75 | 2488
|
82 |
+
| **.+Upper** | 0.65 | 0.52 | 0.57 | 274
|
83 |
+
| **:** | 0.52 | 0.31 | 0.39 | 39
|
84 |
+
| **:+Upper** | 0.36 | 0.62 | 0.45 | 16
|
85 |
+
| **;** | 0.00 | 0.00 | 0.00 | 17
|
86 |
+
| **?** | 0.54 | 0.48 | 0.51 | 46
|
87 |
+
| **?+Upper** | 0.40 | 0.50 | 0.44 | 4
|
88 |
+
| **none** | 0.96 | 0.96 | 0.96 |35352
|
89 |
+
| **Upper** | 0.84 | 0.82 | 0.83 | 5442
|
90 |
+
|
91 |
+
-----------------------------------------------
|
92 |
+
## ☕ Contact
|
93 |
+
Contact [Daulet Nurmanbetov](daulet.nurmanbetov@gmail.com) for questions, feedback and/or requests for similar models.
|
94 |
+
|
95 |
+
-----------------------------------------------
|
config.json
ADDED
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "bert-base-uncased",
|
3 |
+
"architectures": [
|
4 |
+
"BertForTokenClassification"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"gradient_checkpointing": false,
|
8 |
+
"hidden_act": "gelu",
|
9 |
+
"hidden_dropout_prob": 0.1,
|
10 |
+
"hidden_size": 768,
|
11 |
+
"id2label": {
|
12 |
+
"0": "LABEL_0",
|
13 |
+
"1": "LABEL_1",
|
14 |
+
"2": "LABEL_2",
|
15 |
+
"3": "LABEL_3",
|
16 |
+
"4": "LABEL_4",
|
17 |
+
"5": "LABEL_5",
|
18 |
+
"6": "LABEL_6",
|
19 |
+
"7": "LABEL_7",
|
20 |
+
"8": "LABEL_8",
|
21 |
+
"9": "LABEL_9",
|
22 |
+
"10": "LABEL_10",
|
23 |
+
"11": "LABEL_11",
|
24 |
+
"12": "LABEL_12",
|
25 |
+
"13": "LABEL_13",
|
26 |
+
"14": "LABEL_14"
|
27 |
+
},
|
28 |
+
"initializer_range": 0.02,
|
29 |
+
"intermediate_size": 3072,
|
30 |
+
"label2id": {
|
31 |
+
"LABEL_0": 0,
|
32 |
+
"LABEL_1": 1,
|
33 |
+
"LABEL_10": 10,
|
34 |
+
"LABEL_11": 11,
|
35 |
+
"LABEL_12": 12,
|
36 |
+
"LABEL_13": 13,
|
37 |
+
"LABEL_14": 14,
|
38 |
+
"LABEL_2": 2,
|
39 |
+
"LABEL_3": 3,
|
40 |
+
"LABEL_4": 4,
|
41 |
+
"LABEL_5": 5,
|
42 |
+
"LABEL_6": 6,
|
43 |
+
"LABEL_7": 7,
|
44 |
+
"LABEL_8": 8,
|
45 |
+
"LABEL_9": 9
|
46 |
+
},
|
47 |
+
"layer_norm_eps": 1e-12,
|
48 |
+
"max_position_embeddings": 512,
|
49 |
+
"model_type": "bert",
|
50 |
+
"num_attention_heads": 12,
|
51 |
+
"num_hidden_layers": 12,
|
52 |
+
"pad_token_id": 0,
|
53 |
+
"position_embedding_type": "absolute",
|
54 |
+
"transformers_version": "4.6.0",
|
55 |
+
"type_vocab_size": 2,
|
56 |
+
"use_cache": true,
|
57 |
+
"vocab_size": 30522
|
58 |
+
}
|
model_args.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"adafactor_beta1": null, "adafactor_clip_threshold": 1.0, "adafactor_decay_rate": -0.8, "adafactor_eps": [1e-30, 0.001], "adafactor_relative_step": true, "adafactor_scale_parameter": true, "adafactor_warmup_init": true, "adam_epsilon": 1e-08, "best_model_dir": "outputs/best_model", "cache_dir": "cache_dir/", "config": {}, "cosine_schedule_num_cycles": 0.5, "custom_layer_parameters": [], "custom_parameter_groups": [], "dataloader_num_workers": 0, "do_lower_case": false, "dynamic_quantize": false, "early_stopping_consider_epochs": false, "early_stopping_delta": 0, "early_stopping_metric": "eval_loss", "early_stopping_metric_minimize": true, "early_stopping_patience": 3, "encoding": null, "eval_batch_size": 8, "evaluate_during_training": false, "evaluate_during_training_silent": true, "evaluate_during_training_steps": 2000, "evaluate_during_training_verbose": false, "evaluate_each_epoch": true, "fp16": true, "gradient_accumulation_steps": 1, "learning_rate": 4e-05, "local_rank": -1, "logging_steps": 50, "manual_seed": null, "max_grad_norm": 1.0, "max_seq_length": 512, "model_name": "bert-base-uncased", "model_type": "bert", "multiprocessing_chunksize": -1, "n_gpu": 1, "no_cache": false, "no_save": false, "not_saved_args": [], "num_train_epochs": 3, "optimizer": "AdamW", "output_dir": "outputs/", "overwrite_output_dir": true, "polynomial_decay_schedule_lr_end": 1e-07, "polynomial_decay_schedule_power": 1.0, "process_count": 2, "quantized_model": false, "reprocess_input_data": true, "save_best_model": true, "save_eval_checkpoints": true, "save_model_every_epoch": true, "save_optimizer_and_scheduler": true, "save_steps": 2000, "scheduler": "linear_schedule_with_warmup", "silent": false, "skip_special_tokens": true, "tensorboard_dir": null, "thread_count": null, "tokenizer_name": null, "tokenizer_type": null, "train_batch_size": 8, "train_custom_parameters_only": false, "use_cached_eval_features": false, "use_early_stopping": false, "use_hf_datasets": false, "use_multiprocessing": true, "use_multiprocessing_for_evaluation": true, "wandb_kwargs": {}, "wandb_project": null, "warmup_ratio": 0.06, "warmup_steps": 1520, "weight_decay": 0.0, "model_class": "NERModel", "classification_report": false, "labels_list": ["OU", "OO", ".O", "!O", ",O", ".U", "!U", ",U", ":O", ";O", ":U", "'O", "-O", "?O", "?U"], "lazy_loading": true, "lazy_loading_start_line": 0, "onnx": false, "special_tokens_list": []}
|
optimizer.pt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:baa29e1cb9109755c6f437118966fbdeb80728f7892b2fa3560d95bb39fdc333
|
3 |
+
size 871350381
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:fb8efcdafa21bf982d03fd1aa86f95227353e7ea624ca08afe2cac7d726a8cb2
|
3 |
+
size 435701303
|
scheduler.pt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:2d97f7651f224325f0d4b9f605d2296d6e56a594684edf6f268483074481ebdd
|
3 |
+
size 623
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"do_lower_case": false, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "tokenizer_file": "/home/ubuntu/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4", "name_or_path": "bert-base-uncased"}
|
training_args.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:adf90f507e762161db2c536304e314215145aaed6db3be60a9e93ba1e68c7073
|
3 |
+
size 2991
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|