gfigueroa commited on
Commit
f818329
1 Parent(s): f5f4961

Initial commit

Browse files
README.md CHANGED
@@ -1,3 +1,95 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - punctuation
6
+ license: mit
7
+ datasets:
8
+ - yelp_polarity
9
+ metrics:
10
+ - f1
11
+ ---
12
+ # ✨ bert-restore-punctuation
13
+ [![forthebadge](https://forthebadge.com/images/badges/gluten-free.svg)]()
14
+
15
+ This a bert-base-uncased model finetuned for punctuation restoration on [Yelp Reviews](https://www.tensorflow.org/datasets/catalog/yelp_polarity_reviews).
16
+
17
+ The model predicts the punctuation and upper-casing of plain, lower-cased text. An example use case can be ASR output. Or other cases when text has lost punctuation.
18
+
19
+ This model is intended for direct use as a punctuation restoration model for the general English language. Alternatively, you can use this for further fine-tuning on domain-specific texts for punctuation restoration tasks.
20
+
21
+ Model restores the following punctuations -- **[! ? . , - : ; ' ]**
22
+
23
+ The model also restores the upper-casing of words.
24
+
25
+ -----------------------------------------------
26
+ ## 🚋 Usage
27
+ **Below is a quick way to get up and running with the model.**
28
+ 1. First, install the package.
29
+ ```bash
30
+ pip install rpunct
31
+ ```
32
+ 2. Sample python code.
33
+ ```python
34
+ from rpunct import RestorePuncts
35
+ # The default language is 'english'
36
+ rpunct = RestorePuncts()
37
+ rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
38
+ by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
39
+ a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
40
+ professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
41
+ 3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
42
+ # Outputs the following:
43
+ # In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
44
+ # resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
45
+ # thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B.
46
+ # Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
47
+ # sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.
48
+ ```
49
+
50
+ **This model works on arbitrarily large text in English language and uses GPU if available.**
51
+
52
+ -----------------------------------------------
53
+ ## 📡 Training data
54
+
55
+ Here is the number of product reviews we used for finetuning the model:
56
+
57
+ | Language | Number of text samples|
58
+ | -------- | ----------------- |
59
+ | English | 560,000 |
60
+
61
+ We found the best convergence around _**3 epochs**_, which is what presented here and available via a download.
62
+
63
+ -----------------------------------------------
64
+ ## 🎯 Accuracy
65
+ The fine-tuned model obtained the following accuracy on 45,990 held-out text samples:
66
+
67
+ | Accuracy | Overall F1 | Eval Support |
68
+ | -------- | ---------------------- | ------------------- |
69
+ | 91% | 90% | 45,990
70
+
71
+ Below is a breakdown of the performance of the model by each label:
72
+
73
+ | label | precision | recall | f1-score | support|
74
+ | --------- | -------------|-------- | ----------|--------|
75
+ | **!** | 0.45 | 0.17 | 0.24 | 424
76
+ | **!+Upper** | 0.43 | 0.34 | 0.38 | 98
77
+ | **'** | 0.60 | 0.27 | 0.37 | 11
78
+ | **,** | 0.59 | 0.51 | 0.55 | 1522
79
+ | **,+Upper** | 0.52 | 0.50 | 0.51 | 239
80
+ | **-** | 0.00 | 0.00 | 0.00 | 18
81
+ | **.** | 0.69 | 0.84 | 0.75 | 2488
82
+ | **.+Upper** | 0.65 | 0.52 | 0.57 | 274
83
+ | **:** | 0.52 | 0.31 | 0.39 | 39
84
+ | **:+Upper** | 0.36 | 0.62 | 0.45 | 16
85
+ | **;** | 0.00 | 0.00 | 0.00 | 17
86
+ | **?** | 0.54 | 0.48 | 0.51 | 46
87
+ | **?+Upper** | 0.40 | 0.50 | 0.44 | 4
88
+ | **none** | 0.96 | 0.96 | 0.96 |35352
89
+ | **Upper** | 0.84 | 0.82 | 0.83 | 5442
90
+
91
+ -----------------------------------------------
92
+ ## ☕ Contact
93
+ Contact [Daulet Nurmanbetov](daulet.nurmanbetov@gmail.com) for questions, feedback and/or requests for similar models.
94
+
95
+ -----------------------------------------------
config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "bert-base-uncased",
3
+ "architectures": [
4
+ "BertForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "gradient_checkpointing": false,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "id2label": {
12
+ "0": "LABEL_0",
13
+ "1": "LABEL_1",
14
+ "2": "LABEL_2",
15
+ "3": "LABEL_3",
16
+ "4": "LABEL_4",
17
+ "5": "LABEL_5",
18
+ "6": "LABEL_6",
19
+ "7": "LABEL_7",
20
+ "8": "LABEL_8",
21
+ "9": "LABEL_9",
22
+ "10": "LABEL_10",
23
+ "11": "LABEL_11",
24
+ "12": "LABEL_12",
25
+ "13": "LABEL_13",
26
+ "14": "LABEL_14"
27
+ },
28
+ "initializer_range": 0.02,
29
+ "intermediate_size": 3072,
30
+ "label2id": {
31
+ "LABEL_0": 0,
32
+ "LABEL_1": 1,
33
+ "LABEL_10": 10,
34
+ "LABEL_11": 11,
35
+ "LABEL_12": 12,
36
+ "LABEL_13": 13,
37
+ "LABEL_14": 14,
38
+ "LABEL_2": 2,
39
+ "LABEL_3": 3,
40
+ "LABEL_4": 4,
41
+ "LABEL_5": 5,
42
+ "LABEL_6": 6,
43
+ "LABEL_7": 7,
44
+ "LABEL_8": 8,
45
+ "LABEL_9": 9
46
+ },
47
+ "layer_norm_eps": 1e-12,
48
+ "max_position_embeddings": 512,
49
+ "model_type": "bert",
50
+ "num_attention_heads": 12,
51
+ "num_hidden_layers": 12,
52
+ "pad_token_id": 0,
53
+ "position_embedding_type": "absolute",
54
+ "transformers_version": "4.6.0",
55
+ "type_vocab_size": 2,
56
+ "use_cache": true,
57
+ "vocab_size": 30522
58
+ }
model_args.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"adafactor_beta1": null, "adafactor_clip_threshold": 1.0, "adafactor_decay_rate": -0.8, "adafactor_eps": [1e-30, 0.001], "adafactor_relative_step": true, "adafactor_scale_parameter": true, "adafactor_warmup_init": true, "adam_epsilon": 1e-08, "best_model_dir": "outputs/best_model", "cache_dir": "cache_dir/", "config": {}, "cosine_schedule_num_cycles": 0.5, "custom_layer_parameters": [], "custom_parameter_groups": [], "dataloader_num_workers": 0, "do_lower_case": false, "dynamic_quantize": false, "early_stopping_consider_epochs": false, "early_stopping_delta": 0, "early_stopping_metric": "eval_loss", "early_stopping_metric_minimize": true, "early_stopping_patience": 3, "encoding": null, "eval_batch_size": 8, "evaluate_during_training": false, "evaluate_during_training_silent": true, "evaluate_during_training_steps": 2000, "evaluate_during_training_verbose": false, "evaluate_each_epoch": true, "fp16": true, "gradient_accumulation_steps": 1, "learning_rate": 4e-05, "local_rank": -1, "logging_steps": 50, "manual_seed": null, "max_grad_norm": 1.0, "max_seq_length": 512, "model_name": "bert-base-uncased", "model_type": "bert", "multiprocessing_chunksize": -1, "n_gpu": 1, "no_cache": false, "no_save": false, "not_saved_args": [], "num_train_epochs": 3, "optimizer": "AdamW", "output_dir": "outputs/", "overwrite_output_dir": true, "polynomial_decay_schedule_lr_end": 1e-07, "polynomial_decay_schedule_power": 1.0, "process_count": 2, "quantized_model": false, "reprocess_input_data": true, "save_best_model": true, "save_eval_checkpoints": true, "save_model_every_epoch": true, "save_optimizer_and_scheduler": true, "save_steps": 2000, "scheduler": "linear_schedule_with_warmup", "silent": false, "skip_special_tokens": true, "tensorboard_dir": null, "thread_count": null, "tokenizer_name": null, "tokenizer_type": null, "train_batch_size": 8, "train_custom_parameters_only": false, "use_cached_eval_features": false, "use_early_stopping": false, "use_hf_datasets": false, "use_multiprocessing": true, "use_multiprocessing_for_evaluation": true, "wandb_kwargs": {}, "wandb_project": null, "warmup_ratio": 0.06, "warmup_steps": 1520, "weight_decay": 0.0, "model_class": "NERModel", "classification_report": false, "labels_list": ["OU", "OO", ".O", "!O", ",O", ".U", "!U", ",U", ":O", ";O", ":U", "'O", "-O", "?O", "?U"], "lazy_loading": true, "lazy_loading_start_line": 0, "onnx": false, "special_tokens_list": []}
optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:baa29e1cb9109755c6f437118966fbdeb80728f7892b2fa3560d95bb39fdc333
3
+ size 871350381
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fb8efcdafa21bf982d03fd1aa86f95227353e7ea624ca08afe2cac7d726a8cb2
3
+ size 435701303
scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2d97f7651f224325f0d4b9f605d2296d6e56a594684edf6f268483074481ebdd
3
+ size 623
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": false, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "tokenizer_file": "/home/ubuntu/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4", "name_or_path": "bert-base-uncased"}
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:adf90f507e762161db2c536304e314215145aaed6db3be60a9e93ba1e68c7073
3
+ size 2991
vocab.txt ADDED
The diff for this file is too large to render. See raw diff