metadata

license: apache-2.0
library_name: generic
tags:
  - text2text-generation
  - punctuation
  - sentence-boundary-detection
  - truecasing
  - true-casing
language:
  - af
  - am
  - ar
  - bg
  - bn
  - de
  - el
  - en
  - es
  - et
  - fa
  - fi
  - fr
  - gu
  - hi
  - hr
  - hu
  - id
  - is
  - it
  - ja
  - kk
  - kn
  - ko
  - ky
  - lt
  - lv
  - mk
  - ml
  - mr
  - nl
  - or
  - pa
  - pl
  - ps
  - pt
  - ro
  - ru
  - rw
  - so
  - sr
  - sw
  - ta
  - te
  - tr
  - uk
  - zh

Model Overview

This is a fine-tuned xlm-roberta model that restores punctuation, true-cases (capitalizes), and detects sentence boundaries (full stops) in 47 languages.

Tokenizer

Instead of the hacky wrapper used by FairSeq and strangely ported (not fixed) by HuggingFace, the xlm-roberta SentencePiece model was adjusted to correctly encode the text. Per HF's comments,

# Original fairseq vocab and spm vocab must be "aligned":
# Vocab    |    0    |    1    |   2    |    3    |  4  |  5  |  6  |   7   |   8   |  9
# -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ----
# fairseq  | '<s>'   | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's'   | '▁de' | '-'
# spm      | '<unk>' | '<s>'   | '</s>' | ','     | '.' | '▁' | 's' | '▁de' | '-'   | '▁a'

The SP model was un-hacked with the following snippet (SentencePiece experts, let me know if there is a problem here):

from sentencepiece import SentencePieceProcessor
from sentencepiece.sentencepiece_model_pb2 import ModelProto

m = ModelProto()
m.ParseFromString(open("/path/to/xlmroberta/sentencepiece.bpe.model", "rb").read())

pieces = list(m.pieces)
pieces = (
    [
        ModelProto.SentencePiece(piece="<s>", type=ModelProto.SentencePiece.Type.CONTROL),
        ModelProto.SentencePiece(piece="<pad>", type=ModelProto.SentencePiece.Type.CONTROL),
        ModelProto.SentencePiece(piece="</s>", type=ModelProto.SentencePiece.Type.CONTROL),
        ModelProto.SentencePiece(piece="<unk>", type=ModelProto.SentencePiece.Type.UNKNOWN),
    ]
    + pieces[3:]
    + [ModelProto.SentencePiece(piece="<mask>", type=ModelProto.SentencePiece.Type.USER_DEFINED)]
)
del m.pieces[:]
m.pieces.extend(pieces)

with open("/path/to/new/sp.model", "wb") as f:
    f.write(m.SerializeToString())

Post-Punctuation Tokens

This model predicts the following set of punctuation tokens after each subtoken:

Token	Description	Relevant Languages
<NULL>	No punctuation	All
<ACRONYM>	Every character in this subword is followed by a period	Primarily English, some European
.	Latin full stop	Many
,	Latin comma	Many
?	Latin question mark	Many
？	Full-width question mark	Chinese, Japanese
，	Full-width comma	Chinese, Japanese
。	Full-width full stop	Chinese, Japanese
、	Ideographic comma	Chinese, Japanese
・	Middle dot	Japanese
।	Danda	Hindi, Bengali, Oriya
؟	Arabic question mark	Arabic
;	Greek question mark	Greek
።	Ethiopic full stop	Amharic
፣	Ethiopic comma	Amharic
፧	Ethiopic question mark	Amharic

Pre-Punctuation Tokens

This model predicts the following set of punctuation tokens before each subword:

Token	Description	Relevant Languages
<NULL>	No punctuation	All
¿	Inverted question mark	Spanish

Training Details

This model was trained in the NeMo framework.

Training Data

This model was trained with News Crawl data from WMT.

1M lines of text for each language was used, except for a few low-resource languages which may have used less.

Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.

Limitations

This model was trained on news data, and may not perform well on conversational or informal data.

Further, this model is unlikely to be of production quality. It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.

Evaluation

In these metrics, keep in mind that

The data is noisy
Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect. When conditioning on reference punctuation, true-casing and SBD is practically 100% for most languages.
Punctuation can be subjective. E.g.,

Hola mundo, ¿cómo estás?

or

Hola mundo. ¿Cómo estás?

When the sentences are longer and more practical, these ambiguities abound and affect all 3 analytics.

Test Data and Example Generation

Each test example was generated using the following procedure:

Concatenate 10 random sentences
Lower-case the concatenated sentence
Remove all punctuation

The data is a held-out portion of News Crawl, which has been deduplicated. 3,000 lines of data per language was used, generating 3,000 unique examples of 10 sentences each. The last 4 sentences of each example were randomly sampled from the 3,000 and may be duplicated.

Examples longer than the model's maximum length were truncated. The number of affected sentences can be estimated from the "full stop" support: with 3,000 sentences and 10 sentences per example, we expect 30,000 full stop targets total.

1-800-BAD-CODE
/

xlm-roberta_punctuation_fullstop_truecase