1-800-BAD-CODE's picture
Update README.md
f9f8143
|
raw
history blame
5.54 kB
metadata
license: apache-2.0
library_name: generic
tags:
  - text2text-generation
  - punctuation
  - sentence-boundary-detection
  - truecasing
  - true-casing
language:
  - af
  - am
  - ar
  - bg
  - bn
  - de
  - el
  - en
  - es
  - et
  - fa
  - fi
  - fr
  - gu
  - hi
  - hr
  - hu
  - id
  - is
  - it
  - ja
  - kk
  - kn
  - ko
  - ky
  - lt
  - lv
  - mk
  - ml
  - mr
  - nl
  - or
  - pa
  - pl
  - ps
  - pt
  - ro
  - ru
  - rw
  - so
  - sr
  - sw
  - ta
  - te
  - tr
  - uk
  - zh

Model Overview

This is a fine-tuned xlm-roberta model that restores punctuation, true-cases (capitalizes), and detects sentence boundaries (full stops) in 47 languages.

Tokenizer

Instead of the hacky wrapper used by FairSeq and strangely ported (not fixed) by HuggingFace, the xlm-roberta SentencePiece model was adjusted to correctly encode the text. Per HF's comments,

# Original fairseq vocab and spm vocab must be "aligned":
# Vocab    |    0    |    1    |   2    |    3    |  4  |  5  |  6  |   7   |   8   |  9
# -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ----
# fairseq  | '<s>'   | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's'   | '▁de' | '-'
# spm      | '<unk>' | '<s>'   | '</s>' | ','     | '.' | '▁' | 's' | '▁de' | '-'   | '▁a'

The SP model was un-hacked with the following snippet (SentencePiece experts, let me know if there is a problem here):

from sentencepiece import SentencePieceProcessor
from sentencepiece.sentencepiece_model_pb2 import ModelProto

m = ModelProto()
m.ParseFromString(open("/path/to/xlmroberta/sentencepiece.bpe.model", "rb").read())

pieces = list(m.pieces)
pieces = (
    [
        ModelProto.SentencePiece(piece="<s>", type=ModelProto.SentencePiece.Type.CONTROL),
        ModelProto.SentencePiece(piece="<pad>", type=ModelProto.SentencePiece.Type.CONTROL),
        ModelProto.SentencePiece(piece="</s>", type=ModelProto.SentencePiece.Type.CONTROL),
        ModelProto.SentencePiece(piece="<unk>", type=ModelProto.SentencePiece.Type.UNKNOWN),
    ]
    + pieces[3:]
    + [ModelProto.SentencePiece(piece="<mask>", type=ModelProto.SentencePiece.Type.USER_DEFINED)]
)
del m.pieces[:]
m.pieces.extend(pieces)

with open("/path/to/new/sp.model", "wb") as f:
    f.write(m.SerializeToString())

Post-Punctuation Tokens

This model predicts the following set of punctuation tokens after each subtoken:

Token Description Relevant Languages
<NULL> No punctuation All
<ACRONYM> Every character in this subword is followed by a period Primarily English, some European
. Latin full stop Many
, Latin comma Many
? Latin question mark Many
Full-width question mark Chinese, Japanese
Full-width comma Chinese, Japanese
Full-width full stop Chinese, Japanese
Ideographic comma Chinese, Japanese
Middle dot Japanese
Danda Hindi, Bengali, Oriya
؟ Arabic question mark Arabic
; Greek question mark Greek
Ethiopic full stop Amharic
Ethiopic comma Amharic
Ethiopic question mark Amharic

Pre-Punctuation Tokens

This model predicts the following set of punctuation tokens before each subword:

Token Description Relevant Languages
<NULL> No punctuation All
¿ Inverted question mark Spanish

Training Details

This model was trained in the NeMo framework.

Training Data

This model was trained with News Crawl data from WMT.

1M lines of text for each language was used, except for a few low-resource languages which may have used less.

Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.

Limitations

This model was trained on news data, and may not perform well on conversational or informal data.

Further, this model is unlikely to be of production quality. It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.

Evaluation

In these metrics, keep in mind that

  1. The data is noisy

  2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect. When conditioning on reference punctuation, true-casing and SBD is practically 100% for most languages.

  3. Punctuation can be subjective. E.g.,

    Hola mundo, ¿cómo estás?

    or

    Hola mundo. ¿Cómo estás?

    When the sentences are longer and more practical, these ambiguities abound and affect all 3 analytics.

Test Data and Example Generation

Each test example was generated using the following procedure:

  1. Concatenate 10 random sentences
  2. Lower-case the concatenated sentence
  3. Remove all punctuation

The data is a held-out portion of News Crawl, which has been deduplicated. 3,000 lines of data per language was used, generating 3,000 unique examples of 10 sentences each. The last 4 sentences of each example were randomly sampled from the 3,000 and may be duplicated.

Examples longer than the model's maximum length were truncated. The number of affected sentences can be estimated from the "full stop" support: with 3,000 sentences and 10 sentences per example, we expect 30,000 full stop targets total.

Selected Language Evaluation Reports