Edit model card

RuTaBERT (base model)

RuTaBERT is a model that solves the problem of Column Type Annotation based on a pre-trained language model (BERT), fine-tuned on the Russian table corpus. The original repo can be found here.

Model description

RuTaBERT is a fine-tuned 12-layer multilingual BERT (bert-base-multilingual-cased) language model for solving the problem of Column Type Annotation (CTA).

We trained RuTaBERT on a labeled set of tabular data – Russian Web Tables (RWT). RWT was formed based on a Russian-language Wikipedia for September 13, 2021 and contains 1.2 million tables (7.4 million columns). The table labeling of the RWT corpus was carried out automatically based on a set of 356 semantic types (classes, data properties and object properties) taken from the general-purpose knowledge graph DBpedia and translated into Russian. A dataset consisting of 1.441.349 labeled columns was obtained on the stage of table preprocessing. In this case, only 170 semantic types (labels) were used.

Intended Uses

An input table is a two-dimensional array composed of rows and columns. Each cell in the table holds information that can be displayed as text, numbers, dates, and more. You can use the raw vertical tables (e.g., tables in the CSV format) as input data. A vertical table is a data structure organized in vertical columns. Each column may include a header. In such tables, each column can be divided into two types: 1. A named entity (categorical) column contains entity mentions of some domain (e.g., persons, organizations, events). 2. A literal column contains some values of simple datatypes (e.g., date, time, cardinal number).

Assumption 1. The first row of a source table is a header containing attribute (column) names.

Assumption 2. All values of column cells in a source table have the same entity types and data types.

Thus, RuTaBERT predicts semantic types for each column in a vertical table.

How to Use

An example of using this model is given in rutabert_pipeline folder:

  • data contains a table example in the CSV format for testing;
  • dataset, model, sem_types, and pipeline contain the implementation of custom pipeline for RuTaBERT.
  • inference_example contains an example of pipeline registration and inference based on it.

Authors

Downloads last month
34
Safetensors
Model size
167M params
Tensor type
F32
·
Unable to determine this model’s pipeline type. Check the docs .