How to load the model's tokenizer? What is the input of each models?

#1
by spiesg - opened

Hi, I want to test the effect of the original model on the dataset, but I don't know the data input form. Here are the questions:

  1. What is the correspondence between entity and ENTITY_id in add_tokens.json? and the relation to RELATION_ID?
  2. And the input for each model is "[CLS][ENTITY_i][SEP][RELATION_i][SEP][MASK][SEP]" or "[CLS][ENTITY_i][RELATION_i][MASK][SEP]" and use [MASK] as the query?
    3.When we use [MASK] to do prediction, how many categories are there? Only the Number of entities OR the number of entities+Bert's original vocabulary size?
spiesg changed discussion status to closed
  1. The ENTITY_ID and RELATION_ID are encoded ID numbers read in order from relations.txt and entity2textlong.txt
  2. Yes, we use the [MASK] token for prediction.
  3. Our candidate set only includes entities and does not consist of the original vocabulary of BERT. Specifically, we use the part of the expanded vocabulary that pertains to entities as the candidate set.

Sign up or log in to comment