metadata

language:
  - en
tags:
  - masked-lm
  - pytorch
pipeline-tag: fill-mask
mask-token: '[MASK]'
widget:
  - text: >-
      The present [MASK] provides a torque sensor that is small and highly rigid
      and for which high production efficiency is possible.
  - text: >-
      The present invention relates to [MASK] accessories and pertains
      particularly to a brake light unit for bicycles.
  - text: >-
      The present invention discloses a space-bound-free [MASK] and its
      coordinate determining circuit for determining a coordinate of a stylus
      pen.
  - text: >-
      The illuminated [MASK] includes a substantially translucent canopy
      supported by a plurality of ribs pivotally swingable towards and away from
      a shaft.
license: apache-2.0
metrics:
  - perplexity

Motivation

This model is based on anferico/bert-for-patents - a BERT_LARGE model (See next section for details below). By default, the pre-trained model's output embeddings with size 768 (base-models) or with size 1024 (large-models). However, when you store Millions of embeddings, this can require quite a lot of memory/storage. So have reduced the embedding dimension to 64 i.e 1/16th of 1024 using Principle Component Analysis (PCA) and it still gives a comparable performance. Yes! PCA gives better performance than NMF. Note: This process neither improves the runtime, nor the memory requirement for running the model. It only reduces the needed space to store embeddings, for example, for semantic search using vector databases.

BERT for Patents

BERT for Patents is a model trained by Google on 100M+ patents (not just US patents).

If you want to learn more about the model, check out the blog post, white paper and GitHub page containing the original TensorFlow checkpoint.

Projects using this model (or variants of it):

Patents4IPPC (carried out by Pi School and commissioned by the Joint Research Centre (JRC) of the European Commission)