Your Name commited on
Commit
07c9257
β€’
1 Parent(s): 3f3cd0d

Initial Commit

Browse files
{bert-for-patents-64d/1_Pooling β†’ 1_Pooling}/config.json RENAMED
File without changes
{bert-for-patents-64d/2_Dense β†’ 2_Dense}/config.json RENAMED
File without changes
{bert-for-patents-64d/2_Dense β†’ 2_Dense}/pytorch_model.bin RENAMED
File without changes
README.md CHANGED
@@ -1,38 +1,38 @@
1
- ---
2
- language:
3
- - en
4
- tags:
5
- - masked-lm
6
- - pytorch
7
- pipeline-tag:
8
- - "fill-mask"
9
- mask-token:
10
- - "[MASK]"
11
- widget:
12
- - text: "The present [MASK] provides a torque sensor that is small and highly rigid and for which high production efficiency is possible."
13
- - text: "The present invention relates to [MASK] accessories and pertains particularly to a brake light unit for bicycles."
14
- - text: "The present invention discloses a space-bound-free [MASK] and its coordinate determining circuit for determining a coordinate of a stylus pen."
15
- - text: "The illuminated [MASK] includes a substantially translucent canopy supported by a plurality of ribs pivotally swingable towards and away from a shaft."
16
- license: apache-2.0
17
- metrics:
18
- - perplexity
19
-
20
- ---
21
-
22
- # Motivation
23
-
24
-
25
- This model is based on anferico/bert-for-patents - a BERT<sub>LARGE</sub> model (See details below). By default, the pre-trained model's output embeddings with size 768 (base-models) or with size 1024 (large-models). However, when you store Millions of embeddings, this can require quite a lot of memory/storage. So have reduced the embedding dimension to 64 i.e 1/16th of 1024 using Principle Component Analysis (PCA) and it still gives a comparable performance. Yes! PCA gives better performance than NMF. Note: This process neither improves the runtime, nor the memory requirement for running the model. It only reduces the needed space to store embeddings, for example, for semantic search using vector databases.
26
-
27
-
28
-
29
- # BERT for Patents
30
-
31
- BERT for Patents is a model trained by Google on 100M+ patents (not just US patents).
32
-
33
- If you want to learn more about the model, check out the [blog post](https://cloud.google.com/blog/products/ai-machine-learning/how-ai-improves-patent-analysis), [white paper](https://services.google.com/fh/files/blogs/bert_for_patents_white_paper.pdf) and [GitHub page](https://github.com/google/patents-public-data/blob/master/models/BERT%20for%20Patents.md) containing the original TensorFlow checkpoint.
34
-
35
- ---
36
-
37
- ### Projects using this model (or variants of it):
38
- - [Patents4IPPC](https://github.com/ec-jrc/Patents4IPPC) (carried out by [Pi School](https://picampus-school.com/) and commissioned by the [Joint Research Centre (JRC)](https://ec.europa.eu/jrc/en) of the European Commission)
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - masked-lm
6
+ - pytorch
7
+ pipeline-tag:
8
+ - "fill-mask"
9
+ mask-token:
10
+ - "[MASK]"
11
+ widget:
12
+ - text: "The present [MASK] provides a torque sensor that is small and highly rigid and for which high production efficiency is possible."
13
+ - text: "The present invention relates to [MASK] accessories and pertains particularly to a brake light unit for bicycles."
14
+ - text: "The present invention discloses a space-bound-free [MASK] and its coordinate determining circuit for determining a coordinate of a stylus pen."
15
+ - text: "The illuminated [MASK] includes a substantially translucent canopy supported by a plurality of ribs pivotally swingable towards and away from a shaft."
16
+ license: apache-2.0
17
+ metrics:
18
+ - perplexity
19
+
20
+ ---
21
+
22
+ # Motivation
23
+
24
+
25
+ This model is based on anferico/bert-for-patents - a BERT<sub>LARGE</sub> model (See details below). By default, the pre-trained model's output embeddings with size 768 (base-models) or with size 1024 (large-models). However, when you store Millions of embeddings, this can require quite a lot of memory/storage. So have reduced the embedding dimension to 64 i.e 1/16th of 1024 using Principle Component Analysis (PCA) and it still gives a comparable performance. Yes! PCA gives better performance than NMF. Note: This process neither improves the runtime, nor the memory requirement for running the model. It only reduces the needed space to store embeddings, for example, for semantic search using vector databases.
26
+
27
+
28
+
29
+ # BERT for Patents
30
+
31
+ BERT for Patents is a model trained by Google on 100M+ patents (not just US patents).
32
+
33
+ If you want to learn more about the model, check out the [blog post](https://cloud.google.com/blog/products/ai-machine-learning/how-ai-improves-patent-analysis), [white paper](https://services.google.com/fh/files/blogs/bert_for_patents_white_paper.pdf) and [GitHub page](https://github.com/google/patents-public-data/blob/master/models/BERT%20for%20Patents.md) containing the original TensorFlow checkpoint.
34
+
35
+ ---
36
+
37
+ ### Projects using this model (or variants of it):
38
+ - [Patents4IPPC](https://github.com/ec-jrc/Patents4IPPC) (carried out by [Pi School](https://picampus-school.com/) and commissioned by the [Joint Research Centre (JRC)](https://ec.europa.eu/jrc/en) of the European Commission)
bert-for-patents-64d/README.md DELETED
@@ -1,55 +0,0 @@
1
- ---
2
- pipeline_tag: sentence-similarity
3
- tags:
4
- - sentence-transformers
5
- - feature-extraction
6
- - sentence-similarity
7
- ---
8
-
9
- # {MODEL_NAME}
10
-
11
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 64 dimensional dense vector space and can be used for tasks like clustering or semantic search.
12
-
13
- <!--- Describe your model here -->
14
-
15
- ## Usage (Sentence-Transformers)
16
-
17
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
18
-
19
- ```
20
- pip install -U sentence-transformers
21
- ```
22
-
23
- Then you can use the model like this:
24
-
25
- ```python
26
- from sentence_transformers import SentenceTransformer
27
- sentences = ["This is an example sentence", "Each sentence is converted"]
28
-
29
- model = SentenceTransformer('{MODEL_NAME}')
30
- embeddings = model.encode(sentences)
31
- print(embeddings)
32
- ```
33
-
34
-
35
-
36
- ## Evaluation Results
37
-
38
- <!--- Describe how your model was evaluated -->
39
-
40
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
41
-
42
-
43
-
44
- ## Full Model Architecture
45
- ```
46
- SentenceTransformer(
47
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
48
- (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
49
- (dense): Dense({'in_features': 1024, 'out_features': 64, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
50
- )
51
- ```
52
-
53
- ## Citing & Authors
54
-
55
- <!--- Describe where people can find more information -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bert-for-patents-64d/config.json β†’ config.json RENAMED
File without changes
bert-for-patents-64d/config_sentence_transformers.json β†’ config_sentence_transformers.json RENAMED
File without changes
bert-for-patents-64d/modules.json β†’ modules.json RENAMED
File without changes
bert-for-patents-64d/pytorch_model.bin β†’ pytorch_model.bin RENAMED
File without changes
bert-for-patents-64d/sentence_bert_config.json β†’ sentence_bert_config.json RENAMED
File without changes
bert-for-patents-64d/special_tokens_map.json β†’ special_tokens_map.json RENAMED
File without changes
bert-for-patents-64d/tokenizer.json β†’ tokenizer.json RENAMED
File without changes
bert-for-patents-64d/tokenizer_config.json β†’ tokenizer_config.json RENAMED
File without changes
bert-for-patents-64d/vocab.txt β†’ vocab.txt RENAMED
File without changes