christofid commited on
Commit
2324bad
·
1 Parent(s): 7c6533a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -0
README.md CHANGED
@@ -1,3 +1,59 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+ ### PGT
5
+
6
+ PGT is a GPT-2 prompt-based model trained to facilitate 3 patent generation-related tasks, namely: *part-of-patent generation*, *part-of-patent editing* and *patent coherence check*. For more information about the dataset and the training procedure with refer the reader to [our paper](https://openreview.net/pdf?id=dLHtwZKvJmE).
7
+
8
+ The task specification is taken place by appending a short sentence at the end of a given input. The general format is:
9
+
10
+ `input <|sep|> task specific prompt <|sep|>`
11
+
12
+ In all cases, the generated output ends with the special token <|endoftext|> to facilitate postprocessing.
13
+
14
+ ### Supported tasks
15
+
16
+ **Part-of-patent generation** attempts to generate a part of a patent given as input another, already existing part of it. The model has been trained to perform title-to-abstract, abstract-to-claim as well as their inverse generations. For the claim case, the model was only exposed to independent claims during the training. Input example for part-of-patent generation for the abstract-to-title case:
17
+
18
+ `An interesting patent abstract. <|sep|> Given the above abstract, suggest a title <|sep|>`
19
+
20
+ **Part-of-patent editing** attempts to suggest alternatives for some highlighted parts of a patent abstract or claim. These parts are defined in the input with the special [MASK] token. The expected size of these masked parts can be from a single word to a small phrase. If more than one masks are given in the input, then the generated suggestions are distinguished in the output but the special <|mask_sep|> token. Input example for part-of-patent editing working on a claim input:
21
+
22
+ `An interesting patent claim with a [MASK] part. <|sep|> Replace the [MASK] tokens in the above claim <|sep|>`
23
+
24
+ The **coherence check** assesses the quality of a patent by examining whether to given parts of a patent could belong to the same patent in terms of content and syntax. The input patent parts can be title, abstract or claim. The expected output is Yes or No. Input example for the coherence check task having as input a title and a claim:
25
+
26
+ `A patent title <|sep|> An interesting patent claim. <|sep|> Do the above title and claim belong to the same patent? <|sep|>"`
27
+
28
+ Further prompts and tasks can be tried in a zero-shot fashion.
29
+
30
+ The model and the tasks are also integrated and available via the [GT4SD python library](https://github.com/GT4SD/gt4sd-core/blob/main/notebooks/explore-pgt.ipynb).
31
+
32
+ ### Example
33
+
34
+ A full example of part-of-patent generation
35
+
36
+ ```
37
+ from transformers import AutoModelForCausalLM, AutoTokenizer
38
+
39
+ tokenizer = AutoTokenizer.from_pretrained("christofid/pgt", use_auth_token="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")
40
+ model = AutoModelForCausalLM.from_pretrained("christofid/pgt", use_auth_token="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")
41
+
42
+ text = "Automated patent generation <|sep|> Given the above title, suggest an abstract <|sep|>"
43
+
44
+ text_encoded = tokenizer.encode(text, return_tensors="pt")
45
+
46
+ generated = model.generate(text_encoded, do_sample=True, top_k=50, num_return_sequences = 3, max_length=512)
47
+
48
+ generated_text = [tokenizer.decode(case).split("<|endoftext|>")[0].strip() for case in generated]
49
+ ```
50
+
51
+ ### BibTeX entry and citation info
52
+ ```
53
+ @inproceedings{christofidellis2022pgt,
54
+ title={PGT: a prompt based generative transformer for the patent domain},
55
+ author={Christofidellis, Dimitrios and Torres, Antonio Berrios and Dave, Ashish and Roveri, Manuel and Schmidt, Kristin and Swaminathan, Sarath and Vandierendonck, Hans and Zubarev, Dmitry and Manica, Matteo},
56
+ booktitle={ICML 2022 Workshop on Knowledge Retrieval and Language Models},
57
+ year={2022}
58
+ }
59
+ ```