law-ai commited on
Commit
5cd5ee0
1 Parent(s): 86d70c1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -15
README.md CHANGED
@@ -6,7 +6,8 @@ tags:
6
  license: mit
7
  ---
8
  ### InCaseLawBERT
9
- Model and tokenizer files for the InCaseLawBERT model.
 
10
 
11
  ### Training Data
12
  For building the pre-training corpus of Indian legal text, we collected a large corpus of case documents from the Indian Supreme Court and many High Courts of India.
@@ -19,34 +20,43 @@ This model is initialized with the [Legal-BERT model](https://huggingface.co/zlu
19
  We further train this model on our data for 300K steps on the Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks.
20
 
21
  ### Model Overview
 
22
  This model has the same configuration as the [bert-base-uncased model](https://huggingface.co/bert-base-uncased):
23
  12 hidden layers, 768 hidden dimensionality, 12 attention heads, ~110M parameters.
24
 
25
  ### Usage
26
- Using the tokenizer (same as [CaseLawBERT](https://huggingface.co/zlucia/legalbert))
27
  ```python
28
- from transformers import AutoTokenizer
29
  tokenizer = AutoTokenizer.from_pretrained("law-ai/InCaseLawBERT")
30
- ```
31
- Using the model to get embeddings/representations for a sentence
32
- ```python
33
- from transformers import AutoModel
34
  model = AutoModel.from_pretrained("law-ai/InCaseLawBERT")
 
 
35
  ```
36
 
37
  ### Fine-tuning Results
 
 
 
 
 
 
38
 
39
  ### Citation
40
  ```
41
- @inproceedings{paul-2022-ptinlegal,
42
- title = "Pre-training Transformers on Indian Legal Text",
43
- author = "Paul, Shounak and
44
- Mandal, Arpan and
45
- Goyal, Pawan and
46
- Ghosh, Saptarshi",
47
- eprinttype = {arXiv}
 
48
  }
49
  ```
 
50
  ### About Us
51
  We are a group of researchers from the Department of Computer Science and Technology, Indian Insitute of Technology, Kharagpur.
52
  Our research interests are primarily ML and NLP applications for the legal domain, with a special focus on the challenges and oppurtunites for the Indian legal scenario.
@@ -55,4 +65,4 @@ We have, and are currently working on several legal tasks such as:
55
  * semantic segmentation of legal documents
56
  * legal statute identification from facts, court judgment prediction
57
  * legal document matching
58
- You can find our publicly available codes and datasets [here](https://github.com/Law-AI)
 
6
  license: mit
7
  ---
8
  ### InCaseLawBERT
9
+ Model and tokenizer files for the InCaseLawBERT model from the paper [Pre-training Transformers on Indian Legal Text](https://arxiv.org/abs/2209.06049).
10
+
11
 
12
  ### Training Data
13
  For building the pre-training corpus of Indian legal text, we collected a large corpus of case documents from the Indian Supreme Court and many High Courts of India.
 
20
  We further train this model on our data for 300K steps on the Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks.
21
 
22
  ### Model Overview
23
+ This model uses the same tokenizer as [CaseLawBERT](https://huggingface.co/zlucia/legalbert).
24
  This model has the same configuration as the [bert-base-uncased model](https://huggingface.co/bert-base-uncased):
25
  12 hidden layers, 768 hidden dimensionality, 12 attention heads, ~110M parameters.
26
 
27
  ### Usage
28
+ Using the model to get embeddings/representations for a piece of text
29
  ```python
30
+ from transformers import AutoTokenizer, AutoModel
31
  tokenizer = AutoTokenizer.from_pretrained("law-ai/InCaseLawBERT")
32
+ text = "Replace this string with yours"
33
+ encoded_input = tokenizer(text, return_tensors="pt")
 
 
34
  model = AutoModel.from_pretrained("law-ai/InCaseLawBERT")
35
+ output = model(**encoded_input)
36
+ last_hidden_state = output.last_hidden_state
37
  ```
38
 
39
  ### Fine-tuning Results
40
+ We have fine-tuned all pre-trained models on 3 legal tasks with Indian datasets:
41
+ * Legal Statute Identification ([ILSI Dataset](https://arxiv.org/abs/2112.14731))[Multi-label Text Classification]: Identifying relevant statutes (law articles) based on the facts of a court case
42
+ * Semantic Segmentation ([ISS Dataset](https://arxiv.org/abs/1911.05405))[Sentence Tagging]: Segmenting the document into 7 functional parts (semantic segments) such as Facts, Arguments, etc.
43
+ * Court Judgment Prediction ([ILDC Dataset](https://arxiv.org/abs/2105.13562))[Binary Text Classification]: Predicting whether the claims/petitions of a court case will be accepted/rejected
44
+
45
+ InCaseLawBERT performs close to CaseLawBERT across the three tasks, but not as good as InLegalBERT. For details, see our [paper](https://arxiv.org/abs/2209.06049).
46
 
47
  ### Citation
48
  ```
49
+ @article{paul-2022-pretraining,
50
+ doi = {10.48550/ARXIV.2209.06049},
51
+ url = {https://arxiv.org/abs/2209.06049},
52
+ author = {Paul, Shounak and Mandal, Arpan and Goyal, Pawan and Ghosh, Saptarshi},
53
+ title = {Pre-training Transformers on Indian Legal Text},
54
+ publisher = {arXiv},
55
+ year = {2022},
56
+ copyright = {Creative Commons Attribution 4.0 International}
57
  }
58
  ```
59
+
60
  ### About Us
61
  We are a group of researchers from the Department of Computer Science and Technology, Indian Insitute of Technology, Kharagpur.
62
  Our research interests are primarily ML and NLP applications for the legal domain, with a special focus on the challenges and oppurtunites for the Indian legal scenario.
 
65
  * semantic segmentation of legal documents
66
  * legal statute identification from facts, court judgment prediction
67
  * legal document matching
68
+ You can find our publicly available codes and datasets [here](https://github.com/Law-AI).