adlumal
/

auslaw-embed-v1.0

Sentence Similarity

sentence-transformers

feature-extraction

Inference Endpoints

text-embeddings-inference

Model card Files Files and versions Community

adlumal commited on Jan 3

Commit

bcb1ccd

•

1 Parent(s): d9171f8

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -22,9 +22,11 @@ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentence
 This model is a fine-tune of [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) using the HCA case law in the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) by Umar Butler. The PDF/OCR cases were not used.
 The cases were split into < 512 context chunks using the bge-small-en tokeniser and [semchunk](https://github.com/umarbutler/semchunk).
 [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) was used to generate a legal question for each context chunk.
 129,137 context-question pairs were used for training.
 14,348 context-question pairs were used for evaluation (see the table below for results).
 Using a 10% subset of the val dataset the following hit-rate performance was reached and is compared to the base model and OpenAI's default ada embedding model.

 This model is a fine-tune of [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) using the HCA case law in the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) by Umar Butler. The PDF/OCR cases were not used.
 The cases were split into < 512 context chunks using the bge-small-en tokeniser and [semchunk](https://github.com/umarbutler/semchunk).
 [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) was used to generate a legal question for each context chunk.
 129,137 context-question pairs were used for training.
 14,348 context-question pairs were used for evaluation (see the table below for results).
 Using a 10% subset of the val dataset the following hit-rate performance was reached and is compared to the base model and OpenAI's default ada embedding model.