Just tested with some legal text - and here's my opinion/human eval.

#1
by KrishnaKaasyap - opened

I tried the model for both summarisation of a 1000 token length legal case law and creating Q&A pairs from a 1500 token case law.

In summarising the case law, the model performed at par with πŸ€—'s Zephyr 7B (Mistral fine tuned) model. But in creating Q&A pairs - it performed worse than Zephyr 7B.

Comparing with LLaMA 2 70B chat (which is RLHF model) - the summarisation from your model is almost similar to LLaMA 2 70B.

LLaMA 2 70B is visibly great when creating Q&A pairs than both Zephyr and your model. And it is obvious because first it is a RLHF model and it is whooping 10 times bigger.

But what I'm amazed is yours is probably the only 32k context length 7B model out there apart from @vipul 's Together AI LLaMA 7B 32K model. And yours is the only one that is almost entirely trained on synthetic data!

Since you're in the first epoch of training, I'm expecting visible improvements after reaching third or fourth epoch.

Great job dude. πŸ‘ŒπŸΌ

Thanks so much for your positive feedback, I'm glad you enjoyed the experience.

The direct comparison with Zephyr is great for me to keep in mind.

I will re-run your suggested experiments after the second, third, etc. epochs.

I tried the model for both summarisation of a 1000 token length legal case law and creating Q&A pairs from a 1500 token case law.

In summarising the case law, the model performed at par with πŸ€—'s Zephyr 7B (Mistral fine tuned) model. But in creating Q&A pairs - it performed worse than Zephyr 7B.

Comparing with LLaMA 2 70B chat (which is RLHF model) - the summarisation from your model is almost similar to LLaMA 2 70B.

LLaMA 2 70B is visibly great when creating Q&A pairs than both Zephyr and your model. And it is obvious because first it is a RLHF model and it is whooping 10 times bigger.

But what I'm amazed is yours is probably the only 32k context length 7B model out there apart from @vipul 's Together AI LLaMA 7B 32K model. And yours is the only one that is almost entirely trained on synthetic data!

Since you're in the first epoch of training, I'm expecting visible improvements after reaching third or fourth epoch.

Great job dude. πŸ‘ŒπŸΌ

do you use the default system prompt ? if not what prompts give you better results?

do you use the default system prompt ? if not what prompts give you better results?

I used the default system prompt. I don't think any 7B models or even 70B models out there can be properly steered or controlled by changing system prompt.

Even GPT 3.5 couldn't properly be steered until late July and only after improving the model and including function calling - it was able to consider the system message as intended by the user.

Now for summarisation - I employed a specific style of prompt engineering by giving it the exact format it needed at the start and finish of my prompt.

For five different examples I tested - it produced results almost similar to LLaMA 2 70B chat (RLHF) model.

Here's the prompt for summarisation -

You're a professor and expert on Indian income tax laws. Now explain the below case law in the given format -

  1. Case name
  2. In favour of
  3. Judges
  4. Counsel name
  5. Decision date
  6. Appeal no
  7. Short summary of headnote
  8. Detailed analysis of headnote
  9. Sections and their relative acts mentioned in the case law
  10. Final judgement given and the reason for giving such judgement

---Text Starts---

{Inserted text from my legal case law which is almost 1000 - 1200 tokens}

---Text Ends---

Be specific and accurate. Take a deep breath and work on this problem step by step.

β€’β€’β€’β€’β€’β€’β€’

Now coming to generating Q&A from the text given as output here's the prompt I used -

LLaMA 2 70B did it far better than any open source model including this model and Zephyr. GPT 3.5 via Open AI playground did it even better than LLaMA 2 70B.

Here's the prompt -

Now create five (5) detailed and complex scholarly question and answer sets based on the case law and explanation given above.

Mention technical details like - case name, sections and their relative acts - in each and every question & answer pair.

Take a deep breath and work on this problem step by step.

Answer in the below format -

{"prompt": "", "completion": ""}

{"prompt": "", "completion": ""}

{"prompt": "", "completion": ""}

Most of the time it didn't even followed the given format - and even followed, the generated Q&A are very bland and unoriginal.

Hope this helps @gnomealone , thanks. πŸ‘πŸ»

Sign up or log in to comment