turboderp commited on
Commit
59cdf57
1 Parent(s): dd0cc25

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -8
README.md CHANGED
@@ -1,17 +1,35 @@
 
 
 
1
  # Turbcat 8b
 
 
 
 
 
 
2
  # Release notes
3
-
4
- This is a direct upgrade over cat 70B, with 2x the dataset size, added Chinese support.
5
 
6
  # Data Generation
7
- The data generation process is largely the same. Except additional Chinese data were added. 20 postdocs participated in the annotation process and standard model training for embedding scoring.
 
 
 
 
 
8
 
9
  # Task coverage
10
- In addition to standard assistant and roleplay data, the following tasks are targeted:
11
- * GRE
12
- * SAT
13
- * MCAT
14
- * Chinese Kaoyan
 
 
 
 
15
 
16
  # Thirdparty dataset
17
  Thanks to the following people for their tremendous support for dataset generation:
@@ -21,9 +39,14 @@ Thanks to the following people for their tremendous support for dataset generati
21
 
22
  # Prompt format for 8b:
23
  llama3
 
 
 
24
 
25
  # Prompt format for 72b:
26
  chatml
 
 
27
 
28
  # Support
29
  Please join https://discord.gg/DwGz54Mz for model support
 
1
+ ---
2
+ license: llama3
3
+ ---
4
  # Turbcat 8b
5
+ ![image/png](3.png)
6
+ ![image/png](4.png)
7
+ ![image/png](5.png)
8
+ ![image/png](6.png)
9
+ ![image/png](7.png)
10
+ ![image/png](8.png)
11
  # Release notes
12
+ This is a direct upgrade over cat 70B, with 2x the dataset size(2GB-> 5GB), added Chinese support with quality on par with the original English dataset.
13
+ The medical COT portion of the dataset has been sponsored by steelskull, and the action packed character play portion was donated by Gryphe's(aesir dataset). Note that 8b is based on llama3 with limited Chinese support due to base model choice. The chat format in 8b is llama3. The 72b has more comprehensive Chinese support and the format will be chatml.
14
 
15
  # Data Generation
16
+ In addition to the specified fortifications above, the data generation process is largely the same. Except for added Chinese Ph. D. Entrance exam, Traditional Chinese and Chinese story telling data.
17
+
18
+ ## Special Highlights
19
+ * 20 postdocs (10 Chinese, 10 English speaking doctors specialized in computational biology, biomed, biophysics and biochemistry)participated in the annotation process.
20
+ * GRE and MCAT/Kaoyan questions were manually answered by the participants using strictly COT and BERT judges producing embeddings were trained based on the provided annotation. For an example of BERT embedding visualization and scoring, please refer to https://huggingface.co/turboderp/Cat-Llama-3-70B-instruct
21
+ * Initial support of roleplay as api usage. When roleplaying as an API or function, the model does not produce irrelevant content that's not specified by the system prompt.
22
 
23
  # Task coverage
24
+
25
+ ## Chinese tasks on par with English data
26
+ ![image/png](1.png)
27
+ For the Chinese portion of the dataset, we strictly kept its distribution and quality comparable to the English counterpart, as visualized by the close distance of the doublets. The overall QC is visualized by PCA after bert embedding
28
+
29
+ ## Individual tasks Quality Checked by doctors
30
+ For each cluster, we QC using BERT embeddings on an umap:
31
+ ![image/png](2.png)
32
+ The outliers have been manually checked by doctors.
33
 
34
  # Thirdparty dataset
35
  Thanks to the following people for their tremendous support for dataset generation:
 
39
 
40
  # Prompt format for 8b:
41
  llama3
42
+ Example raw prompt:
43
+ #FIXME
44
+
45
 
46
  # Prompt format for 72b:
47
  chatml
48
+ Example raw prompt:
49
+ #FIXME
50
 
51
  # Support
52
  Please join https://discord.gg/DwGz54Mz for model support