TroyDoesAI commited on
Commit
967a36e
1 Parent(s): 05e5799

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -17
README.md CHANGED
@@ -7,9 +7,16 @@ Base Model : microsoft/Phi-3-mini-128k-instruct
7
  Overview
8
  This model is meant to enhance adherence to provided context (e.g., for RAG applications) and reduce hallucinations, inspired by airoboros context-obedient question answer format.
9
 
10
- Dataset format
11
- The format for a contextual prompt is as follows:
 
 
 
 
 
12
 
 
 
13
  BEGININPUT
14
  BEGINCONTEXT
15
  [key0: value0]
@@ -22,20 +29,21 @@ ENDINPUT
22
  BEGININSTRUCTION
23
  [insert your instruction(s). The model was tuned with single questions, paragraph format, lists, etc.]
24
  ENDINSTRUCTION
 
25
 
26
  I know it's a bit verbose and annoying, but after much trial and error, using these explicit delimiters helps the model understand where to find the responses and how to associate specific sources with it.
 
 
 
 
 
 
 
 
 
27
 
28
- BEGININPUT - denotes a new input block
29
- BEGINCONTEXT - denotes the block of context (metadata key/value pairs) to associate with the current input block
30
- ENDCONTEXT - denotes the end of the metadata block for the current input
31
- [text] - Insert whatever text you want for the input block, as many paragraphs as can fit in the context.
32
- ENDINPUT - denotes the end of the current input block
33
- [repeat as many input blocks in this format as you want]
34
- BEGININSTRUCTION - denotes the start of the list (or one) instruction(s) to respond to for all of the input blocks above.
35
- [instruction(s)]
36
- ENDINSTRUCTION - denotes the end of instruction set
37
  Here's a trivial, but important example to prove the point:
38
-
39
  BEGININPUT
40
  BEGINCONTEXT
41
  date: 2021-01-01
@@ -46,25 +54,35 @@ ENDINPUT
46
  BEGININSTRUCTION
47
  What color are bluberries? Source?
48
  ENDINSTRUCTION
 
49
 
50
  And the expected response:
51
-
52
  Blueberries are now green.
53
  Source:
54
  date: 2021-01-01
55
  url: https://web.site/123
 
 
 
56
 
57
- References in response
58
  As shown in the example, the dataset includes many examples of including source details in the response, when the question asks for source/citation/references.
59
 
60
- Why do this? Well, the R in RAG seems to be the weakest link in the chain. Retrieval accuracy, depending on many factors including the overall dataset size, can be quite low. This accuracy increases when retrieving more documents, but then you have the issue of actually using the retrieved documents in prompts. If you use one prompt per document (or document chunk), you know exactly which document the answer came from, so there's no issue. If, however, you include multiple chunks in a single prompt, it's useful to include the specific reference chunk(s) used to generate the response, rather than naively including references to all of the chunks included in the prompt.
 
 
 
 
 
 
61
 
62
  For example, suppose I have two documents:
63
-
64
  url: http://foo.bar/1
65
  Strawberries are tasty.
66
 
67
  url: http://bar.foo/2
68
  The cat is blue.
 
69
 
70
- If the question being asked is What color is the cat?, I would only expect the 2nd document to be referenced in the response, as the other link is irrelevant.
 
7
  Overview
8
  This model is meant to enhance adherence to provided context (e.g., for RAG applications) and reduce hallucinations, inspired by airoboros context-obedient question answer format.
9
 
10
+ ---
11
+ license: cc-by-4.0
12
+ ---
13
+
14
+ # Contextual DPO
15
+
16
+ ## Overview
17
 
18
+ The format for a contextual prompt is as follows:
19
+ ```
20
  BEGININPUT
21
  BEGINCONTEXT
22
  [key0: value0]
 
29
  BEGININSTRUCTION
30
  [insert your instruction(s). The model was tuned with single questions, paragraph format, lists, etc.]
31
  ENDINSTRUCTION
32
+ ```
33
 
34
  I know it's a bit verbose and annoying, but after much trial and error, using these explicit delimiters helps the model understand where to find the responses and how to associate specific sources with it.
35
+ - `BEGININPUT` - denotes a new input block
36
+ - `BEGINCONTEXT` - denotes the block of context (metadata key/value pairs) to associate with the current input block
37
+ - `ENDCONTEXT` - denotes the end of the metadata block for the current input
38
+ - [text] - Insert whatever text you want for the input block, as many paragraphs as can fit in the context.
39
+ - `ENDINPUT` - denotes the end of the current input block
40
+ - [repeat as many input blocks in this format as you want]
41
+ - `BEGININSTRUCTION` - denotes the start of the list (or one) instruction(s) to respond to for all of the input blocks above.
42
+ - [instruction(s)]
43
+ - `ENDINSTRUCTION` - denotes the end of instruction set
44
 
 
 
 
 
 
 
 
 
 
45
  Here's a trivial, but important example to prove the point:
46
+ ```
47
  BEGININPUT
48
  BEGINCONTEXT
49
  date: 2021-01-01
 
54
  BEGININSTRUCTION
55
  What color are bluberries? Source?
56
  ENDINSTRUCTION
57
+ ```
58
 
59
  And the expected response:
60
+ ```
61
  Blueberries are now green.
62
  Source:
63
  date: 2021-01-01
64
  url: https://web.site/123
65
+ ```
66
+
67
+ ### References in response
68
 
 
69
  As shown in the example, the dataset includes many examples of including source details in the response, when the question asks for source/citation/references.
70
 
71
+ Why do this? Well, the R in RAG seems to be the weakest link in the chain.
72
+ Retrieval accuracy, depending on many factors including the overall dataset size, can be quite low.
73
+ This accuracy increases when retrieving more documents, but then you have the issue of actually using
74
+ the retrieved documents in prompts. If you use one prompt per document (or document chunk), you know
75
+ exactly which document the answer came from, so there's no issue. If, however, you include multiple
76
+ chunks in a single prompt, it's useful to include the specific reference chunk(s) used to generate the
77
+ response, rather than naively including references to all of the chunks included in the prompt.
78
 
79
  For example, suppose I have two documents:
80
+ ```
81
  url: http://foo.bar/1
82
  Strawberries are tasty.
83
 
84
  url: http://bar.foo/2
85
  The cat is blue.
86
+ ```
87
 
88
+ If the question being asked is `What color is the cat?`, I would only expect the 2nd document to be referenced in the response, as the other link is irrelevant.