Fine-tuned Merlinite-7B on OpenShift 4.15 documentation using 28606 Q&A pairs.
Method
The Q&A corpus was generated using the following methodology:
- Generated 5 Q&A pairs for each page on OpenShift (OCP) 4.15 PDFs with lengths greater than 1500 characters. The length was chosen to remove the title page and pages without much content.
- The Mistral-7B-Instruct-v0.2 was used to generate the Q&A pair for each page.
- The Mixtral-8x22B-Instruct-v0.1 was used to evaluate the quality of Q&A pair and removed low quality entries.
- Removed Q&A pairs with questions containing phrases or words like "this example", "this context", "this document", "trademark" and "copyright"
The resulting corpus is a 28606 Q&A-pairs. The corups was divided into training (25745 Q&A pairs) and eval (2861 Q&A pairs).
It was trained on 300 iterations.
Bias, Risks, and Limitations
The model has not been aligned to human preferences, so the model might produce problematic output. The model might also maintain the limitations and constraints that arise from the base model.
The model undergoes training on synthetic data, leading to the potential inheritance of both advantages and limitations from the underlying data generation methods.
In the absence of adequate safeguards and RLHF, there exists a risk of malicious utilization of these models for generating disinformation or harmful content. Caution is urged against complete reliance on a specific language model for crucial decisions or impactful information, as preventing these models from fabricating content is not straightforward. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in ungrounded generation scenarios due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain.
- Downloads last month
- 17