Instructions to use otavio-lemos/oci-copilot-jr-dataset with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use otavio-lemos/oci-copilot-jr-dataset with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir oci-copilot-jr-dataset otavio-lemos/oci-copilot-jr-dataset
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Dataset Card: OCI Copilot Jr Dataset
Overview
This dataset contains 13,196 examples of high-quality training data for fine-tuning a Large Language Model to become the best knowledgeable in Oracle Cloud Infrastructure (OCI) — the "OCI Copilot Jr".
The dataset was synthetically generated using prompt templates with OCI CLI commands and real-world enterprise scenarios in Brazilian Portuguese (PT-BR).
| Split | Examples | Percentage |
|---|---|---|
| Train | 9,897 | 75% |
| Valid | 1,979 | 15% |
| Eval | 1,320 | 10% |
| Total | 13,196 | 100% |
Dataset Structure
Schema (Chat Format)
{
"messages": [
{
"role": "system",
"content": "Você é um arquiteto e especialista experiente em OCI focado no domínio de {category}. Forneça orientações técnicas, profundas e definitivas."
},
{
"role": "user",
"content": "Para o ambiente {environment} do nosso projeto {project}, precisamos realizar: {task}. Quais as melhores estratégias e comandos no OCI considerando a restrição: {restriction}?"
},
{
"role": "assistant",
"content": "## {task} — OCI Step-by-Step\n\n**Cenário**: {company}, projeto {project}, ambiente {environment}\n\n[detailed technical response with OCI CLI commands, Terraform, and best practices]"
}
]
}
Categories (88 OCI Domains)
| Pillar | Categories |
|---|---|
| Compute | instances, custom-images, scaling |
| Container | instances, OKE |
| Database | autonomous, autonomous-json, exadata, exadata-cloud, MySQL, NoSQL, PostgreSQL |
| DevOps | artifacts, CI/CD, resource-manager, secrets |
| FinOps | cost-optimization, rightsizing, showback-chargeback, storage-tiering |
| Governance | audit-readiness, budgets-cost, compartments, compliance, landing-zone, policies-guardrails, resource-discovery, tagging |
| Load Balancer | load-balancer |
| Migration | aws-database, azure-compute, azure-database, azure-storage, data-transfer, gcp-compute, gcp-database, gcp-storage, onprem-compute, onprem-database, onprem-storage, onprem-vmware |
| Networking | connectivity, security, VCN |
| Observability | APM, logging, monitoring, stack-monitoring |
| Platform | backup-governance, SRE-operations |
| Security | cloud-guard, dynamic-groups, encryption, federation, IAM-basics, policies, posture-management, vault-keys, vault-secrets, WAF, zero-trust |
| Serverless | api-gateway, functions |
| Storage | block, file, object |
| Terraform | compute, container, database, devops, load-balancer, networking, observability, provider, security, serverless, state, storage |
| Troubleshooting | authentication, compute, connectivity, database, functions, OKE, performance, storage |
Data Generation Pipeline
flowchart LR
A["generate_v7_combined.py\n(88 cats × 150 ex)"] --> B["validate_jsonl.py"]
B --> C["clean_dataset.py"]
C --> D["dedupe_embedding.py\n(threshold 0.97)"]
D --> E["build_dataset_fixed.py\n(75/15/10 split)"]
E --> F["train.jsonl\nvalid.jsonl\neval.jsonl"]
Generation Process
Template-based generation: Uses prompt templates with varied:
- Company names (realistic Brazilian enterprises)
- Project names
- Environments (greenfield, brownfield, production, staging)
- Personas (SRE, Platform Engineer, FinOps Analyst, Architect)
- Restrictions (budget-limited, no-downtime, rollback-15min, etc.)
- Regions and compartments
Quality Validation:
- JSONL schema validation
- Content cleaning (removes generic templates, incorrect CLI)
- Semantic deduplication using embeddings (threshold 0.97)
Token Statistics
| Metric | Value |
|---|---|
| Average tokens/example | 883 |
| Min tokens | 410 |
| Max tokens | 934 |
Fine-Tuning Results
After fine-tuning Qwen 2.5 Coder 7B Instruct (4-bit) with LoRA on this dataset, the model achieved significant improvements:
External Judge Evaluation (mlx-community/Meta-Llama-3.1-8B-Instruct-4bit) - 200 samples
| Metric | Base Model | Fine-Tuned | Delta |
|---|---|---|---|
| technical_correctness | 3.00 | 3.73 | +0.72 |
| depth | 3.06 | 3.82 | +0.76 |
| structure | 3.50 | 4.63 | +1.14 |
| hallucination | 3.62 | 4.46 | +0.84 |
| clarity | 3.20 | 3.98 | +0.77 |
| Overall | 3.27 | 4.12 | +0.85 |
Top Gains by Topic
- storage/object: +3.60
- troubleshooting/performance: +3.80
- observability/apm: +3.40
- security/dynamic-groups: +3.40
- database/postgresql: +3.40
Model Files
| Resource | URL |
|---|---|
| Safetensors | https://huggingface.co/otavio-lemos/oci-copilot-jr-safetensors |
| GGUF | https://huggingface.co/otavio-lemos/oci-copilot-jr-gguf |
Use and Limitations
Intended Use
This dataset is designed for:
- Fine-tuning LLMs for Oracle Cloud Infrastructure (OCI) operations
- Training technical assistants specialized in OCI CLI, Terraform, and best practices
- Building domain-specific RAG systems for cloud operations
Limitations
- Language: Only Brazilian Portuguese (PT-BR)
- Generated data: Not human-annotated, may contain occasional inaccuracies
- Knowledge cutoff: Based on OCI documentation available up to April 2026
- Scope: Focus on operational tasks (not development/architecture planning)
Citation
@dataset{lemos_2026_oci_copilot_jr,
author = {Otavio Lemos},
title = {OCI Copilot Jr Dataset},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/otavio-lemos/oci-copilot-jr-dataset}
}
License
MIT License - See LICENSE
Dataset generated using MLX-Tune pipeline on Apple Silicon M3 Pro
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support