Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.1.0
Evaluation Tasks
This directory contains evaluation tasks organized by use case.
Structure
tasks/
βββ sql_generation/ # SQL generation tasks
β βββ nyc_taxi_small/ # NYC Taxi dataset
βββ code_generation/ # Code generation tasks
β βββ python_algorithms/ # Python algorithm tasks
β βββ go_algorithms/ # Go algorithm tasks
βββ documentation/ # Documentation generation tasks
βββ technical_docs/ # Technical documentation tasks
βββ api_documentation/ # API documentation tasks
Use Cases
1. SQL Generation
- Purpose: Evaluate models on natural language to SQL query generation
- Datasets: NYC Taxi Small
- Dialects: Presto, BigQuery, Snowflake
- Metrics: Correctness, execution success, result matching, dialect compliance
2. Code Generation
- Purpose: Evaluate models on natural language to source code generation
- Languages: Python, Go, JavaScript, Java
- Datasets: Algorithm implementations, web services, data structures
- Metrics: Syntax correctness, compilation success, execution success, code quality
3. Documentation Generation
- Purpose: Evaluate models on natural language to technical documentation
- Formats: Markdown, HTML, JSON, YAML
- Datasets: API docs, technical guides, installation instructions
- Metrics: Accuracy, completeness, clarity, format compliance
Task Structure
Each task directory contains:
Required Files
cases.yaml- Test cases with questions and reference outputsloader.py- Data loading and test execution utilitiesschema.sql- Database schema (for SQL tasks)test_data.json- Test data for evaluation (for code/doc tasks)
Optional Files
README.md- Task-specific documentationrequirements.txt- Task-specific dependenciesconfig.yaml- Task-specific configuration
Adding New Tasks
- Create a new directory under the appropriate use case
- Add the required files (
cases.yaml,loader.py) - Define test cases with questions and reference outputs
- Implement data loading and evaluation logic
- Update the main configuration files
Evaluation Metrics
SQL Generation
- Correctness: Exact match with reference SQL
- Execution Success: SQL executes without errors
- Result Matching: F1 score comparing query results
- Dialect Compliance: Proper SQL transpilation
- Readability: SQL structure and formatting
Code Generation
- Syntax Correctness: Code compiles without syntax errors
- Compilation Success: Code builds successfully
- Execution Success: Code runs and produces expected output
- Code Quality: Follows language best practices
- Performance: Code efficiency and optimization
Documentation Generation
- Accuracy: Content matches reference documentation
- Completeness: Covers all required information
- Clarity: Easy to understand and follow
- Format Compliance: Follows specified documentation format
- Technical Correctness: Technically accurate information