Prompt Evaluation: Rubrics and Acceptance Criteria
Enterprise-grade framework for consistent prompt quality and performance assessment
TL;DR - Quick Answer
Use a 4-dimension rubric: Clarity (specificity, context), Quality (accuracy, relevance), Safety (no sensitive data, bias check), and Performance (consistency, efficiency). Set acceptance criteria with scoring thresholds, quality gates, and measurable outcomes for enterprise deployment.
Evaluation Framework Facts
- 4 Core Dimensions: Clarity (9-10/10), Quality (8-10/10), Safety (Pass/Fail), Performance (80%+ consistency)
- Enterprise Threshold: Minimum 8/10 overall score with no dimension below 7/10 for production use
- Testing Requirements: 10-20 sample runs minimum, diverse input scenarios, edge case validation
- Review Frequency: Monthly for high-usage, quarterly for standard, immediate after model updates
- Documentation: Version control, change logs, test results, and approval trails required
4-Dimension Prompt Evaluation Rubric
Comprehensive framework for enterprise prompt assessment
1. Clarity (Weight: 25%)
Scoring Criteria (1-10):
- • 9-10: Crystal clear, specific instructions with examples
- • 7-8: Clear but could use more specificity
- • 5-6: Understandable but ambiguous in places
- • 1-4: Vague, confusing, or incomplete instructions
Assessment Points:
- • Task definition clarity
- • Context and constraints specified
- • Expected output format defined
- • Edge case instructions included
2. Quality (Weight: 30%)
Scoring Criteria (1-10):
- • 9-10: Consistently accurate, highly relevant outputs
- • 7-8: Generally accurate with minor inconsistencies
- • 5-6: Acceptable quality with notable issues
- • 1-4: Poor quality, frequent errors
Assessment Points:
- • Output accuracy and relevance
- • Consistency across multiple runs
- • Completeness of responses
- • Professional tone and style
3. Safety (Weight: 25%) - Pass/Fail
Pass Criteria (All Required):
- • ✓ No sensitive data exposure
- • ✓ Bias check passed
- • ✓ Compliance with data policies
- • ✓ No harmful content generation
Fail Indicators (Any One):
- • ✗ Prompts contain PII or secrets
- • ✗ Discriminatory outputs detected
- • ✗ Policy violations identified
- • ✗ Security vulnerabilities present
4. Performance (Weight: 20%)
Scoring Criteria (1-10):
- • 9-10: 90%+ consistency, fast response
- • 7-8: 80-89% consistency, good speed
- • 5-6: 70-79% consistency, acceptable speed
- • 1-4: <70% consistency, slow/unreliable
Assessment Points:
- • Output consistency across runs
- • Response time and efficiency
- • Error handling and fallbacks
- • Resource consumption
Enterprise Acceptance Criteria
Quality gates for production deployment
✓ Ready for Production
- • Overall score: 8.0/10 or higher
- • No dimension below 7.0/10
- • Safety: Pass (100% compliance)
- • Consistency: 80%+ across test runs
- • Documentation complete
- • Stakeholder approval obtained
⚠ Requires Improvement
- • Overall score: 6.0-7.9/10
- • One dimension below 7.0/10
- • Safety: Pass but with warnings
- • Consistency: 60-79%
- • Minor documentation gaps
- • Additional testing needed
✗ Not Ready - Major Issues
- • Overall score: Below 6.0/10
- • Multiple dimensions below 7.0/10
- • Safety: Fail on any criteria
- • Consistency: Below 60%
- • Significant quality issues
- • Requires complete revision
📋 Testing Requirements
- • Minimum 10 test runs per scenario
- • Multiple evaluators (2-3 minimum)
- • Edge case and stress testing
- • Different input variations
- • Cross-model validation
- • Performance benchmarking
When to Use This Rubric
Skip Formal Evaluation When
Related Resources
Explore these related topics and services
Need Help Implementing Prompt Evaluation?
Get customized rubrics, evaluation templates, and quality assurance frameworks for your organization's AI initiatives.