Chapter 4: AI Governance and Evaluation
Welcome to the final chapter! Here we explore AI governance, evaluation, and monitoring - essential skills for deploying AI agents responsibly in production environments.
Learning Objectives
By the end of this chapter, youβll be able to:
- Implement syntactic and semantic evaluation for AI agents
- Set up monitoring and tracing for agent performance
- Build RAG evaluation systems using the RAG Triad methodology
- Create governance frameworks for AI agent deployment
- Visualize agent performance metrics and traces
- Establish responsible AI practices for production systems
Why AI Governance Matters
AI governance is critical for:
- Trust and reliability in AI systems
- Regulatory compliance with AI regulations
- Risk mitigation for business operations
- Performance optimization through continuous monitoring
- Ethical AI deployment with bias detection and mitigation
- Transparency and explainability for stakeholders
Evaluation Framework
1. Syntactic Evaluation (1. syntactic.py)
Objective: Evaluate basic agent functionality and response consistency
What youβll learn:
- Basic response evaluation metrics
- Consistency testing across similar queries
- Response format validation
Test prompts for evaluation:
"What's Google's stock price?""What's IBM's stock price?"
Key concepts:
- Response consistency: Similar queries should produce similar response structures
- Format validation: Ensure responses meet expected formats
- Basic functionality: Verify core agent capabilities work reliably
2. RAG Triad Evaluation (2. ragtriad.py)
Objective: Comprehensive evaluation of RAG (Retrieval Augmented Generation) systems
What youβll learn:
- Context Relevance: How relevant is the retrieved context?
- Groundedness: How well does the answer stay grounded in the context?
- Answer Relevance: How relevant is the answer to the original question?
The RAG Triad Framework:
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Context β β Groundedness β β Answer β
β Relevance βββββΊβ Evaluation βββββΊβ Relevance β
β β β β β β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β² β² β²
β β β
Query ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊ Response
Test prompts for evaluation:
"What's the limit on overseas assistance?""What is the definition of epidemic?"
Evaluation metrics:
- Context Relevance Score: 0.0 - 1.0 (higher is better)
- Groundedness Score: 0.0 - 1.0 (higher is better)
- Answer Relevance Score: 0.0 - 1.0 (higher is better)
3. Visualization and Monitoring (3. plotlychart.py)
Objective: Create interactive dashboards for monitoring agent performance
What youβll learn:
- Performance metrics visualization
- Real-time monitoring dashboards
- Trace analysis and debugging
- Historical performance trends
Generated outputs:
graph.png: Performance metrics visualizationrag_graph.png: RAG-specific evaluation charts- Interactive Plotly dashboards
Key Governance Concepts
Evaluation Metrics
Response Quality
- Accuracy: Factual correctness of responses
- Relevance: How well the response addresses the query
- Completeness: Whether the response fully answers the question
- Clarity: How understandable the response is
System Performance
- Response Time: How quickly the agent responds
- Throughput: How many queries can be handled per minute
- Availability: System uptime and reliability
- Cost: Resource consumption per query
Safety and Ethics
- Bias Detection: Identifying unfair or discriminatory responses
- Toxicity Screening: Filtering harmful content
- Privacy Compliance: Protecting user data
- Hallucination Detection: Identifying false or fabricated information
Monitoring and Tracing
Trace Data Structure
{
"trace_id": "uuid",
"timestamp": "2025-01-08T10:30:00Z",
"query": "User question",
"response": "Agent response",
"context": ["Retrieved documents"],
"metrics": {
"context_relevance": 0.85,
"groundedness": 0.92,
"answer_relevance": 0.88,
"response_time_ms": 1500
}
}
Trace Storage
Traces are stored in:
trace.json: Individual trace fileswxgov_traces/: Directory with experiment traces- ChromaDB: Vector database for similarity search
Governance Framework
1. Pre-deployment Evaluation
- Comprehensive testing across diverse scenarios
- Bias and fairness assessment
- Performance benchmarking
- Safety validation
2. Production Monitoring
- Real-time performance tracking
- Anomaly detection
- User feedback collection
- Continuous evaluation
3. Post-deployment Analysis
- Performance trend analysis
- Issue investigation and resolution
- Model improvement recommendations
- Compliance reporting
Governance Best Practices
1. Evaluation Strategy
- Multi-dimensional evaluation: Donβt rely on a single metric
- Human-in-the-loop: Combine automated and human evaluation
- Continuous testing: Regular evaluation, not just pre-deployment
- Diverse test cases: Cover edge cases and various scenarios
2. Monitoring Implementation
- Real-time alerts: Set up alerts for performance degradation
- Baseline establishment: Define normal performance ranges
- Trend analysis: Look for gradual performance changes
- User feedback integration: Collect and analyze user satisfaction
3. Compliance and Ethics
- Documentation: Maintain detailed records of evaluations
- Audit trails: Track all changes and decisions
- Bias testing: Regularly test for unfair outcomes
- Privacy protection: Ensure user data is handled appropriately
4. Incident Response
- Escalation procedures: Clear steps when issues are detected
- Rollback capabilities: Ability to revert to previous versions
- Root cause analysis: Investigate and learn from failures
- Communication plans: Keep stakeholders informed during incidents
Production Deployment Checklist
- Evaluation Framework: Comprehensive testing suite established
- Monitoring Setup: Real-time performance tracking active
- Alert Systems: Automated notifications for issues
- Governance Policies: Clear guidelines and procedures
- Compliance Check: Regulatory requirements met
- Security Validation: Security testing completed
- Performance Baseline: Expected performance ranges defined
- Rollback Plan: Procedure for reverting changes
- Documentation: Complete system documentation
- Training: Team trained on monitoring and response
Course Completion
Congratulations! Youβve completed all four chapters of the Watsonx Agentic AI course:
- Chapter 1: No-code agents with Orchestrate
- Chapter 2: Low-code visual development with Langflow
- Chapter 3: Pro-code development with Langgraph
- Chapter 4: AI governance and evaluation
Next Steps
- Apply these concepts to your own projects
- Join the AI agent development community
- Keep up with the latest developments in agentic AI
- Share your experiences and learnings
Related Files
Governance implementation files:
4. Governance/
βββ 1. syntactic.py # Basic evaluation
βββ 2. ragtriad.py # RAG evaluation
βββ 3. plotlychart.py # Visualization
βββ graph.png # Performance charts
βββ rag_graph.png # RAG evaluation charts
βββ trace.json # Sample trace data
βββ README.md # Setup instructions
βββ credit_card/ # Sample vector database
βββ wxgov_traces/ # Experiment traces
βββ experiment_traces_*.json
βββ experiment_traces_*.log
Additional Resources
- Responsible AI Guidelines - IBMβs responsible AI principles
- AI Governance Framework - NIST AI Risk Management Framework