Chapter 4: AI Governance and Evaluation

Welcome to the final chapter! Here we explore AI governance, evaluation, and monitoring - essential skills for deploying AI agents responsibly in production environments.

Learning Objectives

By the end of this chapter, you’ll be able to:

  • Implement syntactic and semantic evaluation for AI agents
  • Set up monitoring and tracing for agent performance
  • Build RAG evaluation systems using the RAG Triad methodology
  • Create governance frameworks for AI agent deployment
  • Visualize agent performance metrics and traces
  • Establish responsible AI practices for production systems

Why AI Governance Matters

AI governance is critical for:

  • Trust and reliability in AI systems
  • Regulatory compliance with AI regulations
  • Risk mitigation for business operations
  • Performance optimization through continuous monitoring
  • Ethical AI deployment with bias detection and mitigation
  • Transparency and explainability for stakeholders

Evaluation Framework

1. Syntactic Evaluation (1. syntactic.py)

Objective: Evaluate basic agent functionality and response consistency

What you’ll learn:

  • Basic response evaluation metrics
  • Consistency testing across similar queries
  • Response format validation

Test prompts for evaluation:

  1. "What's Google's stock price?"
  2. "What's IBM's stock price?"

Key concepts:

  • Response consistency: Similar queries should produce similar response structures
  • Format validation: Ensure responses meet expected formats
  • Basic functionality: Verify core agent capabilities work reliably

2. RAG Triad Evaluation (2. ragtriad.py)

Objective: Comprehensive evaluation of RAG (Retrieval Augmented Generation) systems

What you’ll learn:

  • Context Relevance: How relevant is the retrieved context?
  • Groundedness: How well does the answer stay grounded in the context?
  • Answer Relevance: How relevant is the answer to the original question?

The RAG Triad Framework:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Context         β”‚    β”‚ Groundedness    β”‚    β”‚ Answer          β”‚
β”‚ Relevance       │◄──►│ Evaluation      │◄──►│ Relevance       β”‚
β”‚                 β”‚    β”‚                 β”‚    β”‚                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β–²                         β–²                         β–²
      β”‚                         β”‚                         β”‚
   Query ───────────────────────────────────────────────────────► Response

Test prompts for evaluation:

  1. "What's the limit on overseas assistance?"
  2. "What is the definition of epidemic?"

Evaluation metrics:

  • Context Relevance Score: 0.0 - 1.0 (higher is better)
  • Groundedness Score: 0.0 - 1.0 (higher is better)
  • Answer Relevance Score: 0.0 - 1.0 (higher is better)

3. Visualization and Monitoring (3. plotlychart.py)

Objective: Create interactive dashboards for monitoring agent performance

What you’ll learn:

  • Performance metrics visualization
  • Real-time monitoring dashboards
  • Trace analysis and debugging
  • Historical performance trends

Generated outputs:

  • graph.png: Performance metrics visualization
  • rag_graph.png: RAG-specific evaluation charts
  • Interactive Plotly dashboards

Key Governance Concepts

Evaluation Metrics

Response Quality

  • Accuracy: Factual correctness of responses
  • Relevance: How well the response addresses the query
  • Completeness: Whether the response fully answers the question
  • Clarity: How understandable the response is

System Performance

  • Response Time: How quickly the agent responds
  • Throughput: How many queries can be handled per minute
  • Availability: System uptime and reliability
  • Cost: Resource consumption per query

Safety and Ethics

  • Bias Detection: Identifying unfair or discriminatory responses
  • Toxicity Screening: Filtering harmful content
  • Privacy Compliance: Protecting user data
  • Hallucination Detection: Identifying false or fabricated information

Monitoring and Tracing

Trace Data Structure

{
  "trace_id": "uuid",
  "timestamp": "2025-01-08T10:30:00Z",
  "query": "User question",
  "response": "Agent response",
  "context": ["Retrieved documents"],
  "metrics": {
    "context_relevance": 0.85,
    "groundedness": 0.92,
    "answer_relevance": 0.88,
    "response_time_ms": 1500
  }
}

Trace Storage

Traces are stored in:

  • trace.json: Individual trace files
  • wxgov_traces/: Directory with experiment traces
  • ChromaDB: Vector database for similarity search

Governance Framework

1. Pre-deployment Evaluation

  • Comprehensive testing across diverse scenarios
  • Bias and fairness assessment
  • Performance benchmarking
  • Safety validation

2. Production Monitoring

  • Real-time performance tracking
  • Anomaly detection
  • User feedback collection
  • Continuous evaluation

3. Post-deployment Analysis

  • Performance trend analysis
  • Issue investigation and resolution
  • Model improvement recommendations
  • Compliance reporting

Governance Best Practices

1. Evaluation Strategy

  • Multi-dimensional evaluation: Don’t rely on a single metric
  • Human-in-the-loop: Combine automated and human evaluation
  • Continuous testing: Regular evaluation, not just pre-deployment
  • Diverse test cases: Cover edge cases and various scenarios

2. Monitoring Implementation

  • Real-time alerts: Set up alerts for performance degradation
  • Baseline establishment: Define normal performance ranges
  • Trend analysis: Look for gradual performance changes
  • User feedback integration: Collect and analyze user satisfaction

3. Compliance and Ethics

  • Documentation: Maintain detailed records of evaluations
  • Audit trails: Track all changes and decisions
  • Bias testing: Regularly test for unfair outcomes
  • Privacy protection: Ensure user data is handled appropriately

4. Incident Response

  • Escalation procedures: Clear steps when issues are detected
  • Rollback capabilities: Ability to revert to previous versions
  • Root cause analysis: Investigate and learn from failures
  • Communication plans: Keep stakeholders informed during incidents

Production Deployment Checklist

  • Evaluation Framework: Comprehensive testing suite established
  • Monitoring Setup: Real-time performance tracking active
  • Alert Systems: Automated notifications for issues
  • Governance Policies: Clear guidelines and procedures
  • Compliance Check: Regulatory requirements met
  • Security Validation: Security testing completed
  • Performance Baseline: Expected performance ranges defined
  • Rollback Plan: Procedure for reverting changes
  • Documentation: Complete system documentation
  • Training: Team trained on monitoring and response

Course Completion

Congratulations! You’ve completed all four chapters of the Watsonx Agentic AI course:

  • Chapter 1: No-code agents with Orchestrate
  • Chapter 2: Low-code visual development with Langflow
  • Chapter 3: Pro-code development with Langgraph
  • Chapter 4: AI governance and evaluation

Next Steps

  • Apply these concepts to your own projects
  • Join the AI agent development community
  • Keep up with the latest developments in agentic AI
  • Share your experiences and learnings

Governance implementation files:

4. Governance/
β”œβ”€β”€ 1. syntactic.py                    # Basic evaluation
β”œβ”€β”€ 2. ragtriad.py                     # RAG evaluation
β”œβ”€β”€ 3. plotlychart.py                  # Visualization
β”œβ”€β”€ graph.png                          # Performance charts
β”œβ”€β”€ rag_graph.png                      # RAG evaluation charts
β”œβ”€β”€ trace.json                         # Sample trace data
β”œβ”€β”€ README.md                          # Setup instructions
β”œβ”€β”€ credit_card/                       # Sample vector database
└── wxgov_traces/                      # Experiment traces
    β”œβ”€β”€ experiment_traces_*.json
    └── experiment_traces_*.log

Additional Resources