Chapter 4: AI Governance and Evaluation

Welcome to the final chapter! Here we explore AI governance, evaluation, and monitoring - essential skills for deploying AI agents responsibly in production environments.

Learning Objectives

By the end of this chapter, you’ll be able to:

Implement syntactic and semantic evaluation for AI agents
Set up monitoring and tracing for agent performance
Build RAG evaluation systems using the RAG Triad methodology
Create governance frameworks for AI agent deployment
Visualize agent performance metrics and traces
Establish responsible AI practices for production systems

Why AI Governance Matters

AI governance is critical for:

Trust and reliability in AI systems
Regulatory compliance with AI regulations
Risk mitigation for business operations
Performance optimization through continuous monitoring
Ethical AI deployment with bias detection and mitigation
Transparency and explainability for stakeholders

Evaluation Framework

1. Syntactic Evaluation (`1. syntactic.py`)

Objective: Evaluate basic agent functionality and response consistency

What you’ll learn:

Basic response evaluation metrics
Consistency testing across similar queries
Response format validation

Test prompts for evaluation:

"What's Google's stock price?"
"What's IBM's stock price?"

Key concepts:

Response consistency: Similar queries should produce similar response structures
Format validation: Ensure responses meet expected formats
Basic functionality: Verify core agent capabilities work reliably

2. RAG Triad Evaluation (`2. ragtriad.py`)

Objective: Comprehensive evaluation of RAG (Retrieval Augmented Generation) systems

What you’ll learn:

Context Relevance: How relevant is the retrieved context?
Groundedness: How well does the answer stay grounded in the context?
Answer Relevance: How relevant is the answer to the original question?

The RAG Triad Framework:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ Context         │    │ Groundedness    │    │ Answer          │
│ Relevance       │◄──►│ Evaluation      │◄──►│ Relevance       │
│                 │    │                 │    │                 │
└─────────────────┘    └─────────────────┘    └─────────────────┘
      ▲                         ▲                         ▲
      │                         │                         │
   Query ───────────────────────────────────────────────────────► Response

Test prompts for evaluation:

"What's the limit on overseas assistance?"
"What is the definition of epidemic?"

Evaluation metrics:

Context Relevance Score: 0.0 - 1.0 (higher is better)
Groundedness Score: 0.0 - 1.0 (higher is better)
Answer Relevance Score: 0.0 - 1.0 (higher is better)

3. Visualization and Monitoring (`3. plotlychart.py`)

Objective: Create interactive dashboards for monitoring agent performance

What you’ll learn:

Performance metrics visualization
Real-time monitoring dashboards
Trace analysis and debugging
Historical performance trends

Generated outputs:

graph.png: Performance metrics visualization
rag_graph.png: RAG-specific evaluation charts
Interactive Plotly dashboards

Key Governance Concepts

Evaluation Metrics

Response Quality

Accuracy: Factual correctness of responses
Relevance: How well the response addresses the query
Completeness: Whether the response fully answers the question
Clarity: How understandable the response is

System Performance

Response Time: How quickly the agent responds
Throughput: How many queries can be handled per minute
Availability: System uptime and reliability
Cost: Resource consumption per query

Safety and Ethics

Bias Detection: Identifying unfair or discriminatory responses
Toxicity Screening: Filtering harmful content
Privacy Compliance: Protecting user data
Hallucination Detection: Identifying false or fabricated information

Monitoring and Tracing

Trace Data Structure

{
  "trace_id": "uuid",
  "timestamp": "2025-01-08T10:30:00Z",
  "query": "User question",
  "response": "Agent response",
  "context": ["Retrieved documents"],
  "metrics": {
    "context_relevance": 0.85,
    "groundedness": 0.92,
    "answer_relevance": 0.88,
    "response_time_ms": 1500
  }
}

Trace Storage

Traces are stored in:

trace.json: Individual trace files
wxgov_traces/: Directory with experiment traces
ChromaDB: Vector database for similarity search

Governance Framework

1. Pre-deployment Evaluation

Comprehensive testing across diverse scenarios
Bias and fairness assessment
Performance benchmarking
Safety validation

2. Production Monitoring

Real-time performance tracking
Anomaly detection
User feedback collection
Continuous evaluation

3. Post-deployment Analysis

Performance trend analysis
Issue investigation and resolution
Model improvement recommendations
Compliance reporting

Governance Best Practices

1. Evaluation Strategy

Multi-dimensional evaluation: Don’t rely on a single metric
Human-in-the-loop: Combine automated and human evaluation
Continuous testing: Regular evaluation, not just pre-deployment
Diverse test cases: Cover edge cases and various scenarios

2. Monitoring Implementation

Real-time alerts: Set up alerts for performance degradation
Baseline establishment: Define normal performance ranges
Trend analysis: Look for gradual performance changes
User feedback integration: Collect and analyze user satisfaction

3. Compliance and Ethics

Documentation: Maintain detailed records of evaluations
Audit trails: Track all changes and decisions
Bias testing: Regularly test for unfair outcomes
Privacy protection: Ensure user data is handled appropriately

4. Incident Response

Escalation procedures: Clear steps when issues are detected
Rollback capabilities: Ability to revert to previous versions
Root cause analysis: Investigate and learn from failures
Communication plans: Keep stakeholders informed during incidents

Production Deployment Checklist

Course Completion

Congratulations! You’ve completed all four chapters of the Watsonx Agentic AI course:

Chapter 1: No-code agents with Orchestrate
Chapter 2: Low-code visual development with Langflow
Chapter 3: Pro-code development with Langgraph
Chapter 4: AI governance and evaluation

Next Steps

Apply these concepts to your own projects
Join the AI agent development community
Keep up with the latest developments in agentic AI
Share your experiences and learnings

Governance implementation files:

4. Governance/
├── 1. syntactic.py                    # Basic evaluation
├── 2. ragtriad.py                     # RAG evaluation
├── 3. plotlychart.py                  # Visualization
├── graph.png                          # Performance charts
├── rag_graph.png                      # RAG evaluation charts
├── trace.json                         # Sample trace data
├── README.md                          # Setup instructions
├── credit_card/                       # Sample vector database
└── wxgov_traces/                      # Experiment traces
    ├── experiment_traces_*.json
    └── experiment_traces_*.log

Additional Resources

Responsible AI Guidelines - IBM’s responsible AI principles
AI Governance Framework - NIST AI Risk Management Framework

Chapter 4: AI Governance and Evaluation

Learning Objectives

Why AI Governance Matters

Evaluation Framework

1. Syntactic Evaluation (1. syntactic.py)

2. RAG Triad Evaluation (2. ragtriad.py)

3. Visualization and Monitoring (3. plotlychart.py)

Key Governance Concepts

Evaluation Metrics

Response Quality

System Performance

Safety and Ethics

Monitoring and Tracing

Trace Data Structure

Trace Storage

Governance Framework

1. Pre-deployment Evaluation

2. Production Monitoring

3. Post-deployment Analysis

Governance Best Practices

1. Evaluation Strategy

2. Monitoring Implementation

3. Compliance and Ethics

4. Incident Response

Production Deployment Checklist

Course Completion

Next Steps

Related Files

Additional Resources

1. Syntactic Evaluation (`1. syntactic.py`)

2. RAG Triad Evaluation (`2. ragtriad.py`)

3. Visualization and Monitoring (`3. plotlychart.py`)