IaC Testing Standards
This document defines the organization-wide testing philosophy and standards for Infrastructure as Code (IaC). These principles apply across all IaC tools including Terraform, Terragrunt, Ansible, Kubernetes manifests, and other infrastructure automation technologies.
1. Testing Philosophy¶
Why Test Infrastructure Code¶
Infrastructure code controls production systems, data, and availability. Bugs in infrastructure code can cause:
- Production outages: Misconfigured networking, load balancers, or DNS
- Security vulnerabilities: Exposed ports, weak encryption, misconfigured IAM
- Data loss: Incorrect database configurations, backup failures
- Cost overruns: Improperly scaled resources, orphaned infrastructure
- Compliance violations: Missing audit logs, inadequate access controls
Testing infrastructure code is not optional—it's a fundamental requirement for production readiness.
Shift-Left Testing¶
Catch issues as early as possible in the development cycle:
Developer's Commit CI Integration Production
Machine Hook Pipeline Testing Deployment
| | | | |
▼ ▼ ▼ ▼ ▼
[Static] [Lint] [Unit Tests] [Integration] [Smoke Tests]
$0 cost $1 $10 $100 $1000+
Cost of finding bugs increases exponentially as code moves right →
Key Principles:
- Fast Feedback Loops: Developers get results in seconds/minutes, not hours
- Automated Validation: No manual approval gates before basic checks
- Local Testing: Tests run on developer machines before commit
- Pre-Commit Hooks: Block invalid code from entering version control
- CI Pipeline Gating: Automated quality gates at every stage
The Cost of Production Bugs¶
Real-world impact of infrastructure bugs:
| Severity | Example | Impact | Cost |
|---|---|---|---|
| Critical | Security group exposed to 0.0.0.0/0 | Data breach | $100K - $10M+ |
| High | Incorrect database configuration | Data loss | $50K - $500K |
| Medium | Misconfigured load balancer | Service degradation | $10K - $50K |
| Low | Suboptimal resource sizing | Cost inefficiency | $1K - $10K |
Testing prevents these issues before they reach production.
2. Testing Pyramid for IaC¶
IaC testing follows a modified testing pyramid optimized for infrastructure validation:
╱╲ Smoke Tests (Production)
╱ ╲ < 5% of test effort
╱────╲
╱ Comp ╲ Compliance & Security
╱ liance╲ 10-15% of test effort
╱──────────╲
╱Integration╲ Integration Tests
╱ Tests ╲ 20-30% of test effort
╱──────────────╲
╱ Unit Tests ╲ Module/Role Tests
╱ ╲ 50-60% of test effort
────────────────────
Static Analysis Lint, Format, Security Scans
(Pre-requisite) Run on every commit
Static Analysis (Tier 0)¶
Purpose: Catch syntax errors, style violations, and obvious security issues
Tools:
- Terraform:
terraform fmt,terraform validate,tflint,tfsec,checkov - Ansible:
ansible-lint,yamllint - Kubernetes:
kubeval,kube-linter
Execution: < 30 seconds, runs on every commit
Example Coverage:
- Syntax validation
- Formatting consistency
- Deprecated API usage
- Common security misconfigurations
- Secret detection
Unit Tests (Tier 1)¶
Purpose: Verify individual modules/roles work as specified
Characteristics:
- Fast (< 10 minutes)
- Isolated (no external dependencies)
- Deterministic (same input = same output)
- Test module contracts and guarantees
Tools:
- Terraform: Terratest (Go), Native Terraform Tests (.tftest.hcl)
- Ansible: Molecule with Docker driver
- Kubernetes: Unit (Go testing framework)
What to Test:
- Resource Creation: Expected resources are created
- Input Validation: Invalid inputs are rejected
- Output Correctness: Outputs match expected values
- Idempotency: Multiple runs produce identical results
- Conditional Logic: All code paths are exercised
- Error Handling: Graceful handling of failures
Example: Testing a VPC Terraform module
✓ Creates VPC with correct CIDR
✓ Creates 2 public subnets across AZs
✓ Creates 2 private subnets across AZs
✓ Creates Internet Gateway
✓ Creates NAT Gateways (one per AZ)
✓ Configures route tables correctly
✓ Rejects invalid CIDR blocks
✓ Validates AZ count >= 2
✓ Idempotent on re-apply
Integration Tests (Tier 2)¶
Purpose: Verify multiple components work together
Characteristics:
- Slower (< 60 minutes)
- Uses real infrastructure (test environments)
- Tests cross-module interactions
- Validates end-to-end workflows
What to Test:
- Multi-Module Deployments: VPC + EKS + RDS together
- Network Connectivity: Subnets can reach each other
- Service Integration: Application can connect to database
- DNS Resolution: Service discovery works correctly
- Load Balancer Routing: Traffic flows to correct targets
Example: Testing a three-tier application stack
✓ VPC created successfully
✓ RDS instance accessible from private subnet
✓ EKS cluster deployed and healthy
✓ Application pods can connect to database
✓ Load balancer routes traffic to pods
✓ DNS resolves to load balancer
✓ HTTPS certificate validates correctly
✓ Health checks pass for all services
Compliance Tests (Tier 3)¶
Purpose: Verify infrastructure meets security and regulatory requirements
Tools:
- InSpec: Compliance validation framework
- Chef InSpec: Security and compliance testing
- Open Policy Agent (OPA): Policy enforcement
- Cloud Custodian: Cloud governance
What to Test:
- Security Baselines: CIS benchmarks, STIG compliance
- Access Controls: Proper IAM permissions, least privilege
- Encryption: Data encrypted at rest and in transit
- Audit Logging: CloudTrail, audit logs enabled
- Network Security: No public access to sensitive resources
- Compliance Standards: SOC2, PCI-DSS, HIPAA requirements
Example: CIS AWS Foundations Benchmark
✓ IAM password policy configured
✓ MFA enabled on root account
✓ CloudTrail enabled in all regions
✓ S3 buckets have encryption enabled
✓ VPC Flow Logs enabled
✓ Security groups don't allow 0.0.0.0/0 on sensitive ports
✓ EBS volumes encrypted
✓ RDS instances have backup enabled
Smoke Tests (Tier 4)¶
Purpose: Verify deployed infrastructure is functioning
Characteristics:
- Runs in production (or production-like environment)
- Tests actual deployed resources
- Validates end-to-end functionality
- Runs post-deployment
What to Test:
- Service Health: All services respond to health checks
- Connectivity: External services can reach endpoints
- Authentication: Auth flows work correctly
- Critical Paths: Key user journeys function
- Performance: Response times within SLAs
Example: Post-deployment smoke tests
✓ HTTPS endpoint responds (200 OK)
✓ Health check endpoint healthy
✓ Database connection pool active
✓ Cache service responding
✓ Message queue accepting messages
✓ Monitoring and alerting active
✓ Backup jobs scheduled
3. Contract-Based Development¶
What is a Contract?¶
A contract is an explicit, testable promise about what infrastructure code will create and how it will behave.
Contracts define:
- Purpose: What problem this module/role solves
- Inputs: Required and optional parameters
- Outputs: Values provided for use by other modules
- Resources: What infrastructure will be created
- Behavior: Guarantees about how infrastructure will function
- Compatibility: Which platforms, versions are supported
CONTRACT.md Purpose¶
Every reusable module/role must include a CONTRACT.md file that explicitly states its guarantees.
Benefits:
- Testability: Contracts are directly testable
- Documentation: Self-documenting modules
- Versioning: Clear breaking change policies
- Trust: Consumers know exactly what to expect
- Quality: Forces thoughtful module design
Contract Structure¶
# Module Contract: [Name]
## Purpose
[One paragraph describing what this module does]
## Guarantees
### Resources Created
- [List of infrastructure resources that will be created]
### Behavior Guarantees
1. [Specific, testable promise about behavior]
2. [Another guarantee]
### Input Requirements
[Document all inputs with validation rules]
### Output Guarantees
[Document all outputs and their format]
### Platform Support Matrix
[Which platforms/versions are supported]
## Breaking Changes Policy
[Semantic versioning rules and deprecation timeline]
## Testing Requirements
- [List of tests that must pass]
- [Coverage requirements]
See: CONTRACT.md Template (issue #169) for complete example
Writing Testable Contracts¶
BAD (vague, untestable):
"This module creates networking resources"
GOOD (specific, testable):
"This module creates:
- Exactly 1 VPC with DNS hostnames enabled
- N public subnets (min 2, configurable)
- N private subnets (min 2, configurable)
- Subnets distributed across at least 2 availability zones
- 1 Internet Gateway attached to public subnets
- 1 NAT Gateway per availability zone for private subnets"
Every statement in the "GOOD" example can be verified with automated tests.
Versioning and Compatibility¶
Use Semantic Versioning for infrastructure modules:
- MAJOR (v1.0.0 → v2.0.0): Breaking changes to interface, resources, or behavior
- MINOR (v1.0.0 → v1.1.0): New features, backward-compatible changes
- PATCH (v1.0.0 → v1.0.1): Bug fixes, no functional changes
Breaking Change Examples:
- Renaming input variables
- Removing output values
- Changing resource names (causes recreation)
- Removing resources
- Changing default values that affect behavior
Breaking Change Policy:
- Announce in CHANGELOG.md at least 2 minor versions in advance
- Mark deprecated features with warnings
- Provide migration guides with examples
- Maintain deprecated features for minimum 2 minor versions
4. Coverage Standards¶
What to Measure¶
Coverage is not just about lines of code—it's about guarantees tested.
Coverage Dimensions:
- Guarantee Coverage: % of contract promises verified by tests
- Resource Coverage: % of resource types exercised in tests
- Input Coverage: % of input variables tested (including edge cases)
- Output Coverage: % of outputs validated
- Conditional Coverage: % of conditional logic paths tested
- Platform Coverage: % of supported platforms tested
Minimum Coverage Thresholds¶
| Coverage Type | Minimum | Target | Critical Modules |
|---|---|---|---|
| Guarantee Coverage | 100% | 100% | 100% |
| Resource Coverage | 80% | 90% | 100% |
| Input Coverage | 70% | 85% | 95% |
| Output Coverage | 100% | 100% | 100% |
| Conditional Coverage | 80% | 90% | 95% |
| Platform Coverage | 2+ platforms | All supported | All supported |
Critical Modules include:
- Security-related modules (IAM, networking, encryption)
- Data storage modules (databases, object storage)
- Publicly published modules
- Modules used across multiple teams/projects
Risk-Based Coverage Decisions¶
Not all code requires equal coverage. Use risk assessment:
High Risk (requires maximum coverage):
- Production infrastructure
- Security configurations
- Data storage and backups
- Network access controls
- Compliance-related resources
Medium Risk (requires standard coverage):
- Development/staging environments
- Non-critical applications
- Internal tools
- Monitoring and logging
Low Risk (can have reduced coverage):
- Temporary test environments
- Proof-of-concept code
- Development utilities
- Documentation-only changes
Platform/OS Coverage Requirements¶
Minimum Platform Coverage: Test on at least 2 different platforms/OS distributions
Platform Selection Guidelines:
- Primary Platform: Most commonly used in production
- Secondary Platform: Second most common or most different architecture
- Edge Case Platform: If claiming support, must test
Examples:
- Ansible Roles: Ubuntu 22.04 (primary) + RHEL 9 (secondary) + Windows (if supported)
- Terraform Modules: AWS (primary) + Azure (if multi-cloud) + GCP (if supported)
- Kubernetes: EKS (primary) + GKE (secondary) + on-prem (if supported)
5. Test Organization¶
Directory Structure Standards¶
Organize tests in a consistent, discoverable structure:
Terraform Modules¶
modules/
└── vpc/
├── main.tf
├── variables.tf
├── outputs.tf
├── CONTRACT.md
├── README.md
├── examples/
│ └── complete/
│ └── main.tf
└── tests/
├── unit/
│ └── vpc_test.go
├── integration/
│ └── full_stack_test.go
├── compliance/
│ └── security_baseline.rb
└── fixtures/
└── test_vpc.tfvars
Ansible Roles¶
roles/
└── webserver/
├── tasks/
│ └── main.yml
├── defaults/
│ └── main.yml
├── meta/
│ └── main.yml
├── CONTRACT.md
├── README.md
└── molecule/
├── default/
│ ├── molecule.yml
│ ├── converge.yml
│ └── verify.yml
├── compliance/
│ ├── molecule.yml
│ └── tests/
│ └── test_security.rb
└── multi-platform/
└── molecule.yml
Naming Conventions¶
Test Files:
- Unit tests:
*_test.go,test_*.py,*_spec.rb - Integration tests:
*_integration_test.go,test_*_integration.py - Compliance tests:
*_compliance.rb,security_baseline.rb
Test Functions/Methods:
- Use descriptive names:
TestVPCCreatesCorrectSubnets(notTestVPC) - Follow pattern:
Test[Module][What]ortest_[module]_[what] - Be specific:
TestWebserverInstallsNginxPackage(notTestInstall)
Test Data Management¶
Principles:
- No Secrets in Test Data: Use fake credentials, placeholder values
- Realistic But Safe: Test data resembles production but is clearly fake
- Version Controlled: Test fixtures in git, not environment variables
- Isolated: Each test uses independent test data
- Repeatable: Same test data produces same results
Test Data Locations:
- Fixtures:
tests/fixtures/*.tfvars,molecule/default/vars.yml - Mock Responses:
tests/mocks/*.json - Test Certificates:
tests/fixtures/certs/*.pem(self-signed)
Example Test Fixture:
# tests/fixtures/test_config.yml
vpc_cidr: "10.0.0.0/16"
environment: "test"
project: "test-project"
# DO NOT use real values
aws_account_id: "123456789012" # Fake account
Fixture and Mock Usage¶
When to Use Fixtures:
- Consistent test data across multiple tests
- Complex configuration structures
- Known-good examples for regression testing
When to Use Mocks:
- External API calls (AWS API, cloud providers)
- Expensive operations (avoid real resource creation in unit tests)
- Non-deterministic responses (random values, timestamps)
Example Mock:
# Mock AWS API responses in unit tests
@mock.patch('boto3.client')
def test_s3_bucket_creation(mock_boto):
mock_s3 = mock_boto.return_value
mock_s3.create_bucket.return_value = {'Location': '/test-bucket'}
# Test code that creates S3 bucket
result = create_bucket('test-bucket')
assert result['Location'] == '/test-bucket'
mock_s3.create_bucket.assert_called_once()
6. Quality Gates¶
Tiered Enforcement Approach¶
Introduce quality gates progressively to avoid disrupting development:
Phase 1 Phase 2 Phase 3
Warning Only Advisory Strict Enforcement
(Weeks 1-2) (Weeks 3-4) (Week 5+)
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Lint │ │ Lint │ │ Lint │
│ Fails │──▶ │ Fails │──▶ │ Fails │──▶ ❌ Block Merge
│ │ │ │ │ │
│ ⚠️ Warn │ │ 🔧 Fix │ │ 🚫 Block│
│ Continue│ │ + Comment│ │ Merge │
└─────────┘ └─────────┘ └─────────┘
Allow Merge Allow Merge Block Merge
+ Warning + MR Comment Hard Requirement
Phase 1: Warning Only (Weeks 1-2)¶
Goal: Build awareness without blocking work
Behavior:
- Tests run on every PR
- Failures logged but don't block merge
- Metrics collected on failure rates
- Team sees quality status but not forced to fix
Implementation:
# GitLab CI
lint:terraform:
script: terraform fmt -check || echo "⚠️ Formatting issues detected"
allow_failure: true
# GitHub Actions
- name: Lint Terraform
run: terraform fmt -check
continue-on-error: true
Success Criteria: Failure rate < 20% before moving to Phase 2
Phase 2: Advisory (Weeks 3-4)¶
Goal: Provide automated fixes and guidance
Behavior:
- Tests run and report failures
- Automated fix suggestions posted to MR/PR
- Failures still don't block merge (yet)
- Dashboard shows quality trends
Implementation:
- Post MR/PR comments with fix instructions
- Provide automated fix commands
- Link to documentation and examples
- Show quality trend (improving/degrading)
Example MR Comment:
## 🔧 Terraform Formatting Issues
Formatting issues detected in 3 files. Run this command to fix:
```bash
terraform fmt -recursive
```
**Files affected**:
- modules/vpc/main.tf
- modules/rds/variables.tf
**Documentation**: [Terraform Style Guide](link)
Success Criteria: Failure rate < 10% before moving to Phase 3
Phase 3: Strict Enforcement (Week 5+)¶
Goal: Enforce quality standards
Behavior:
- Tests run on every PR
- Failures block merge
- Exceptions require explicit approval
- Always enforced on main/production branches
Implementation:
# GitLab CI
lint:terraform:
script: terraform fmt -check
allow_failure: false
rules:
- if: $ENFORCEMENT_PHASE == "strict"
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH # Always strict on main
# GitHub Actions
- name: Lint Terraform
run: terraform fmt -check
# No continue-on-error = blocks on failure
Success Criteria: < 5% failure rate, minimal exceptions needed
Exceptions and Override Policies¶
When Exceptions Are Allowed:
- Emergency Fixes: Production incidents requiring immediate fix
- External Dependencies: Third-party module failures outside our control
- Tool Bugs: Known issues in linting/testing tools
- Deprecation Periods: Temporary bypass during breaking changes
Exception Process:
- Create exception request (GitHub issue, Jira ticket)
- Document reason and mitigation plan
- Obtain approval from tech lead or higher
- Set expiration date (max 30 days)
- Track exceptions in dashboard
- Review and close or extend before expiration
Exception Request Template:
## Quality Gate Exception Request
**Requester**: [Name]
**Date**: [YYYY-MM-DD]
**Expiration**: [YYYY-MM-DD] (max 30 days)
### What quality gate is being bypassed?
[Specific check/test being skipped]
### Why is this exception needed?
[Detailed justification]
### What is the risk?
[Impact if bypassed check would have failed]
### Mitigation Plan
[How will risk be addressed?]
### Approval
- [ ] Tech Lead: [Name]
- [ ] Security Team (if security-related): [Name]
Emergency Bypass Procedures¶
For Production Incidents Only:
- Verbal Approval: Tech lead or on-call approves verbally
- Skip Quality Gates: Merge with
[emergency-bypass]in commit message - Create Incident Ticket: Document incident and bypass
- Post-Incident Review: Within 24 hours, review what was bypassed
- Remediation: Fix bypassed checks within 7 days
Automated Detection:
# Detect emergency bypasses in CI
emergency-bypass:
script:
- |
if echo "$CI_COMMIT_MESSAGE" | grep -q "\[emergency-bypass\]"; then
echo "🚨 Emergency bypass detected"
# Post to Slack, create ticket
./scripts/notify_emergency_bypass.sh
fi
allow_failure: true
7. CI/CD Integration Patterns¶
Pipeline Stage Organization¶
Organize CI/CD pipelines in three tiers matching the testing pyramid:
Tier 1: Validate (Fast Feedback)
├─ Lint (YAML, Terraform, Ansible)
├─ Format Check
├─ Security Scan (Static)
└─ Secret Detection
⏱️ < 2 minutes
Tier 2: Test (Unit & Module)
├─ Unit Tests (Terratest, Molecule)
├─ Module Contract Verification
└─ Parallel Platform Tests
⏱️ < 10 minutes
Tier 3: Integration (Full Stack)
├─ Integration Tests
├─ Compliance Verification (InSpec)
└─ Smoke Tests
⏱️ < 60 minutes
🕒 Nightly or Pre-Release
Artifact Generation and Retention¶
Generated Artifacts:
- Test Reports: JUnit XML, JSON results
- Coverage Reports: Cobertura, LCOV formats
- Compliance Evidence: InSpec JSON, audit logs
- Plan Files: Terraform plans for review
- Logs: Detailed execution logs for debugging
Retention Policy:
| Artifact Type | Retention | Justification |
|---|---|---|
| Test Results | 7 days | Short-term debugging |
| Coverage Reports | 30 days | Trend analysis |
| Compliance Evidence | 90 days | Audit requirements |
| Release Artifacts | 365 days | Production traceability |
| Failed Test Logs | 14 days | Debugging failures |
Storage Optimization:
- Compress large artifacts (tar.gz)
- Store only on failure for debugging artifacts
- Archive to long-term storage (S3 Glacier) after retention period
Reporting and Dashboards¶
Required Dashboards:
- Test Coverage Dashboard:
- Coverage trends over time
- Per-module coverage breakdown
-
Platform coverage matrix
-
Quality Gates Dashboard:
- Pass/fail rates by gate
- Enforcement phase status
-
Exception tracking
-
Pipeline Performance Dashboard:
- Average pipeline duration
- Test execution times
-
Flaky test tracking
-
Compliance Dashboard:
- Compliance test results
- CIS benchmark scores
- Security scan findings
Tool Recommendations:
- Grafana with GitLab/GitHub metrics
- SonarQube for code quality
- Allure for test reporting
- Custom dashboards with Prometheus + Grafana
Feedback Mechanisms¶
Merge/Pull Request Comments:
- Test result summary
- Coverage metrics with trends
- Failed test details with logs
- Fix suggestions with commands
- Links to dashboards and documentation
Badges:
- Coverage badge with percentage
- Build status badge
- Compliance status badge
- Latest release badge
Notifications:
- Slack/Teams notifications for failures
- Email for critical compliance failures
- GitHub/GitLab notifications for reviewers
- Weekly digest of quality metrics
8. Developer Experience¶
Pre-Commit Hooks¶
Required Pre-Commit Checks:
# .pre-commit-config.yaml
repos:
- repo: https://github.com/antonbabenko/pre-commit-terraform
rev: v1.83.0
hooks:
- id: terraform_fmt
- id: terraform_validate
- id: terraform_docs
- repo: https://github.com/ansible/ansible-lint
rev: v6.20.0
hooks:
- id: ansible-lint
- repo: https://github.com/adrienverge/yamllint
rev: v1.32.0
hooks:
- id: yamllint
- repo: https://github.com/trufflesecurity/trufflehog
rev: v3.54.0
hooks:
- id: trufflehog
Best Practices:
- Keep hooks fast (< 10 seconds total)
- Only run checks on changed files
- Provide auto-fix where possible
- Allow bypass for emergencies (
git commit --no-verify) - Track bypass usage for metrics
Local Testing Capabilities¶
Developers Must Be Able to:
- Run All Tests Locally: No "CI-only" tests
- Test Individual Modules: Don't require full stack
- Use Mocked Dependencies: Avoid real cloud resources for unit tests
- Get Fast Feedback: Unit tests complete in < 2 minutes locally
- Debug Failures: Clear error messages and logs
Local Testing Tools:
- Terraform:
terraform test, Terratest with local Docker - Ansible:
molecule testwith Docker driver - GitLab CI:
gitlab-ci-localfor running pipelines locally - GitHub Actions:
actfor local action testing
Example Local Test Commands:
# Terraform module
cd modules/vpc
terraform test # Run native Terraform tests
cd tests && go test -v ./... # Run Terratest
# Ansible role
cd roles/webserver
molecule test # Run full test sequence
molecule test -s compliance # Run compliance tests
# GitLab CI pipeline
gitlab-ci-local # Run full pipeline
gitlab-ci-local --job lint:terraform # Run specific job
Fast Feedback Mechanisms¶
Feedback Speed Targets:
- Pre-Commit Hooks: < 10 seconds
- Lint Stage: < 2 minutes
- Unit Tests: < 10 minutes
- Integration Tests: < 60 minutes
- Full Pipeline: < 90 minutes
Optimization Strategies:
- Parallel Execution: Run tests across multiple runners
- Change Detection: Only test changed modules
- Caching: Cache dependencies (pip, npm, Go modules)
- Incremental Testing: Run smoke tests first, full tests later
- Sharding: Split large test suites across runners
Documentation Requirements¶
Every Module/Role Must Have:
- README.md: Usage examples, inputs, outputs
- CONTRACT.md: Explicit guarantees and promises
- CHANGELOG.md: Version history and breaking changes
- Testing Section: How to run tests, what they verify
README.md Template:
# Module Name
[One-sentence description]
## Usage
[Minimal working example]
## Testing
[How to run tests locally]
## Inputs
| Name | Type | Required | Description |
|------|------|----------|-------------|
## Outputs
| Name | Type | Description |
|------|------|-------------|
## Platform Support
- Platform 1: Tested
- Platform 2: Tested
## See Also
- [CONTRACT.md](CONTRACT.md) - Module guarantees
- [CHANGELOG.md](CHANGELOG.md) - Version history
9. Compliance & Governance¶
Evidence Generation¶
Required Evidence:
- Test Execution Logs: Prove tests were run
- Test Results: JUnit XML, JSON reports
- Compliance Reports: InSpec JSON, CIS benchmark results
- Coverage Reports: Code and guarantee coverage
- Audit Trails: Who approved, when, what changed
Evidence Format:
- Machine-Readable: JSON, XML for automated processing
- Human-Readable: HTML, PDF reports for auditors
- Signed/Verified: Cryptographic signatures for tamper-proofing
- Timestamped: Precise execution timestamps
- Traceable: Linked to git commits and PRs
Example Compliance Evidence:
{
"report_type": "compliance_verification",
"timestamp": "2024-01-15T10:30:00Z",
"module": "vpc-module-v1.2.0",
"commit_sha": "abc123def456",
"pull_request": "https://github.com/org/repo/pull/123",
"tests_executed": {
"total": 45,
"passed": 45,
"failed": 0,
"skipped": 0
},
"compliance_checks": {
"cis_aws_foundations": {
"total": 25,
"passed": 25,
"failed": 0,
"score": "100%"
},
"pci_dss": {
"total": 15,
"passed": 15,
"failed": 0,
"score": "100%"
}
},
"evidence_files": [
"artifacts/junit-report.xml",
"artifacts/inspec-results.json",
"artifacts/coverage-report.xml"
],
"approved_by": "tech-lead@example.com",
"verification_signature": "sha256:..."
}
Audit Trail Maintenance¶
What to Track:
- Code Changes: Git commits, PRs, approvals
- Test Results: All test executions with results
- Quality Gate Bypasses: Who, when, why, approval
- Deployment Events: What was deployed, when, by whom
- Access Changes: IAM modifications, permission grants
Audit Trail Storage:
- Git History: Permanent record of code changes
- CI/CD Logs: Retained per retention policy
- Centralized Logging: CloudWatch, Splunk, ELK stack
- Compliance Databases: Long-term audit storage
Audit Trail Query Examples:
-- Find all quality gate bypasses in last 30 days
SELECT commit_sha, author, bypass_reason, approved_by, timestamp
FROM audit_log
WHERE event_type = 'quality_gate_bypass'
AND timestamp > NOW() - INTERVAL '30 days'
ORDER BY timestamp DESC;
-- Find all deployments that failed compliance checks
SELECT deployment_id, module, compliance_score, deployer, timestamp
FROM deployments
WHERE compliance_score < 100
ORDER BY timestamp DESC;
Regulatory Requirements¶
SOC 2 Requirements:
- Automated security testing in CI/CD
- Evidence of test execution
- Access control audit logs
- Change management records
- Incident response documentation
PCI-DSS Requirements:
- Network segmentation testing
- Encryption verification
- Access control validation
- Vulnerability scanning
- Quarterly compliance reviews
HIPAA Requirements:
- Data encryption verification
- Access audit logs
- Security risk assessments
- Breach notification procedures
- Business associate agreements
Compliance Testing Integration:
# Compliance-specific test job
compliance:pci:
stage: compliance
script:
- inspec exec compliance/pci-dss.rb --reporter json:pci-results.json
artifacts:
reports:
junit: pci-results.json
expire_in: 365 days # Long retention for audit
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
- if: $CI_PIPELINE_SOURCE == "schedule"
Compliance-Ready Reporting¶
Report Requirements:
- Executive Summary: High-level compliance status
- Detailed Findings: Specific pass/fail results
- Evidence Links: Artifacts, logs, screenshots
- Remediation Plans: For any failures
- Trend Analysis: Compliance score over time
Report Generation:
# Generate compliance report
inspec exec compliance/ \
--reporter html:compliance-report.html \
--reporter json:compliance-report.json \
--reporter cli
# Upload to compliance dashboard
aws s3 cp compliance-report.html s3://compliance-reports/$(date +%Y-%m-%d)/
10. Continuous Improvement¶
Metrics to Track¶
Quality Metrics:
- Test Coverage: Overall and per-module
- Test Pass Rate: % of tests passing
- Flaky Test Rate: % of tests with inconsistent results
- Bug Escape Rate: Bugs found in production vs. testing
- Mean Time to Detection (MTTD): Time from bug introduction to detection
Performance Metrics:
- Pipeline Duration: Total time for full pipeline
- Test Execution Time: Time per test suite
- Feedback Loop Time: Commit to test results
- Build Success Rate: % of pipelines passing
- Deployment Frequency: How often we deploy
Process Metrics:
- Quality Gate Bypass Rate: % of merges bypassing gates
- Exception Request Rate: How many exceptions needed
- Pre-Commit Hook Bypass Rate: % of commits without hooks
- Documentation Completeness: % of modules with CONTRACT.md
- Platform Coverage: % of modules tested on all platforms
Review Cycles¶
Weekly Reviews:
- Test failure trends
- Flaky test identification
- Quality gate bypass analysis
- Pipeline performance
Monthly Reviews:
- Coverage trend analysis
- Compliance status review
- Exception requests review
- Tool and process improvements
Quarterly Reviews:
- Contract maintenance (update guarantees)
- Platform support review (add/remove platforms)
- Testing strategy assessment
- Regulatory compliance audit
Annual Reviews:
- Comprehensive testing standards review
- Tool evaluation and upgrades
- Security standards updates
- Industry best practices alignment
Contract Maintenance¶
Quarterly Contract Review:
- Accuracy Check: Do guarantees match current behavior?
- Completeness Check: Are all behaviors documented?
- Platform Update: Add/remove supported platforms
- Deprecation Planning: Mark features for removal
- Test Alignment: Do tests verify all guarantees?
Contract Update Process:
## Contract Review Checklist
- [ ] Reviewed all "Guarantees" sections
- [ ] Verified platform support matrix is current
- [ ] Updated breaking changes policy
- [ ] Confirmed test coverage matches guarantees
- [ ] Updated examples and usage documentation
- [ ] Checked for deprecated features
- [ ] Planned next version changes
Platform Expansion¶
When to Add Platform Support:
- Business Need: New platform used in production
- Customer Request: External users need different platform
- Risk Reduction: Avoid vendor lock-in
- Compliance: Regulatory requirement for specific platform
Platform Addition Process:
- Evaluate Feasibility: Can module work on new platform?
- Update Contract: Add platform to support matrix
- Add Platform Tests: Create test scenarios
- Document Differences: Platform-specific behavior
- Update CI/CD: Add platform to test matrix
- Announce Support: Update README, release notes
Example Platform Addition:
# Before: Only testing Ubuntu
test:ansible:
matrix:
- PLATFORM: [ubuntu-22.04]
# After: Added RHEL support
test:ansible:
matrix:
- PLATFORM: [ubuntu-22.04, rhel-9]
Summary¶
This document defines the organization-wide standards for Infrastructure as Code testing. Key principles:
- Test Everything: IaC bugs are expensive; testing prevents them
- Shift Left: Catch issues early for lowest cost
- Contract-Driven: Test what you promise
- Progressive Enforcement: Start permissive, tighten gradually
- Evidence-Based: Generate compliance-ready artifacts
- Developer-Friendly: Fast feedback, local testing, good DX
- Continuous Improvement: Regular reviews, metric tracking
See Also:
- Terraform Testing Guide
- Ansible Testing Guide
- GitLab CI Pipeline Architecture
- CONTRACT.md Template
- Testing Documentation Template
This living document is reviewed quarterly and updated based on lessons learned, industry best practices, and regulatory requirements.