Blocking RPC calls in critical request path cause request latency and timeouts

## Summary

Tyk Gateway makes **blocking RPC calls to MDCB in the critical request path**, causing request latency and timeouts when MDCB is slow or unavailable. This is a fundamental architectural issue - network I/O should **never block user requests**.

## Architecture Problem

### Current (Broken) Design

```
User Request
    ↓
[Request Processing]
    ↓
Analytics Recording (handler_success.go:305)
    ↓
OrgSessionExpiry() - BLOCKS ON RPC CALL
    ↓
[User waits for network I/O...]
    ↓
Response
```

**Result:** User requests block waiting for MDCB responses.

### Correct Design

```
User Request                Background Worker
    ↓                              ↓
[Request Processing]          Periodic Refresh
    ↓                              ↓
Read Cached Org Data         Call MDCB (async)
    ↓                              ↓
Response (fast!)             Update Cache
```

**Result:** User requests never wait for network I/O.

## Root Causes

### 1. OrgSessionExpiry() Called in Request Thread

**File:** `gateway/handler_success.go:302-312`

```go
expiresAfter := s.Spec.ExpireAnalyticsAfter

if s.Spec.GlobalConfig.EnforceOrgDataAge {
    orgExpireDataTime := s.OrgSessionExpiry(s.Spec.OrgID)  // BLOCKS REQUEST!
    
    if orgExpireDataTime > 0 {
        expiresAfter = orgExpireDataTime
    }
}

record.SetExpiry(expiresAfter)
```

**Also in:** `gateway/handler_error.go:272-280`

**Problem:** Analytics recording happens in the request goroutine BEFORE sending to async worker. The `OrgSessionExpiry()` call blocks the entire request.

### 2. OrgSession() Makes Synchronous RPC Call

**File:** `gateway/middleware.go:278-364` (v5.2.6, similar in v5.8.6)

```go
func (t BaseMiddleware) OrgSession(orgID string) (user.SessionState, bool) {
    if rpc.IsEmergencyMode() {
        return user.SessionState{}, false
    }
    
    // Try and get the session from the session store
    session, found := t.Spec.OrgSessionManager.SessionDetail(orgID, orgID, false)  // RPC CALL!
    if found && t.Spec.GlobalConfig.EnforceOrgDataAge {
        t.Logger().Debug("Setting data expiry: ", orgID)
        t.Gw.ExpiryCache.Set(session.OrgID, session.DataExpires, cache.DefaultExpiration)
    }
    
    session.SetKeyHash(storage.HashKey(orgID, t.Gw.GetConfig().HashKeys))
    return session.Clone(), found
}
```

**Calls:** `RPCStorageHandler.SessionDetail()` → `FuncClientSingleton()` → Network I/O

### 3. OrganizationMonitor Middleware Blocks All Requests

**File:** `gateway/mw_organisation_activity.go:52-93`

```go
func (k *OrganizationMonitor) ProcessRequest(w http.ResponseWriter, r *http.Request, _ interface{}) (error, int) {
    // ...code omitted...
    
    // try to get from Redis
    if !found {
        // not found in in-app cache, let's read from Redis
        orgSession, found = k.OrgSession(k.Spec.OrgID)  // CAN BLOCK ON RPC!
        if !found {
            k.setOrgHasNoSession(true)
            return nil, http.StatusOK
        }
    }
    // ...
}
```

**Enabled for:** ALL APIs when `EnforceOrgQuotas: true` (including keyless APIs!)

## Impact

### When MDCB is Slow or Down

**v5.2.6:**
- Request blocks for ~30 milliseconds (fast-fail)
- Happens every ~10 minutes when cache expires
- Thundering herd: all concurrent requests block simultaneously

**v5.8.6:**
- Request blocks for up to ~90 seconds (3 × 30s timeout)
- Happens every ~10 minutes when cache expires
- Gateway appears completely unresponsive

### Affected Traffic

- ✅ **ALL APIs** with `EnforceOrgQuotas: true`
- ✅ **ALL APIs** with `EnforceOrgDataAge: true` (forced in RPC mode)
- ✅ **Keyless APIs** (no authentication required!)
- ✅ **Every request** when cache expires (every 10 minutes)

### Production Impact

- Users experience intermittent 30-90 second timeouts
- Load balancer health checks may fail
- Cascading failures if multiple gateways hit empty cache simultaneously
- No way to disable (EnforceOrgDataAge forced to true)

## Proposed Solution

### Phase 1: Move RPC Calls Out of Request Path (High Priority)

#### 1.1. Background Org Session Refresh

**New component:** `OrgSessionRefresher`

```go
type OrgSessionRefresher struct {
    cache       *cache.Cache
    rpcStore    storage.Handler
    orgIDs      sync.Map  // track active orgs
    refreshInterval time.Duration
}

func (r *OrgSessionRefresher) Start(ctx context.Context) {
    ticker := time.NewTicker(r.refreshInterval)
    defer ticker.Stop()
    
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            r.refreshAllOrgs()
        }
    }
}

func (r *OrgSessionRefresher) refreshAllOrgs() {
    r.orgIDs.Range(func(orgID, _ interface{}) bool {
        go r.refreshOrgSession(orgID.(string))  // Async, non-blocking
        return true
    })
}

func (r *OrgSessionRefresher) refreshOrgSession(orgID string) {
    session, found := r.rpcStore.SessionDetail(orgID, orgID, false)
    if found {
        r.cache.Set(orgID, session, cache.DefaultExpiration)
    }
}
```

**Usage in request path:**

```go
func (t BaseMiddleware) OrgSession(orgID string) (user.SessionState, bool) {
    // Register this org for background refresh
    t.Gw.OrgRefresher.Track(orgID)
    
    // ONLY read from cache - never block
    if session, found := t.Gw.OrgSessionCache.Get(orgID); found {
        return session.(user.SessionState).Clone(), true
    }
    
    // Not in cache - use defaults, background refresh will populate it
    return user.SessionState{}, false
}
```

**Result:** Requests **never** block on RPC calls.

#### 1.2. Move OrgSessionExpiry to Analytics Worker

**File:** `gateway/handler_success.go:302-312`

**Before (blocks request):**
```go
expiresAfter := s.Spec.ExpireAnalyticsAfter

if s.Spec.GlobalConfig.EnforceOrgDataAge {
    orgExpireDataTime := s.OrgSessionExpiry(s.Spec.OrgID)  // BLOCKS!
    if orgExpireDataTime > 0 {
        expiresAfter = orgExpireDataTime
    }
}

record.SetExpiry(expiresAfter)
```

**After (non-blocking):**
```go
expiresAfter := s.Spec.ExpireAnalyticsAfter

// Use cached value only - no blocking calls
if s.Spec.GlobalConfig.EnforceOrgDataAge {
    if cachedExpiry, found := s.Gw.OrgExpiryCache.Get(s.Spec.OrgID); found {
        expiresAfter = cachedExpiry.(int64)
    }
}

record.SetExpiry(expiresAfter)
```

**Background worker** updates `OrgExpiryCache` periodically.

### Phase 2: Circuit Breaker Pattern (Medium Priority)

Implement circuit breaker for RPC calls:

```go
type RPCCircuitBreaker struct {
    state          atomic.Value  // closed, open, half-open
    failures       atomic.Int64
    lastFailure    atomic.Value  // time.Time
    threshold      int
    timeout        time.Duration
}

func (cb *RPCCircuitBreaker) Call(fn func() error) error {
    if cb.IsOpen() {
        return ErrCircuitOpen
    }
    
    err := fn()
    if err != nil {
        cb.RecordFailure()
        if cb.failures.Load() >= cb.threshold {
            cb.Open()
        }
        return err
    }
    
    cb.RecordSuccess()
    return nil
}
```

**Benefits:**
- Stop trying to reach MDCB after N failures
- Fail fast during outages
- Automatic recovery when MDCB comes back

### Phase 3: Graceful Degradation (Medium Priority)

When MDCB is unavailable:

1. **Use last known good values** from cache
2. **Extend cache TTL** automatically (e.g., from 10 min to 1 hour)
3. **Log warnings** but don't block requests
4. **Emit metrics** for monitoring

```go
func (t BaseMiddleware) OrgSessionExpiry(orgID string) int64 {
    // Try cache first
    if expiry, found := t.Gw.ExpiryCache.Get(orgID); found {
        return expiry.(int64)
    }
    
    // MDCB unavailable - use safe default and extend cache
    defaultExpiry := DEFAULT_ORG_SESSION_EXPIRATION  // 7 days
    t.Gw.ExpiryCache.Set(orgID, defaultExpiry, 1*time.Hour)  // Cache for 1 hour instead of 10 min
    
    log.WithField("orgID", orgID).Warning("MDCB unavailable, using default expiry")
    
    return defaultExpiry
}
```

### Phase 4: Configuration Options (Low Priority)

Allow users to tune behavior:

```json
{
  "enforce_org_data_age": false,
  "org_session_refresh_interval": 30,
  "org_cache_ttl": 600,
  "rpc_circuit_breaker": {
    "enabled": true,
    "failure_threshold": 5,
    "timeout": 60
  }
}
```

## Success Criteria

After implementing these changes:

- ✅ **Zero blocking RPC calls** in request path
- ✅ Requests complete in <10ms regardless of MDCB state
- ✅ Gateway remains responsive when MDCB is down
- ✅ Automatic recovery when MDCB comes back online
- ✅ No thundering herd when cache expires
- ✅ Metrics and logging for monitoring
- ✅ Backward compatible (same behavior when MDCB is healthy)

## Testing Plan

### 1. Functional Tests

```go
func TestOrgSessionNonBlocking(t *testing.T) {
    // Stop MDCB
    mdcb.Stop()
    
    // Request should complete quickly
    start := time.Now()
    resp := makeRequest(t, "/api/endpoint")
    duration := time.Since(start)
    
    assert.Equal(t, http.StatusOK, resp.StatusCode)
    assert.Less(t, duration, 50*time.Millisecond, "Request should not block")
}
```

### 2. Load Tests

- 1000 req/s for 10 minutes
- Stop MDCB at 5-minute mark
- Verify no request latency increase

### 3. Cache Expiry Tests

- Run traffic continuously
- Wait for cache expiry (10 min)
- Verify no thundering herd
- Verify requests don't block

### 4. Recovery Tests

- Start with MDCB down
- Send requests (should work with defaults)
- Start MDCB
- Verify background refresh populates cache
- Verify latest org data is used

## Related Issues

- #7429 - EnforceOrgDataAge forced to true in RPC mode
- This issue describes the configuration problem
- **This issue** describes the architectural problem

Both need to be fixed, but this issue is more critical as it affects all RPC users.

## Migration Path

1. **v5.8.x patch:** Apply immediate fix (fail fast) to reduce 90s → 30ms
2. **v5.9.0:** Implement Phase 1 (background refresh, move out of request path)
3. **v5.9.x:** Add Phase 2 (circuit breaker) and Phase 3 (graceful degradation)
4. **v5.10.0:** Add Phase 4 (configuration options)

## References

### Files That Need Changes

**High priority:**
- `gateway/handler_success.go:302-312` - Move OrgSessionExpiry out of request thread
- `gateway/handler_error.go:272-280` - Same as above for error path
- `gateway/middleware.go:278-364` - Make OrgSession non-blocking
- `gateway/server.go` - Add OrgSessionRefresher background worker

**Medium priority:**
- `rpc/rpc_client.go` - Add circuit breaker
- `gateway/mw_organisation_activity.go` - Use non-blocking org session lookup

**Low priority:**
- `config/config.go` - Add configuration options
- Documentation updates

### Architecture Principles

1. **Never block requests on network I/O** - cache-first, background refresh
2. **Fail gracefully** - use defaults when dependencies unavailable
3. **Monitor and alert** - emit metrics for degraded state
4. **Automatic recovery** - no manual intervention required
5. **Backward compatible** - same behavior when everything is healthy

---

This is a critical production issue affecting all Tyk Gateway deployments using MDCB. The current architecture violates basic principles of high-availability system design by blocking user requests on synchronous RPC calls.

Blocking RPC calls in critical request path cause request latency and timeouts #7432

Description

Summary

Architecture Problem

Current (Broken) Design

Correct Design

Root Causes

1. OrgSessionExpiry() Called in Request Thread

2. OrgSession() Makes Synchronous RPC Call

3. OrganizationMonitor Middleware Blocks All Requests

Impact

When MDCB is Slow or Down

Affected Traffic

Production Impact

Proposed Solution

Phase 1: Move RPC Calls Out of Request Path (High Priority)

1.1. Background Org Session Refresh

1.2. Move OrgSessionExpiry to Analytics Worker

Phase 2: Circuit Breaker Pattern (Medium Priority)

Phase 3: Graceful Degradation (Medium Priority)

Phase 4: Configuration Options (Low Priority)

Success Criteria

Testing Plan

1. Functional Tests

2. Load Tests

3. Cache Expiry Tests

4. Recovery Tests

Related Issues

Migration Path

References

Files That Need Changes

Architecture Principles

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions