Skip to content

Blocking RPC calls in critical request path cause request latency and timeouts #7432

@buger

Description

@buger

Summary

Tyk Gateway makes blocking RPC calls to MDCB in the critical request path, causing request latency and timeouts when MDCB is slow or unavailable. This is a fundamental architectural issue - network I/O should never block user requests.

Architecture Problem

Current (Broken) Design

User Request
    ↓
[Request Processing]
    ↓
Analytics Recording (handler_success.go:305)
    ↓
OrgSessionExpiry() - BLOCKS ON RPC CALL
    ↓
[User waits for network I/O...]
    ↓
Response

Result: User requests block waiting for MDCB responses.

Correct Design

User Request                Background Worker
    ↓                              ↓
[Request Processing]          Periodic Refresh
    ↓                              ↓
Read Cached Org Data         Call MDCB (async)
    ↓                              ↓
Response (fast!)             Update Cache

Result: User requests never wait for network I/O.

Root Causes

1. OrgSessionExpiry() Called in Request Thread

File: gateway/handler_success.go:302-312

expiresAfter := s.Spec.ExpireAnalyticsAfter

if s.Spec.GlobalConfig.EnforceOrgDataAge {
    orgExpireDataTime := s.OrgSessionExpiry(s.Spec.OrgID)  // BLOCKS REQUEST!
    
    if orgExpireDataTime > 0 {
        expiresAfter = orgExpireDataTime
    }
}

record.SetExpiry(expiresAfter)

Also in: gateway/handler_error.go:272-280

Problem: Analytics recording happens in the request goroutine BEFORE sending to async worker. The OrgSessionExpiry() call blocks the entire request.

2. OrgSession() Makes Synchronous RPC Call

File: gateway/middleware.go:278-364 (v5.2.6, similar in v5.8.6)

func (t BaseMiddleware) OrgSession(orgID string) (user.SessionState, bool) {
    if rpc.IsEmergencyMode() {
        return user.SessionState{}, false
    }
    
    // Try and get the session from the session store
    session, found := t.Spec.OrgSessionManager.SessionDetail(orgID, orgID, false)  // RPC CALL!
    if found && t.Spec.GlobalConfig.EnforceOrgDataAge {
        t.Logger().Debug("Setting data expiry: ", orgID)
        t.Gw.ExpiryCache.Set(session.OrgID, session.DataExpires, cache.DefaultExpiration)
    }
    
    session.SetKeyHash(storage.HashKey(orgID, t.Gw.GetConfig().HashKeys))
    return session.Clone(), found
}

Calls: RPCStorageHandler.SessionDetail()FuncClientSingleton() → Network I/O

3. OrganizationMonitor Middleware Blocks All Requests

File: gateway/mw_organisation_activity.go:52-93

func (k *OrganizationMonitor) ProcessRequest(w http.ResponseWriter, r *http.Request, _ interface{}) (error, int) {
    // ...code omitted...
    
    // try to get from Redis
    if !found {
        // not found in in-app cache, let's read from Redis
        orgSession, found = k.OrgSession(k.Spec.OrgID)  // CAN BLOCK ON RPC!
        if !found {
            k.setOrgHasNoSession(true)
            return nil, http.StatusOK
        }
    }
    // ...
}

Enabled for: ALL APIs when EnforceOrgQuotas: true (including keyless APIs!)

Impact

When MDCB is Slow or Down

v5.2.6:

  • Request blocks for ~30 milliseconds (fast-fail)
  • Happens every ~10 minutes when cache expires
  • Thundering herd: all concurrent requests block simultaneously

v5.8.6:

  • Request blocks for up to ~90 seconds (3 × 30s timeout)
  • Happens every ~10 minutes when cache expires
  • Gateway appears completely unresponsive

Affected Traffic

  • ALL APIs with EnforceOrgQuotas: true
  • ALL APIs with EnforceOrgDataAge: true (forced in RPC mode)
  • Keyless APIs (no authentication required!)
  • Every request when cache expires (every 10 minutes)

Production Impact

  • Users experience intermittent 30-90 second timeouts
  • Load balancer health checks may fail
  • Cascading failures if multiple gateways hit empty cache simultaneously
  • No way to disable (EnforceOrgDataAge forced to true)

Proposed Solution

Phase 1: Move RPC Calls Out of Request Path (High Priority)

1.1. Background Org Session Refresh

New component: OrgSessionRefresher

type OrgSessionRefresher struct {
    cache       *cache.Cache
    rpcStore    storage.Handler
    orgIDs      sync.Map  // track active orgs
    refreshInterval time.Duration
}

func (r *OrgSessionRefresher) Start(ctx context.Context) {
    ticker := time.NewTicker(r.refreshInterval)
    defer ticker.Stop()
    
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            r.refreshAllOrgs()
        }
    }
}

func (r *OrgSessionRefresher) refreshAllOrgs() {
    r.orgIDs.Range(func(orgID, _ interface{}) bool {
        go r.refreshOrgSession(orgID.(string))  // Async, non-blocking
        return true
    })
}

func (r *OrgSessionRefresher) refreshOrgSession(orgID string) {
    session, found := r.rpcStore.SessionDetail(orgID, orgID, false)
    if found {
        r.cache.Set(orgID, session, cache.DefaultExpiration)
    }
}

Usage in request path:

func (t BaseMiddleware) OrgSession(orgID string) (user.SessionState, bool) {
    // Register this org for background refresh
    t.Gw.OrgRefresher.Track(orgID)
    
    // ONLY read from cache - never block
    if session, found := t.Gw.OrgSessionCache.Get(orgID); found {
        return session.(user.SessionState).Clone(), true
    }
    
    // Not in cache - use defaults, background refresh will populate it
    return user.SessionState{}, false
}

Result: Requests never block on RPC calls.

1.2. Move OrgSessionExpiry to Analytics Worker

File: gateway/handler_success.go:302-312

Before (blocks request):

expiresAfter := s.Spec.ExpireAnalyticsAfter

if s.Spec.GlobalConfig.EnforceOrgDataAge {
    orgExpireDataTime := s.OrgSessionExpiry(s.Spec.OrgID)  // BLOCKS!
    if orgExpireDataTime > 0 {
        expiresAfter = orgExpireDataTime
    }
}

record.SetExpiry(expiresAfter)

After (non-blocking):

expiresAfter := s.Spec.ExpireAnalyticsAfter

// Use cached value only - no blocking calls
if s.Spec.GlobalConfig.EnforceOrgDataAge {
    if cachedExpiry, found := s.Gw.OrgExpiryCache.Get(s.Spec.OrgID); found {
        expiresAfter = cachedExpiry.(int64)
    }
}

record.SetExpiry(expiresAfter)

Background worker updates OrgExpiryCache periodically.

Phase 2: Circuit Breaker Pattern (Medium Priority)

Implement circuit breaker for RPC calls:

type RPCCircuitBreaker struct {
    state          atomic.Value  // closed, open, half-open
    failures       atomic.Int64
    lastFailure    atomic.Value  // time.Time
    threshold      int
    timeout        time.Duration
}

func (cb *RPCCircuitBreaker) Call(fn func() error) error {
    if cb.IsOpen() {
        return ErrCircuitOpen
    }
    
    err := fn()
    if err != nil {
        cb.RecordFailure()
        if cb.failures.Load() >= cb.threshold {
            cb.Open()
        }
        return err
    }
    
    cb.RecordSuccess()
    return nil
}

Benefits:

  • Stop trying to reach MDCB after N failures
  • Fail fast during outages
  • Automatic recovery when MDCB comes back

Phase 3: Graceful Degradation (Medium Priority)

When MDCB is unavailable:

  1. Use last known good values from cache
  2. Extend cache TTL automatically (e.g., from 10 min to 1 hour)
  3. Log warnings but don't block requests
  4. Emit metrics for monitoring
func (t BaseMiddleware) OrgSessionExpiry(orgID string) int64 {
    // Try cache first
    if expiry, found := t.Gw.ExpiryCache.Get(orgID); found {
        return expiry.(int64)
    }
    
    // MDCB unavailable - use safe default and extend cache
    defaultExpiry := DEFAULT_ORG_SESSION_EXPIRATION  // 7 days
    t.Gw.ExpiryCache.Set(orgID, defaultExpiry, 1*time.Hour)  // Cache for 1 hour instead of 10 min
    
    log.WithField("orgID", orgID).Warning("MDCB unavailable, using default expiry")
    
    return defaultExpiry
}

Phase 4: Configuration Options (Low Priority)

Allow users to tune behavior:

{
  "enforce_org_data_age": false,
  "org_session_refresh_interval": 30,
  "org_cache_ttl": 600,
  "rpc_circuit_breaker": {
    "enabled": true,
    "failure_threshold": 5,
    "timeout": 60
  }
}

Success Criteria

After implementing these changes:

  • Zero blocking RPC calls in request path
  • ✅ Requests complete in <10ms regardless of MDCB state
  • ✅ Gateway remains responsive when MDCB is down
  • ✅ Automatic recovery when MDCB comes back online
  • ✅ No thundering herd when cache expires
  • ✅ Metrics and logging for monitoring
  • ✅ Backward compatible (same behavior when MDCB is healthy)

Testing Plan

1. Functional Tests

func TestOrgSessionNonBlocking(t *testing.T) {
    // Stop MDCB
    mdcb.Stop()
    
    // Request should complete quickly
    start := time.Now()
    resp := makeRequest(t, "/api/endpoint")
    duration := time.Since(start)
    
    assert.Equal(t, http.StatusOK, resp.StatusCode)
    assert.Less(t, duration, 50*time.Millisecond, "Request should not block")
}

2. Load Tests

  • 1000 req/s for 10 minutes
  • Stop MDCB at 5-minute mark
  • Verify no request latency increase

3. Cache Expiry Tests

  • Run traffic continuously
  • Wait for cache expiry (10 min)
  • Verify no thundering herd
  • Verify requests don't block

4. Recovery Tests

  • Start with MDCB down
  • Send requests (should work with defaults)
  • Start MDCB
  • Verify background refresh populates cache
  • Verify latest org data is used

Related Issues

Both need to be fixed, but this issue is more critical as it affects all RPC users.

Migration Path

  1. v5.8.x patch: Apply immediate fix (fail fast) to reduce 90s → 30ms
  2. v5.9.0: Implement Phase 1 (background refresh, move out of request path)
  3. v5.9.x: Add Phase 2 (circuit breaker) and Phase 3 (graceful degradation)
  4. v5.10.0: Add Phase 4 (configuration options)

References

Files That Need Changes

High priority:

  • gateway/handler_success.go:302-312 - Move OrgSessionExpiry out of request thread
  • gateway/handler_error.go:272-280 - Same as above for error path
  • gateway/middleware.go:278-364 - Make OrgSession non-blocking
  • gateway/server.go - Add OrgSessionRefresher background worker

Medium priority:

  • rpc/rpc_client.go - Add circuit breaker
  • gateway/mw_organisation_activity.go - Use non-blocking org session lookup

Low priority:

  • config/config.go - Add configuration options
  • Documentation updates

Architecture Principles

  1. Never block requests on network I/O - cache-first, background refresh
  2. Fail gracefully - use defaults when dependencies unavailable
  3. Monitor and alert - emit metrics for degraded state
  4. Automatic recovery - no manual intervention required
  5. Backward compatible - same behavior when everything is healthy

This is a critical production issue affecting all Tyk Gateway deployments using MDCB. The current architecture violates basic principles of high-availability system design by blocking user requests on synchronous RPC calls.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions