-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Summary
Tyk Gateway makes blocking RPC calls to MDCB in the critical request path, causing request latency and timeouts when MDCB is slow or unavailable. This is a fundamental architectural issue - network I/O should never block user requests.
Architecture Problem
Current (Broken) Design
User Request
↓
[Request Processing]
↓
Analytics Recording (handler_success.go:305)
↓
OrgSessionExpiry() - BLOCKS ON RPC CALL
↓
[User waits for network I/O...]
↓
Response
Result: User requests block waiting for MDCB responses.
Correct Design
User Request Background Worker
↓ ↓
[Request Processing] Periodic Refresh
↓ ↓
Read Cached Org Data Call MDCB (async)
↓ ↓
Response (fast!) Update Cache
Result: User requests never wait for network I/O.
Root Causes
1. OrgSessionExpiry() Called in Request Thread
File: gateway/handler_success.go:302-312
expiresAfter := s.Spec.ExpireAnalyticsAfter
if s.Spec.GlobalConfig.EnforceOrgDataAge {
orgExpireDataTime := s.OrgSessionExpiry(s.Spec.OrgID) // BLOCKS REQUEST!
if orgExpireDataTime > 0 {
expiresAfter = orgExpireDataTime
}
}
record.SetExpiry(expiresAfter)Also in: gateway/handler_error.go:272-280
Problem: Analytics recording happens in the request goroutine BEFORE sending to async worker. The OrgSessionExpiry() call blocks the entire request.
2. OrgSession() Makes Synchronous RPC Call
File: gateway/middleware.go:278-364 (v5.2.6, similar in v5.8.6)
func (t BaseMiddleware) OrgSession(orgID string) (user.SessionState, bool) {
if rpc.IsEmergencyMode() {
return user.SessionState{}, false
}
// Try and get the session from the session store
session, found := t.Spec.OrgSessionManager.SessionDetail(orgID, orgID, false) // RPC CALL!
if found && t.Spec.GlobalConfig.EnforceOrgDataAge {
t.Logger().Debug("Setting data expiry: ", orgID)
t.Gw.ExpiryCache.Set(session.OrgID, session.DataExpires, cache.DefaultExpiration)
}
session.SetKeyHash(storage.HashKey(orgID, t.Gw.GetConfig().HashKeys))
return session.Clone(), found
}Calls: RPCStorageHandler.SessionDetail() → FuncClientSingleton() → Network I/O
3. OrganizationMonitor Middleware Blocks All Requests
File: gateway/mw_organisation_activity.go:52-93
func (k *OrganizationMonitor) ProcessRequest(w http.ResponseWriter, r *http.Request, _ interface{}) (error, int) {
// ...code omitted...
// try to get from Redis
if !found {
// not found in in-app cache, let's read from Redis
orgSession, found = k.OrgSession(k.Spec.OrgID) // CAN BLOCK ON RPC!
if !found {
k.setOrgHasNoSession(true)
return nil, http.StatusOK
}
}
// ...
}Enabled for: ALL APIs when EnforceOrgQuotas: true (including keyless APIs!)
Impact
When MDCB is Slow or Down
v5.2.6:
- Request blocks for ~30 milliseconds (fast-fail)
- Happens every ~10 minutes when cache expires
- Thundering herd: all concurrent requests block simultaneously
v5.8.6:
- Request blocks for up to ~90 seconds (3 × 30s timeout)
- Happens every ~10 minutes when cache expires
- Gateway appears completely unresponsive
Affected Traffic
- ✅ ALL APIs with
EnforceOrgQuotas: true - ✅ ALL APIs with
EnforceOrgDataAge: true(forced in RPC mode) - ✅ Keyless APIs (no authentication required!)
- ✅ Every request when cache expires (every 10 minutes)
Production Impact
- Users experience intermittent 30-90 second timeouts
- Load balancer health checks may fail
- Cascading failures if multiple gateways hit empty cache simultaneously
- No way to disable (EnforceOrgDataAge forced to true)
Proposed Solution
Phase 1: Move RPC Calls Out of Request Path (High Priority)
1.1. Background Org Session Refresh
New component: OrgSessionRefresher
type OrgSessionRefresher struct {
cache *cache.Cache
rpcStore storage.Handler
orgIDs sync.Map // track active orgs
refreshInterval time.Duration
}
func (r *OrgSessionRefresher) Start(ctx context.Context) {
ticker := time.NewTicker(r.refreshInterval)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
r.refreshAllOrgs()
}
}
}
func (r *OrgSessionRefresher) refreshAllOrgs() {
r.orgIDs.Range(func(orgID, _ interface{}) bool {
go r.refreshOrgSession(orgID.(string)) // Async, non-blocking
return true
})
}
func (r *OrgSessionRefresher) refreshOrgSession(orgID string) {
session, found := r.rpcStore.SessionDetail(orgID, orgID, false)
if found {
r.cache.Set(orgID, session, cache.DefaultExpiration)
}
}Usage in request path:
func (t BaseMiddleware) OrgSession(orgID string) (user.SessionState, bool) {
// Register this org for background refresh
t.Gw.OrgRefresher.Track(orgID)
// ONLY read from cache - never block
if session, found := t.Gw.OrgSessionCache.Get(orgID); found {
return session.(user.SessionState).Clone(), true
}
// Not in cache - use defaults, background refresh will populate it
return user.SessionState{}, false
}Result: Requests never block on RPC calls.
1.2. Move OrgSessionExpiry to Analytics Worker
File: gateway/handler_success.go:302-312
Before (blocks request):
expiresAfter := s.Spec.ExpireAnalyticsAfter
if s.Spec.GlobalConfig.EnforceOrgDataAge {
orgExpireDataTime := s.OrgSessionExpiry(s.Spec.OrgID) // BLOCKS!
if orgExpireDataTime > 0 {
expiresAfter = orgExpireDataTime
}
}
record.SetExpiry(expiresAfter)After (non-blocking):
expiresAfter := s.Spec.ExpireAnalyticsAfter
// Use cached value only - no blocking calls
if s.Spec.GlobalConfig.EnforceOrgDataAge {
if cachedExpiry, found := s.Gw.OrgExpiryCache.Get(s.Spec.OrgID); found {
expiresAfter = cachedExpiry.(int64)
}
}
record.SetExpiry(expiresAfter)Background worker updates OrgExpiryCache periodically.
Phase 2: Circuit Breaker Pattern (Medium Priority)
Implement circuit breaker for RPC calls:
type RPCCircuitBreaker struct {
state atomic.Value // closed, open, half-open
failures atomic.Int64
lastFailure atomic.Value // time.Time
threshold int
timeout time.Duration
}
func (cb *RPCCircuitBreaker) Call(fn func() error) error {
if cb.IsOpen() {
return ErrCircuitOpen
}
err := fn()
if err != nil {
cb.RecordFailure()
if cb.failures.Load() >= cb.threshold {
cb.Open()
}
return err
}
cb.RecordSuccess()
return nil
}Benefits:
- Stop trying to reach MDCB after N failures
- Fail fast during outages
- Automatic recovery when MDCB comes back
Phase 3: Graceful Degradation (Medium Priority)
When MDCB is unavailable:
- Use last known good values from cache
- Extend cache TTL automatically (e.g., from 10 min to 1 hour)
- Log warnings but don't block requests
- Emit metrics for monitoring
func (t BaseMiddleware) OrgSessionExpiry(orgID string) int64 {
// Try cache first
if expiry, found := t.Gw.ExpiryCache.Get(orgID); found {
return expiry.(int64)
}
// MDCB unavailable - use safe default and extend cache
defaultExpiry := DEFAULT_ORG_SESSION_EXPIRATION // 7 days
t.Gw.ExpiryCache.Set(orgID, defaultExpiry, 1*time.Hour) // Cache for 1 hour instead of 10 min
log.WithField("orgID", orgID).Warning("MDCB unavailable, using default expiry")
return defaultExpiry
}Phase 4: Configuration Options (Low Priority)
Allow users to tune behavior:
{
"enforce_org_data_age": false,
"org_session_refresh_interval": 30,
"org_cache_ttl": 600,
"rpc_circuit_breaker": {
"enabled": true,
"failure_threshold": 5,
"timeout": 60
}
}Success Criteria
After implementing these changes:
- ✅ Zero blocking RPC calls in request path
- ✅ Requests complete in <10ms regardless of MDCB state
- ✅ Gateway remains responsive when MDCB is down
- ✅ Automatic recovery when MDCB comes back online
- ✅ No thundering herd when cache expires
- ✅ Metrics and logging for monitoring
- ✅ Backward compatible (same behavior when MDCB is healthy)
Testing Plan
1. Functional Tests
func TestOrgSessionNonBlocking(t *testing.T) {
// Stop MDCB
mdcb.Stop()
// Request should complete quickly
start := time.Now()
resp := makeRequest(t, "/api/endpoint")
duration := time.Since(start)
assert.Equal(t, http.StatusOK, resp.StatusCode)
assert.Less(t, duration, 50*time.Millisecond, "Request should not block")
}2. Load Tests
- 1000 req/s for 10 minutes
- Stop MDCB at 5-minute mark
- Verify no request latency increase
3. Cache Expiry Tests
- Run traffic continuously
- Wait for cache expiry (10 min)
- Verify no thundering herd
- Verify requests don't block
4. Recovery Tests
- Start with MDCB down
- Send requests (should work with defaults)
- Start MDCB
- Verify background refresh populates cache
- Verify latest org data is used
Related Issues
- EnforceOrgDataAge forcibly enabled in RPC mode, breaking analytics when MDCB unavailable #7429 - EnforceOrgDataAge forced to true in RPC mode
- This issue describes the configuration problem
- This issue describes the architectural problem
Both need to be fixed, but this issue is more critical as it affects all RPC users.
Migration Path
- v5.8.x patch: Apply immediate fix (fail fast) to reduce 90s → 30ms
- v5.9.0: Implement Phase 1 (background refresh, move out of request path)
- v5.9.x: Add Phase 2 (circuit breaker) and Phase 3 (graceful degradation)
- v5.10.0: Add Phase 4 (configuration options)
References
Files That Need Changes
High priority:
gateway/handler_success.go:302-312- Move OrgSessionExpiry out of request threadgateway/handler_error.go:272-280- Same as above for error pathgateway/middleware.go:278-364- Make OrgSession non-blockinggateway/server.go- Add OrgSessionRefresher background worker
Medium priority:
rpc/rpc_client.go- Add circuit breakergateway/mw_organisation_activity.go- Use non-blocking org session lookup
Low priority:
config/config.go- Add configuration options- Documentation updates
Architecture Principles
- Never block requests on network I/O - cache-first, background refresh
- Fail gracefully - use defaults when dependencies unavailable
- Monitor and alert - emit metrics for degraded state
- Automatic recovery - no manual intervention required
- Backward compatible - same behavior when everything is healthy
This is a critical production issue affecting all Tyk Gateway deployments using MDCB. The current architecture violates basic principles of high-availability system design by blocking user requests on synchronous RPC calls.