Replies: 2 comments 4 replies
-
|
When your cluster goes red, is it because of ILM actions or something else (watermark, etc.)? |
Beta Was this translation helpful? Give feedback.
-
|
Short answer Not watermarks. Each time the cluster went red it was due to data-stream write indices created with a primary shard in UNASSIGNED state (INDEX_CREATED / CLUSTER_RECOVERED), most commonly for Endpoint and System streams. A manual /_rollover on the affected streams immediately returned the cluster to green. Evidence from this node When red (examples): _cluster/health"status": "red", _cat/shards (snippets)Watermarks / disk not the cause: _cat/allocationnode disk.percent disk.used disk.avail defaults before any changeslow: 80% high: 85% flood: 90% later persistent setting we tried (still no watermark pressure)low: 85% high: 92% flood: 95% Manual rollover heals immediately: Rollover responses show rolled_over: true and _cluster/health flips to "status": "green" with unassigned_shards: 0. After recovery: What seems to be happening On this single-node SO2 install, new daily write indices for certain Fleet/Endpoint and System data streams are occasionally created with the primary shard left UNASSIGNED right after INDEX_CREATED. It doesn’t correlate with disk pressure or node loss. A targeted /_rollover of the stream creates a new backing index that assigns correctly and the cluster recovers. Why it’s not watermarks Suggestions Immediate / practical: Longer-term: Repro (repeated here) Happy to provide |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Version
2.4.160
Installation Method
Security Onion ISO image
Description
other (please provide detail below)
Installation Type
Standalone
Location
on-prem with Internet access
Hardware Specs
Exceeds minimum requirements
CPU
12
RAM
40
Storage for /
163
Storage for /nsm
327
Network Traffic Collection
tap
Network Traffic Speeds
1Gbps to 10Gbps
Status
No, one or more services are failed (please provide detail below)
Salt Status
No, there are no failures
Logs
No, there are no additional clues
Detail
🧩 Summary
Security Onion 2 consistently enters a status: red state over time in single-node deployments, primarily due to issues with data stream rollover, ILM tiering, and shard allocation. This has persisted across versions since at least 2022.
⸻
✅ Environment
• Deployment Type: Single-node (e.g., home/SOC lab)
• Elasticsearch Version: [Insert SO2 version — e.g., 8.11.x]
• Fleet/Elastic Agent: Active and reporting
• Typical Uptime Before Failure: 1–3 days
⸻
💥 Symptoms
• UI shows “Elasticsearch: Fault”
• Cluster health: "status": "red"
• Unassigned primary shards due to INDEX_CREATED or CLUSTER_RECOVERED
• ILM rollover fails silently
• Auto-created .ds-* indices persist with no assigned primary shard
• Fleet-managed .ds-.fleet-* indices show permission issues for API keys when attempting remediation
• Please note that this has persisted despite other countermeasures, such as daily cron reboot.
⸻
🔍 Root Cause Analysis
1. ILM Policy Mismatch
SO2 ships with ILM policies that assume multi-tier storage (hot/warm/cold), but most home/lab users only have hot tier. Once the warm transition is attempted, rollover and shard recovery stall.
2. Unhealed Data Streams
When a new .ds- index is created but not assigned properly, no automatic fix occurs, and ingest stalls on those streams.
3. Restricted Indices Cannot Be Fixed by API Key
Attempts to reroute or adjust settings on .ds-.fleet-* or .ds-.lists-* fail due to security_exception unless the API key has manage or all, which is a violation of principle of least privilege in production contexts.
4. No Scheduled Auto-Rollover or Cleanup
There is no systemd/cron-based safeguard in default SO2 to roll over indices or detect red cluster states.
⸻
🛠️ Suggested Remediation Options
✅ Fix Now
• Include so-fix-red.sh or similar in /opt/so/scripts/ that:
• Detects and logs unassigned shards
• Performs safe rollovers on stale .ds-* data streams
• Removes or compresses old, non-write indices if needed
• Optionally posts to logs or webhook
🔄 Fix Long-Term
• Patch ILM policies to remove/skip warm/cold phases on single-node installs
• Warn users when they’re likely to hit this based on system profile
• Document cron-based healing or ship with a toggle for self-healing jobs
• Improve error surface for red state in UI — right now users only see "Elasticsearch: Fault"
⸻
🧪 Repro Steps
1. Install SO2 in a single-node setup with default ILM and 1 Gbps TAP link and network syslog
2. Allow system to ingest data over 2–3 months (endpoint, suricata, Zeek)
3. Observe eventual:
• status: red
• 1–3 unassigned .ds-logs-* shards
• Rollover errors and cluster degradation
⸻
🔁 Workaround We’ve Deployed
🙏 Request
Can this be escalated as a known issue or receive a patch/workaround in upcoming releases?
Happy to provide:
• logs
• full curl output
• redacted .ilm.explain response
• Elastic Agent debug
Guidelines
Beta Was this translation helpful? Give feedback.
All reactions