Recurring Elasticsearch “Red State” in Single-Node Deployments (Security Onion 2) #15156

mazda4409 · 2025-10-20T06:37:09Z

mazda4409
Oct 20, 2025

Version

2.4.160

Installation Method

Security Onion ISO image

Description

other (please provide detail below)

Installation Type

Standalone

Location

on-prem with Internet access

Hardware Specs

Exceeds minimum requirements

CPU

12

RAM

40

Storage for /

163

Storage for /nsm

327

Network Traffic Collection

tap

Network Traffic Speeds

1Gbps to 10Gbps

Status

No, one or more services are failed (please provide detail below)

Salt Status

No, there are no failures

Logs

No, there are no additional clues

Detail

🧩 Summary

Security Onion 2 consistently enters a status: red state over time in single-node deployments, primarily due to issues with data stream rollover, ILM tiering, and shard allocation. This has persisted across versions since at least 2022.

⸻

✅ Environment
• Deployment Type: Single-node (e.g., home/SOC lab)
• Elasticsearch Version: [Insert SO2 version — e.g., 8.11.x]
• Fleet/Elastic Agent: Active and reporting
• Typical Uptime Before Failure: 1–3 days

⸻

💥 Symptoms
• UI shows “Elasticsearch: Fault”
• Cluster health: "status": "red"
• Unassigned primary shards due to INDEX_CREATED or CLUSTER_RECOVERED
• ILM rollover fails silently
• Auto-created .ds-* indices persist with no assigned primary shard
• Fleet-managed .ds-.fleet-* indices show permission issues for API keys when attempting remediation
• Please note that this has persisted despite other countermeasures, such as daily cron reboot.

⸻

🔍 Root Cause Analysis
1. ILM Policy Mismatch
SO2 ships with ILM policies that assume multi-tier storage (hot/warm/cold), but most home/lab users only have hot tier. Once the warm transition is attempted, rollover and shard recovery stall.
2. Unhealed Data Streams
When a new .ds- index is created but not assigned properly, no automatic fix occurs, and ingest stalls on those streams.
3. Restricted Indices Cannot Be Fixed by API Key
Attempts to reroute or adjust settings on .ds-.fleet-* or .ds-.lists-* fail due to security_exception unless the API key has manage or all, which is a violation of principle of least privilege in production contexts.
4. No Scheduled Auto-Rollover or Cleanup
There is no systemd/cron-based safeguard in default SO2 to roll over indices or detect red cluster states.

⸻

🛠️ Suggested Remediation Options

✅ Fix Now
• Include so-fix-red.sh or similar in /opt/so/scripts/ that:
• Detects and logs unassigned shards
• Performs safe rollovers on stale .ds-* data streams
• Removes or compresses old, non-write indices if needed
• Optionally posts to logs or webhook

🔄 Fix Long-Term
• Patch ILM policies to remove/skip warm/cold phases on single-node installs
• Warn users when they’re likely to hit this based on system profile
• Document cron-based healing or ship with a toggle for self-healing jobs
• Improve error surface for red state in UI — right now users only see "Elasticsearch: Fault"

⸻

🧪 Repro Steps
1. Install SO2 in a single-node setup with default ILM and 1 Gbps TAP link and network syslog
2. Allow system to ingest data over 2–3 months (endpoint, suricata, Zeek)
3. Observe eventual:
• status: red
• 1–3 unassigned .ds-logs-* shards
• Rollover errors and cluster degradation

⸻

🔁 Workaround We’ve Deployed

curl -XPOST -k -H "Authorization: ApiKey $API_KEY" \
  https://localhost:9200/logs-system.system-default/_rollover

# Cron every 2 hours
echo "0 */2 * * * root /opt/so/scripts/fix-es-red.sh >> /var/log/so-es-fix.log 2>&1" | sudo tee /etc/cron.d/esfix

🙏 Request

Can this be escalated as a known issue or receive a patch/workaround in upcoming releases?

Happy to provide:
• logs
• full curl output
• redacted .ilm.explain response
• Elastic Agent debug

Guidelines

I have read the discussion guidelines at Read before posting! #1720 and assert that I have followed the guidelines.

cm-ops · 2025-10-21T16:33:56Z

cm-ops
Oct 21, 2025
Maintainer

When your cluster goes red, is it because of ILM actions or something else (watermark, etc.)?

0 replies

mazda4409 · 2025-10-21T16:50:24Z

mazda4409
Oct 21, 2025
Author

Short answer

Not watermarks. Each time the cluster went red it was due to data-stream write indices created with a primary shard in UNASSIGNED state (INDEX_CREATED / CLUSTER_RECOVERED), most commonly for Endpoint and System streams. A manual /_rollover on the affected streams immediately returned the cluster to green.

Evidence from this node

When red (examples):

_cluster/health

"status": "red",
"unassigned_primary_shards": 1..6

_cat/shards (snippets)

.ds-logs-endpoint.events.api-default-2025.10.19-000001      0 p UNASSIGNED INDEX_CREATED
.ds-logs-endpoint.events.library-default-2025.10.19-000001  0 p UNASSIGNED CLUSTER_RECOVERED
.ds-logs-system.system-default-2025.10.20-000001            0 p UNASSIGNED INDEX_CREATED

Watermarks / disk not the cause:

_cat/allocation

node disk.percent disk.used disk.avail
homesecurityonion 71 234.9gb 91.7gb

defaults before any changes

low: 80% high: 85% flood: 90%

later persistent setting we tried (still no watermark pressure)

low: 85% high: 92% flood: 95%

Manual rollover heals immediately:

POST /logs-endpoint.events.api-default/_rollover
POST /logs-endpoint.events.library-default/_rollover
POST /logs-system.system-default/_rollover

Rollover responses show rolled_over: true and _cluster/health flips to "status": "green" with unassigned_shards: 0.

After recovery:
No UNASSIGNED shards.
"status": "green"

What seems to be happening

On this single-node SO2 install, new daily write indices for certain Fleet/Endpoint and System data streams are occasionally created with the primary shard left UNASSIGNED right after INDEX_CREATED. It doesn’t correlate with disk pressure or node loss. A targeted /_rollover of the stream creates a new backing index that assigns correctly and the cluster recovers.

Why it’s not watermarks
• Disk stayed ~71% (≈92 GB free) during incidents.
• _cluster/settings?include_defaults=true showed defaults 80/85/90 and later a persistent 85/92/95.
• No flood-stage or disk-based allocation blocks observed; allocation/explain after recovery shows shards started.

Suggestions

Immediate / practical:
• Provide a small shipped helper (e.g., /opt/so/scripts/so-fix-red.sh) that:
• Detects UNASSIGNED primaries on .ds-* indices.
• Runs a safe /_rollover for only the impacted streams.
• Logs actions and (optionally) notifies via webhook.

Longer-term:
• Single-node ILM profile: Ship ILM that avoids warm/cold tiers and enforces number_of_replicas: 0 everywhere relevant for standalone installs.
• Self-healing: Background task to validate write indices for active data streams after creation and re-rollover on failure.
• UI: When “Elasticsearch: Fault,” surface which streams/indices are blocked and show unassigned.reason inline so users can act without shell access.

Repro (repeated here)
1. Single-node SO2 with Endpoint + System logs.
2. Let it run; new .ds-logs-endpoint.events.{api,library}-* (or .ds-logs-system.system-*) are created.
3. Cluster flips red; _cat/shards shows the new backing index primary is UNASSIGNED with INDEX_CREATED.
4. POST //_rollover ⇒ new backing index created and assigned ⇒ health returns to green.

Happy to provide
• Redacted /_ilm/explain for affected streams.
• _cat/shards before/after.
• _cluster/settings outputs.
• Logs from /opt/so/log/elasticsearch/.

4 replies

cm-ops Oct 27, 2025
Maintainer

Elasticsearch has to create the index and then intitialize and start it. So you will see the grid status change as that is happening.

mazda4409 Oct 27, 2025
Author

Thanks, that makes sense for the brief init window. My question is about cases where it doesn’t clear on its own.

On this single-node box, I’m occasionally seeing new data-stream backing indices created with the primary shard stuck UNASSIGNED (reason INDEX_CREATED or CLUSTER_RECOVERED) for several minutes, and the grid stays RED until I do a targeted /_rollover on the affected stream(s). Disk isn’t the culprit (≈71% used, ~92 GB free; low/high/flood at 85/92/95). Example snippets:

# while RED
_cat/shards
.ds-logs-endpoint.events.api-default-2025.10.19-000001      0 p UNASSIGNED INDEX_CREATED
.ds-logs-endpoint.events.library-default-2025.10.19-000001  0 p UNASSIGNED CLUSTER_RECOVERED

# after manual rollover
status: green
unassigned_shards: 0

A few questions so I can narrow this down:
1. Expected duration: How long would you expect primaries to be UNASSIGNED during the normal init/start cycle on a single node? (Trying to distinguish “normal flicker” vs. “stuck”.)
2. What to capture: When this happens again, would the most helpful artifacts be:
• _cluster/allocation/explain for the specific shard while it’s UNASSIGNED,
• /_ilm/explain for the stream,
• and the ES server log around the event (I saw Search rejected due to missing shards … Consider using allow_partial_search_results at 2025-10-20T00:06:58Z)?
3. Single-node defaults: For standalone installs, is it recommended to enforce index.number_of_replicas: 0 for logs-* data streams to reduce init flaps? (I can test a high-priority index template if that’s the path you suggest.)
4. UI behavior: Is it expected for the UI to flip RED (vs YELLOW) during that brief window, or does RED indicate the init exceeded an expected threshold?

If you’d like, I can reproduce and attach the live allocation_explain + ilm.explain when the next event occurs. Thanks!

cm-ops Oct 31, 2025
Maintainer

GET _health_report and GET _cluster/allocation/explain when you see it again. Those two should point to a reason.

mazda4409 Oct 31, 2025
Author

since applying the single-node tweaks (forcing index.number_of_replicas: 0 for logs data streams and adding a safe / _rollover check), I haven’t seen the RED condition recur. Cluster has stayed green with unassigned_shards: 0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recurring Elasticsearch “Red State” in Single-Node Deployments (Security Onion 2) #15156

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Recurring Elasticsearch “Red State” in Single-Node Deployments (Security Onion 2) #15156

Uh oh!

Uh oh!

mazda4409 Oct 20, 2025

Version

Installation Method

Description

Installation Type

Location

Hardware Specs

CPU

RAM

Storage for /

Storage for /nsm

Network Traffic Collection

Network Traffic Speeds

Status

Salt Status

Logs

Detail

Guidelines

Replies: 2 comments · 4 replies

Uh oh!

cm-ops Oct 21, 2025 Maintainer

Uh oh!

mazda4409 Oct 21, 2025 Author

_cluster/health

_cat/shards (snippets)

_cat/allocation

defaults before any changes

later persistent setting we tried (still no watermark pressure)

Uh oh!

cm-ops Oct 27, 2025 Maintainer

Uh oh!

mazda4409 Oct 27, 2025 Author

Uh oh!

cm-ops Oct 31, 2025 Maintainer

Uh oh!

mazda4409 Oct 31, 2025 Author

mazda4409
Oct 20, 2025

Replies: 2 comments 4 replies

cm-ops
Oct 21, 2025
Maintainer

mazda4409
Oct 21, 2025
Author

cm-ops Oct 27, 2025
Maintainer

mazda4409 Oct 27, 2025
Author

cm-ops Oct 31, 2025
Maintainer

mazda4409 Oct 31, 2025
Author