Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue: #1080
An implementation of the LANraragi metrics exporter, where library, API and process level metrics are conditionally collected and served through the "/api/metrics" endpoint in the Prometheus exposition format, and with Redis as the shared metrics state. Metrics data is stored in Redis at db4, but we can change to an existing db if we want.
Most of the code written by AI, and then reviewed/rewritten by me. Architectural decisions made by me.
Already tested this on personal prod environments for about a week, will continue doing so 👌
Demo screenshots and pretty pictures
Things you can do in Prometheus/Grafana, with the metrics provided
Configuring the metrics exporter settings
3rd-party implementations and shared state
There are 2 perl implementations (probably more) of metrics exporter, Net::Prometheus and mojolicious-plugin-prometheus. The main issue with just using them is that we need a shared state. Net::Prometheus is lower-level but no shared state, while mojolicious uses shared state but is higher level with IPC.
On the other hand with LRR, I want to collect all the metrics, and might as well use Redis too since that's what it's good for anyways so I just decided to rawdog it
Opt-in
Metrics collection and endpoint exposure is optional and opt-in (via the
enablemetricssetting flag from config), and instructions to enable metrics has been documented.OS dependent
Each OS (e.g. macos, windows, linux*) needs an implementation of the process-level metrics collection. Currently only Linux process-level metrics are supported, but people are welcome to contribute for other OS's :)
Openmetrics and general spec stuff
This implementation
iswas written to be compliant to the OpenMetrics 1.0 specification, with minor adjustments to conform to Prometheus server capabilities (turns out that Prometheus didn't actually support the OpenMetrics spec, despite the spec being the first thing that shows up when you do Prom exporter specification research...!). A couple deviations from the OpenMetrics spec:Still, OpenMetrics has some good practices, so most of its rules were followed. There's also OpenTelemetry which is another thing, but we're sticking mostly to Prometheus exposition.
Metric collection types
There are 3 broad categories of metrics being collected: API/http, library, and process.
API
API/HTTP refers to metrics collected by a mojolicious worker handling a single HTTP request/endpoint. Metrics collected include duration and bytes sent/received.
The natural way to handle this passive collection is via mojo hooks.
Also, API metrics group requests by endpoint type. I.e., instead of the full "/api/archives/123456..." endpoint, we use the
Routing.pm"/api/archives/:id` endpoint. This is to handle cardinality explosion.Library
Library/stats refers to the stats mentioned in the initial issue: how many archives, how many pages, etc.. These are usually aggregated values by another worker process during a file monitor event, so metrics can get these data for free and we don't need a separate periodic hook. (archive byte size, on the other hand...)
And since prometheus servers periodically scrape metrics from LRR, it doesn't make sense for metric scraping to also trigger expensive calls that drag the whole server down, so it's best for the metrics API to be as lean as possible.
Process (Minion/Shinobu)
These are the CPU/memory/FD/IO metrics that one may find in node exporter. Process metrics collection is a passive "process", so it's done on as a 30s recurring task.
DB Cleanups
There are 3 ways of doing cleanups for metrics: shutdown cleanup, startup cleanup, and TTL (continuous cleanup).
TTL isn't exaclty a valid approach because it violates OpenMetrics specification that metrics should generally exist for the lifetime of the process (and it causes metrics to disappear). That leaves startup and shutdown, but startup is generally more reliable.