PrometheusRemoteWrite Exporter: Fixing The Broken WAL

by Felix Dubois 54 views

Hey folks! Today, we're diving into a pretty critical issue that some of you might have encountered while using the prometheusremotewrite exporter in OpenTelemetry Collector (otelcol). Specifically, we're talking about a broken Write-Ahead Log (WAL). This can lead to metrics not being properly written to Prometheus, and nobody wants that!

Understanding the PrometheusRemoteWrite Exporter and WAL

First off, let's get some context. The prometheusremotewrite exporter is a crucial component in the OpenTelemetry Collector, acting as a bridge to send your collected metrics to Prometheus, a widely-used monitoring solution. Think of it as the messenger delivering your valuable data. Now, the Write-Ahead Log (WAL) is like a safety net. It's designed to ensure data durability by first writing incoming data to a log file before applying the changes to the main storage. This way, if something goes wrong (like a crash), the system can recover the data from the WAL. So, a broken WAL is a big deal because it compromises this data safety net.

The Problem: "Out of Order Sample" Errors

The core issue we're tackling is that the WAL in the prometheusremotewrite exporter has been, as some have put it, "broken for years". This manifests itself when the exporter tries to send metrics to the Prometheus API (/api/v1/write) and gets a nasty 400 - Bad Request response. The error messages often point to problems processing WAL entries, specifically an "out of order sample" error, and sometimes even "out of bounds" errors. These errors indicate that the data being written to Prometheus is not in the expected sequence, leading to rejections and data loss. Digging into the logs, you might see something like this:

2025-08-05T14:11:50.231Z [otelcol] 2025-08-05T14:11:50.231Z error prw.wal [email protected]/wal.go:245 error processing WAL entries {"resource": {"service.instance.id": "00ba5573-5bb4-4294-b1ca-1f84b32dbf29", "service.name": "otelcol", "service.version": "0.130.1"}, "otelcol.component.id": "prometheusremotewrite/0", "otelcol.component.kind": "exporter", "otelcol.signal": "metrics", "error": "Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): out of order sample\n; Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): out of order sample\n; Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): out of order sample\n", "errorCauses": [{\"error\": \"Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): out of order sample\\n\"}, {\"error\": \"Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): out of order sample\\n\"}, {\"error\": \"Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): out of order sample\\n\"}]}

This log snippet clearly shows the error processing WAL entries and the dreaded out of order sample message. This indicates that Prometheus is receiving data that doesn't align with its expected time series order, causing it to reject the writes. This can be incredibly frustrating, especially when you're relying on these metrics for monitoring and alerting. The WAL essentially safeguards against data loss during unexpected interruptions. When the WAL malfunctions, metrics intended for Prometheus might not be written correctly, potentially leading to monitoring gaps. The “out of order sample” error particularly indicates that the time series data being sent to Prometheus is not in the expected chronological order, a critical issue for time-series databases like Prometheus. This disruption not only affects the integrity of the data but can also lead to inaccuracies in dashboards and alerts, undermining the reliability of your monitoring system. Therefore, resolving this issue is paramount to ensuring the robustness of your monitoring pipeline.

The Solution: Disabling the WAL (For Now)

Now, for the good news! There's a relatively straightforward workaround. By removing the WAL configuration from your prometheusremotewrite exporter, you can often resolve this issue. This essentially tells the exporter to bypass the WAL and write directly to Prometheus. Here's how you can do it:

exporters:
 prometheusremotewrite/0:
 endpoint: http://prom-0.prom-endpoints.how-to.svc.cluster.local:9090/api/v1/write
 tls:
 insecure_skip_verify: false
- wal:
- directory: /otelcol

By commenting out or removing the wal section in your OpenTelemetry Collector configuration, you instruct the exporter to bypass the problematic WAL functionality. While this might seem counterintuitive—disabling a feature designed for data durability—it's a pragmatic solution to the immediate issue of metrics failing to be written to Prometheus. However, it's important to acknowledge the trade-offs. Disabling the WAL means you're potentially exposing your metrics pipeline to data loss in the event of a system crash or interruption. Therefore, this workaround should be considered a temporary fix, and you should closely monitor your system's stability. The ultimate goal is to have a fully functional WAL that ensures data integrity without causing out-of-order sample errors. Until a proper fix is implemented, this workaround provides a viable way to keep your metrics flowing to Prometheus, allowing you to maintain your monitoring and alerting capabilities.

Reproducing the Issue

To give you a clearer picture, let's talk about how this issue was reproduced. The setup involves using Juju, a tool for deploying and managing applications, to set up the infrastructure. This typically includes deploying a metrics source (like Alertmanager), a metrics sink (like Prometheus), and configuring them within the otel-collector's receivers and exporters. In essence, it's a common monitoring pipeline setup. If you're using a similar setup, you might be susceptible to this WAL issue.

Steps to Reproduce:

  1. Deploy a metrics source (e.g., Alertmanager).
  2. Deploy a metrics sink (e.g., Prometheus).
  3. Configure these in the otel-collector receivers and exporters.

When everything is set up, the expectation is that metrics should flow seamlessly into Prometheus, with the WAL ensuring data integrity. However, the actual result often includes errors in the otel-collector logs, specifically those hinting at a broken WAL. The logs are the key indicator that something isn't right with the WAL. Without the logs, you might not immediately realize that your metrics pipeline is experiencing issues. This is why regularly checking your otel-collector logs is a crucial practice for maintaining a healthy monitoring system. By actively monitoring these logs, you can catch errors like the "out of order sample" issue early on and take corrective actions before they lead to significant data loss or monitoring disruptions. The proactive approach not only helps in identifying problems but also in understanding the root cause, enabling more effective troubleshooting and long-term solutions.

Environment Details

For those who are curious about the specific environment where this issue was observed, here are the details:

  • Collector Version: 0.130.1
  • OS: Ubuntu 24.04.2 LTS

This information can be helpful if you're trying to reproduce the issue or if you're running a similar setup. It's always good to know if the problem is specific to certain versions or environments.

OpenTelemetry Collector Configuration

Here's the OpenTelemetry Collector configuration used in the reported scenario. This configuration includes the receivers, processors, exporters, extensions, and service definitions. Pay close attention to the prometheusremotewrite/0 exporter configuration, where the WAL is defined:

connectors: {}
exporters:
 debug:
 verbosity: basic
 prometheusremotewrite/0:
 endpoint: http://prom-0.prom-endpoints.how-to.svc.cluster.local:9090/api/v1/write
 tls:
 insecure_skip_verify: false
 wal:
 directory: /otelcol
extensions:
 file_storage:
 directory: /otelcol
 health_check:
 endpoint: 0.0.0.0:13133
processors:
 attributes:
 actions:
 - action: upsert
 key: loki.attribute.labels
 value: container, job, filename, juju_application, juju_charm, juju_model, juju_model_uuid, juju_unit, snap_name, path
 resource:
 attributes:
 - action: insert
 key: loki.format
 value: raw
receivers:
 otlp:
 protocols:
 grpc:
 endpoint: 0.0.0.0:4317
 http:
 endpoint: 0.0.0.0:4318
 prometheus:
 config:
 scrape_configs:
 - job_name: juju_how-to_7b30903e_otelcol_self-monitoring
 scrape_interval: 60s
 static_configs:
 - labels:
 instance: how-to_7b30903e_otelcol_otelcol/0
 juju_application: otelcol
 juju_charm: opentelemetry-collector-k8s
 juju_model: how-to
 juju_model_uuid: 7b30903e-8941-4a40-864c-0cbbf277c57f
 juju_unit: otelcol/0
 targets:
 - 0.0.0.0:8888
 - job_name: juju_how-to_7b30903e_am_prometheus_scrape
 metrics_path: /metrics
 relabel_configs:
 - regex: (.*)
 separator: _
 source_labels:
 - juju_model
 - juju_model_uuid
 - juju_application
 target_label: instance
 scheme: http
 static_configs:
 - labels:
 juju_application: am
 juju_charm: alertmanager-k8s
 juju_model: how-to
 juju_model_uuid: 7b30903e-8941-4a40-864c-0cbbf277c57f
 targets:
 - am-0.am-endpoints.how-to.svc.cluster.local:9093
 tls_config:
 insecure_skip_verify: false
service:
 extensions:
 - health_check
 - file_storage
 pipelines:
 logs:
 exporters:
 - debug
 processors:
 - resource
 - attributes
 receivers:
 - otlp
 metrics:
 exporters:
 - prometheusremotewrite/0
 receivers:
 - otlp
 - prometheus
 traces:
 exporters:
 - debug
 receivers:
 - otlp
 telemetry:
 logs:
 level: DEBUG
 metrics:
 level: normal

This configuration provides a comprehensive view of how the OpenTelemetry Collector is set up in this specific scenario. By examining the receivers, processors, and exporters, you can understand the flow of metrics data within the system. The prometheusremotewrite/0 exporter is of particular interest here, as it's where the WAL configuration resides. The wal section specifies the directory where the Write-Ahead Log files are stored. When the WAL encounters issues, such as the “out of order sample” error, it can disrupt the entire metrics pipeline. This is why the workaround involves disabling the WAL, as it bypasses the problematic component and allows metrics to be written directly to Prometheus. However, as mentioned earlier, this should be seen as a temporary solution, and a proper fix for the WAL issue is necessary for long-term data reliability.

Log Output Snippet

To further illustrate the problem, here's a snippet of the log output that clearly shows the WAL-related errors:

2025-08-05T14:11:50.231Z [otelcol] 2025-08-05T14:11:50.231Z     error   prw.wal [email protected]/wal.go:245    error processing WAL entries    {"resource": {"service.instance.id": "00ba5573-5bb4-4294-b1ca-1f84b32dbf29", "service.name": "otelcol", "service.version": "0.130.1"}, "otelcol.component.id": "prometheusremotewrite/0", "otelcol.component.kind": "exporter", "otelcol.signal": "metrics", "error": "Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): out of order sample\n; Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): out of order sample\n; Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): out of order sample\n", "errorCauses": [{\"error\": \"Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): out of order sample\\n\"}, {\"error\": \"Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): out of order sample\\n\"}, {\"error\": \"Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): out of order sample\\n\"}]}

This log output is a critical piece of evidence when troubleshooting the WAL issue. It explicitly states that there was an error processing WAL entries and includes the out of order sample error. These messages are strong indicators that the WAL is not functioning correctly. The log also provides valuable context, such as the component ID (prometheusremotewrite/0), the kind of component (exporter), and the signal being processed (metrics). This information can help you pinpoint the exact location of the problem within your OpenTelemetry Collector setup. By analyzing the log output, you can confirm that the WAL is indeed the source of the issue and that the workaround of disabling it is a valid approach. However, it's important to remember that log analysis is an ongoing process. Regularly monitoring your logs will help you catch any recurrence of the issue or any new problems that might arise.

Conclusion and Next Steps

So, there you have it! A deep dive into the prometheusremotewrite exporter's broken WAL, the "out of order sample" errors, and a temporary solution. While disabling the WAL might seem like a quick fix, it's crucial to remember that this is a workaround, not a permanent solution. The long-term goal is to have a properly functioning WAL that ensures data durability. In the meantime, keep an eye on your logs, and stay tuned for updates on a proper fix. Has anyone else encountered this issue? What solutions have you found? Let's discuss in the comments below!