Five Logstash gotchas that wedged me in production
Notes to future me after a week of cascading Logstash failures.
PQ on EFS/NFS will wedge on you
Logstash’s persistent queue uses .lock files and atomic rename() for checkpointing. Both go wrong on network filesystems: stale locks survive ungraceful task exits, and rename() can block forever in the kernel on metadata contention.
Thread dump from one wedge:
"Converge PipelineAction::Create<my-pipeline>" RUNNABLE, elapsed 1146s
at sun.nio.fs.UnixNativeDispatcher.rename0 (native)
at org.logstash.ackedqueue.io.FileCheckpointIO.write
at org.logstash.execution.AbstractPipelineExt.openQueue
19 minutes stuck in a syscall the JVM can’t interrupt. Move PQ to task-local storage and recover via sql_last_value plus a small overlap window.
:sql_last_value substitutes as a string
Fine for implicit coercion:
WHERE col > :sql_last_value
Breaks on arithmetic:
WHERE col > :sql_last_value - INTERVAL '60 seconds'
-- ERROR: invalid input syntax for type interval: "2026-04-21 …"
Cast it:
WHERE col > (:sql_last_value)::timestamptz - INTERVAL '60 seconds'
Epoch sentinel checks (:sql_last_value = '1970-01-01 …') stay uncasted.
DLQ input needs per-pipeline subdirs to already exist
The DLQ writer creates <path>/<pipeline_id>/ eagerly at pipeline start. The DLQ input doesn’t — it calls Files.newDirectoryStream in its constructor and throws:
Error: DLQ sub-path /usr/share/logstash/data/dead_letter_queue/<pipeline_id> does not exist
Exception: Java::JavaNioFile::NoSuchFileException
No tolerance flag on the plugin. On fresh storage, pre-create from an entrypoint:
grep -B1 "path => '$DLQ_BASE'" "$DLQ_CONF" \
| grep pipeline_id \
| sed -E "s/.*pipeline_id => '([^']+)'.*/\1/" \
| while read pid; do mkdir -p "$DLQ_BASE/$pid"; done
exec "$@"
queue.max_bytes is a preflight disk check
Logstash sums it across every persisted-queue pipeline at startup and refuses to start if the total exceeds free disk:
Persistent queues require more disk space than is available:
- Total space required: 76gb
- Currently free space: 16gb
So queue.max_bytes: 1900mb × 40 pipelines = 76 GB required, even if actual usage is kilobytes. On Fargate (default 20 GB) this bites immediately. Size it to real burst capacity, not EFS-era headroom.
Trailing ; + jdbc_paging_enabled = syntax error
With paging enabled, Logstash wraps your statement:
SELECT count(*) AS "count" FROM (<your statement>) AS "t1"
A trailing ; in the inner statement breaks that. Easy to miss when the query was developed in psql where the semicolon is habit.