Log Collection Guide

This guide covers log configuration, formats, and querying. For the complete observability setup including OpenTelemetry, metrics, and Grafana dashboards, see Monitoring Guide.

Table of Contents


Quick Start

# Point to OpenTelemetry Collector (local or Grafana Cloud)
export OTLP_ENDPOINT=http://localhost:4317

wealth run

Logs, metrics, and traces are exported via OTLP gRPC. For complete OpenTelemetry setup, see Monitoring Guide - OpenTelemetry Metrics.

Local-Only Mode

If OTLP_ENDPOINT is not set, logs output to stdout/stderr:

# Optional: Enable file logging
export WEALTH__OBSERVABILITY__LOG_FILE=/tmp/wealth.log

wealth run

Log Query Examples

Basic Queries

# All logs from wealth bot
{job="wealth-bot"}

# Filter by log level
{job="wealth-bot"} |= "ERROR"
{job="wealth-bot"} |= "WARN"
{job="wealth-bot"} |= "INFO"

# Filter by level label (if parsed)
{job="wealth-bot", level="ERROR"}
{job="wealth-bot", level="INFO"}

# Filter by module
{job="wealth-bot"} |= "wealth::strategy"
{job="wealth-bot"} |= "wealth::execution"
{job="wealth-bot"} |= "wealth::market_data"

# Search for specific text
{job="wealth-bot"} |= "WebSocket"
{job="wealth-bot"} |= "arbitrage"
{job="wealth-bot"} |= "position"

Advanced Queries

# Errors in the last hour
{job="wealth-bot", level="ERROR"} [1h]

# Rate of errors per minute
rate({job="wealth-bot", level="ERROR"}[5m])

# Count logs by level
sum by (level) (count_over_time({job="wealth-bot"}[1h]))

# Logs containing correlation_id
{job="wealth-bot"} |~ "correlation_id=\\w+"

# WebSocket connection issues
{job="wealth-bot"} |~ "WebSocket.*error|WebSocket.*failed|WebSocket.*timeout"

# Order execution logs
{job="wealth-bot"} |= "Executing arbitrage" or |= "Order placed"

# Strategy-related logs
{job="wealth-bot", module=~"wealth::strategy.*"}

All key log messages include a structured event field for precise filtering:

# Parse JSON and filter by event type
{service="wealth-bot"} | json | event="opportunity_detected"
{service="wealth-bot"} | json | event="arbitrage_executed"
{service="wealth-bot"} | json | event="position_close_succeeded"

# Trade skipped events (all use _skipped suffix)
{service="wealth-bot"} | json | event="quantity_validation_failed_skipped"
{service="wealth-bot"} | json | event="precision_mismatch_skipped"
{service="wealth-bot"} | json | event="unhedged_positions_skipped"
{service="wealth-bot"} | json | event="insufficient_balance_skipped"

# All skipped events
{service="wealth-bot"} | json | event=~".*_skipped"

# All position lifecycle events
{service="wealth-bot"} | json | event=~"position_.*"

# WebSocket and connection events
{service="wealth-bot"} | json | event=~"websocket_.*"

# Circuit breaker activity
{service="wealth-bot"} | json | event=~"circuit_breaker_.*"

# Error events requiring attention
{service="wealth-bot"} | json | event="unhedged_position_detected"
{service="wealth-bot"} | json | event="size_discrepancy_detected"

# Count opportunities vs executions
sum by (event) (count_over_time({service="wealth-bot"} | json | event=~"opportunity_detected|arbitrage_executed" [1h]))

See Loki JSON Parsing Guide for the complete list of 150+ event types.

Pattern Matching

# Extract values from logs using regex
{job="wealth-bot"} 
  | regexp "correlation_id=(?P<cid>\\w+)" 
  | line_format "{{.cid}}: {{.message}}"

# Parse structured data
{job="wealth-bot"} 
  | pattern `<_> <level> <module>: <message>`
  | level = "ERROR"

Creating a Logs Dashboard

1. Log Volume Panel

Query:

sum(rate({job="wealth-bot"}[1m])) by (level)

Visualization: Time series graph showing log rate by level

2. Error Rate Panel

Query:

sum(rate({job="wealth-bot", level="ERROR"}[5m]))

Visualization: Stat panel with alert threshold at > 0

3. Recent Errors Table

Query:

{job="wealth-bot", level="ERROR"}

Visualization: Logs panel (table view) Options: Show time, level, and message columns

4. Log Level Distribution

Query:

sum by (level) (count_over_time({job="wealth-bot"}[1h]))

Visualization: Pie chart

5. Module Activity

Query:

sum by (module) (count_over_time({job="wealth-bot"}[1h]))

Visualization: Bar chart

Log File Rotation

To prevent log files from growing too large:

Using logrotate (Linux)

Create /etc/logrotate.d/wealth:

/tmp/wealth*.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    create 0640 thiras thiras
    postrotate
        # Send SIGHUP to reload (if bot supports it)
        # pkill -HUP -f wealth || true
    endscript
}

Test configuration:

sudo logrotate -d /etc/logrotate.d/wealth
sudo logrotate -f /etc/logrotate.d/wealth

Using truncate

Simple script to truncate logs periodically:

#!/bin/bash
# truncate-logs.sh

LOG_FILE="/tmp/wealth.log"
MAX_SIZE_MB=100

if [ -f "$LOG_FILE" ]; then
    SIZE=$(du -m "$LOG_FILE" | cut -f1)
    if [ "$SIZE" -gt "$MAX_SIZE_MB" ]; then
        echo "Truncating $LOG_FILE (${SIZE}MB > ${MAX_SIZE_MB}MB)"
        > "$LOG_FILE"
    fi
fi

Add to crontab:

# Run every hour
0 * * * * /path/to/truncate-logs.sh

Alerting on Logs

Create Alert Rule in Grafana

  1. Go to AlertingAlert rulesNew alert rule
  2. Set query:
    sum(rate({job="wealth-bot", level="ERROR"}[5m])) > 0
    
  3. Set evaluation interval: 1m
  4. Set condition: Alert when query result > 0
  5. Add notification channel (email, Slack, etc.)

Common Alert Rules

High Error Rate:

sum(rate({job="wealth-bot", level="ERROR"}[5m])) > 0.1

WebSocket Connection Failures:

sum(count_over_time({job="wealth-bot"} |= "WebSocket" |= "failed" [5m])) > 3

No Logs Received (Bot Down):

absent_over_time({job="wealth-bot"}[5m]) == 1

Troubleshooting

Pull Model (Promtail)

No logs appearing in Loki

  1. Check Promtail is running:

    docker compose ps promtail
    docker compose logs promtail
    
  2. Verify log file exists and is readable:

    ls -lah /tmp/wealth.log
    tail -f /tmp/wealth.log
    
  3. Check Promtail positions file:

    docker compose exec promtail cat /tmp/positions.yaml
    
  4. Test Loki directly:

    curl -s http://localhost:3100/loki/api/v1/label/job/values
    

Logs not parsing correctly

  1. Check log format matches regex:

    # Example log line
    echo "2025-11-10T01:23:45.123456Z  INFO wealth::strategy: Message" | \
      grep -oP '^\S+\s+\w+\s+[\w:]+:\s+.*$'
    
  2. View Promtail debug logs:

    docker compose logs promtail | grep -i error
    

Push Model (Loki Direct)

No logs appearing in Loki via push

  1. Check Loki is running:

    docker compose ps loki
    docker compose logs loki
    
    # Check Loki health
    curl http://localhost:3100/ready
    
  2. Verify Loki endpoint is reachable:

    # Test HTTP endpoint (note: the /loki/api/v1/push path is added by the library)
    curl -v http://localhost:3100/ready
    
  3. Check bot is configured correctly:

    # Verify environment variable is set
    echo $OTLP_ENDPOINT
    
    # Should see startup message when running bot:
    # "OpenTelemetry initialized with endpoint: http://localhost:4317"
    
  4. Check Loki logs for errors:

    docker compose logs loki | grep -i error
    docker compose logs loki | grep -i "push"
    
  5. Test OpenTelemetry Collector health:

    # Check OTLP receiver is responding
    curl http://localhost:13133/
    
    # Check metrics being exported to Prometheus
    curl http://localhost:8889/metrics | grep wealth
    

Push connection timeouts

  1. Check network connectivity:

    # Test OTLP gRPC endpoint
    telnet localhost 4317
    
    # Or check if port is listening
    nc -zv localhost 4317
    
  2. Check Docker network:

    docker network inspect wealth_monitoring
    
  3. Check OpenTelemetry Collector configuration:

    # View collector logs for errors
    docker compose logs otel-collector
    
    # Verify collector config (in compose.yml)
    docker compose config | grep -A 20 otel-collector
    

Logs delayed or missing

  1. Check OTLP export is working:

    • OpenTelemetry batches logs before sending
    • Default batch timeout is 10 seconds
    • Check bot logs for OTLP export errors
  2. Monitor OpenTelemetry Collector:

    # Check collector is receiving telemetry
    docker compose logs otel-collector | grep -i "logs"
    
    # Check collector metrics
    curl http://localhost:8888/metrics | grep otelcol_receiver
    
  3. Verify labels are correct:

    # Check available labels in Loki
    curl http://localhost:3100/loki/api/v1/labels
    
    # Check values for 'service' label
    curl http://localhost:3100/loki/api/v1/label/service/values
    

General Issues

Performance issues

  1. Check Loki disk usage:

    docker compose exec loki df -h /loki
    
  2. Limit log retention in Loki config:

    • Edit Loki config to set retention period
    • Default: unlimited (until disk full)

Advanced: JSON Logging

For better log parsing and indexing, JSON logging is supported. This is configured automatically when using OTLP export.

Update Promtail Config

In compose.yml, update the pipeline_stages:

pipeline_stages:
  - json:
      expressions:
        timestamp: timestamp
        level: level
        message: message
        module: target
        span: span
        correlation_id: fields.correlation_id
  - labels:
      level:
      module:
  - timestamp:
      source: timestamp
      format: RFC3339Nano

Log Retention

Loki stores logs with automatic compaction. Configure retention in compose.yml:

loki:
  command: 
    - -config.file=/etc/loki/local-config.yaml
    - -config.expand-env=true
  environment:
    - LOKI_RETENTION_PERIOD=30d

Or create custom Loki config with retention limits.

Best Practices

  1. Use Loki direct push for production - Lower latency, simpler setup than OTLP
  2. Keep file logging for debugging - Hybrid mode provides redundancy
  3. Use structured logging - Include correlation_id, operation, etc.
  4. Set appropriate log levels - Use DEBUG for development, INFO for production
  5. Create dashboards - Visualize key metrics from logs
  6. Set up alerts - Get notified of critical errors
  7. Index important fields - Add labels for common filters (level, module)
  8. Monitor Loki performance - Check ingestion rate and query latency
  9. Configure log retention - Balance storage costs with retention needs
  10. Use correlation IDs - Automatically included in logs for tracing

Comparison: Pull vs Push

AspectPull (Promtail)Push (Loki Direct)
Setup ComplexitySimpleSimpler (no Promtail needed)
Latency5-10 seconds< 1 second
Disk I/ORequired (log files)Optional
Network EfficiencyLower (file polling)Higher (batched HTTP)
ReliabilityFile-based bufferingIn-memory buffering
ScalabilityOne agent per hostDirect to Loki
DependenciesPromtail serviceNone (built into bot)
Production Ready✓✓ (recommended)

Migration Path: Pull → Push

  1. Phase 1: Enable OpenTelemetry OTLP export

    # Keep existing file logging if desired
    export WEALTH_LOG_FILE=/tmp/wealth.log
    
    # Add OTLP endpoint
    export OTLP_ENDPOINT=http://localhost:4317
    
    wealth run
    
  2. Phase 2: Verify OTLP export in Grafana

    • Check logs appear in Loki via Grafana Explore
    • Verify metrics in Prometheus
    • Check traces in Tempo
    • Confirm correlation between logs/metrics/traces
  3. Phase 3: Disable file logging (optional)

    # Remove file logging for OTLP-only mode
    unset WEALTH_LOG_FILE
    
    # Keep OTLP export
    export OTLP_ENDPOINT=http://localhost:4317
    
    wealth run
    
  4. Phase 4: Production deployment

    # Ensure all observability services are running
    docker compose up -d
    
    # Configure bot for OTLP
    export OTLP_ENDPOINT=http://localhost:4317
    export OTEL_RESOURCE_ATTRIBUTES="service.name=wealth-bot,deployment.environment=production"
    
    wealth run
    

External Resources