Log Collection Guide

This guide covers log configuration, formats, and querying. For the complete observability setup including OpenTelemetry, metrics, and Grafana dashboards, see Monitoring Guide.

Quick Start
Log Levels
Structured Logging
Event-Based Queries
Log File Rotation
Troubleshooting

Quick Start

Enable OTLP Export (Recommended)

# Point to OpenTelemetry Collector (local or Grafana Cloud)
export OTLP_ENDPOINT=http://localhost:4317

wealth run

Logs, metrics, and traces are exported via OTLP gRPC. For complete OpenTelemetry setup, see Monitoring Guide - OpenTelemetry Metrics.

Local-Only Mode

If OTLP_ENDPOINT is not set, logs output to stdout/stderr:

# Optional: Enable file logging
export WEALTH__OBSERVABILITY__LOG_FILE=/tmp/wealth.log

wealth run

Log Query Examples

Basic Queries

# All logs from wealth bot
{job="wealth-bot"}

# Filter by log level
{job="wealth-bot"} |= "ERROR"
{job="wealth-bot"} |= "WARN"
{job="wealth-bot"} |= "INFO"

# Filter by level label (if parsed)
{job="wealth-bot", level="ERROR"}
{job="wealth-bot", level="INFO"}

# Filter by module
{job="wealth-bot"} |= "wealth::strategy"
{job="wealth-bot"} |= "wealth::execution"
{job="wealth-bot"} |= "wealth::market_data"

# Search for specific text
{job="wealth-bot"} |= "WebSocket"
{job="wealth-bot"} |= "arbitrage"
{job="wealth-bot"} |= "position"

Advanced Queries

# Errors in the last hour
{job="wealth-bot", level="ERROR"} [1h]

# Rate of errors per minute
rate({job="wealth-bot", level="ERROR"}[5m])

# Count logs by level
sum by (level) (count_over_time({job="wealth-bot"}[1h]))

# Logs containing correlation_id
{job="wealth-bot"} |~ "correlation_id=\\w+"

# WebSocket connection issues
{job="wealth-bot"} |~ "WebSocket.*error|WebSocket.*failed|WebSocket.*timeout"

# Order execution logs
{job="wealth-bot"} |= "Executing arbitrage" or |= "Order placed"

# Strategy-related logs
{job="wealth-bot", module=~"wealth::strategy.*"}

Event-Based Queries (Recommended)

All key log messages include a structured event field for precise filtering:

# Parse JSON and filter by event type
{service="wealth-bot"} | json | event="opportunity_detected"
{service="wealth-bot"} | json | event="arbitrage_executed"
{service="wealth-bot"} | json | event="position_close_succeeded"

# Trade skipped events (all use _skipped suffix)
{service="wealth-bot"} | json | event="quantity_validation_failed_skipped"
{service="wealth-bot"} | json | event="precision_mismatch_skipped"
{service="wealth-bot"} | json | event="unhedged_positions_skipped"
{service="wealth-bot"} | json | event="insufficient_balance_skipped"

# All skipped events
{service="wealth-bot"} | json | event=~".*_skipped"

# All position lifecycle events
{service="wealth-bot"} | json | event=~"position_.*"

# WebSocket and connection events
{service="wealth-bot"} | json | event=~"websocket_.*"

# Circuit breaker activity
{service="wealth-bot"} | json | event=~"circuit_breaker_.*"

# Error events requiring attention
{service="wealth-bot"} | json | event="unhedged_position_detected"
{service="wealth-bot"} | json | event="size_discrepancy_detected"

# Count opportunities vs executions
sum by (event) (count_over_time({service="wealth-bot"} | json | event=~"opportunity_detected|arbitrage_executed" [1h]))

Pattern Matching

# Extract values from logs using regex
{job="wealth-bot"} 
  | regexp "correlation_id=(?P<cid>\\w+)" 
  | line_format "{{.cid}}: {{.message}}"

# Parse structured data
{job="wealth-bot"} 
  | pattern `<_> <level> <module>: <message>`
  | level = "ERROR"

Creating a Logs Dashboard

1. Log Volume Panel

Query:

sum(rate({job="wealth-bot"}[1m])) by (level)

Visualization: Time series graph showing log rate by level

2. Error Rate Panel

Query:

sum(rate({job="wealth-bot", level="ERROR"}[5m]))

Visualization: Stat panel with alert threshold at > 0

3. Recent Errors Table

Query:

{job="wealth-bot", level="ERROR"}

Visualization: Logs panel (table view) Options: Show time, level, and message columns

4. Log Level Distribution

Query:

sum by (level) (count_over_time({job="wealth-bot"}[1h]))

Visualization: Pie chart

5. Module Activity

Query:

sum by (module) (count_over_time({job="wealth-bot"}[1h]))

Visualization: Bar chart

Log File Rotation

To prevent log files from growing too large:

Using logrotate (Linux)

Create /etc/logrotate.d/wealth:

/tmp/wealth*.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    create 0640 thiras thiras
    postrotate
        # Send SIGHUP to reload (if bot supports it)
        # pkill -HUP -f wealth || true
    endscript
}

Test configuration:

sudo logrotate -d /etc/logrotate.d/wealth
sudo logrotate -f /etc/logrotate.d/wealth

Using truncate

Simple script to truncate logs periodically:

#!/bin/bash
# truncate-logs.sh

LOG_FILE="/tmp/wealth.log"
MAX_SIZE_MB=100

if [ -f "$LOG_FILE" ]; then
    SIZE=$(du -m "$LOG_FILE" | cut -f1)
    if [ "$SIZE" -gt "$MAX_SIZE_MB" ]; then
        echo "Truncating $LOG_FILE (${SIZE}MB > ${MAX_SIZE_MB}MB)"
        > "$LOG_FILE"
    fi
fi

Add to crontab:

# Run every hour
0 * * * * /path/to/truncate-logs.sh

Alerting on Logs

Create Alert Rule in Grafana

Go to Alerting → Alert rules → New alert rule

Set query:

sum(rate({job="wealth-bot", level="ERROR"}[5m])) > 0

Set evaluation interval: 1m
Set condition: Alert when query result > 0
Add notification channel (email, Slack, etc.)

Common Alert Rules

High Error Rate:

sum(rate({job="wealth-bot", level="ERROR"}[5m])) > 0.1

WebSocket Connection Failures:

sum(count_over_time({job="wealth-bot"} |= "WebSocket" |= "failed" [5m])) > 3

No Logs Received (Bot Down):

absent_over_time({job="wealth-bot"}[5m]) == 1

Troubleshooting

Pull Model (Promtail)

No logs appearing in Loki

Check Promtail is running:

docker compose ps promtail
docker compose logs promtail

Verify log file exists and is readable:

ls -lah /tmp/wealth.log
tail -f /tmp/wealth.log

Check Promtail positions file:

docker compose exec promtail cat /tmp/positions.yaml

Test Loki directly:

curl -s http://localhost:3100/loki/api/v1/label/job/values

Logs not parsing correctly

Check log format matches regex:

# Example log line
echo "2025-11-10T01:23:45.123456Z  INFO wealth::strategy: Message" | \
  grep -oP '^\S+\s+\w+\s+[\w:]+:\s+.*$'

View Promtail debug logs:

docker compose logs promtail | grep -i error

Push Model (Loki Direct)

No logs appearing in Loki via push

Check Loki is running:

docker compose ps loki
docker compose logs loki

# Check Loki health
curl http://localhost:3100/ready

Verify Loki endpoint is reachable:

# Test HTTP endpoint (note: the /loki/api/v1/push path is added by the library)
curl -v http://localhost:3100/ready

Check bot is configured correctly:

# Verify environment variable is set
echo $OTLP_ENDPOINT

# Should see startup message when running bot:
# "OpenTelemetry initialized with endpoint: http://localhost:4317"

Check Loki logs for errors:

docker compose logs loki | grep -i error
docker compose logs loki | grep -i "push"

Test OpenTelemetry Collector health:

# Check OTLP receiver is responding
curl http://localhost:13133/

# Check metrics being exported to Prometheus
curl http://localhost:8889/metrics | grep wealth

Push connection timeouts

Check network connectivity:

# Test OTLP gRPC endpoint
telnet localhost 4317

# Or check if port is listening
nc -zv localhost 4317

Check Docker network:

docker network inspect wealth_monitoring

Check OpenTelemetry Collector configuration:

# View collector logs for errors
docker compose logs otel-collector

# Verify collector config (in compose.yml)
docker compose config | grep -A 20 otel-collector

Logs delayed or missing

Check OTLP export is working:
- OpenTelemetry batches logs before sending
- Default batch timeout is 10 seconds
- Check bot logs for OTLP export errors

Monitor OpenTelemetry Collector:

# Check collector is receiving telemetry
docker compose logs otel-collector | grep -i "logs"

# Check collector metrics
curl http://localhost:8888/metrics | grep otelcol_receiver

Verify labels are correct:

# Check available labels in Loki
curl http://localhost:3100/loki/api/v1/labels

# Check values for 'service' label
curl http://localhost:3100/loki/api/v1/label/service/values

General Issues

Performance issues

Check Loki disk usage:
```
docker compose exec loki df -h /loki
```
Limit log retention in Loki config:
- Edit Loki config to set retention period
- Default: unlimited (until disk full)

Advanced: JSON Logging

For better log parsing and indexing, JSON logging is supported. This is configured automatically when using OTLP export.

Update Promtail Config

In compose.yml, update the pipeline_stages:

pipeline_stages:
  - json:
      expressions:
        timestamp: timestamp
        level: level
        message: message
        module: target
        span: span
        correlation_id: fields.correlation_id
  - labels:
      level:
      module:
  - timestamp:
      source: timestamp
      format: RFC3339Nano

Log Retention

Loki stores logs with automatic compaction. Configure retention in compose.yml:

loki:
  command: 
    - -config.file=/etc/loki/local-config.yaml
    - -config.expand-env=true
  environment:
    - LOKI_RETENTION_PERIOD=30d

Or create custom Loki config with retention limits.

Best Practices

Use Loki direct push for production - Lower latency, simpler setup than OTLP
Keep file logging for debugging - Hybrid mode provides redundancy
Use structured logging - Include correlation_id, operation, etc.
Set appropriate log levels - Use DEBUG for development, INFO for production
Create dashboards - Visualize key metrics from logs
Set up alerts - Get notified of critical errors
Index important fields - Add labels for common filters (level, module)
Monitor Loki performance - Check ingestion rate and query latency
Configure log retention - Balance storage costs with retention needs
Use correlation IDs - Automatically included in logs for tracing

Comparison: Pull vs Push

Aspect	Pull (Promtail)	Push (Loki Direct)
Setup Complexity	Simple	Simpler (no Promtail needed)
Latency	5-10 seconds	< 1 second
Disk I/O	Required (log files)	Optional
Network Efficiency	Lower (file polling)	Higher (batched HTTP)
Reliability	File-based buffering	In-memory buffering
Scalability	One agent per host	Direct to Loki
Dependencies	Promtail service	None (built into bot)
Production Ready	✓	✓✓ (recommended)

Migration Path: Pull → Push

Phase 1: Enable OpenTelemetry OTLP export

# Keep existing file logging if desired
export WEALTH_LOG_FILE=/tmp/wealth.log

# Add OTLP endpoint
export OTLP_ENDPOINT=http://localhost:4317

wealth run

Phase 2: Verify OTLP export in Grafana
- Check logs appear in Loki via Grafana Explore
- Verify metrics in Prometheus
- Check traces in Tempo
- Confirm correlation between logs/metrics/traces

Phase 3: Disable file logging (optional)

# Remove file logging for OTLP-only mode
unset WEALTH_LOG_FILE

# Keep OTLP export
export OTLP_ENDPOINT=http://localhost:4317

wealth run

Phase 4: Production deployment

# Ensure all observability services are running
docker compose up -d

# Configure bot for OTLP
export OTLP_ENDPOINT=http://localhost:4317
export OTEL_RESOURCE_ATTRIBUTES="service.name=wealth-bot,deployment.environment=production"

wealth run

Monitoring Guide - Complete observability with OpenTelemetry, metrics, and dashboards
Grafana Cloud Setup - Production Grafana setup
Troubleshooting - Common issues

Wealth Trading System - User Guide

Log Collection Guide

Table of Contents

Quick Start

Enable OTLP Export (Recommended)

Local-Only Mode

Log Query Examples

Basic Queries

Advanced Queries

Event-Based Queries (Recommended)

Pattern Matching

Creating a Logs Dashboard

1. Log Volume Panel

2. Error Rate Panel

3. Recent Errors Table

4. Log Level Distribution

5. Module Activity

Log File Rotation

Using logrotate (Linux)

Using truncate

Alerting on Logs

Create Alert Rule in Grafana

Common Alert Rules

Troubleshooting

Pull Model (Promtail)

No logs appearing in Loki

Logs not parsing correctly

Push Model (Loki Direct)

No logs appearing in Loki via push

Push connection timeouts

Logs delayed or missing

General Issues

Performance issues

Advanced: JSON Logging

Update Promtail Config

Log Retention

Best Practices

Comparison: Pull vs Push

Migration Path: Pull → Push

External Resources