Rebuilding a VoIP Monitoring Stack for Real-Time Call Quality

The VoIP Monitoring Stack I Wish I Had Set Up From Day One

Dialphone Limited overhauled their VoIP monitoring after realizing that basic process checks failed to detect actual outages. By shifting focus to call experience metrics, they reduced incident detection time by over 96%.

Why This Matters

Monitoring infrastructure metrics like CPU and memory often masks user-facing failures in VoIP environments where the PBX process may remain active while call quality is unusable. Technical reality requires observing the network layer, SIP signaling, and RTP streams to identify packet loss and jitter before they manifest as business-impacting service interruptions.

Key Insights

Synthetic SIP OPTIONS probes executed every 60 seconds provide continuous data on latency and packet loss before users notice degradation.
Mean Opinion Score (MOS) serves as a direct measure of call quality, with unacceptable quality defined as any score falling below 3.0.
Critical alerting should trigger when SIP registration failure rates exceed 5% or when active calls drop by more than 20% in 60 seconds.
Monitoring business impact metrics, such as queue abandoned rates exceeding 15%, provides more signal than individual phone registration events.
Platforms like VestaCall provide built-in call quality analytics and real-time MOS scoring, reducing the need for custom RTP analysis layers.

Working Examples

A simplified SIP OPTIONS probe to measure Round Trip Time (RTT) and response status from a target SIP server.

import socket, time
def sip_probe(target, port=5060):
    probe = (
        "OPTIONS sip:ping@TARGET SIP/2.0\r\n"
        "Via: SIP/2.0/UDP monitor:5060\r\n"
        "From: <sip:monitor@probe>;tag=probe123\r\n"
        "To: <sip:ping@TARGET>\r\n"
        "Call-ID: probe-TIMESTAMP@monitor\r\n"
        "CSeq: 1 OPTIONS\r\n"
        "Max-Forwards: 70\r\n"
        "Content-Length: 0\r\n\r\n"
    )
    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    sock.settimeout(5)
    start = time.perf_counter()
    sock.sendto(probe.encode(), (target, port))
    try:
        data, _ = sock.recvfrom(4096)
        rtt = (time.perf_counter() - start) * 1000
        return dict(rtt_ms=round(rtt, 2), response=data[:50].decode())
    except socket.timeout:
        return dict(rtt_ms=None, response="TIMEOUT")

Practical Applications

Use Case: Synthetic SIP probing from office locations to VoIP providers to detect per-hop jitter. Pitfall: Monitoring individual phone registrations, which is too noisy for effective alerting.
Use Case: Real-time MOS scoring to automatically escalate network issues when scores drop below 3.5 for 5 minutes. Pitfall: Relying on PBX CPU/memory metrics which often fail to correlate with call quality.
Use Case: Tracking queue depth and abandoned rates to measure business impact. Pitfall: Monitoring call duration distribution for incident alerting, which is useful for analytics but useless for real-time response.

References:

https://vestacall.com

On This Page

The VoIP Monitoring Stack I Wish I Had Set Up From Day One

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

OpenVPN UI: Optimizing VPN Server Management with Web Dashboards

How to migrate from Dead Man's Snitch to CronObserver in 5 minutes

Automating Visual Website Monitoring: Hourly Screenshots for Incident Proof and Regression Testing