sgcWebSockets Performance Tuning — Scaling to 100k Connections

29 May 2026 · Features

From "It Works" to "It Scales"

sgcWebSockets handles a few thousand connections out of the box with no tuning. Pushing past 50,000 or 100,000 concurrent sockets on a single box is also possible — but it requires choosing the right server class, sizing thread pools to your workload, picking the right compression strategy, and tuning the operating system. This post walks through every lever, in the order you should pull them, with benchmark numbers from a Hetzner AX102 (Ryzen 9 7950X3D, 128 GB RAM, 1 Gbps NIC) running Windows Server 2025 and Ubuntu 24.04.

A word on methodology before we start. Every number you see below is from a synthetic echo workload — 100-byte payloads, 10 messages/sec/client, no business logic. Real applications will land somewhere between these numbers and an order of magnitude slower depending on what your OnMessage handler actually does. Treat these as ceilings, not promises. The point is to show the shape of the scaling curve and where the cliffs are, not to give you a guarantee for an arbitrary workload.

Also: tune the thing closest to the user first. There is no point shaving 200 microseconds off your broadcast loop when your TLS handshake takes 80 milliseconds. Profile, find the bottleneck, fix it, repeat. The list below is roughly ordered by impact-per-effort, but every workload is different.

1. Choose the Right I/O Model

sgcWebSockets ships two server families: the Indy-based TsgcWebSocketServer (one-thread-per-connection) and the IOCP/epoll-backed TsgcWebSocketHTTPServer (event-driven). Past about 5,000 concurrent connections, you want the second one. Period.

Server	I/O model	Sweet spot	Hard ceiling
TsgcWebSocketServer (Indy)	Thread per connection	<5,000	~10,000 (thread stack exhaustion)
TsgcWebSocketHTTPServer Win	IOCP	10,000–100,000	~250,000 (file descriptors)
TsgcWebSocketHTTPServer Lin	epoll	10,000–100,000	~1,000,000 (with tuning)
TsgcWebSocketHTTPServer Mac	kqueue	10,000–50,000	~100,000

2. Size the Thread Pool

The IOCP/epoll server uses a fixed-size worker pool. Default is CPUCount. For pure echo / fan-out workloads keep it small (2–4 per core). For workloads that touch the database or call external APIs inside OnMessage, bump it up — otherwise one slow request blocks N peers.

oServer := TsgcWebSocketHTTPServer.Create(Self);
oServer.Port := 443;

// Tune the worker pool
oServer.ThreadPool.PoolSize       := CPUCount * 2;   // CPU-bound
// oServer.ThreadPool.PoolSize    := 128;            // DB-bound (e.g. 16 cores)
oServer.ThreadPool.QueueSize      := 4096;
oServer.ThreadPool.MaxConnections := 100000;

oServer.Active := True;

Rule of thumb: PoolSize = CPUCount + (average_blocking_ms / average_cpu_ms) * CPUCount. If you find Application Insights / your APM reporting growing queue depth, double the pool. There is a hard upper limit set by your kernel scheduler — tens of thousands of OS threads on Windows is not fun — but you are usually nowhere near that for sensible workloads. Past 256 workers, consider offloading the blocking work to a dedicated worker pool instead.

The single best architectural decision is to keep OnMessage non-blocking and push every actual unit of work onto a dedicated queue serviced by a separate thread pool. That decouples the I/O thread (which must stay fast) from the business thread (which can be slow without consequence). It also makes you observable: queue depth becomes the leading indicator of "we are about to fall over".

3. Compression: per-message-deflate

WebSocket compression (RFC 7692) trims 60–90% off JSON or text payloads. It is also CPU-heavy. Enable it globally only when your workload is text-heavy AND your CPU has headroom. For binary or already-compressed payloads (JPEG, MP4, gzipped logs) it is pure overhead.

oServer.Extensions.PerMessage_Deflate.Enabled         := True;
oServer.Extensions.PerMessage_Deflate.ServerMaxWindow := 15; // default
oServer.Extensions.PerMessage_Deflate.MemLevel        := 8;
oServer.Extensions.PerMessage_Deflate.Threshold       := 256; // skip tiny msgs

Set Threshold so messages smaller than the header overhead are not compressed. Skipping the deflate on a 60-byte heartbeat saves more CPU than it costs.

A subtle gotcha: per-message-deflate uses a sliding window that retains state across messages on the same connection. That state is per-connection memory. With ServerMaxWindow=15 (the default), each connection holds around 32 KB of dictionary. Multiply by 100,000 connections and you have 3 GB of RAM just for compression state. Drop ServerMaxWindow to 10 or 11 if you are memory-bound — you lose a few percent of compression ratio in exchange for roughly 8x less per-connection memory.

4. Fragmentation

Large frames (>1 MB) hold the worker thread until the full message is reassembled. Fragment outgoing messages to keep latencies smooth and free workers for other peers.

oClient.WriteOptions.FragmentEnabled := True;
oClient.WriteOptions.FragmentSize    := 65536;  // 64 KB chunks

On the server side, set ReadOptions.MaxFrameSize to a sane upper bound (we use 4 MB) to protect against malicious peers that try to allocate gigabyte buffers.

5. Broadcast Optimisation

Sending the same message to every connected client is the #1 bottleneck for chat / trading / pub-sub servers. The naive loop for each client: client.Send(msg) serialises and compresses the same payload N times. Use the built-in broadcast which serialises once and reuses the encoded frame.

// Slow: N encodes, N compresses
for i := 0 to oServer.Connections.Count - 1 do
  oServer.Connections[i].WriteData(vJSON);

// Fast: 1 encode, 1 compress, N writes
oServer.Broadcast(vJSON);

// Fastest for fan-out >10k: pre-encoded buffer
vFrame := oServer.EncodeFrame(vJSON);
oServer.BroadcastEncoded(vFrame);

On a 16-core box, the difference between the naive loop and BroadcastEncoded for a 50,000-client fan-out is 12 seconds vs 380 ms. The same principle applies to channels — pre-encode the frame, then walk the subscriber list. If your subscribers split across many channels, encode once per channel and broadcast within. Premature pessimisation in this code path will tank an otherwise fast server.

6. OS-Level Tuning

The kernel imposes hard limits long before the component does. Tune these before you blame the library.

Setting	Linux	Windows	Recommendation
File descriptor limit	`ulimit -n`	HKLM — `MaxUserPort`	2 × expected connections
TCP backlog	`net.core.somaxconn`	`TcpMaxConnectResponseRetransmissions`	4096+
TIME_WAIT reuse	`tcp_tw_reuse=1`	`TcpTimedWaitDelay=30`	Reduce port exhaustion
SO_REUSEPORT	kernel ≥3.9	N/A	Multi-process acceptor
Ephemeral port range	`net.ipv4.ip_local_port_range`	`MaxUserPort`	10000–65535

7. Heartbeats and Idle Detection

Mobile clients drop off the network all the time. Without heartbeats, your server keeps the socket open until the TCP keepalive timer fires (typically 2 hours). Configure short heartbeats and dead-peer detection.

oServer.HeartBeat.Enabled  := True;
oServer.HeartBeat.Interval := 30;     // seconds
oServer.HeartBeat.Timeout  := 90;     // close if no pong within this

This catches half-open connections within 90 seconds rather than two hours, freeing thousands of stale sockets on a busy server.

8. Load Balancer Pairing

If you need to scale beyond a single box, pair sgcWebSockets with our TsgcWebSocketHTTPServer_LoadBalancer or an external L7 LB (HAProxy, nginx, AWS ALB). Two rules:

Use sticky sessions — WebSocket frames are not idempotent and cannot be re-routed mid-conversation.
Forward the original X-Forwarded-For and TLS termination headers so your application sees real client IPs.

Reference Benchmarks

Numbers from a single AX102 box (16 cores / 32 threads, 128 GB), running an echo server with 100-byte payloads at 10 messages/sec/client.

Concurrent clients	Throughput (msg/s)	p50 latency	p99 latency	CPU usage	RSS
10,000	100,000	0.8 ms	3.2 ms	14%	0.9 GB
50,000	500,000	1.1 ms	5.4 ms	38%	3.8 GB
100,000	1,000,000	1.7 ms	9.8 ms	71%	7.2 GB
250,000	2,500,000	3.4 ms	22 ms	96%	17.8 GB

9. TLS Termination

TLS handshakes are CPU-expensive. If you serve thousands of new connections per second, terminating TLS in the Delphi process can saturate cores doing crypto instead of serving frames. For high-churn workloads we terminate TLS in nginx or HAProxy in front of the sgcWebSockets server and run the backend in plain HTTP. The frontend gets hardware AES acceleration, session resumption, and OCSP stapling for free, and the Delphi process gets to spend 100% of its CPU on application logic.

For workloads with persistent long-lived connections (a typical chat / trading scenario), in-process TLS is fine because the handshake is amortised over hours or days. For connect-disconnect-reconnect bursts (mobile clients on flaky networks), put a reverse proxy in front.

10. NIC and Network Tuning

At 1 Gbps you are unlikely to saturate the NIC. Above 10 Gbps you have to think about interrupt coalescing, receive-side scaling (RSS), and pinning sgcWebSockets worker threads to NUMA-local cores. On Linux, ethtool -L and set_irq_affinity.sh are your friends. On Windows, set RSS Profile to NUMAScaling in the NIC properties and verify with Get-NetAdapterRss. Worth tuning only if your monitoring tells you the kernel is spending real time in softirq or DPCs.

Profile-Guided Tuning Loop

Tuning is iterative. Start with defaults, run a representative load, look at: CPU per worker, GC / allocation rate, p99 latency under fan-out, and OS-level connection counters. Change one thing, re-run, compare. The most common surprises:

Compression enabled on already-compressed payloads → CPU spikes for zero bandwidth gain.
Synchronous DB calls inside OnMessage → worker pool saturated at <1% CPU.
No broadcast batching → head-of-line blocking during market open spikes.
Default thread pool on a 64-core box → serialising work onto 64 workers when 256 would unlock 4× throughput.