From "It Works" to "It Scales"
sgcWebSockets handles a few thousand connections out of the box with no tuning. Pushing past 50,000 or 100,000 concurrent sockets on a single box is also possible — but it requires choosing the right server class, sizing thread pools to your workload, picking the right compression strategy, and tuning the operating system. This post walks through every lever, in the order you should pull them, with benchmark numbers from a Hetzner AX102 (Ryzen 9 7950X3D, 128 GB RAM, 1 Gbps NIC) running Windows Server 2025 and Ubuntu 24.04.
A word on methodology before we start. Every number you see below is from a synthetic echo workload — 100-byte payloads, 10 messages/sec/client, no business logic. Real applications will land somewhere between these numbers and an order of magnitude slower depending on what your OnMessage handler actually does. Treat these as ceilings, not promises. The point is to show the shape of the scaling curve and where the cliffs are, not to give you a guarantee for an arbitrary workload.
Also: tune the thing closest to the user first. There is no point shaving 200 microseconds off your broadcast loop when your TLS handshake takes 80 milliseconds. Profile, find the bottleneck, fix it, repeat. The list below is roughly ordered by impact-per-effort, but every workload is different.
1. Choose the Right I/O Model
sgcWebSockets ships two server families: the Indy-based TsgcWebSocketServer (one-thread-per-connection) and the IOCP/epoll-backed TsgcWebSocketHTTPServer (event-driven). Past about 5,000 concurrent connections, you want the second one. Period.
| Server | I/O model | Sweet spot | Hard ceiling |
| TsgcWebSocketServer (Indy) | Thread per connection | <5,000 | ~10,000 (thread stack exhaustion) |
| TsgcWebSocketHTTPServer Win | IOCP | 10,000–100,000 | ~250,000 (file descriptors) |
| TsgcWebSocketHTTPServer Lin | epoll | 10,000–100,000 | ~1,000,000 (with tuning) |
| TsgcWebSocketHTTPServer Mac | kqueue | 10,000–50,000 | ~100,000 |
2. Size the Thread Pool
The IOCP/epoll server uses a fixed-size worker pool. Default is CPUCount. For pure echo / fan-out workloads keep it small (2–4 per core). For workloads that touch the database or call external APIs inside OnMessage, bump it up — otherwise one slow request blocks N peers.
oServer := TsgcWebSocketHTTPServer.Create(Self); oServer.Port := 443; // Tune the worker pool oServer.ThreadPool.PoolSize := CPUCount * 2; // CPU-bound // oServer.ThreadPool.PoolSize := 128; // DB-bound (e.g. 16 cores) oServer.ThreadPool.QueueSize := 4096; oServer.ThreadPool.MaxConnections := 100000; oServer.Active := True;
Rule of thumb: PoolSize = CPUCount + (average_blocking_ms / average_cpu_ms) * CPUCount. If you find Application Insights / your APM reporting growing queue depth, double the pool. There is a hard upper limit set by your kernel scheduler — tens of thousands of OS threads on Windows is not fun — but you are usually nowhere near that for sensible workloads. Past 256 workers, consider offloading the blocking work to a dedicated worker pool instead.
The single best architectural decision is to keep OnMessage non-blocking and push every actual unit of work onto a dedicated queue serviced by a separate thread pool. That decouples the I/O thread (which must stay fast) from the business thread (which can be slow without consequence). It also makes you observable: queue depth becomes the leading indicator of "we are about to fall over".
3. Compression: per-message-deflate
WebSocket compression (RFC 7692) trims 60–90% off JSON or text payloads. It is also CPU-heavy. Enable it globally only when your workload is text-heavy AND your CPU has headroom. For binary or already-compressed payloads (JPEG, MP4, gzipped logs) it is pure overhead.
oServer.Extensions.PerMessage_Deflate.Enabled := True; oServer.Extensions.PerMessage_Deflate.ServerMaxWindow := 15; // default oServer.Extensions.PerMessage_Deflate.MemLevel := 8; oServer.Extensions.PerMessage_Deflate.Threshold := 256; // skip tiny msgs
Set Threshold so messages smaller than the header overhead are not compressed. Skipping the deflate on a 60-byte heartbeat saves more CPU than it costs.
A subtle gotcha: per-message-deflate uses a sliding window that retains state across messages on the same connection. That state is per-connection memory. With ServerMaxWindow=15 (the default), each connection holds around 32 KB of dictionary. Multiply by 100,000 connections and you have 3 GB of RAM just for compression state. Drop ServerMaxWindow to 10 or 11 if you are memory-bound — you lose a few percent of compression ratio in exchange for roughly 8x less per-connection memory.
4. Fragmentation
Large frames (>1 MB) hold the worker thread until the full message is reassembled. Fragment outgoing messages to keep latencies smooth and free workers for other peers.
oClient.WriteOptions.FragmentEnabled := True; oClient.WriteOptions.FragmentSize := 65536; // 64 KB chunks
On the server side, set ReadOptions.MaxFrameSize to a sane upper bound (we use 4 MB) to protect against malicious peers that try to allocate gigabyte buffers.
5. Broadcast Optimisation
Sending the same message to every connected client is the #1 bottleneck for chat / trading / pub-sub servers. The naive loop for each client: client.Send(msg) serialises and compresses the same payload N times. Use the built-in broadcast which serialises once and reuses the encoded frame.
// Slow: N encodes, N compresses for i := 0 to oServer.Connections.Count - 1 do oServer.Connections[i].WriteData(vJSON); // Fast: 1 encode, 1 compress, N writes oServer.Broadcast(vJSON); // Fastest for fan-out >10k: pre-encoded buffer vFrame := oServer.EncodeFrame(vJSON); oServer.BroadcastEncoded(vFrame);
On a 16-core box, the difference between the naive loop and BroadcastEncoded for a 50,000-client fan-out is 12 seconds vs 380 ms. The same principle applies to channels — pre-encode the frame, then walk the subscriber list. If your subscribers split across many channels, encode once per channel and broadcast within. Premature pessimisation in this code path will tank an otherwise fast server.
6. OS-Level Tuning
The kernel imposes hard limits long before the component does. Tune these before you blame the library.
| Setting | Linux | Windows | Recommendation |
| File descriptor limit | ulimit -n |
HKLM — MaxUserPort |
2 × expected connections |
| TCP backlog | net.core.somaxconn |
TcpMaxConnectResponseRetransmissions |
4096+ |
| TIME_WAIT reuse | tcp_tw_reuse=1 |
TcpTimedWaitDelay=30 |
Reduce port exhaustion |
| SO_REUSEPORT | kernel ≥3.9 | N/A | Multi-process acceptor |
| Ephemeral port range | net.ipv4.ip_local_port_range |
MaxUserPort |
10000–65535 |
7. Heartbeats and Idle Detection
Mobile clients drop off the network all the time. Without heartbeats, your server keeps the socket open until the TCP keepalive timer fires (typically 2 hours). Configure short heartbeats and dead-peer detection.
oServer.HeartBeat.Enabled := True; oServer.HeartBeat.Interval := 30; // seconds oServer.HeartBeat.Timeout := 90; // close if no pong within this
This catches half-open connections within 90 seconds rather than two hours, freeing thousands of stale sockets on a busy server.
8. Load Balancer Pairing
If you need to scale beyond a single box, pair sgcWebSockets with our TsgcWebSocketHTTPServer_LoadBalancer or an external L7 LB (HAProxy, nginx, AWS ALB). Two rules:
- Use sticky sessions — WebSocket frames are not idempotent and cannot be re-routed mid-conversation.
- Forward the original
X-Forwarded-Forand TLS termination headers so your application sees real client IPs.
Reference Benchmarks
Numbers from a single AX102 box (16 cores / 32 threads, 128 GB), running an echo server with 100-byte payloads at 10 messages/sec/client.
| Concurrent clients | Throughput (msg/s) | p50 latency | p99 latency | CPU usage | RSS |
| 10,000 | 100,000 | 0.8 ms | 3.2 ms | 14% | 0.9 GB |
| 50,000 | 500,000 | 1.1 ms | 5.4 ms | 38% | 3.8 GB |
| 100,000 | 1,000,000 | 1.7 ms | 9.8 ms | 71% | 7.2 GB |
| 250,000 | 2,500,000 | 3.4 ms | 22 ms | 96% | 17.8 GB |
9. TLS Termination
TLS handshakes are CPU-expensive. If you serve thousands of new connections per second, terminating TLS in the Delphi process can saturate cores doing crypto instead of serving frames. For high-churn workloads we terminate TLS in nginx or HAProxy in front of the sgcWebSockets server and run the backend in plain HTTP. The frontend gets hardware AES acceleration, session resumption, and OCSP stapling for free, and the Delphi process gets to spend 100% of its CPU on application logic.
For workloads with persistent long-lived connections (a typical chat / trading scenario), in-process TLS is fine because the handshake is amortised over hours or days. For connect-disconnect-reconnect bursts (mobile clients on flaky networks), put a reverse proxy in front.
10. NIC and Network Tuning
At 1 Gbps you are unlikely to saturate the NIC. Above 10 Gbps you have to think about interrupt coalescing, receive-side scaling (RSS), and pinning sgcWebSockets worker threads to NUMA-local cores. On Linux, ethtool -L and set_irq_affinity.sh are your friends. On Windows, set RSS Profile to NUMAScaling in the NIC properties and verify with Get-NetAdapterRss. Worth tuning only if your monitoring tells you the kernel is spending real time in softirq or DPCs.
Profile-Guided Tuning Loop
Tuning is iterative. Start with defaults, run a representative load, look at: CPU per worker, GC / allocation rate, p99 latency under fan-out, and OS-level connection counters. Change one thing, re-run, compare. The most common surprises:
- Compression enabled on already-compressed payloads → CPU spikes for zero bandwidth gain.
- Synchronous DB calls inside
OnMessage→ worker pool saturated at <1% CPU. - No broadcast batching → head-of-line blocking during market open spikes.
- Default thread pool on a 64-core box → serialising work onto 64 workers when 256 would unlock 4× throughput.
Further Reading
If you have not picked the right server class yet, start at Which Edition. Then jump to the Load Balancer component for multi-box scaling. New to the library? The Getting Started hub walks you through installation in five minutes.