By Admin on Sunday, 20 December 2020
Category: All

FastMM4 vs FastMM5 vs FastMM4-AVX

​Recently a new version of FastMM, developed by Pierre le Riche, has been released, the new version is called FastMM5 and has been rewritten to improve the performance on multi threaded applications, can be configured for better speed or less memory usage and more.

Support from Delphi XE3 Compiler and can used on Windows32 and Windows64.

FastMM5 is dual licensed, so there are 2 licenses: GPL and Commercial. So if you want use in commercial projects, you must purchase a license. More details here

https://github.com/pleriche/FastMM5

FastMM4 has a new fork, called FastMM4-AVX, developed by Maxim Masiutin, which adds very interesting features like: more efficient synchronization, AVX instructions for faster memory copy, speed improvements and more. FastMM4-AVX is dual licensed: MPL and GPL. More details here:

https://github.com/maximmasiutin/FastMM4-AVX

Configuration

​In order to test the performance with our components, a new windows console application has been created, sgcBenchmark.exe, which will be used to measure the performance of every memory manager using our sgcWebSockets components.

The test is very simple, a client (or more than one client) connects to a server, sends a message to server and server replies with the same message to client. This is repeated 100.000 times. The tests are repeated changing the number of concurrent clients, first 1, then 10, 100... the measured time is the time elapsed between the first message sent by client and the last message received from server (so the time used to connect to server is not measured).

The benchmark will compare the performance using the Default Memory Manager that comes with Delphi 10.4.1, FastMM5 and FastMM4-AVX

Benchmark Indy WebSocket Server

In the first Benchmark, the Server used is the Indy WebSocket Server, this server is based on Indy TCP Server, so every connection creates 1 thread. 

The values are measured in milliseconds, so for example, the first test that is done with 1 client in Windows32 platforms, using the default memory manager takes 4135 milliseconds, using FastMM5 takes 4214 milliseconds and using FastMM4-AVX takes 4823 milliseconds. The percentage calculated is against the reference value, in this case against the Default memory manager that comes with delphi, as much lower is the percentage, better performance has.

The Benchmark has been done 3 times and the values showed are the sum of the benchmarks / 3.

For the benchmark, the server used was:


The Delphi version used was Delphi 10.4.1, and the latest FastMM5 and FastMM4-AVX versions from github servers.

Find below the result of the benchmark.

Clients Platform Default  (ms) FMM5 (ms)​FMM5 (%)​FMM4-AVX (ms)FMM4-AVX (%)
​1 ​Win32 ​4135 ​4214​1.91%4823​16.64%​
​1​Win64​4052​4520​11.55%4328​6.81%​
​10​Win32​4214​1729​-58.97%1828​-56.62%​
​10​Win64​4104​1875​-54.31%1651​-59.77%​
​100​Win32​3958​1604​-59.47%1583​-60.01%​
​100​Win64​3958​1614​-59.22%1635​-58.69%​
​500​Win32​4098​1723​-57.96%1854​-54.76%​
​500​Win64​5333​1791​-66.42%1833​-65.63%​
​1000​Win32​5927​2208​-62.75%2328​-60.72%​
​1000​Win64​8166​2229​-72.70%2234​-72.64%​

Benchmark HTTP.SYS Server

In the second Benchmark, the Server used is the HTTP.SYS WebSocket Server, this server is based on HTTP API Microsoft Framework and the connections are handled by a pool of threads.

The values are measured in milliseconds, so for example, the first test that is done with 1 client in Windows32 platforms, using the default memory manager takes 5364 milliseconds, using FastMM5 takes 5182 milliseconds and using FastMM4-AVX takes 5838 milliseconds. The percentage calculated is against the reference value, in this case against the Default memory manager that comes with delphi, as much lower is the percentage, better performance has.

The Benchmark has been done 3 times and the values showed are the sum of the benchmarks / 3.

For the benchmark, the server used was:


The Delphi version used was Delphi 10.4.1, and the latest FastMM5 and FastMM4-AVX versions from github servers.

Find below the result of the benchmark.

Clients Platform Default  (ms) FMM5 (ms)​FMM5 (%)​FMM4-AVX (ms)FMM4-AVX (%)
​1 ​Win32 ​5364 ​5182-3.39%5838​8.84%​
​1​Win64​5507​5206-0.61%5135​1.54%​
​10​Win32​4922​1744​-64.57%2088​-57.58%​
​10​Win64​4958​1770​-64.30%1953​-60.61%​
​100​Win32​3359​1682​-49.93%2244​-33.19%​
​100​Win64​3979​1536​-61.40%1859​-53.28%​
​500​Win32​2364​1890​-20.05%2344​-0.85%​
​500​Win64​2901​1666​-42.57%1859​-35.92%​
​1000​Win32​3296​1968​-40.29%2531​-23.21%​
​1000​Win64​4469​1989​-55.49%2047​-54.20%​

Comments about Benchmarks

Find below some comments about the results obtained after benchmark the 3 different memory managers:

The final decision to choose one memory manager or another depends of the project, I think there is no single memory manager that works as the best in all conditions, so before choose one or another, test, test and test again to see which performance better for your needs.

Console Benchmarks  

Find below the compiled windows console applications used to do benchmarks, just execute one of them and 2 console applications will start, one for server and another for client, just let the client do the tests and when finish, both applications will be closed automatically. There is one console application for every platform and memory manager.

To execute all tests, I used a .bat file to call every benchmark and save in a text file the results.

Updated 16-May-2021 

​After a request from Maxim, the developer of FastMM4-AVX, informing about the following improvements, find below his comments:

I have improved FastMM4-AVX to work better in multithreaded environments at the cost of single-thread performance.
The speed improvement primarily relates to 64 bits.
The essence of the improvement is the following. If the list of blocks of this size is locked on releasing a memory block, the FreeMem returns immediately, and the block will be released later, during the next FreeMem. As a result, two memory blocks will be released with a single memory lock.

FastMM5 uses a different technique: it has not a single list of blocks for each size, but several of them called "arenas", so the thread contention will occur less likely because each thread will probably get its own arena. However, this requires much more memory, and may need a much higher number of memory pages, which, in some scenarios, may be slower.

I've downloaded the latest versions from FastMM5 and FastMM4-AVX and repeated the benchmarks, find below the results.

Benchmark Indy WebSocket Server 

In the first Benchmark, the Server used is the Indy WebSocket Server, this server is based on Indy TCP Server, so every connection creates 1 thread.

The values are measured in milliseconds, so for example, the first test that is done with 1 client in Windows32 platforms, using the default memory manager takes 4109 milliseconds, using FastMM5 takes 4156 milliseconds and using FastMM4-AVX takes 4255 milliseconds. The percentage calculated is against the reference value, in this case against the Default memory manager that comes with delphi, as much lower is the percentage, better performance has.

The Benchmark has been done 3 times and the values showed are the sum of the benchmarks / 3.

For the benchmark, the server used was:


The Delphi version used was Delphi 10.4.2, and the latest FastMM5 and FastMM4-AVX versions from github servers.

Find below the result of the benchmark.

Clients Platform Default  (ms) FMM5 (ms)​FMM5 (%)​FMM4-AVX (ms)FMM4-AVX (%)
​1 ​Win32 ​4109 ​4156​1.14%4255​3.55%​
​1​Win64​4369​4182​-4.17%3885​-10.98%​
​10​Win32​3562​1541​-56.74%1594​-55.25%​
​10​Win64​4130​1708​-58.64%1630​-60.53%​
​100​Win32​3411​1463​-57.11%1427​-58.16%​
​100​Win64​3875​1453​-62.50%1447​-62.66%​
​500​Win32​2859​1693​-40.78%1645​-42.46%​
​500​Win64​4864​1661​-65.85%1672​-65.63%​
​1000​Win32​5744​2062​-64.10%2119​-63.11%​
​1000​Win64​7203​2245​-68.83%1963​-72.75%​

Benchmark HTTP.SYS Server 

In the second Benchmark, the Server used is the HTTP.SYS WebSocket Server, this server is based on HTTP API Microsoft Framework and the connections are handled by a pool of threads.

The values are measured in milliseconds, so for example, the first test that is done with 1 client in Windows32 platforms, using the default memory manager takes 5182 milliseconds, using FastMM5 takes 5255 milliseconds and using FastMM4-AVX takes 5057 milliseconds. The percentage calculated is against the reference value, in this case against the Default memory manager that comes with delphi, as much lower is the percentage, better performance has.

The Benchmark has been done 3 times and the values showed are the sum of the benchmarks / 3.

For the benchmark, the server used was:


The Delphi version used was Delphi 10.4.2, and the latest FastMM5 and FastMM4-AVX versions from github servers.

Find below the result of the benchmark.

Clients Platform Default  (ms) FMM5 (ms)​FMM5 (%)​FMM4-AVX (ms)FMM4-AVX (%)
​1 ​Win32 ​5182 ​52551.41%5057​-2.41%​
​1​Win64​5177​5073-2.01%5047​-2.51%​
​10​Win32​4776​1489​-68.62%1552​-67.50%​
​10​Win64​4948​1484​-70.01%1547​-68.73%​
​100​Win32​3067​1328​-56.70%1625​-47.02%​
​100​Win64​3552​1317​-62.92%1557​-56.17%​
​500​Win32​2109​1536​-27.17%1593​-24.47%​
​500​Win64​2573​1411​-45.16%1547​-39.88%​
​1000​Win32​2906​1630​-43.91%1927​-33.69%​
​1000​Win64​3354​1515​-54.83%1713​-48.93%​

Comments about Benchmarks 

As shown in the benchmark results, the FastMM4-AVX library has improved the performance under multi threaded environments. 


Console Benchmarks

Find below the compiled the updated windows console applications used to do benchmarks

Related Posts