FastMM4 vs FastMM5 vs FastMM4-AVX

Recently a new version of FastMM, developed by Pierre le Riche, has been released, the new version is called FastMM5 and has been rewritten to improve the performance on multi threaded applications, can be configured for better speed or less memory usage and more.

Support from Delphi XE3 Compiler and can used on Windows32 and Windows64.

FastMM5 is dual licensed, so there are 2 licenses: GPL and Commercial. So if you want use in commercial projects, you must purchase a license. More details here

https://github.com/pleriche/FastMM5


FastMM4 has a new fork, called FastMM4-AVX, developed by Maxim Masiutin, which adds very interesting features like: more efficient synchronization, AVX instructions for faster memory copy, speed improvements and more. FastMM4-AVX is dual licensed: MPL and GPL. More details here:

https://github.com/maximmasiutin/FastMM4-AVX

Configuration

In order to test the performance with our components, a new windows console application has been created, sgcBenchmark.exe, which will be used to measure the performance of every memory manager using our sgcWebSockets components.

The test is very simple, a client (or more than one client) connects to a server, sends a message to server and server replies with the same message to client. This is repeated 100.000 times. The tests are repeated changing the number of concurrent clients, first 1, then 10, 100... the measured time is the time elapsed between the first message sent by client and the last message received from server (so the time used to connect to server is not measured).

The benchmark will compare the performance using the Default Memory Manager that comes with Delphi 10.4.1, FastMM5 and FastMM4-AVX


Benchmark Indy WebSocket Server

In the first Benchmark, the Server used is the Indy WebSocket Server, this server is based on Indy TCP Server, so every connection creates 1 thread. 

The values are measured in milliseconds, so for example, the first test that is done with 1 client in Windows32 platforms, using the default memory manager takes 4135 milliseconds, using FastMM5 takes 4214 milliseconds and using FastMM4-AVX takes 4823 milliseconds. The percentage calculated is against the reference value, in this case against the Default memory manager that comes with delphi, as much lower is the percentage, better performance has.

The Benchmark has been done 3 times and the values showed are the sum of the benchmarks / 3.

For the benchmark, the server used was:

  • Windows 2016 Server Datacenter
  • 16 Virtual Processors
  • 32 GB RAM
  • 2.2 GHz

The Delphi version used was Delphi 10.4.1, and the latest FastMM5 and FastMM4-AVX versions from github servers.

Find below the result of the benchmark.

Clients Platform Default  (ms) FMM5 (ms)FMM5 (%)FMM4-AVX (ms)FMM4-AVX (%)
1 Win32 4135 42141.91%482316.64%
1Win644052452011.55%43286.81%
10Win3242141729-58.97%1828-56.62%
10Win6441041875-54.31%1651-59.77%
100Win3239581604-59.47%1583-60.01%
100Win6439581614-59.22%1635-58.69%
500Win3240981723-57.96%1854-54.76%
500Win6453331791-66.42%1833-65.63%
1000Win3259272208-62.75%2328-60.72%
1000Win6481662229-72.70%2234-72.64%

Benchmark HTTP.SYS Server

In the second Benchmark, the Server used is the HTTP.SYS WebSocket Server, this server is based on HTTP API Microsoft Framework and the connections are handled by a pool of threads.

The values are measured in milliseconds, so for example, the first test that is done with 1 client in Windows32 platforms, using the default memory manager takes 5364 milliseconds, using FastMM5 takes 5182 milliseconds and using FastMM4-AVX takes 5838 milliseconds. The percentage calculated is against the reference value, in this case against the Default memory manager that comes with delphi, as much lower is the percentage, better performance has.

The Benchmark has been done 3 times and the values showed are the sum of the benchmarks / 3.

For the benchmark, the server used was:

  • Windows 2016 Server Datacenter
  • 16 Virtual Processors
  • 32 GB RAM
  • 2.2 GHz

The Delphi version used was Delphi 10.4.1, and the latest FastMM5 and FastMM4-AVX versions from github servers.

Find below the result of the benchmark.

Clients Platform Default  (ms) FMM5 (ms)FMM5 (%)FMM4-AVX (ms)FMM4-AVX (%)
1 Win32 5364 5182-3.39%58388.84%
1Win6455075206-0.61%51351.54%
10Win3249221744-64.57%2088-57.58%
10Win6449581770-64.30%1953-60.61%
100Win3233591682-49.93%2244-33.19%
100Win6439791536-61.40%1859-53.28%
500Win3223641890-20.05%2344-0.85%
500Win6429011666-42.57%1859-35.92%
1000Win3232961968-40.29%2531-23.21%
1000Win6444691989-55.49%2047-54.20%

Comments about Benchmarks

Find below some comments about the results obtained after benchmark the 3 different memory managers:
  • Using in single threaded application, there are no big differences in performance between FastMM4, FastMM5 and FasMM4-AVX.
  • FastMM5 and FastMM4-AVX work much better in multithreaded applications.
  • The differences between FastMM5 and FastMM4-AVX are small, at least doing these benchmarks.
  • Windows 32 benchmarks performs better than Windows 64 tests. Using FastMM5 or FastMM4-AVX in a Windows 64 applications improves performance more than in Windows 32.

The final decision to choose one memory manager or another depends of the project, I think there is no single memory manager that works as the best in all conditions, so before choose one or another, test, test and test again to see which performance better for your needs.

Console Benchmarks  

Find below the compiled windows console applications used to do benchmarks, just execute one of them and 2 console applications will start, one for server and another for client, just let the client do the tests and when finish, both applications will be closed automatically. There is one console application for every platform and memory manager.

To execute all tests, I used a .bat file to call every benchmark and save in a text file the results.

call bin\Win32\sgcBenchmark.exe >> benchmark_win32.txt
call bin\Win32\sgcBenchmark_FMM5.exe >> benchmark_win32_FMM5.txt
call bin\Win32\sgcBenchmark_FMM4_AVX.exe >> benchmark_win32_FMM4_AVX.txt

call bin\Win64\sgcBenchmark.exe >> benchmark_win64.txt
call bin\Win64\sgcBenchmark_FMM5.exe >> benchmark_win64_FMM5.txt
call bin\Win64\sgcBenchmark_FMM4_AVX.exe >> benchmark_win64_FMM4_AVX.txt

call bin\Win32\sgcBenchmark.exe -t both.sys >> benchmark_sys_win32.txt
call bin\Win32\sgcBenchmark_FMM5.exe -t both.sys >> benchmark_sys_win32_FMM5.txt
call bin\Win32\sgcBenchmark_FMM4_AVX.exe -t both.sys >> benchmark_sys_win32_FMM4_AVX.txt

call bin\Win64\sgcBenchmark.exe -t both.sys >> benchmark_sys_win64.txt
call bin\Win64\sgcBenchmark_FMM5.exe -t both.sys >> benchmark_sys_win64_FMM5.txt
call bin\Win64\sgcBenchmark_FMM4_AVX.exe -t both.sys >> benchmark_sys_win64_FMM4_AVX.txt

exit 
File Name: sgcBenchmark
File Size: 7.9 mb
Download File

Updated 16-May-2021 

After a request from Maxim, the developer of FastMM4-AVX, informing about the following improvements, find below his comments:

I have improved FastMM4-AVX to work better in multithreaded environments at the cost of single-thread performance.
The speed improvement primarily relates to 64 bits.
The essence of the improvement is the following. If the list of blocks of this size is locked on releasing a memory block, the FreeMem returns immediately, and the block will be released later, during the next FreeMem. As a result, two memory blocks will be released with a single memory lock.

FastMM5 uses a different technique: it has not a single list of blocks for each size, but several of them called "arenas", so the thread contention will occur less likely because each thread will probably get its own arena. However, this requires much more memory, and may need a much higher number of memory pages, which, in some scenarios, may be slower.

I've downloaded the latest versions from FastMM5 and FastMM4-AVX and repeated the benchmarks, find below the results.

Benchmark Indy WebSocket Server 

In the first Benchmark, the Server used is the Indy WebSocket Server, this server is based on Indy TCP Server, so every connection creates 1 thread.

The values are measured in milliseconds, so for example, the first test that is done with 1 client in Windows32 platforms, using the default memory manager takes 4109 milliseconds, using FastMM5 takes 4156 milliseconds and using FastMM4-AVX takes 4255 milliseconds. The percentage calculated is against the reference value, in this case against the Default memory manager that comes with delphi, as much lower is the percentage, better performance has.

The Benchmark has been done 3 times and the values showed are the sum of the benchmarks / 3.

For the benchmark, the server used was:

  • Windows 2016 Server Datacenter
  • 16 Virtual Processors
  • 32 GB RAM
  • 2.2 GHz

The Delphi version used was Delphi 10.4.2, and the latest FastMM5 and FastMM4-AVX versions from github servers.

Find below the result of the benchmark.

Clients Platform Default  (ms) FMM5 (ms)FMM5 (%)FMM4-AVX (ms)FMM4-AVX (%)
1 Win32 4109 41561.14%42553.55%
1Win6443694182-4.17%3885-10.98%
10Win3235621541-56.74%1594-55.25%
10Win6441301708-58.64%1630-60.53%
100Win3234111463-57.11%1427-58.16%
100Win6438751453-62.50%1447-62.66%
500Win3228591693-40.78%1645-42.46%
500Win6448641661-65.85%1672-65.63%
1000Win3257442062-64.10%2119-63.11%
1000Win6472032245-68.83%1963-72.75%

Benchmark HTTP.SYS Server 

In the second Benchmark, the Server used is the HTTP.SYS WebSocket Server, this server is based on HTTP API Microsoft Framework and the connections are handled by a pool of threads.

The values are measured in milliseconds, so for example, the first test that is done with 1 client in Windows32 platforms, using the default memory manager takes 5182 milliseconds, using FastMM5 takes 5255 milliseconds and using FastMM4-AVX takes 5057 milliseconds. The percentage calculated is against the reference value, in this case against the Default memory manager that comes with delphi, as much lower is the percentage, better performance has.

The Benchmark has been done 3 times and the values showed are the sum of the benchmarks / 3.

For the benchmark, the server used was:

  • Windows 2016 Server Datacenter
  • 16 Virtual Processors
  • 32 GB RAM
  • 2.2 GHz

The Delphi version used was Delphi 10.4.2, and the latest FastMM5 and FastMM4-AVX versions from github servers.

Find below the result of the benchmark.

Clients Platform Default  (ms) FMM5 (ms)FMM5 (%)FMM4-AVX (ms)FMM4-AVX (%)
1 Win32 5182 52551.41%5057-2.41%
1Win6451775073-2.01%5047-2.51%
10Win3247761489-68.62%1552-67.50%
10Win6449481484-70.01%1547-68.73%
100Win3230671328-56.70%1625-47.02%
100Win6435521317-62.92%1557-56.17%
500Win3221091536-27.17%1593-24.47%
500Win6425731411-45.16%1547-39.88%
1000Win3229061630-43.91%1927-33.69%
1000Win6433541515-54.83%1713-48.93%

Comments about Benchmarks 

As shown in the benchmark results, the FastMM4-AVX library has improved the performance under multi threaded environments. 

  • The results are very similar using the Indy server, FastMM4-AVX performs a bit better than FastMM5, especially with high concurrency and 64bits.
  • Using the HTTP.SYS server, FastMM5 performs better than FastMM4-AVX, but the differences are smaller than in previous benchmarks.

Console Benchmarks

Find below the compiled the updated windows console applications used to do benchmarks

file
File Name: sgcBenchmark_202105
File Size: 11.1 mb
Download File
×
Stay Informed

When you subscribe to the blog, we will send you an e-mail when there are new updates on the site so you wouldn't miss them.

Service Accounts Google Cloud PubSub
HTTP/2 Server Push

Related Posts