Apache optimization and mitigating DoS and DDoS attacks

Denial-of-service (DoS) and Distributed Denial-of-service (DDoS) attacks are some of the most common cyberattacks these days. They are fairly easy to execute and the consequences can vary from annoying to very problematic, for example if a crucial web service of a company or public service becomes inaccessible. In the current geopolitical situation DDoS attacks are a very popular method used by hacktivists.

In this guide I propose a configuration which can help against some of these attacks. I recommend implementing such measures on all web servers. Even if they might not completely prevent an attack, at least they from a basis which you can easily adapt to future attacks. These measures can also help against spammers and scrapers of your website and improve general performance of your server.

Firewall protection

Blocking connections from known bad IP addresses

We don’t want to waste CPU time to packets coming from known bad IP addresses, so we block them as soon as possible. Follow my earlier article to block known bad IP addresses.

Geoblocking connections

If you need a quick way to mitigate a DoS or DDoS attack you are experiencing, you can consider geoblocking connections in your firewall. For example you could temporarily only allow connections to your web server from IP addresses of your own country if that’s where most of your visitors come from. You can download lists of all IPv4 and IPv6 addresses per country. See the previously mentioned article about using iplists to implement geoblocking with Foomuuri for configuring the firewall.

Rate limiting connections

Then we can rate limit new connections per source IP address in our firewall.

In Foomuuri, you can set something like this in the public-localhost section:

http  saddr_rate "3/minute burst 20" saddr_rate_name http_limit
https saddr_rate "3/minute burst 20" saddr_rate_name http_limit
http  drop counter "http_limit" log "http_limit"
https drop counter "http_limit" log "http_limit"

This will allow a remote host to open 20 new connections per source IP to both port 80 and port 443 together, after which new connections will be limited to 3 per minute. Other connection attempts will be dropped, counted and logged with prefix http_limit. You will need to adapt these numbers to what your own server requires. When under attack, I recommend removing the last 2 lines: you don’t want to waste time logging everything. When setting up rules which define limits per IP address, take into account users which are sharing a public IP via NAT, such as on corporate networks.

The above rules still allows one source IP address to have more than 20 simultaneous connections open with your web server, because it only limits the rate at which new connections can be made. If you want to limit the total number of open connections, you can use something like this:

http  saddr_rate "ct count 20" saddr_rate_name http_limit
https saddr_rate "ct count 20" saddr_rate_name http_limit
http  drop counter "http_limit" log "http_limit"
https drop counter "http_limit" log "http_limit"

It is also possible to implement a global connection limit to the http and https ports, irrespective of the source IP address. While this will be very effective against DDoS attacks, you will also block unmalicious visitors of your site. If you implement global rate limiting, add rules before to allow your own IP address, so that you will not be blocked yourself. In case of emergency and if all else fails, you could combine global rate limiting with geoblocking: first create rules which allow connections from certain countries and regions and after that place the global rate limit rules which will then be applied to connections from other countries and regions.

For example:

http  saddr_rate "ct count 20" saddr_rate_name http_limit saddr @belgium4
https saddr_rate "ct count 20" saddr_rate_name http_limit saddr @belgium6
http saddr @belgium4 drop
http saddr @belgium6 drop
http global_rate "10/second burst 20"
https global_rate "10/second burst 20"

If the iplists @belgium4 and @belgium6 contain all Belgian Pv4 and IPv6 addresses, then these rules will allow up to 20 connections per source IP from Belgium. More connections per source IP from Belgium will be dropped. From other parts of the world, we allow a burst of 20 new connections after which there will be a global limit of 10 new connections per second, irrespective the source address.

Enable SYN cookies

SYN cookies are a method which helps to mitigate SYN attacks, in which a server is flooded by SYN requests. When the backlog queue where SYN requests are stored, is full, SYN cookies will be sent in the server’s SYN + ACK response, instead of creating an entry in the backlog queue. The SYN cookie contains a cryptographic encoding of a SYN entry in the backlog and can be reconstructed by the server from the client’s ACK response.

SYN cookies are enabled by default in all Debian Linux kernels because they set CONFIG_SYN_COOKIES=y. If you want to check whether SYN cookies are enabled, run

# sysctl net.ipv4.tcp_syncookies

If the value is 0, you can enable them by running

sysctl -w net.ipv4.tcp_syncookies=1

To make this setting persist after a reboot, you create the file /etc/sysctld.d/syncookies.conf with this content:

net.ipv4.tcp_syncookies=1

Configuring Apache to block Denial-of-Service attacks

Tuning Apache and the event MPM

In /etc/apache2/apache2.conf check these settings:

Timeout 60
KeepAlive On
MaxKeepAliveRequests 100
KeepAliveTimeout 5

These are the default in recent upstream Apache versions, but they might be different if you are using an older packaged Apache version. Also Debian still sets a much higher Timeout value by default. If Timeout and KeepAliveTimeout are set too high, clients can keep a connection open for way too much time, filling up the available workers. If you are under attack, you can consider setting the Timeout value even lower and disabling KeepAlive completely. See further to read how to do the latter automatically with mod_qos.

Make sure you are using the Event MPM in Apache, because it’s the most performant. You can check with this command:

# apache2ctl -M | grep mpm
 mpm_event_module (shared)

Then we need to configure the Event MPM in /etc/apache2/mods-enabled/mpm_event.conf:

StartServers             5
ServerLimit              16
MinSpareThreads          100
MaxSpareThreads		 400
ThreadLimit              120
ThreadsPerChild		 100
MaxRequestWorkers        1000
MaxConnectionsPerChild   50000

We start 5 (StartServers) child server processes, with each child having up to 100 (ThreadsPerChild) threads to deal with connections.We want to have 100-400 (MinSpareThreadsMaxSparehreads) spare threads to be available to handle new requests. The MaxRequestWorkers sets an upper limit of 1000 requests we can handle simultaneously. ServerLimit defines the maximum amount of child server processes. You only need to increase the default value of 16 if MaxRequestWorkers / ThreadsPerChild is higher than 16. The ThreadLimit is the maximum value you can set ThreadsPerChild to by just restarting Apache. If you need to increase the value of ThreadsPerChild to a value higher than the current ThreadLimit, you will need to modify ThreadLimit too and stop and start the parent process manually. Don’t set the ThreadLimit value much higher than the ThreadsPerChild value, because it increases memory consumption. After 50000 connections, a child process will exit and a new one will be spawn. This is useful in case of memory leaks.

Enabling compression and client caching of content

Enable the deflate and brotli modules in Apache to serve compressed files to clients which support this:

# a2enmod deflate
# a2enmod brotli

Then in /etc/apache2/mods-enabled/deflate.conf put this:

<IfModule mod_filter.c>
	AddOutputFilterByType DEFLATE text/html text/plain text/xml
	AddOutputFilterByType DEFLATE text/css
	AddOutputFilterByType DEFLATE application/x-javascript application/javascript application/ecmascript
	AddOutputFilterByType DEFLATE application/rss+xml
	AddOutputFilterByType DEFLATE application/xml
	AddOutputFilterByType DEFLATE text/json application/json
	AddOutputFilterByType DEFLATE application/wasm
</IfModule>

and in /etc/apache2/mods-enabled/brotli.conf this:

<IfModule mod_filter.c>
	AddOutputFilterByType BROTLI_COMPRESS text/html text/plain text/xml
	AddOutputFilterByType BROTLI_COMPRESS text/css
	AddOutputFilterByType BROTLI_COMPRESS application/x-javascript application/javascript application/ecmascript
	AddOutputFilterByType BROTLI_COMPRESS application/rss+xml
	AddOutputFilterByType BROTLI_COMPRESS application/xml
	AddOutputFilterByType BROTLI_COMPRESS text/json application/json
	AddOutputFilterByType BROTLI_COMPRESS application/wasm
</IfModule>

In order to let browsers cache static files, so that they don’t have to redownload all images if a user comes back, we set some expires headers. First make sure the expires module is enabled:

# a2enmod expires

Then in /etc/apache2/mods-enabled/expires.conf set this:

<IfModule mod_expires.c>
        ExpiresActive On
        ExpiresDefault A120
        ExpiresByType image/x-icon A604800
        ExpiresByType application/x-javascript A604800
        ExpiresByType application/javascript A604800
        ExpiresByType text/css A604800
        ExpiresByType image/gif A604800
        ExpiresByType image/png A604800
        ExpiresByType image/jpeg A604800
        ExpiresByType image/webp A604800
        ExpiresByType image/avif A604800
        ExpiresByType image/x-ms-bmp A604800
        ExpiresByType image/svg+xml A604800
        ExpiresByType image/vnd.microsoft.icon A604800
        ExpiresByType text/plain A604800
        ExpiresByType application/x-shockwave-flash A604800
        ExpiresByType video/x-flv A604800
        ExpiresByType video/mp4 A604800
        ExpiresByType application/pdf A604800
        ExpiresByType application/font-woff A604800
        ExpiresByType font/woff A604800
        ExpiresByType font/woff2 A604800
        ExpiresByType application/vnd.ms-fontobject A604800
        ExpiresByType text/html A120
</IfModule>

This will allow the mentioned MIME types to be cached for a week, while HTML and other files will be cached up to 2 minutes.

Modsecurity to block web attacks

Configure Modsecurity with the OWASP CoreRuleSet in order to prevent wasting time to web attacks. Block old HTTP versions: this will already block some stupid security scanners and bots.

mod_reqtimeout to mitigate Slowloris attacks

The Slowloris attack is an easy attack in which the attacker opens many connections to your web server, but intentionally slows down completing its request. Your server worker threads get busy waiting for completion of the requests, which does not happen This way, one client can occupy all worker threads of the web server without using a lot of bandwidth.

Debian has enabled protection in Apache by default by means of Apache mod_reqtimeout. The configuration can be found in /etc/apache2/mods-enabled/reqtimeout.conf:

RequestReadTimeout header=20-40,minrate=500
RequestReadTimeout body=10,minrate=500

The first line will limit the wait time until the first byte of the request line is received to 20 seconds, after that it will require at least 500 bytes per second, with a total limit of 40 seconds until all request headers are received.

The second line limits the wait time until the first byte of the request body is received to 10 seconds, after which 500 bytes per second are required.

If a connection is slower than the above requirements, it will be closed by the server.

mod_qos to limit requests on your server

First an important caveat: the libapache2-mod-qos package included in Debian 12 Bookworm is broken. If you install this one, your Apache web server will crash at startup. You can find a fixed libapache12-mod-qos package for Debian 12 in my bookworm-frehi repository.

Enable the module with

# a2enmod qos

Then we set some configuration options in /etc/apache2/mods-enabled/qos.conf:

# allows keep-alive support till the server reaches 80% of all MaxRequestWorkers
QS_SrvMaxConnClose 80%

# don't allow a single client to open more than 10 TCP connections if
# the server has more than 600 connections:
QS_SrvMaxConnPerIP 10 600

# minimum data rate (bytes/sec) when the server
# has 100 or more open TCP connections:
QS_SrvMinDataRate 1000 32000 100

# disables connection restrictions for certain clients:
QS_SrvMaxConnExcludeIP 192.0.2.
QS_SrvMaxConnExcludeIP 2001:db8::2

QS_ClientEntries 50000

We disable keepalive as soon as we reach 80% of MaxRequestWorkers, so that thread are not occupied any more by keepalive requests. You can also use an absolute number instead of a percentage if you prefer that. As soon as we reach 600 simultaneous connections, we only only allow 10 connections per IP address. As soon as there are more than 100 connections, we require that there is minimum of data rate of 1000 bytes per second. The required transfer rate is increased linearly to 32000 bytes per second when the amount of connections reaches the value of MaxRequestWorkers. The QS_ClientEntries setting defines how many different IP addresses mod_qos keeps track of. By default it’s 50000. On very busy servers you will need to increase this, but keep in mind that this increases memory usage. Use QS_SrvMaxConnExcludeIP to exclude certain IP addresses from these limitations.

Then we want to limit the amount of requests a client can make. We focus on the resource intensive requests. Very often attackers will abuse the search form, because this not only stresses your web server, but also your database and your web application itself. If one of them, or the combination of these three, can’t cope with requests, your application will be knocked off line. Also other forms, such as contact forms or authentication forms, are often targeted.

First I add this to the VirtualHost I want to protect:

RewriteEngine on
RewriteCond "%{REQUEST_URI}" "^/$"
RewriteCond "%{QUERY_STRING}" "s="
RewriteRule ^ - [E=LimitWPSearch] 
                
RewriteCond "%{REQUEST_URI}" "^/wp-login.php$"
RewriteCond "%{REQUEST_METHOD}" "POST"
RewriteRule ^ - [E=LimitWPLogin]

RewriteCond "%{REQUEST_URI}" "^/wp-json/"
RewriteRule ^ - [E=LimitWPJson]

RewriteCond "%{REQUEST_URI}" "^/wp-comments-post.php"
RewriteRule ^ - [E=LimitWPComment]

RewriteCond "%{REQUEST_URI}" "^/wp-json/contact-form-7/"
RewriteCond "%{REQUEST_METHOD}" "POST"
RewriteRule ^ - [E=LimitWPFormPost]

RewriteCond "%{REQUEST_URI}" "^/xmlrpc.php"
RewriteRule ^ - [E=LimitWPXmlrpc]

RewriteCond "%{REQUEST_METHOD}" "POST"
RewriteRule ^ - [E=LimitWPPost]

RewriteCond "%{REQUEST_URI}" ".*"
RewriteRule ^ - [E=LimitWPAll]

The website I want to protect runs on WordPress. I use mod_rewrite to set a variable when certain requests are made, for example LimitWPSearch when a search is done, LimitWPLogin when data is submitted to the login form, LimitWPJson when something in /wp-json/ is accessed, LimitWPComment when a comment is submitted, LimitWPXmlrpc when /xmlrpc.php is accessed, LimitWPFormPost when a form created with the Contact Form 7 plugin is posted, LimitWPPost when anything is sent via POST, and LimitWPAll on any request. A request can trigger multiple rules.

Then we go back to our global server configuration to define the exact values for these limits. You can again do this in /etc/apache2/mods-enabled/qos.conf or /etc/apache2/conf-enabled/qos.conf or something like that:

QS_ClientEventLimitCount 20 40 LimitWPPost
QS_ClientEventLimitCount 400 120 LimiWPAll
QS_ClientEventLimitCount 5 60 LimitWPSearch
QS_ClientEventLimitCount 50 600 LimitWPJson
QS_ClientEventLimitCount 10 3600 LimitWPLogin
QS_ClientEventLimitCount 3 60 LimitWPComment
QS_ClientEventLimitCount 6 60 LimitWPXmlrpc
QS_ClientEventLimitCount 3 60 LimitWPFormPost

For example when a client hits the LimitWPSearch rule 5 times in 60 seconds, the server will return a HTTP error code. In practice, this means that a client can do up to 4 successful searches per minute. You will have to adapt these settings to your own web applications. In case you get hit by an attack, you can easily adapt the values as necessary or add new limits.

Using Fail2ban to block attackers

Now I want to go further and block offenders of my QOS rules for a longer time in my firewall with Fail2ban.

Create /etc/fail2ban/filter.d/apache-mod_qos.conf:

[INCLUDES]

# Read common prefixes. If any customizations available -- read them from
# apache-common.local
before = apache-common.conf


[Definition]


failregex = mod_qos\(\d+\): access denied,.+\sc=<HOST>

ignoreregex = 

Then create /etc/fail2ban/jail.d/apache-mod_qos.conf:

[apache-mod_qos]

port         = http,https
backend      = pyinotify
journalmatch = 
logpath      = 	/var/log/apache2/*error.log 
		/var/log/apache2/*/error.log
maxretry     = 3
enabled      = true

Make sure you have the package python3-pyinotify installed.

Note that a stateful firewall will by default only block new connections, and you might still see some violations over the existing connection even after an IP is banned by Fail2ban.

Optimizing your web applications

Databases (MariaDB, MySQL, PostgreSQL)

Ideally you should run your DBMS on a separate server. If you run MariaDB (or MySQL), you can use mysqltuner to get automatic performance tuning recommendations.

Generic configuration recommendations for PostgreSQL can be created on PgTune.

WordPress

If you are using WordPress, I recommend setting up the Autoptimize, WebP Express, Avif Express and WP Super Cache modules. Autoptimize can aggregate and minify your HTML, CSS and Javascript files, reducing bandwidth usage and reducing the number of request needed to load your site. WebP Express and Avif Express will automatically convert your JPEG, GIF and PNG images to the more efficient WebP and AVIF formats, which again reduces bandwidth.

WP Super Cache can cache your pages, so that they don’t have to be dynamically generated for every request. I strongly recommend that in the settings of WP Super Cache, in the Advanced page, you choose the Expert cache delivery method. You will need to set up some rewrite rules in your .htaccess file. In this mode, Apache will directly serve the cached pages without any involvement of PHP. You can easily check this by stopping the php8.2-fpm service and visiting your website. The cached pages will just load.

Drupal

I’m not really a Drupal specialist, but at least make sure that in Administration – Configuration – Performance caching and the bandwidth optimization options are enabled.

Mediawiki

There are some performance tuning tips for Mediawiki in the documentation.

Testing

I like to use h2load to do benchmarking of web servers. It’s part of the nghttp2-client package in Debian:

# apt install nghttp2-client

Ideally, you run benchmarks before making any of the above changes, and you retest after each modification to see the effect. Also when testing, you should do them from a host where it does not matter if it gets locked out. It can be a good idea to stop fail2ban, because it can get annoying if you are blocked on the firewall. You can also run these benchmarks on the host itself, bypassing the firewall rules.

First we test our mod_qos rule which limits the amount of searches we can do:

# h2load -n8 -c1 https://example.com/?s=test
starting benchmark...
spawning thread #0: 1 total client(s). 8 total requests
TLS Protocol: TLSv1.3
Cipher: TLS_AES_128_GCM_SHA256
Server Temp Key: X25519 253 bits
Application protocol: h2
progress: 12% done
progress: 25% done
progress: 37% done
progress: 50% done
progress: 62% done
progress: 75% done
progress: 87% done
progress: 100% done

finished in 277.41ms, 28.84 req/s, 544.07KB/s
requests: 8 total, 8 started, 8 done, 4 succeeded, 4 failed, 0 errored, 0 timeout
status codes: 4 2xx, 0 3xx, 4 4xx, 0 5xx
traffic: 150.93KB (154553) total, 636B (636) headers (space savings 83.13%), 150.03KB (153628) data
                     min         max         mean         sd        +/- sd
time for request:     2.97ms    210.18ms     32.38ms     71.98ms    87.50%
time for connect:    16.45ms     16.45ms     16.45ms         0us   100.00%
time to 1st byte:   226.50ms    226.50ms    226.50ms         0us   100.00%
req/s           :      28.98       28.98       28.98        0.00   100.00%

You can see that 4 of these request succeed (status code 2xx) and the other 4 failed (status code 4xx). This is exactly what we configured: starting from the 5th request in a period of 60 seconds, we don’t allow searches. In Apache’s error log we see:

[qos:error] [pid 1473291:tid 1473481] [remote 2001:db8::2] mod_qos(067): access denied, QS_ClientEventLimitCount rule: event=LimitWPSearch, max=5, current=5, age=0, c=2001:db8::2, id=Z4PiOQcCdoj4ljQCvAkxPwAAsD8

We can check the connection rate limit we have configured in our firewall, by benchmarking the creation of more connections than we allowed in the burst value:

# h2load -n10000 -c50 https://example.com/
starting benchmark...
spawning thread #0: 50 total client(s). 10000 total requests
TLS Protocol: TLSv1.3
Cipher: TLS_AES_128_GCM_SHA256
Server Temp Key: X25519 253 bits
Application protocol: h2
progress: 10% done
progress: 20% done
progress: 30% done
progress: 40% done
progress: 50% done
progress: 60% done
progress: 70% done
progress: 80% done
progress: 90% done
progress: 100% done

finished in 148.78s, 67.22 req/s, 695.57KB/s
requests: 10000 total, 10000 started, 10000 done, 798 succeeded, 9202 failed, 0 errored, 0 timeout
status codes: 798 2xx, 0 3xx, 9202 4xx, 0 5xx
traffic: 101.06MB (105968397) total, 181.31KB (185665) headers (space savings 95.80%), 100.66MB (105550608) data
                     min         max         mean         sd        +/- sd
time for request:    19.47ms    649.75ms     50.94ms     23.46ms    86.23%
time for connect:   127.26ms      69.74s      25.28s      26.08s    80.00%
time to 1st byte:   211.04ms      69.79s      25.34s      26.06s    80.00%
req/s           :       1.34       17.91        6.75        5.79    80.00%

As you can see, the “time for connect” went up to 69.74 seconds because connections were dropped because of the limit set in Foomuuri.

If logging of these connections is enabled in the firewall, we can see this:

http_limit IN=ens3 OUT= MAC=aa:aa:aa:aa:aa:aa:aa:aa:aa8:aa:aa:aa:aa:aa SRC=2001:db8::2 DST=2001:db8::3 LEN=80 TC=0 HOPLIMIT=45 FLOWLBL=327007 PROTO=TCP SPT=32850 DPT=443 WINDOW=64800 RES=0x00 SYN URGP=0

Also mod_qos kicks in and this results in only 798 succeeded requests, while the others where blocked.

Conclusion

DoS and DDoS attacks by hacktivist groups are very common and happen on a daily basis, often kicking websites of companies and public services offline. Unfortunately, there is not a lot of practical advise on the Wold Wide Web how to mitigate these attacks. I hope this article provides a some basic insight.

The proposed measures should protect well against a simple DoS attack. More difficult to judge is how well they protect against a DDoS attack. They will certainly make them more difficult, but certainly, the larger the botnet executing the attack is, the more difficult it becomes to mitigate this on the server level. If the attackers manage to saturate your bandwidth, there is nothing you can do and you will need measures on the network level to prevent the traffic hitting your server at all.

In any case, the proposed configuration should provide a basis for mitigating some attacks. You can easily adapt the rules to the attacks you are experiencing. I recommend implementing and testing this configuration before you experience a real attack.

Sources and more information

Some various performance improvements for Debian 12 Bookworm

Here some various improvements I implemented on some of my Debian 12 Bookworm servers in order to improve performance.

zswap: use zsmalloc allocator with newer kernel

If your system has little memory, you might be using zswap already. When memory is getting full, the system will try to swap out less used data from memory to a compressed swap in memory instead of writing it immediately to a swap partition or swap file on slower storage. In Linux kernel version 6.7 the zsmalloc allocator, which is superior to other allocators (zbud and z3fold), became the default.

So first upgrade to a more recent kernel. You can get a recent kernel from bookworm-backports or Debian testing or unstable.

To enable zswap at boot you can create a file named /etc/default/grub.d/zswap.cfg which contains:

GRUB_CMDLINE_LINUX_DEFAULT="$GRUB_CMDLINE_LINUX_DEFAULT zswap.enabled=1 zswap.compressor=lz4 zswap.max_pool_percent=30"

If you want higher compression at the cost of more CPU time, you can replace lz4 by zstd.

You will need to add the compression module of your preference to your initramfs. So in the case of lz4, just add

lz4

to /etc/initramfs-tools/modules and then run

# update-initramfs -u

Finally execute

# update-grub

to update your Grub configuration so that these settings will become automatically active at the next boot.

I upgraded on to Linux 6.11 on a VPS with 2GB of RAM with these settings, and the system feels much snappier now.

To check effectiveness of zswap, you can use the zswap-stats script.

Update to systemd 254 or higher to improve behaviour under memory pressure

systemd 254 includes a change which makes journald and resolved flush their caches when the system is under memory pressure. This will free memory, reducing swapping. When this happens, this can be found in the logs:

systemd-journald[587721]: Under memory pressure, flushing caches.

You can find systemd 254 for Debian 12 in bookworm-backports.

Exclude cron from audit logging

On one of my systems, my audit logs where rapidly filling. You can check whether this is happening for you by looking at the dates of the files in /var/log/audit/:

# ls -l /var/log/audit/

By default auditd will write files up to 8 MB, after which it will rotate the file. If these different files have their modification date very close to each others, then you might consider reducing the logging.

Possible causes for audit logs filling up are Apparmor logs. Improve your Apparmor profiles to reduce warnings and errors. On one of my system, the cause was the logging caused by the execution of cron jobs. Especially because mailman3-web contains a cron job which is executed every single minute.

To prevent logging everything related to cron, create a file /etc/audit/rules.d/excludecron.rules:

-a exclude,always -F exe=/usr/sbin/cron

Then run

# augenrules --load

to load the new rules.

Enabling jumbo frames on your network

Jumbo frames are Ethernet frames with up to 9000 bytes of payload, in contrast to normal frames which have up to 1500 bytes per payload. They are useful on fast (Gigabit Ethernet and faster) networks, because they reduce the overhead. Not only will it result in a higher throughput, it will also reduce CPU usage.

To use jumbo frames, you whole network needs to support it. That means that your switch needs to support jumbo frames (it might need to be enabled by hand), and also all connected hosts need to support jumbo frames. Jumbo frames should also only be used on reliable networks, as the higher payload will make it more costly to resend a frame if packets get lost.

So first you need to make sure your switch has jumbo frame support enabled. I’m using a HP Procurve switch and for this type of switch you can find the instructions here:

$ netcat 10.141.253.1 telnet
Username:
Password:
ProCurve Switch 2824# config
ProCurve Switch 2824(config)# show vlans
ProCurve Switch 2824(config)# vlan $VLAN_ID jumbo
ProCurve Switch 2824(config)# show vlans
ProCurve Switch 2824(config)# write memory

Now that your switch is configured properly, you can configure the hosts.

For hosts which have a static IP configured in /etc/network/interfaces you need to add the line

    mtu 9000

to the iface stanza of the interface on which you want to enable jumbo frames. This does not work for interfaces getting an IP via DHCP, because they will use the MTU value sent by the DHCP server.

To enable jumbo frames via DHCP, edit the /etc/dhcp/dhcpd.conf file on the DHCP server, and add this to the subnet stanza:

option interface-mtu 9000;

Now bring the network interface offline and online, and jumbo frames should be enabled. You can verify with the command

# ip addr show

which will show the mtu values for all network interfaces.

FS-CACHE for NFS clients

FS-CACHE is a system which caches files from remote network mounts on the local disk. It is a very easy to set up facility to improve performance on NFS clients.

I strongly recommend a recent kernel if you want to use FS-CACHE though. I tried this with the 4.9 based Debian Stretch kernel a year ago, and this resulted in a kernel oops from time to time, so I had to disable it again. I’m currently using it again with a 4.19 based kernel, and I did not encounter any stability issues up to now.

First of all, you will need a dedicated file system where you will store the cache on. I prefer to use XFS, because it has nice performance and stability. Mount the file system on /var/cache/fscache.

Then install the cachefilesd package and edit the file /etc/default/cachefilesd so that it contains:

RUN=yes

Then edit the file /etc/cachefilesd.conf. It should look like this:

dir /var/cache/fscache
tag mycache
brun 10%
bcull 7%
bstop 3%
frun 10%
fcull 7%
fstop 3%

These numbers define when cache culling (making space in the cache by discarding less recently used files) happens: when the amount of available disk space or the amount of available files drops below 7%, culling will start. Culling will stop when 10% is available again. If the available disk space or available amount of files drops below 3%, no further cache allocation is done until more than 3% is available again. See also man cachefilesd.conf.

Start cachefilesd by running

# systemctl start cachefilesd

If it fails to start with these messages in the logs:

cachefilesd[1724]: About to bind cache
kernel: CacheFiles: Security denies permission to nominate security context: error -2
cachefilesd[1724]: CacheFiles bind failed: errno 2 (No such file or directory)
cachefilesd[1724]: Starting FilesCache daemon : cachefilesd failed!
systemd[1]: cachefilesd.service: Control process exited, code=exited status=1
systemd[1]: cachefilesd.service: Failed with result 'exit-code'.
systemd[1]: Failed to start LSB: CacheFiles daemon.

then you are hitting this bug. This happens when you are using a kernel with AppArmor enabled (like Debian’s kernel from testing). You can work around it by commenting out the line defining the security context in /etc/cachefilesd.conf:

#secctx system_u:system_r:cachefiles_kernel_t:s0

and starting cachefilesd again.

Now in /etc/fstab add the fsc option to all NFS file systems where you want to enable caching for. For example for NFS4 your fstab line might look like this:

nfs.server:/home	/home	nfs4	_netdev,fsc,noatime,vers=4.2,nodev,nosuid	0	0

Now remount the file system. Just using the remount option is probably not enough: you have to completely umount and mount the NFS file system. Check with the mount command whether the fsc option is present. You can also run

# cat /proc/fs/nfsfs/volumes 

and check whether the FSC column is set to YES.

Try copying some large files from your NFS mount to the local disk. Then you can check the statistics by running

# cat /proc/fs/fscache/stats

You should sea that not all values are 0 any more. You will also see with

$ df -h /var/cache/fscache

your file system is being filled with cached data.

Debian Stretch on AMD EPYC (ZEN) with an NVIDIA GPU for HPC

Recently at work we bought a new Dell PowerEdge R7425 server for our HPC cluster. These are some of the specifications:

  • 2 AMD EPYC 7351 16-Core Processors
  • 128 GB RAM (16 DIMMs of 8 GB)
  • Tesla V100 GPU
Dell Poweredge R7425 front with cover
Dell Poweredge R7425 front without cover
Dell Poweredge R7425 inside

Our FAI configuration automatically installed Debian stretch on it without any problem. All hardware was recognized and working. The installation of the basic operating system took less than 20 minutes. FAI also sets up Puppet on the machine. After booting the system, Puppet continues setting up the system: installing all needed software, setting up the Slurm daemon (part of the job scheduler), mounting the NFS4 shared directories, etc. Everything together, the system was automatically installed and configured in less than 1 hour.

Linux kernel upgrade

Even though the Linux 4.9 kernel of Debian Stretch works with the hardware, there are still some reasons to update to a newer kernel. Only in more recent kernel versions, the k10temp kernel module is capable of reading out the CPU temperature sensors. We also had problems with fscache (used for NFS4 caching) with the 4.9 kernel in the past, which are fixed in a newer kernel. Furthermore there have been many other performance optimizations which could be interesting for HPC.

You can find a more recent kernel in Debian’s Backports repository. At the time of writing it is a 4.18 based kernel. However, I decided to build my own 4.19 based kernel.

In order to build a Debian kernel package, you will need to have the package kernel-package installed. Download the sources of the Linux kernel you want to build, and configure it (using make menuconfig or any method you prefer). Then build your kernel using this command:

$ make -j 32 deb-pkg

Replace 32 by the number of parallel jobs you want to run; the number of CPU cores you have is a good amount. You can also add the LOCALVERSION and KDEB_PKGVERSION variables to set a custom name and version number. See the Debian handbook for a more complete howto. When the build is finished successfully, you can install the linux-image and linux-headers package using dpkg.

We mentioned temperature monitoring support with the k10temp driver in newer kernels. If you want to check the temperatures of all NUMA nodes on your CPUs, use this command:

$ cat /sys/bus/pci/drivers/k10temp/*/hwmon/hwmon*/temp1_input

Divide the value by 1000 to get the temperature in degrees Celsius. Of course you can also use the sensors command of the lm-sensors package.

Kernel settings

VM dirty values

On systems with lots of RAM, you will encounter problems because the default values of vm.dirty_ratio and vm.dirty_background_ratio are too high. This can cause stalls when all dirty data in the cache is being flushed to disk or to the NFS server.

You can read more information and a solution in SuSE’s knowledge base. On Debian, you can create a file /etc/sysctl.d/dirty.conf with this content:

vm.dirty_bytes = 629145600
vm.dirty_background_bytes = 314572800

Run

# systctl -p /etc/sysctl.d/dirty.conf

to make the settings take effect immediately.

Kernel parameters

In /etc/default/grub, I added the options

transparant_hugepage=always cgroup_enable=memory

to the GRUB_CMDLINE_LINUX variable.

Transparant hugepages can improve performance in some cases. However, it can have a very negative impact on some specific workloads too. Applications like Redis, MongoDB and Oracle DB recommend not enabling transparant hugepages by default. So make sure that it’s worthwhile for your workload before adding this option.

Memory cgroups are used by Slurm to prevent jobs using more memory than what they reserved. Run

# update-grub

to make sure the changes will take effect at the next boot.

I/O scheduler

If you’re using a configuration based on recent Debian’s kernel configuration, you will likely be using the Multi-Queue Block IO Queueing Mechanism with the mq-deadline scheduler as default. This is great for SSDs (especially NVME based ones), but might not be ideal for rotational hard drives. You can use the BFQ scheduler as an alternative on such drives. Be sure to test this properly tough, because with Linux 4.19 I experienced some stability problems which seemed to be related to BFQ. I’ll be reviewing this scheduler again for 4.20 or 4.21.

First if BFQ is built as a module (wich is the case in Debian’s kernel config), you will need to load the module. Create a file /etc/modules-load.d/bfq.conf with contents

bfq

Then to use this scheduler by default for all rotational drives, create the file /etc/udev/rules.d/60-io-scheduler.rules with these contents:

# set scheduler for non-rotating disks
ACTION=="add|change", KERNEL=="sd[a-z]|mmcblk[0-9]*|nvme[0-9]*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
# set scheduler for rotating disks
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq"

Run

# update-initramfs -u

to rebuild the initramfs so it includes this udev rule and it will be loaded at the next boot sequence.

The Arch wiki has more information on I/O schedulers.

CPU Microcode update

We want the latest microcode for the CPU to be loaded. This is needed to mitigate the Spectre vulnerabilities. Install the amd64-microcode package from stretch-backports.

Packages with AMD ZEN support

hwloc

hwloc is a utility which reads out the NUMA topology of your system. It is used by the Slurm workload manager to to bind tasks to certain cores.

The version of hwloc in Stretch (1.11.5) does not have support for the AMD ZEN architecture. However, hwloc 1.11.12 is available in stretch-backports, and this version does have AMD ZEN support. So make sure you have the packages hwloc libhwloc5 libhwloc-dev libhwloc-plugins installed from stretch-backports.

BLAS libraries

There is no BLAS library in Debian Stretch which supports AMD ZEN architecture. Unfortunately, at the moment of writing there is is also no good BLAS implementation for ZEN available in stretch-backports. This will likely change in the near future though, as BLIS has now entered Debian Unstable and will likely be backported too in the stretch-backports repository.

NVIDIA drivers and CUDA libraries

NVIDIA drivers

I installed the NVIDIA drivers an CUDA libraries from the tarballs downloaded from the NVIDIA website because at the time of writing all packages available in the Debian repositories are outdated.

First make sure you have the linux-headers package installed which corresponds with the linux-image kernel package you are running. We will be using DKMS to rebuild the driver automatically whenever we install a new kernel, so also make sure you have the dkms package installed.

Download the NVIDIA driver for your GPU from the NVIDIA website. Remove the nouveau driver with the

# rmmod nouveau

command. And create a file /etc/modules-load.d/blacklist-nouveau.conf with these contents:

blacklist nouveau

and rebuild the initramfs by running

# update-initramfs -u

This will ensure the nouveau module will not be loaded automatically.

Now install the driver, by using a command similar to this:

# NVIDIA-Linux-x86_64-410.79.run -s --dkms

This will do a silent installation, integrating the driver with DKMS so it will get built automatically every time you install a new linux-image together with its corresponding linux-headers package.

To make sure that the necessary device files in /dev exist after rebooting this system, I put this script /usr/local/sbin/load-nvidia:

#!/bin/sh
/sbin/modprobe nvidia
if [ "$?" -eq 0 ]; then
  # Count the number of NVIDIA controllers found.
  NVDEVS=`lspci | grep -i NVIDIA`
  N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
  NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`
  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
    mknod -m 666 /dev/nvidia$i c 195 $i
  done
  mknod -m 666 /dev/nvidiactl c 195 255
else
  exit 1
fi
/sbin/modprobe nvidia-uvm
if [ "$?" -eq 0 ]; then
  # Find out the major device number used by the nvidia-uvm driver
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
  mknod -m 666 /dev/nvidia-uvm c $D 0
else
  exit 1
fi

In order to start this script at boot up, I created a systemd service. Create the file /etc/systemd/system/load-nvidia.service with this content:

[Unit]
Description=Load NVidia driver and creates nodes in /dev
Before=slurmd.service

[Service]
ExecStart=/usr/local/sbin/load-nvidia
Type=oneshot
RemainAfterExit=true

[Install]
WantedBy=multi-user.target

Now run these commands to enable the service:

# systemctl daemon-reload
# systemctl enable load-nvidia.service

You can verify whether everything is working by running the command

$ nvidia-smi

CUDA libraries

Download the CUDA libraries. For Debian, choose Linux x86_64 Ubuntu 18.04 runfile (local).

Then install the libraries with this command:

# cuda_10.0.130_410.48_linux --silent --override --toolkit

This will install the toolkit in silent mode. With the override option it is possible to install the toolkit on systems which don’t have an NVIDIA GPU, which might be useful for compilation purposes.

To make sure your users have the necessary binaries and libraries available in their path, create the file /etc/profile.d/cuda.sh with this content:

export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH


Linux performance improvements

Two years ago I wrote an article presenting some Linux performance improvements. These performance improvements are still valid, but it is time to talk about some new improvements available. As I am using Debian now, I will focus on that distribution, but you should be able to easily implement these things on other distributions too. Some of these improvements are best suited for desktop systems, other for server systems and some are useful for both. Continue reading “Linux performance improvements”

Improving Mediawiki performance

Now that I am on the subject of improving performance, I configured some performance improvements for a Mediawiki installation here:

  • Make sure you run the latest Mediawiki version. Mediawiki 1.16 introduced a new localisation caching system which is supposed to improve performance, so you definitely want this to get the best performance.
  • Create a directory where Mediawiki can store the localisation cache (make sure it is writable by your web server). By preference store it on a tmpfs (at least if you are sure it will be big enough to store the cache), and configure it in LocalSettings.php:
    $wgCacheDirectory = "/tmp/mediawiki";
    Iif /tmp is on a tmpfs, you might add creation of this directory with the right permissions to /etc/rc.local, so that it still exists after a reboot.
  • Enable file caching in Mediawiki’s LocalSettings.php:
    $wgFileCacheDirectory = "{$wgCacheDirectory}/html";
    $wgUseFileCache = true;
    $wgShowIPinHeader = false;
    $wgUseGzip = true;
  • Make sure you have installed some PHP accelerator for caching. I have APC installed and configured it in Mediawiki’s LocalSettings.php:
    $wgMainCacheType = CACHE_ACCEL;

Here is a benchmark before implementing the above configuration (with CACHE_NONE, but APC still installed):

$ ab -kt 30 http://site/wiki/index.php/Page
This is ApacheBench, Version 2.3 < $Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking site (be patient)
Finished 255 requests

Server Software: Apache/2.2.16
Server Hostname: site
Server Port: 80

Document Path: /wiki/index.php/Page
Document Length: 12750 bytes

Concurrency Level: 1
Time taken for tests: 30.084 seconds
Complete requests: 255
Failed requests: 0
Write errors: 0
Keep-Alive requests: 0
Total transferred: 3344070 bytes
HTML transferred: 3251250 bytes
Requests per second: 8.48 [#/sec] (mean)
Time per request: 117.978 [ms] (mean)
Time per request: 117.978 [ms] (mean, across all concurrent requests)
Transfer rate: 108.55 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 3 6 2.8 7 21
Processing: 88 112 11.1 112 163
Waiting: 66 90 9.1 89 125
Total: 95 118 11.9 118 170

Percentage of the requests served within a certain time (ms)
50% 118
66% 122
75% 125
80% 127
90% 132
95% 138
98% 145
99% 156
100% 170 (longest request)

And here a benchmark after implementing the changes:

ab -kt 30 http://site/wiki/index.php/Page
This is ApacheBench, Version 2.3 < $Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking site (be patient)
Finished 649 requests

Server Software: Apache/2.2.16
Server Hostname: site
Server Port: 80

Document Path: /wiki/index.php/Page
Document Length: 12792 bytes

Concurrency Level: 1
Time taken for tests: 30.015 seconds
Complete requests: 649
Failed requests: 0
Write errors: 0
Keep-Alive requests: 0
Total transferred: 8538244 bytes
HTML transferred: 8302008 bytes
Requests per second: 21.62 [#/sec] (mean)
Time per request: 46.248 [ms] (mean)
Time per request: 46.248 [ms] (mean, across all concurrent requests)
Transfer rate: 277.80 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 3 9 3.7 8 29
Processing: 23 37 6.0 37 62
Waiting: 13 23 4.9 24 41
Total: 28 46 7.8 45 82

Percentage of the requests served within a certain time (ms)
50% 45
66% 47
75% 49
80% 50
90% 56
95% 62
98% 68
99% 73
100% 82 (longest request)

So Mediawiki can deal with more than 2,5 times as much requests now.

Some people use Apache’s mod_disk_cache to cache Mediawiki pages, but I prefer Mediawiki’s own caching system because it is more standard and does not require patching Mediawiki, even if it might not get as much benefit as a real proxy or mod_disk_cache.

Improving performance by using tmpfs

Today I analyzed disk reads and writes on a server with iotop and strace and found some interesting possible optimizations.

With iotop you can check which processes are reading and writing from the disks. I always press the o, p and a keys in iotop so that it only shows me processes doing I/O and so that it will show accumulated I/O instead of the bandwidth. With the left and right arrows I select on which columns to sort the list.

Once you have identified the processes wich are doing much I/O, you can check what they are reading or writing with strace, for example
# strace  -f -p $PID  -e trace=open,read,write

(you can leave out read and/or write if this gives too much noise)

This way I identified some locations where processes do lots of read and write operations on temporary files.

For nagios I placed /var/lib/nagios3/spool and /var/cache/nagios3 on a tmpfs, for Amavis /var/lib/amavis/tmp and for PostgreSQL /var/lib/postgresql/8.4/main/pg_stat_tm.

Other candidates you might want to consider: /tmp, /var/tmp and /var/lib/php5. There are probably many others, depending on which services you use.

Speeding up my Linux system

My Mandriva 2009.1 system at home had become a bit slow lately, and so I decided to do some attempts to make it a bit faster again. This is not the most powerful system anymore (Asus A8N-SlI NForce4 motherboard, Athlon 64 3500+, 3 GB RAM, 250 GB SATA-1 disk, NVidia 6600 GT graphics card), but it sometimes felt very slow because of lots of disk activity, especially during start-up. I succeeded in improving the performance noticeably: the disk activity now stops much earlier after log-in and after starting Evolution.

I did some different changes at once and have not always measured what was the impact of each individual change. So your mileage may vary.

  • I updated from Mandriva 2009.1 to Mandriva Cooker. Actually I don’t know if this has had any direct effect on the performance. However, it’s a pre-requisite or a recommendation for some of the later changes (GNote and ext4).
  • I removed several of the GNOME panel applets, which probably helps in reducing GNOME start up time. I remove the system monitor applet, one of the weather applets, and Deskbar.
  • I removed Tomboy (which was also active as an applet in my GNOME panel) and installed GNote. GNote looks exactly the same as Tomboy and transparently replaces it (it will immediately start showing your Tomboy notes), but it’s written in C++. The fact that now the Mono .NET runtime environment does not need to be started during GNOME start-up, might have improved the GNOME log-in performance.
  • I cleaned up my mailboxes a bit by removing old mails I don’t need anymore. After that, I manually vacuumed the sqlite database used by Evolution. To do so, close Evolution, and run the following commands in the shell (you will need to have the package sqlite3-tools installed):
    $ evolution --force-shutdown
    $ for i in $(find ~/.evolution/mail -name folders.db); do echo "VACUUM;" | sqlite3 $i; done

    This reduced the size of the folders.db for main IMAP account from more than 300 MB to about 150 MB! After this operation much less disk activity happened while starting up Evolution and the system remained much more responsive. It seems I’m not the only one who was suffering from this problem. This is a serious regression since Evolution switched from berkeleydb to sqlite. Apart from this problem, Evolution’s IMAP implementation is currently also very slow with IMAP if you have big folders and no work seems to be done on that… I have the feeling Mutt‘s motto is correct: all mail clients suck, this one just sucks less. Still, I prefer a GUI mail client.
  • I removed Beagle from my system. All in all I don’t used it very often, and it looks like Tracker might become much more interesting in the future.
  • I switched from Firefox 3.0 to Firefox 3.5, which is also a bit faster. Packages are available in cooker’s main/testing repository, or you can just download a build from mozilla.org. A long time ago I experienced slowdowns in Firefox, which I fixed at that time by disabling reporting of attack sites and web forgeries in Firefox’ preferences – Security. It’s better to not disable this if Firefox is working nicely for you.
  • I switched from ext3 to ext4 for my / and /usr partition. You can just switch from ext3 to ext4 by replacing ext3 by ext4 in /etc/fstab. However, then you won’t take advantage of all new features. To do so, switch to runlevel 1 (init 1 in the console), umount the partition you want to migrate (if you want to migrate /, you can mount it as read-only by running mount -o remount,ro /. Then run these commands on the device:
    # tune2fs -O extents,uninit_bg,dir_index /dev/device
    # fsck -pDf /dev/device

    Then reboot your system.
    Don’t migrate your /boot partition or your / partition if you don’t have a separate /boot partition, because this might lead to an unbootable system because I’m not sure whether grub in Mandriva has complete ext4 support.
    I would also recommend running an up to date Linux kernel, because ext4 has undergone many improvements lately. Cooker’s current kernel 2.6.30.1 is working nicely for me.
    For more ext4 information, I recommend reading the Linux kernel newbies ext4 page.
  • My /home partition is using XFS. If you are using XFS, you can run xfs_fsr to defragment files.

After all these changes, my system feels much snappier now than one month ago.

Fix bad performance with NVidia 177.80 drivers

Since I upgraded to Nvidia’s beta driver series which were supposed to improve performance for KDE 4 (including the now stable version 177.80), my GNOME desktop on my system with a Geforce 6600 GT graphics card, felt a lot slower. It was most noticeable when browsing the web with Firefox. When quickly scrolling a web page with my mouse’s scroll wheel, X started eating 100% of CPU time and the image on the screen started lagging behind a lot. Also just rendering a page seemed to be much slower. Disabling smooth scrolling in Firefox, did not help at all.

Searching on the web, I found out that I’m not the only one with this problem. However, setting the InitialPixmapPlacement to 0 made the Compiz/Emerald window manager crash. I found out that setting InitialPixmapPlacement to 1 also seemed to fix the problem, without compiz/emerald crashing.

So if you also suffer from bad performance in GNOME with the proprietary NVidia drivers, create a script called fix-broken-nvidia.sh in /usr/local/bin with contents:


#!/bin/sh<br/>
/usr/bin/nvidia-settings -a InitialPixmapPlacement=1

Then go to GNOME’s System – Preference menu and start up Sessions. In the startup programs tab page, click on Add, and choose /usr/local/bin/fix-broken-nvidia.sh as the command. Save the settings, and restart X. Firefox now works a lot faster for me: web pages now appear instantaneous and I can scroll web pages without my CPU getting overloaded.

Thanks to NVidia for bringing me such great performance with their new drivers. Out of gratefulness, I’ll make sure my next graphics card is an Intel or ATI one.