Debian Stretch on AMD EPYC (ZEN) with an NVIDIA GPU for HPC

Recently at work we bought a new Dell PowerEdge R7425 server for our HPC cluster. These are some of the specifications:

  • 2 AMD EPYC 7351 16-Core Processors
  • 128 GB RAM (16 DIMMs of 8 GB)
  • Tesla V100 GPU
Dell Poweredge R7425 front with cover
Dell Poweredge R7425 front without cover
Dell Poweredge R7425 inside

Our FAI configuration automatically installed Debian stretch on it without any problem. All hardware was recognized and working. The installation of the basic operating system took less than 20 minutes. FAI also sets up Puppet on the machine. After booting the system, Puppet continues setting up the system: installing all needed software, setting up the Slurm daemon (part of the job scheduler), mounting the NFS4 shared directories, etc. Everything together, the system was automatically installed and configured in less than 1 hour.

Linux kernel upgrade

Even though the Linux 4.9 kernel of Debian Stretch works with the hardware, there are still some reasons to update to a newer kernel. Only in more recent kernel versions, the k10temp kernel module is capable of reading out the CPU temperature sensors. We also had problems with fscache (used for NFS4 caching) with the 4.9 kernel in the past, which are fixed in a newer kernel. Furthermore there have been many other performance optimizations which could be interesting for HPC.

You can find a more recent kernel in Debian’s Backports repository. At the time of writing it is a 4.18 based kernel. However, I decided to build my own 4.19 based kernel.

In order to build a Debian kernel package, you will need to have the package kernel-package installed. Download the sources of the Linux kernel you want to build, and configure it (using make menuconfig or any method you prefer). Then build your kernel using this command:

$ make -j 32 deb-pkg

Replace 32 by the number of parallel jobs you want to run; the number of CPU cores you have is a good amount. You can also add the LOCALVERSION and KDEB_PKGVERSION variables to set a custom name and version number. See the Debian handbook for a more complete howto. When the build is finished successfully, you can install the linux-image and linux-headers package using dpkg.

We mentioned temperature monitoring support with the k10temp driver in newer kernels. If you want to check the temperatures of all NUMA nodes on your CPUs, use this command:

$ cat /sys/bus/pci/drivers/k10temp/*/hwmon/hwmon*/temp1_input

Divide the value by 1000 to get the temperature in degrees Celsius. Of course you can also use the sensors command of the lm-sensors package.

Kernel settings

VM dirty values

On systems with lots of RAM, you will encounter problems because the default values of vm.dirty_ratio and vm.dirty_background_ratio are too high. This can cause stalls when all dirty data in the cache is being flushed to disk or to the NFS server.

You can read more information and a solution in SuSE’s knowledge base. On Debian, you can create a file /etc/sysctl.d/dirty.conf with this content:

vm.dirty_bytes = 629145600
vm.dirty_background_bytes = 314572800

Run

# systctl -p /etc/sysctl.d/dirty.conf

to make the settings take effect immediately.

Kernel parameters

In /etc/default/grub, I added the options

transparant_hugepage=always cgroup_enable=memory

to the GRUB_CMDLINE_LINUX variable.

Transparant hugepages can improve performance in some cases. However, it can have a very negative impact on some specific workloads too. Applications like Redis, MongoDB and Oracle DB recommend not enabling transparant hugepages by default. So make sure that it’s worthwhile for your workload before adding this option.

Memory cgroups are used by Slurm to prevent jobs using more memory than what they reserved. Run

# update-grub

to make sure the changes will take effect at the next boot.

I/O scheduler

If you’re using a configuration based on recent Debian’s kernel configuration, you will likely be using the Multi-Queue Block IO Queueing Mechanism with the mq-deadline scheduler as default. This is great for SSDs (especially NVME based ones), but might not be ideal for rotational hard drives. You can use the BFQ scheduler as an alternative on such drives. Be sure to test this properly tough, because with Linux 4.19 I experienced some stability problems which seemed to be related to BFQ. I’ll be reviewing this scheduler again for 4.20 or 4.21.

First if BFQ is built as a module (wich is the case in Debian’s kernel config), you will need to load the module. Create a file /etc/modules-load.d/bfq.conf with contents

bfq

Then to use this scheduler by default for all rotational drives, create the file /etc/udev/rules.d/60-io-scheduler.rules with these contents:

# set scheduler for non-rotating disks
ACTION=="add|change", KERNEL=="sd[a-z]|mmcblk[0-9]*|nvme[0-9]*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
# set scheduler for rotating disks
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq"

Run

# update-initramfs -u

to rebuild the initramfs so it includes this udev rule and it will be loaded at the next boot sequence.

The Arch wiki has more information on I/O schedulers.

CPU Microcode update

We want the latest microcode for the CPU to be loaded. This is needed to mitigate the Spectre vulnerabilities. Install the amd64-microcode package from stretch-backports.

Packages with AMD ZEN support

hwloc

hwloc is a utility which reads out the NUMA topology of your system. It is used by the Slurm workload manager to to bind tasks to certain cores.

The version of hwloc in Stretch (1.11.5) does not have support for the AMD ZEN architecture. However, hwloc 1.11.12 is available in stretch-backports, and this version does have AMD ZEN support. So make sure you have the packages hwloc libhwloc5 libhwloc-dev libhwloc-plugins installed from stretch-backports.

BLAS libraries

There is no BLAS library in Debian Stretch which supports AMD ZEN architecture. Unfortunately, at the moment of writing there is is also no good BLAS implementation for ZEN available in stretch-backports. This will likely change in the near future though, as BLIS has now entered Debian Unstable and will likely be backported too in the stretch-backports repository.

NVIDIA drivers and CUDA libraries

NVIDIA drivers

I installed the NVIDIA drivers an CUDA libraries from the tarballs downloaded from the NVIDIA website because at the time of writing all packages available in the Debian repositories are outdated.

First make sure you have the linux-headers package installed which corresponds with the linux-image kernel package you are running. We will be using DKMS to rebuild the driver automatically whenever we install a new kernel, so also make sure you have the dkms package installed.

Download the NVIDIA driver for your GPU from the NVIDIA website. Remove the nouveau driver with the

# rmmod nouveau

command. And create a file /etc/modules-load.d/blacklist-nouveau.conf with these contents:

blacklist nouveau

and rebuild the initramfs by running

# update-initramfs -u

This will ensure the nouveau module will not be loaded automatically.

Now install the driver, by using a command similar to this:

# NVIDIA-Linux-x86_64-410.79.run -s --dkms

This will do a silent installation, integrating the driver with DKMS so it will get built automatically every time you install a new linux-image together with its corresponding linux-headers package.

To make sure that the necessary device files in /dev exist after rebooting this system, I put this script /usr/local/sbin/load-nvidia:

#!/bin/sh
/sbin/modprobe nvidia
if [ "$?" -eq 0 ]; then
  # Count the number of NVIDIA controllers found.
  NVDEVS=`lspci | grep -i NVIDIA`
  N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
  NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`
  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
    mknod -m 666 /dev/nvidia$i c 195 $i
  done
  mknod -m 666 /dev/nvidiactl c 195 255
else
  exit 1
fi
/sbin/modprobe nvidia-uvm
if [ "$?" -eq 0 ]; then
  # Find out the major device number used by the nvidia-uvm driver
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
  mknod -m 666 /dev/nvidia-uvm c $D 0
else
  exit 1
fi

In order to start this script at boot up, I created a systemd service. Create the file /etc/systemd/system/load-nvidia.service with this content:

[Unit]
Description=Load NVidia driver and creates nodes in /dev
Before=slurmd.service

[Service]
ExecStart=/usr/local/sbin/load-nvidia
Type=oneshot
RemainAfterExit=true

[Install]
WantedBy=multi-user.target

Now run these commands to enable the service:

# systemctl daemon-reload
# systemctl enable load-nvidia.service

You can verify whether everything is working by running the command

$ nvidia-smi

CUDA libraries

Download the CUDA libraries. For Debian, choose Linux x86_64 Ubuntu 18.04 runfile (local).

Then install the libraries with this command:

# cuda_10.0.130_410.48_linux --silent --override --toolkit

This will install the toolkit in silent mode. With the override option it is possible to install the toolkit on systems which don’t have an NVIDIA GPU, which might be useful for compilation purposes.

To make sure your users have the necessary binaries and libraries available in their path, create the file /etc/profile.d/cuda.sh with this content:

export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH


Improving Mediawiki performance

Now that I am on the subject of improving performance, I configured some performance improvements for a Mediawiki installation here:

  • Make sure you run the latest Mediawiki version. Mediawiki 1.16 introduced a new localisation caching system which is supposed to improve performance, so you definitely want this to get the best performance.
  • Create a directory where Mediawiki can store the localisation cache (make sure it is writable by your web server). By preference store it on a tmpfs (at least if you are sure it will be big enough to store the cache), and configure it in LocalSettings.php:
    $wgCacheDirectory = "/tmp/mediawiki";
    Iif /tmp is on a tmpfs, you might add creation of this directory with the right permissions to /etc/rc.local, so that it still exists after a reboot.
  • Enable file caching in Mediawiki’s LocalSettings.php:
    $wgFileCacheDirectory = "{$wgCacheDirectory}/html";
    $wgUseFileCache = true;
    $wgShowIPinHeader = false;
    $wgUseGzip = true;
  • Make sure you have installed some PHP accelerator for caching. I have APC installed and configured it in Mediawiki’s LocalSettings.php:
    $wgMainCacheType = CACHE_ACCEL;

Here is a benchmark before implementing the above configuration (with CACHE_NONE, but APC still installed):

$ ab -kt 30 http://site/wiki/index.php/Page
This is ApacheBench, Version 2.3 < $Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking site (be patient)
Finished 255 requests

Server Software: Apache/2.2.16
Server Hostname: site
Server Port: 80

Document Path: /wiki/index.php/Page
Document Length: 12750 bytes

Concurrency Level: 1
Time taken for tests: 30.084 seconds
Complete requests: 255
Failed requests: 0
Write errors: 0
Keep-Alive requests: 0
Total transferred: 3344070 bytes
HTML transferred: 3251250 bytes
Requests per second: 8.48 [#/sec] (mean)
Time per request: 117.978 [ms] (mean)
Time per request: 117.978 [ms] (mean, across all concurrent requests)
Transfer rate: 108.55 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 3 6 2.8 7 21
Processing: 88 112 11.1 112 163
Waiting: 66 90 9.1 89 125
Total: 95 118 11.9 118 170

Percentage of the requests served within a certain time (ms)
50% 118
66% 122
75% 125
80% 127
90% 132
95% 138
98% 145
99% 156
100% 170 (longest request)

And here a benchmark after implementing the changes:

ab -kt 30 http://site/wiki/index.php/Page
This is ApacheBench, Version 2.3 < $Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking site (be patient)
Finished 649 requests

Server Software: Apache/2.2.16
Server Hostname: site
Server Port: 80

Document Path: /wiki/index.php/Page
Document Length: 12792 bytes

Concurrency Level: 1
Time taken for tests: 30.015 seconds
Complete requests: 649
Failed requests: 0
Write errors: 0
Keep-Alive requests: 0
Total transferred: 8538244 bytes
HTML transferred: 8302008 bytes
Requests per second: 21.62 [#/sec] (mean)
Time per request: 46.248 [ms] (mean)
Time per request: 46.248 [ms] (mean, across all concurrent requests)
Transfer rate: 277.80 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 3 9 3.7 8 29
Processing: 23 37 6.0 37 62
Waiting: 13 23 4.9 24 41
Total: 28 46 7.8 45 82

Percentage of the requests served within a certain time (ms)
50% 45
66% 47
75% 49
80% 50
90% 56
95% 62
98% 68
99% 73
100% 82 (longest request)

So Mediawiki can deal with more than 2,5 times as much requests now.

Some people use Apache’s mod_disk_cache to cache Mediawiki pages, but I prefer Mediawiki’s own caching system because it is more standard and does not require patching Mediawiki, even if it might not get as much benefit as a real proxy or mod_disk_cache.

Improving performance by using tmpfs

Today I analyzed disk reads and writes on a server with iotop and strace and found some interesting possible optimizations.

With iotop you can check which processes are reading and writing from the disks. I always press the o, p and a keys in iotop so that it only shows me processes doing I/O and so that it will show accumulated I/O instead of the bandwidth. With the left and right arrows I select on which columns to sort the list.

Once you have identified the processes wich are doing much I/O, you can check what they are reading or writing with strace, for example
# strace  -f -p $PID  -e trace=open,read,write

(you can leave out read and/or write if this gives too much noise)

This way I identified some locations where processes do lots of read and write operations on temporary files.

For nagios I placed /var/lib/nagios3/spool and /var/cache/nagios3 on a tmpfs, for Amavis /var/lib/amavis/tmp and for PostgreSQL /var/lib/postgresql/8.4/main/pg_stat_tm.

Other candidates you might want to consider: /tmp, /var/tmp and /var/lib/php5. There are probably many others, depending on which services you use.

DHCPd failover

Last week, I set up two dhcpd servers in a fail-over configuration. The goal is that when one DHCP server goes down, the other one takes over so that clients don’t lose their network connection. I read different tutorials on the web, such as this one of a fellow blogger and this documentation published by IBM.

Continue reading “DHCPd failover”

Migrating from Microsoft SQL Server to PostgreSQL

One of the servers I manage at work is still running Microsoft Windows 2000. This system is hosting a few old forgotten web sites and it runs Microsoft SQL Server containing a few databases still in use. This server was already there when I started working at the university. Fortunately I never had to do much work on it and there never was a serious crash: it would have been a serious headache for me restoring this system because I do not have much knowledge about Windows and SQL Server.

During some spare time at work I decided taking a look at migrating the MS SQL Server databases to PostgreSQL. And all in all it was not very complicated. Fortunately, the databases itself are not very large and have a rather simple structure. The first thing I did, was recreating the structure of the database in PostgreSQL. With a tool like pgAdmin3 this is pretty straightforward. An alternative is to export the database structure in SQL format and adapt the resulting file so that it contains valid PostgreSQL table creation statements. The latter method is explained on the PostgreSQL wiki.

To migrate the data itself, I wrote a little Perl script. Actually I do not have much knowledge about Perl but it is a fairly straightforward language, so this did not prove to be too difficult. I connect to the MS SQL Server database using DBI, the Perl Database Interface, with the ODBC and FreeTDS drivers. The script gets all records from the tables you define, and inserts them in the PostgreSQL database using the PostgreSQL DBI driver.

These are the required packages on a Debian Squeeze system:

libdbd-odbc-perl
libdbd-pg-perl
tdsodbc
unixodbc
tdsodbc

Define the FreeTDS ODBC driver in /etc/odbcinst.ini:

[FreeTDS]
Description	= TDS driver (Sybase/MS SQL)
Driver		= /usr/lib/odbc/libtdsodbc.so
Setup		= /usr/lib/odbc/libtdsS.so
UsageCount	= 2

Then define the database in odbc.ini:

[database]
Driver = FreeTDS
Trace = No
Server = hostname
Port = 1433
Database = databasename

The script assumes that all field names in the PostgreSQL database are using lowercase names and that the MS SQL database is using ISO8859-15 encoding and the PostgreSQL database is using UTF-8 encoding. For every table, I make a migrateTable call with 3 arguments: the table name, an array containing all field names and the name of the primary key field for which you created a sequence (named fieldname_seq).

Once the database itself was migrated, the applications using the database had to be modified to use PostgreSQL. This was not too difficult, because the applications were PHP scripts using ODBC. These additional packages were needed:

odbc-postgresql
php5-odbc

To configure the PostgreSQL ODBC driver, add this in /etc/odbcinst.ini:

[PostgreSQL Unicode]
Description	= PostgreSQL ODBC driver (Unicode version)
Driver		= /usr/lib/odbc/psqlodbcw.so
Setup		= /usr/lib/odbc/libodbcpsqlS.so
Debug		= 0
CommLog		= 0
UsageCount	= 0

Then define the databse in odbc.ini:

[database]
Driver              = PostgreSQL Unicode
ServerName          = localhost
Database            = dbname
Username            = username
sslmode             = require

Then be sure not to call odbc_connect with the SQL_CUR_USE_ODBC option (which was actually needed with FreeTDS to fix some weird errors), because it causes segfaults.

The result? WISE’s publications pages is now using PostgreSQL as a back-end!

Migrating mail from KMail to Evolution

At work I am busy migrating some Linux desktop users from an old Slackware 12.0 system with KDE 3.5 to Debian Squeeze with GNOME 2.30. So I had to migrate the e-mails from KMail to Evolution, a task which was not that trivial as you would hope at first.

On these systems, the mails were saved in ~/Mail. This directory did not contain a standard maildir structure or mbox files, but some weird mix of those two. I do not know whether this is typical for KMail or whether this was a peculiarity on these systems because the e-mails have been migrated from even other e-mail clients and versions in the past already.

The first step was to make a standard maildir out of this structure. I created ~/Maildir and searched for all directories called “cur” in ~/Mail. All parents of these directories are standard maildir folders, so I copied these to ~/Maildir. One of them was called inbox. In maildir, the inbox is just the parent directory and not some folder called inbox. So I moved the cur new and tmp subdirectories of ~/Maildir/inbox to ~/Maildir itself. Then I renamed all other subfolders in ~/Maildir so that they contained a dot before the actual folder name, as this is how subfolders are named in maildir.

Now I was still left with a few mbox mail folders. These are single files which contain a whole mailbox. In order to convert them, I used the mb2md script. I ran mb2md -s ~/Mail -R ~/Maildir/ to convert all remaining mbox folders to the maildir structure.

Then I had a nice maildir structure, but Evolution cannot directly import Maildir folders. However it should be possible to create a local Maildir account in Evolution, pointing to the ~/Maildir directory and then copying over all folders. I did not try this. Instead, I installed the Dovecot IMAP server and edited /etc/dovecot.conf. I set these options:

protocols = imap
mail_location = maildir:~/Maildir

and restarted dovecot. Then I configured an IMAP+ account on localhost in Evolution, and dragged and dropped all IMAP folders to the local folders.

KMail’s address book is stored in ~/.kde/share/apps/kabc. Double clicking on the most recent vcf file there, should be enough to open and import it in Evolution.

Improving battery life time in Linux

Today I received a new battery for my Dell Latitude E6400 laptop. The old battery (a 6 cell one) only lasted for about 15 minutes any more, 1,5 year after I acquired this system. The new battery is a 9 cell version. gnome-power-manager estimates I should be able to run 6 hours now without a charge, which corresponds with what I expected.

Now that I have a new battery, it is time to take a look again at trying to lower power consumption as much as possible. On my system I am using Mandriva 2010.1 Cooker with GNOME 2.30 and laptop-mode-tools 1.54, but this howto should apply (sometimes with slight modifications) to other distributions too.

Continue reading “Improving battery life time in Linux”

Server migration

Since two days, I have merged the main servers used by two research laboratories at work. One server was an old Linux server which really needed a hardware upgrade, and the other one was a Mac Pro machine running a flaky OS X Leopard. The new server is of course running Linux: Debian Lenny.

It was a very interesting experience: working out procedures to migrate the mailboxes (from Dovecot on the Linux server and Cyrus on the Mac server to Cyrus on the new server), finding out how to set up one NIC in two different subnets (especially the routing is a little bit tricky), getting all services hooked up to LDAP and managed by GOSA, getting dhcpd to do exactly what we want in a shared-network set up, and much more.

The new server is a HP DL185 G5 with an AMD Opteron quad core CPU and 8 GB of RAM and hosts two KVM virtual machines, one for public services and another one running internal services. You can visit the two websites, which are also hosted on this machine of course, of the concerned research labs:

Maybe in the not too far away future, I should try to move the services hosted on the underpowered desktop machine running this website, also to a virtual machine…

Updating to Debian Lenny

Last week-end, Debian Lenny 5.0 was finally released. I use Debian on most servers I manage at work. A few of them were already using Lenny when it was still the testing branch, but most are still on Debian Etch. So this morning I decided to test upgrade one of the less critical Etch systems to Lenny. That system is only used to store back-up files from other systems, so it would not be a problem if that machine was off line for a couple of hours.
According to the release notes, you should rather use aptitude instead of apt to upgrade, so that’s what I did. All went well, until suddenly the package upgrade hung while installing new udev configuration files. I could Ctrl-C the process to continue, but from that moment on, more and more post installation scripts started hanging and had to be interrupted.

I noticed that also simple commands, such as ps and getent passwd were hanging too and that I could not log in via SSH anymore. Fortunately, the existing SSH connections continued to work, so I was not locked out yet.

I straced getent passwd and noticed that it hung while trying to connect to the remote LDAP server. The problem was apparently that Lenny’s libnns_ldap tried to connect via LDAPS to port 389, while LDAPS uses port 636 by default. It seems that you need to specify the port number 636 now to make LDAPS work right, so for example: ldaps://remote.host:636. I fixed this in libnss-ldap.conf and pam_ldap.conf, and then I could finish the upgrade without any problem. Apparently this is a known problem.

So definitely modify your configuration first if you are using LDAP authentication and want to upgrade to Lenny. I should probably also fix my nsswitch.conf so that applications don’t start to hang if the LDAP server is unreachable…

In spite of this problem, the whole upgrade was done in less than 1 hour. Without that problem, I guess it would have taken about 20 minutes less. Quite impressive!