Enabling jumbo frames on your network

Jumbo frames are Ethernet frames with up to 9000 bytes of payload, in contrast to normal frames which have up to 1500 bytes per payload. They are useful on fast (Gigabit Ethernet and faster) networks, because they reduce the overhead. Not only will it result in a higher throughput, it will also reduce CPU usage.

To use jumbo frames, you whole network needs to support it. That means that your switch needs to support jumbo frames (it might need to be enabled by hand), and also all connected hosts need to support jumbo frames. Jumbo frames should also only be used on reliable networks, as the higher payload will make it more costly to resend a frame if packets get lost.

So first you need to make sure your switch has jumbo frame support enabled. I’m using a HP Procurve switch and for this type of switch you can find the instructions here:

$ netcat 10.141.253.1 telnet
Username:
Password:
ProCurve Switch 2824# config
ProCurve Switch 2824(config)# show vlans
ProCurve Switch 2824(config)# vlan $VLAN_ID jumbo
ProCurve Switch 2824(config)# show vlans
ProCurve Switch 2824(config)# write memory

Now that your switch is configured properly, you can configure the hosts.

For hosts which have a static IP configured in /etc/network/interfaces you need to add the line

    mtu 9000

to the iface stanza of the interface on which you want to enable jumbo frames. This does not work for interfaces getting an IP via DHCP, because they will use the MTU value sent by the DHCP server.

To enable jumbo frames via DHCP, edit the /etc/dhcp/dhcpd.conf file on the DHCP server, and add this to the subnet stanza:

option interface-mtu 9000;

Now bring the network interface offline and online, and jumbo frames should be enabled. You can verify with the command

# ip addr show

which will show the mtu values for all network interfaces.

FS-CACHE for NFS clients

FS-CACHE is a system which caches files from remote network mounts on the local disk. It is a very easy to set up facility to improve performance on NFS clients.

I strongly recommend a recent kernel if you want to use FS-CACHE though. I tried this with the 4.9 based Debian Stretch kernel a year ago, and this resulted in a kernel oops from time to time, so I had to disable it again. I’m currently using it again with a 4.19 based kernel, and I did not encounter any stability issues up to now.

First of all, you will need a dedicated file system where you will store the cache on. I prefer to use XFS, because it has nice performance and stability. Mount the file system on /var/cache/fscache.

Then install the cachefilesd package and edit the file /etc/default/cachefilesd so that it contains:

RUN=yes

Then edit the file /etc/cachefilesd.conf. It should look like this:

dir /var/cache/fscache
tag mycache
brun 10%
bcull 7%
bstop 3%
frun 10%
fcull 7%
fstop 3%

These numbers define when cache culling (making space in the cache by discarding less recently used files) happens: when the amount of available disk space or the amount of available files drops below 7%, culling will start. Culling will stop when 10% is available again. If the available disk space or available amount of files drops below 3%, no further cache allocation is done until more than 3% is available again. See also man cachefilesd.conf.

Start cachefilesd by running

# systemctl start cachefilesd

If it fails to start with these messages in the logs:

cachefilesd[1724]: About to bind cache
kernel: CacheFiles: Security denies permission to nominate security context: error -2
cachefilesd[1724]: CacheFiles bind failed: errno 2 (No such file or directory)
cachefilesd[1724]: Starting FilesCache daemon : cachefilesd failed!
systemd[1]: cachefilesd.service: Control process exited, code=exited status=1
systemd[1]: cachefilesd.service: Failed with result 'exit-code'.
systemd[1]: Failed to start LSB: CacheFiles daemon.

then you are hitting this bug. This happens when you are using a kernel with AppArmor enabled (like Debian’s kernel from testing). You can work around it by commenting out the line defining the security context in /etc/cachefilesd.conf:

#secctx system_u:system_r:cachefiles_kernel_t:s0

and starting cachefilesd again.

Now in /etc/fstab add the fsc option to all NFS file systems where you want to enable caching for. For example for NFS4 your fstab line might look like this:

nfs.server:/home	/home	nfs4	_netdev,fsc,noatime,vers=4.2,nodev,nosuid	0	0

Now remount the file system. Just using the remount option is probably not enough: you have to completely umount and mount the NFS file system. Check with the mount command whether the fsc option is present. You can also run

# cat /proc/fs/nfsfs/volumes 

and check whether the FSC column is set to YES.

Try copying some large files from your NFS mount to the local disk. Then you can check the statistics by running

# cat /proc/fs/fscache/stats

You should sea that not all values are 0 any more. You will also see with

$ df -h /var/cache/fscache

your file system is being filled with cached data.

Debian Stretch on AMD EPYC (ZEN) with an NVIDIA GPU for HPC

Recently at work we bought a new Dell PowerEdge R7425 server for our HPC cluster. These are some of the specifications:

  • 2 AMD EPYC 7351 16-Core Processors
  • 128 GB RAM (16 DIMMs of 8 GB)
  • Tesla V100 GPU
Dell Poweredge R7425 front with cover
Dell Poweredge R7425 front without cover
Dell Poweredge R7425 inside

Our FAI configuration automatically installed Debian stretch on it without any problem. All hardware was recognized and working. The installation of the basic operating system took less than 20 minutes. FAI also sets up Puppet on the machine. After booting the system, Puppet continues setting up the system: installing all needed software, setting up the Slurm daemon (part of the job scheduler), mounting the NFS4 shared directories, etc. Everything together, the system was automatically installed and configured in less than 1 hour.

Linux kernel upgrade

Even though the Linux 4.9 kernel of Debian Stretch works with the hardware, there are still some reasons to update to a newer kernel. Only in more recent kernel versions, the k10temp kernel module is capable of reading out the CPU temperature sensors. We also had problems with fscache (used for NFS4 caching) with the 4.9 kernel in the past, which are fixed in a newer kernel. Furthermore there have been many other performance optimizations which could be interesting for HPC.

You can find a more recent kernel in Debian’s Backports repository. At the time of writing it is a 4.18 based kernel. However, I decided to build my own 4.19 based kernel.

In order to build a Debian kernel package, you will need to have the package kernel-package installed. Download the sources of the Linux kernel you want to build, and configure it (using make menuconfig or any method you prefer). Then build your kernel using this command:

$ make -j 32 deb-pkg

Replace 32 by the number of parallel jobs you want to run; the number of CPU cores you have is a good amount. You can also add the LOCALVERSION and KDEB_PKGVERSION variables to set a custom name and version number. See the Debian handbook for a more complete howto. When the build is finished successfully, you can install the linux-image and linux-headers package using dpkg.

We mentioned temperature monitoring support with the k10temp driver in newer kernels. If you want to check the temperatures of all NUMA nodes on your CPUs, use this command:

$ cat /sys/bus/pci/drivers/k10temp/*/hwmon/hwmon*/temp1_input

Divide the value by 1000 to get the temperature in degrees Celsius. Of course you can also use the sensors command of the lm-sensors package.

Kernel settings

VM dirty values

On systems with lots of RAM, you will encounter problems because the default values of vm.dirty_ratio and vm.dirty_background_ratio are too high. This can cause stalls when all dirty data in the cache is being flushed to disk or to the NFS server.

You can read more information and a solution in SuSE’s knowledge base. On Debian, you can create a file /etc/sysctl.d/dirty.conf with this content:

vm.dirty_bytes = 629145600
vm.dirty_background_bytes = 314572800

Run

# systctl -p /etc/sysctl.d/dirty.conf

to make the settings take effect immediately.

Kernel parameters

In /etc/default/grub, I added the options

transparant_hugepage=always cgroup_enable=memory

to the GRUB_CMDLINE_LINUX variable.

Transparant hugepages can improve performance in some cases. However, it can have a very negative impact on some specific workloads too. Applications like Redis, MongoDB and Oracle DB recommend not enabling transparant hugepages by default. So make sure that it’s worthwhile for your workload before adding this option.

Memory cgroups are used by Slurm to prevent jobs using more memory than what they reserved. Run

# update-grub

to make sure the changes will take effect at the next boot.

I/O scheduler

If you’re using a configuration based on recent Debian’s kernel configuration, you will likely be using the Multi-Queue Block IO Queueing Mechanism with the mq-deadline scheduler as default. This is great for SSDs (especially NVME based ones), but might not be ideal for rotational hard drives. You can use the BFQ scheduler as an alternative on such drives. Be sure to test this properly tough, because with Linux 4.19 I experienced some stability problems which seemed to be related to BFQ. I’ll be reviewing this scheduler again for 4.20 or 4.21.

First if BFQ is built as a module (wich is the case in Debian’s kernel config), you will need to load the module. Create a file /etc/modules-load.d/bfq.conf with contents

bfq

Then to use this scheduler by default for all rotational drives, create the file /etc/udev/rules.d/60-io-scheduler.rules with these contents:

# set scheduler for non-rotating disks
ACTION=="add|change", KERNEL=="sd[a-z]|mmcblk[0-9]*|nvme[0-9]*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
# set scheduler for rotating disks
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq"

Run

# update-initramfs -u

to rebuild the initramfs so it includes this udev rule and it will be loaded at the next boot sequence.

The Arch wiki has more information on I/O schedulers.

CPU Microcode update

We want the latest microcode for the CPU to be loaded. This is needed to mitigate the Spectre vulnerabilities. Install the amd64-microcode package from stretch-backports.

Packages with AMD ZEN support

hwloc

hwloc is a utility which reads out the NUMA topology of your system. It is used by the Slurm workload manager to to bind tasks to certain cores.

The version of hwloc in Stretch (1.11.5) does not have support for the AMD ZEN architecture. However, hwloc 1.11.12 is available in stretch-backports, and this version does have AMD ZEN support. So make sure you have the packages hwloc libhwloc5 libhwloc-dev libhwloc-plugins installed from stretch-backports.

BLAS libraries

There is no BLAS library in Debian Stretch which supports AMD ZEN architecture. Unfortunately, at the moment of writing there is is also no good BLAS implementation for ZEN available in stretch-backports. This will likely change in the near future though, as BLIS has now entered Debian Unstable and will likely be backported too in the stretch-backports repository.

NVIDIA drivers and CUDA libraries

NVIDIA drivers

I installed the NVIDIA drivers an CUDA libraries from the tarballs downloaded from the NVIDIA website because at the time of writing all packages available in the Debian repositories are outdated.

First make sure you have the linux-headers package installed which corresponds with the linux-image kernel package you are running. We will be using DKMS to rebuild the driver automatically whenever we install a new kernel, so also make sure you have the dkms package installed.

Download the NVIDIA driver for your GPU from the NVIDIA website. Remove the nouveau driver with the

# rmmod nouveau

command. And create a file /etc/modules-load.d/blacklist-nouveau.conf with these contents:

blacklist nouveau

and rebuild the initramfs by running

# update-initramfs -u

This will ensure the nouveau module will not be loaded automatically.

Now install the driver, by using a command similar to this:

# NVIDIA-Linux-x86_64-410.79.run -s --dkms

This will do a silent installation, integrating the driver with DKMS so it will get built automatically every time you install a new linux-image together with its corresponding linux-headers package.

To make sure that the necessary device files in /dev exist after rebooting this system, I put this script /usr/local/sbin/load-nvidia:

#!/bin/sh
/sbin/modprobe nvidia
if [ "$?" -eq 0 ]; then
  # Count the number of NVIDIA controllers found.
  NVDEVS=`lspci | grep -i NVIDIA`
  N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
  NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`
  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
    mknod -m 666 /dev/nvidia$i c 195 $i
  done
  mknod -m 666 /dev/nvidiactl c 195 255
else
  exit 1
fi
/sbin/modprobe nvidia-uvm
if [ "$?" -eq 0 ]; then
  # Find out the major device number used by the nvidia-uvm driver
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
  mknod -m 666 /dev/nvidia-uvm c $D 0
else
  exit 1
fi

In order to start this script at boot up, I created a systemd service. Create the file /etc/systemd/system/load-nvidia.service with this content:

[Unit]
Description=Load NVidia driver and creates nodes in /dev
Before=slurmd.service

[Service]
ExecStart=/usr/local/sbin/load-nvidia
Type=oneshot
RemainAfterExit=true

[Install]
WantedBy=multi-user.target

Now run these commands to enable the service:

# systemctl daemon-reload
# systemctl enable load-nvidia.service

You can verify whether everything is working by running the command

$ nvidia-smi

CUDA libraries

Download the CUDA libraries. For Debian, choose Linux x86_64 Ubuntu 18.04 runfile (local).

Then install the libraries with this command:

# cuda_10.0.130_410.48_linux --silent --override --toolkit

This will install the toolkit in silent mode. With the override option it is possible to install the toolkit on systems which don’t have an NVIDIA GPU, which might be useful for compilation purposes.

To make sure your users have the necessary binaries and libraries available in their path, create the file /etc/profile.d/cuda.sh with this content:

export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH


Going back to my roots: testing Mageia 4 beta

Many years ago I used to be a Mandriva user and contributor, mostly active in packaging software. I stopped my contributions because I had the feeling the distribution was having more and more trouble keeping up with all new evolutions in the GNU Linux free software world and was loosing ground to other, more innovative distributions. Finally I settled for Debian myself. Even though it is not always the most innovative distribution itself, I liked its open, independent community-based nature.

Now after all this time, I was curious to see how my former favourite distribution had evolved. Mandriva was forked by former Mandriva employees and contributors, and so Mageia was born. Mageia is currently developing version 4 of its distribution and released beta 2 two weeks ago, on 13 December 2013. I decided to try out this version.
Continue reading “Going back to my roots: testing Mageia 4 beta”

Living in a surveillance state

Because of time constraints it has been a long time since I wrote something here. However, this is something I want to share with as many people as possible now: Mikko Hypponen’s talk titled “Living in a surveillance state”, last week at TEDxBrussels . If you think that you don’t have to fear the spying by the NSA, GCHQ and other state services because you have nothing to hide, or you are wondering what we can do against it, then you should definitely watch this. “Open source” is the key answer to the latter question by the way.

These are 20 very well spent minutes of your time.

wmode=opaque" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen>

Leap second causing ksoftirqd and java to use lots of cpu time

Today there was a leap second at 23:59:60 UTC. On one of my systems, this caused a high CPU load starting from around 02h00 GMT+2 (which corresponds with the time of the leap second). ksoftirqd and some java (glassfish) process where using lots of CPU time. This system was running Debian Squeeze with kernel 2.6.32-45. The problem is very easy to fix: just run

# date -s "`date`"

and everything will be fine again. I found this solution on the Linux Kernel Mailing List: http://marc.info/?l=linux-kernel&m=134113389621450&w=2. Apparently a similar problem can happen with Firefox, Thunderbird, Chrome/Chromium, Java, Mysql, Virtualbox and probably other processes.

I was a bit suprised that this problem only happened on this particular machine, because I have several other servers running similar kernel versions.

Multi-monitor support with Randr 1.3 and NVidia’s proprietary driver

I just got a second monitor at home and wanted to configure the two monitors with my NVidia graphics card. You can set up TwinView in the Nvida Settings application, however I did not like that solution: the next time I restarted X, all the settings were lost and the second monitor powered off. Also GNOME did not seem to behave correctly when the monitors went on stand by and I unlocked the desktop. The desktop appeared to be shifted over the monitors. The latter might be a bug of gnome-settings-daemon 3.2 and not Nvidia’s however.

However since the NVidia proprietary driver version 330 beta series, it finally supports Randr 1.3 so that you can configure dual screen with the configuration tools provided with your desktop. This driver is currently available in Debian Experimental. To install it (make sure you have experimental in your apt sources.list first, of course), run this command:

# apt-get install -t experimental xserver-xorg-video-nvidia

I also pulled in gnome-settings-daemon and gnome-control-center version 3.4 which appeared in Debian Sid today:

# apt-get install -t unstable gnome-settings-daemon gnome-contol-center

Now reboot your system (to be sure the new Nvidia kernel and X drivers are loaded), and then go System Tools – Preferences – System Settings (gnome-control-center in a terminal window) – Display. Enable the wo monitors, set the optimal (highest) resolution and drag them in the right position, click Apply, and confirm everything is working fine. Now you have a nice multi-monitor setup without needing to mess with NVidia’s twin view and without having to create a script to get the right settings applied automatically when X is started.

Creating your own GNOME session based on cairo-dock and Compiz

Personally I absolutely do not like the gnome-shell in GNOME 3. I actually even hate it: it is slow, messy and cumbersome to use and I have the feeling that developers are not listening to criticism. Obvious and trivial design bugs which are well known, are totally ignored (bug 662738 is an example).

For that reason, I went looking for an alternative desktop. KDE is way too bloated for a netbook with 1 GB of RAM, while XFCE is not as polished as a traditional GNOME 2.32 desktop. The best alternative I could find out right now, was to just replace the GNOME Shell by a custom panel or dock implementation. In the end I chose cairo-dock: it is written in C, so it is probably not as memory hungry as AWN (which uses Python) and Docky (which uses Mono, which I also consider as a possible patent minefield). Cairo-dock is also actively maintained. I paired cairo-dock with the compiz window manager to get some nicely looking desktop.
Continue reading “Creating your own GNOME session based on cairo-dock and Compiz”

MegaCLI: useful commands

Recently I installed a server with a Supermicro SMC2108 RAID adapter, which is actually a LSI MegaRAID SAS 9260. LSI created a command line utility called MegaCLI for Linux to manage this adapter. You can download it from their support pages. The downloaded archive contains an RPM file. I installed mc and rpm on Debian with apt-get, and then extracted the MegaCli64 binary (for x86_64) to /usr/local/sbin, and the libsysfs.so.2.0.2 from the Lib_utils RPM to /opt/lsi/3rdpartylibs/x86_64/ (that’s the location where MegaCli64 looks for this library).

Here are some useful commands:

View information about the RAID adapter

For checking the firmware version, battery back-up unit presence, installed cache memory and the capabilities of the adapter:

# MegaCli64 -AdpAllInfo -aAll

View information about the battery backup-up unit state

# MegaCli64 -AdpBbuCmd -aAll

View information about virtual disks

Useful for checking RAID level, stripe size, cache policy and RAID state:

# MegaCli64 -LDInfo -Lall -aALL

View information about physical drives

# MegaCli64 -PDList -aALL

Patrol read

Patrol read is a feature which tries to discover disk error before it is too late and data is lost. By default it is done automatically (with a delay of 168 hours between different patrol reads) and will take up to 30% of IO resources.

To see information about the patrol read state and the delay between patrol read runs:
# MegaCli64 -AdpPR -Info -aALL

To find out the current patrol read rate, execute
# MegaCli64 -AdpGetProp PatrolReadRate -aALL

To reduce patrol read resource usage to 2% in order to minimize the performance impact:
# MegaCli64 -AdpSetProp PatrolReadRate 2 -aALL

To disable automatic patrol read:
# MegaCli64 -AdpPR -Dsbl -aALL

To start a manual patrol read scan:
# MegaCli64 -AdpPR -Start -aALL

To stop a patrol read scan:
# MegaCli64 -AdpPR -Stop -aALL

You could use the above commands to run patrol read in off-peak times.

Migrate from one RAID level to another

In this example, I migrate the virtual disk 0 from RAID level 6 to RAID 5, so that the disk space of one additional disk becomes available. The second command is used to make Linux detect the new size of the RAID disk.

# /usr/local/sbin/MegaCli64 -LDRecon -Start -r5 -L0 -a0
# echo 1 > /sys/block/sda/device/rescan

Extending an existing RAID array with a new disk

./MegaCli64 -LDRecon -Start -r5 -Add -PhysDrv[32:3] -L0 -a0

Create a new RAID 5 virtual disk from a set of new hard drives

First we need to now the enclosure and slot number of the hard drives we want to use for the new RAID disk. You can find them out by the first command. Then I add a virtual disk using RAID level 5, followed by the list of drives I want to use, specified by enclosure:slot syntax.

# MegaCli64 -PDList -aALL | egrep 'Adapter|Enclosure|Slot|Inquiry'
# MegaCli64 -CfgLdAdd -r5'[252:5,252:6,252:7]' -a0

Extending an existing RAID array with a new disk

First check the enclosure device ID and the slot number of the newly added disk with the command above. Then we reconstruct the logical drive, adding the new drive. For a RAID 5 array this command is used:

# MegaCli64 -LDRecon -Start -r5 -Add -PhysDrv[32:3] -L0 -a0

View reconstruction progress

When reconstructing a RAID array, you can check its progress with this command.
# MegaCli64 -LDRecon ShowProg L0 -a0

(replace L0 by L1 for the second virtual disk, and so on)

Configure write-cache to be disabled when battery is broken

# MegaCli64 -LDSetProp NoCachedBadBBU -LALL -aALL

Change physical disk cache policy

If your system is not connected to a UPS, you should disable the physical disk cache in order to prevent data loss.

# MegaCli -LDGetProp -DskCache -LAll -aALL

To enable it (only do this if you have a UPS and redundant power supplies):

# MegaCli -LDGetProp -DskCache -LAll -aALL

More information

http://ftzdomino.blogspot.com/2009/03/some-useful-megacli-commands.html
https://twiki.cern.ch/twiki/bin/view/FIOgroup/DiskRefPerc
http://hwraid.le-vert.net/wiki/LSIMegaRAIDSAS
http://kb.lsi.com/KnowledgebaseArticle16516.aspx

Fixing grub-probe error: Couldn’t find PV, check your device.map.

Today I was getting this error when installing a new kernel on a server running Debian:

/usr/sbin/grub-probe: error: Couldn't find PV pv2. Check your device.map.

The error can be reproduce by running the update-grub command.

The day before, a new RAID disk was added to this server, so I suspected this could be the cause. The file /boot/grub/device.map contained a reference to the first RAID disk as (hd0) but did not contain a reference to the new RAID disk. I ran

# ls -l /dev/disk/by-id/

to find out which SCSI ID referred to sdb (the new RAID disk), and then added the following line to device.map:


(hd1) /dev/disk/by-id/scsi-3600304800087c4f015fb4f2e4cc7a8e5

Now installing the new kernel works fine!