Linux 5.0 Netfilter bug

On two desktop systems running Debian Buster with Linux kernel version 5.0.7, I was experiencing a problem when Shorewall6 was stopping or restarting. This kernel backtrace appeared in the logs:

 [   28.932323] WARNING: CPU: 1 PID: 169 at net/netfilter/nft_compat.c:82 nft_xt_put.part.9+0x21/0x30 [nft_compat]
[   28.932325] Modules linked in: ip6t_REJECT(E) nf_reject_ipv6(E) nft_chain_nat_ipv6(E) nf_nat_ipv6(E) nft_chain_route_ipv6(E) xt_multiport(E) nf_log_ipv6(E) xt_recent(E) xt_comment(E) xt_hashlimit(E) xt_addrtype(E) xt_mark(E) xt_CT(E) nfnetlink_log(E) xt_NFLOG(E) nf_log_ipv4(E) nf_log_common(E) xt_LOG(E) nf_nat_tftp(E) nf_nat_snmp_basic(E) nf_conntrack_snmp(E) nf_nat_sip(E) nf_nat_pptp(E) nf_nat_irc(E) nf_nat_h323(E) nf_nat_ftp(E) nf_nat_amanda(E) ts_kmp(E) nf_conntrack_amanda(E) nf_conntrack_sane(E) nf_conntrack_tftp(E) nf_conntrack_sip(E) nf_conntrack_pptp(E) nf_conntrack_proto_gre(E) nf_conntrack_netlink(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_irc(E) nf_conntrack_h323(E) nf_conntrack_ftp(E) nft_chain_route_ipv4(E) xt_CHECKSUM(E) nft_chain_nat_ipv4(E) ipt_M
 ASQUERADE(E) nf_nat_ipv4(E) nf_nat(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) ipt_REJECT(E) nf_reject_ipv4(E) nft_counter(E) xt_tcpudp(E) nft_compat(E) tun(E) bridge(E) stp(E)
[   28.932357]  llc(E) devlink(E) nf_tables(E) nfnetlink(E) msr(E) cmac(E) cpufreq_userspace(E) cpufreq_powersave(E) cpufreq_conservative(E) bnep(E) binfmt_misc(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) ext4(E) mbcache(E) jbd2(E) fscrypto(E) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) efi_pstore(E) ghash_clmulni_intel(E) btusb(E) mei_wdt(E) btrtl(E) btbcm(E) btintel(E) bluetooth(E) arc4(E) aesni_intel(E) snd_hda_codec_hdmi(E) drbg(E) iwldvm(E) aes_x86_64(E) ansi_cprng(E) crypto_simd(E) ecdh_generic(E) cryptd(E) glue_helper(E) crc16(E) snd_hda_codec_idt(E) mac80211(E) hp_wmi(E) snd_hda_codec_generic(E) sparse_keymap(E) joydev(E) ledtrig_audio(E) snd_hda_intel(E) iwlwifi(E) snd_hda_codec(E) intel_cstate(E) w
 mi_bmof(E) uvcvideo(E) intel_uncore(E) sg(E) serio_raw(E) intel_rapl_perf(E) snd_hda_core(E) videobuf2_vmalloc(E) tpm_infineon(E) videobuf2_memops(E) videobuf2_v4l2(E) videobuf2_common(E) snd_hwdep(E)
[   28.932408]  videodev(E) media(E) snd_pcm(E) efivars(E) snd_timer(E) iTCO_wdt(E) cfg80211(E) iTCO_vendor_support(E) rfkill(E) snd(E) tpm_tis(E) tpm_tis_core(E) soundcore(E) tpm(E) mei_me(E) mei(E) rng_core(E) evdev(E) hp_accel(E) lis3lv02d(E) input_polldev(E) pcc_cpufreq(E) hp_wireless(E) battery(E) ac(E) coretemp(E) loop(E) parport_pc(E) ppdev(E) lp(E) parport(E) bfq(E) efivarfs(E) ip_tables(E) x_tables(E) autofs4(E) btrfs(E) xor(E) zstd_decompress(E) zstd_compress(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) dm_mod(E) sr_mod(E) cdrom(E) sd_mod(E) hid_generic(E) usbhid(E) hid(E) sdhci_pci(E) cqhci(E) i915(E) ahci(E) i2c_algo_bit(E) libahci(E) sdhci(E) drm_kms_helper(E) crc32c_intel(E) mmc_core(E) xhci_pci(E) libata(E) ehci_pci(E) xhci_hcd(E) ehci_hcd(E) scsi_mod(E) psmouse(E) lpc_ich(
 E) firewire_ohci(E) firewire_core(E) crc_itu_t(E) e1000e(E) drm(E) usbcore(E) thermal(E) wmi(E) video(E) button(E)
[   28.932469] CPU: 1 PID: 169 Comm: kworker/1:2 Tainted: G            E     5.0.7 #1
[   28.932471] Hardware name: Hewlett-Packard HP EliteBook 8470p/179B, BIOS 68ICF Ver. F.31 09/24/2012
[   28.932481] Workqueue: events nf_tables_trans_destroy_work [nf_tables]
[   28.932486] RIP: 0010:nft_xt_put.part.9+0x21/0x30 [nft_compat]
[   28.932489] Code: ff ff ff f3 c3 0f 1f 40 00 0f 1f 44 00 00 48 8b 07 48 39 c7 75 14 48 83 ef 80 be 80 00 00 00 e8 f5 54 14 f6 b8 01 00 00 00 c3 <0f> 0b eb e8 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 53
[   28.932491] RSP: 0018:ffffb119411a3db8 EFLAGS: 00010206
[   28.932493] RAX: ffff9a33fe12b300 RBX: ffff9a33fe12b600 RCX: 0000000000000000
[   28.932495] RDX: 0000000000000000 RSI: ffff9a33fe12b678 RDI: ffff9a33fe12b600
[   28.932497] RBP: ffffffffc10e3400 R08: ffffffffc10e3180 R09: ffffffffc1288800
[   28.932498] R10: 0000000000000001 R11: 0000000000000001 R12: ffff9a34081d9e40
[   28.932500] R13: dead000000000200 R14: dead000000000100 R15: ffffffffc12a5088
[   28.932503] FS:  0000000000000000(0000) GS:ffff9a3436840000(0000) knlGS:0000000000000000
[   28.932505] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   28.932506] CR2: 0000557e2fdb5000 CR3: 00000001f6e5e002 CR4: 00000000001606e0
[   28.932508] Call Trace:
[   28.932516]  __nft_match_destroy.isra.10+0x69/0xa0 [nft_compat]
[   28.932526]  nf_tables_expr_destroy+0x1a/0x40 [nf_tables]
[   28.932533]  nf_tables_rule_destroy+0x4f/0x80 [nf_tables]
[   28.932541]  nf_tables_trans_destroy_work+0x1dd/0x200 [nf_tables]
[   28.932548]  process_one_work+0x191/0x380
[   28.932553]  worker_thread+0x204/0x3b0
[   28.932557]  ? rescuer_thread+0x340/0x340
[   28.932560]  kthread+0xf8/0x130
[   28.932563]  ? kthread_create_worker_on_cpu+0x70/0x70
[   28.932569]  ret_from_fork+0x35/0x40
[   28.932573] ---[ end trace fc35add4fa3b2bde ]---
[   29.015565] general protection fault: 0000 [#1] SMP PTI
[   29.015574] CPU: 3 PID: 2069 Comm: ip6tables-resto Tainted: G        W   E     5.0.7 #1
[   29.015577] Hardware name: Hewlett-Packard HP EliteBook 8470p/179B, BIOS 68ICF Ver. F.31 09/24/2012
[   29.015586] RIP: 0010:strcmp+0x4/0x20
[   29.015590] Code: 74 1a 49 39 d0 48 89 d0 75 e9 48 85 d2 74 05 c6 44 17 ff 00 48 c7 c0 f9 ff ff ff c3 f3 c3 f3 c3 66 0f 1f 44 00 00 48 83 c7 01 <0f> b6 47 ff 48 83 c6 01 3a 46 ff 75 07 84 c0 75 eb 31 c0 c3 19 c0
[   29.015593] RSP: 0018:ffffb119428e78e0 EFLAGS: 00010282
[   29.015597] RAX: 00000000ffffffff RBX: ffffb11941401264 RCX: 000000000000000b
[   29.015600] RDX: ffff9a33fe12b600 RSI: ffffb11941401264 RDI: 894810247c8d4849
[   29.015602] RBP: ffff9a340486c510 R08: 0000000000000003 R09: ffff9a33f6d58128
[   29.015605] R10: ffffb119428e7930 R11: 0000000000000002 R12: 0000000000000000
[   29.015607] R13: ffffffffc1294e70 R14: ffff9a340486c500 R15: 894810247c8d4838
[   29.015611] FS:  00007f26d10ba740(0000) GS:ffff9a34368c0000(0000) knlGS:0000000000000000
[   29.015614] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   29.015617] CR2: 00007f26d118a6d0 CR3: 00000001fd760003 CR4: 00000000001606e0
[   29.015619] Call Trace:
[   29.015631]  nft_match_select_ops+0x92/0x210 [nft_compat]
[   29.015646]  nf_tables_expr_parse+0x13e/0x1e0 [nf_tables]
[   29.015653]  ? kvmalloc_node+0x43/0x70
[   29.015663]  nf_tables_newrule+0x247/0x8b0 [nf_tables]
[   29.015671]  nfnetlink_rcv_batch+0x499/0x720 [nfnetlink]
[   29.015679]  ? skb_queue_tail+0x1b/0x50
[   29.015685]  ? _cond_resched+0x16/0x40
[   29.015691]  ? kmem_cache_alloc_node_trace+0x1c1/0x1f0
[   29.015695]  ? __insert_vmap_area+0x99/0x100
[   29.015702]  ? refcount_inc_checked+0x5/0x30
[   29.015707]  ? apparmor_capable+0x70/0xb0
[   29.015713]  ? __nla_parse+0x34/0x150
[   29.015719]  nfnetlink_rcv+0x113/0x136 [nfnetlink]
[   29.015725]  netlink_unicast+0x1b9/0x240
[   29.015731]  netlink_sendmsg+0x2d0/0x3c0
[   29.015735]  sock_sendmsg+0x36/0x40
[   29.015739]  ___sys_sendmsg+0x2e9/0x300
[   29.015744]  ? page_add_file_rmap+0x13/0x1f0
[   29.015750]  ? filemap_map_pages+0x183/0x380
[   29.015756]  ? __handle_mm_fault+0xb89/0x1200
[   29.015760]  ? refcount_inc_checked+0x5/0x30
[   29.015764]  ? apparmor_capable+0x70/0xb0
[   29.015768]  ? security_capable+0x35/0x50
[   29.015772]  ? release_sock+0x19/0x90
[   29.015776]  ? __sys_sendmsg+0x63/0xa0
[   29.015780]  __sys_sendmsg+0x63/0xa0
[   29.015787]  do_syscall_64+0x55/0xf0
[   29.015792]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   29.015797] RIP: 0033:0x7f26d11bcc74
[   29.015800] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 48 8d 05 89 5a 0c 00 8b 00 85 c0 75 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 41 89 d4 55 48 89 f5 53
[   29.015803] RSP: 002b:00007ffd02e15868 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[   29.015807] RAX: ffffffffffffffda RBX: 00007ffd02e15880 RCX: 00007f26d11bcc74
[   29.015809] RDX: 0000000000000000 RSI: 00007ffd02e16900 RDI: 0000000000000003
[   29.015812] RBP: 00007ffd02e16f80 R08: 0000000000000004 R09: 0000000000000000
[   29.015814] R10: 00007ffd02e168ec R11: 0000000000000246 R12: 0000564c33d862a0
[   29.015816] R13: 00007ffd02e19850 R14: 00007ffd02e15870 R15: 00007ffd02e19888
[   29.015820] Modules linked in: ip6t_REJECT(E) nf_reject_ipv6(E) nft_chain_nat_ipv6(E) nf_nat_ipv6(E) nft_chain_route_ipv6(E) xt_multiport(E) nf_log_ipv6(E) xt_recent(E) xt_comment(E) xt_hashlimit(E) xt_addrtype(E) xt_mark(E) xt_CT(E) nfnetlink_log(E) xt_NFLOG(E) nf_log_ipv4(E) nf_log_common(E) xt_LOG(E) nf_nat_tftp(E) nf_nat_snmp_basic(E) nf_conntrack_snmp(E) nf_nat_sip(E) nf_nat_pptp(E) nf_nat_irc(E) nf_nat_h323(E) nf_nat_ftp(E) nf_nat_amanda(E) ts_kmp(E) nf_conntrack_amanda(E) nf_conntrack_sane(E) nf_conntrack_tftp(E) nf_conntrack_sip(E) nf_conntrack_pptp(E) nf_conntrack_proto_gre(E) nf_conntrack_netlink(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_irc(E) nf_conntrack_h323(E) nf_conntrack_ftp(E) nft_chain_route_ipv4(E) xt_CHECKSUM(E) nft_chain_nat_ipv4(E) ipt_M
 ASQUERADE(E) nf_nat_ipv4(E) nf_nat(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) ipt_REJECT(E) nf_reject_ipv4(E) nft_counter(E) xt_tcpudp(E) nft_compat(E) tun(E) bridge(E) stp(E)
[   29.015861]  llc(E) devlink(E) nf_tables(E) nfnetlink(E) msr(E) cmac(E) cpufreq_userspace(E) cpufreq_powersave(E) cpufreq_conservative(E) bnep(E) binfmt_misc(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) ext4(E) mbcache(E) jbd2(E) fscrypto(E) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) efi_pstore(E) ghash_clmulni_intel(E) btusb(E) mei_wdt(E) btrtl(E) btbcm(E) btintel(E) bluetooth(E) arc4(E) aesni_intel(E) snd_hda_codec_hdmi(E) drbg(E) iwldvm(E) aes_x86_64(E) ansi_cprng(E) crypto_simd(E) ecdh_generic(E) cryptd(E) glue_helper(E) crc16(E) snd_hda_codec_idt(E) mac80211(E) hp_wmi(E) snd_hda_codec_generic(E) sparse_keymap(E) joydev(E) ledtrig_audio(E) snd_hda_intel(E) iwlwifi(E) snd_hda_codec(E) intel_cstate(E) w
 mi_bmof(E) uvcvideo(E) intel_uncore(E) sg(E) serio_raw(E) intel_rapl_perf(E) snd_hda_core(E) videobuf2_vmalloc(E) tpm_infineon(E) videobuf2_memops(E) videobuf2_v4l2(E) videobuf2_common(E) snd_hwdep(E)
[   29.015913]  videodev(E) media(E) snd_pcm(E) efivars(E) snd_timer(E) iTCO_wdt(E) cfg80211(E) iTCO_vendor_support(E) rfkill(E) snd(E) tpm_tis(E) tpm_tis_core(E) soundcore(E) tpm(E) mei_me(E) mei(E) rng_core(E) evdev(E) hp_accel(E) lis3lv02d(E) input_polldev(E) pcc_cpufreq(E) hp_wireless(E) battery(E) ac(E) coretemp(E) loop(E) parport_pc(E) ppdev(E) lp(E) parport(E) bfq(E) efivarfs(E) ip_tables(E) x_tables(E) autofs4(E) btrfs(E) xor(E) zstd_decompress(E) zstd_compress(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) dm_mod(E) sr_mod(E) cdrom(E) sd_mod(E) hid_generic(E) usbhid(E) hid(E) sdhci_pci(E) cqhci(E) i915(E) ahci(E) i2c_algo_bit(E) libahci(E) sdhci(E) drm_kms_helper(E) crc32c_intel(E) mmc_core(E) xhci_pci(E) libata(E) ehci_pci(E) xhci_hcd(E) ehci_hcd(E) scsi_mod(E) psmouse(E) lpc_ich(
 E) firewire_ohci(E) firewire_core(E) crc_itu_t(E) e1000e(E) drm(E) usbcore(E) thermal(E) wmi(E) video(E) button(E)
[   29.015977] ---[ end trace fc35add4fa3b2bdf ]---
[   29.613482] RIP: 0010:strcmp+0x4/0x20
[   29.613486] Code: 74 1a 49 39 d0 48 89 d0 75 e9 48 85 d2 74 05 c6 44 17 ff 00 48 c7 c0 f9 ff ff ff c3 f3 c3 f3 c3 66 0f 1f 44 00 00 48 83 c7 01 <0f> b6 47 ff 48 83 c6 01 3a 46 ff 75 07 84 c0 75 eb 31 c0 c3 19 c0
[   29.613488] RSP: 0018:ffffb119428e78e0 EFLAGS: 00010282
[   29.613490] RAX: 00000000ffffffff RBX: ffffb11941401264 RCX: 000000000000000b
[   29.613492] RDX: ffff9a33fe12b600 RSI: ffffb11941401264 RDI: 894810247c8d4849
[   29.613493] RBP: ffff9a340486c510 R08: 0000000000000003 R09: ffff9a33f6d58128
[   29.613494] R10: ffffb119428e7930 R11: 0000000000000002 R12: 0000000000000000
[   29.613495] R13: ffffffffc1294e70 R14: ffff9a340486c500 R15: 894810247c8d4838
[   29.613497] FS:  00007f26d10ba740(0000) GS:ffff9a34368c0000(0000) knlGS:0000000000000000
[   29.613499] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   29.613500] CR2: 00007f26d118a6d0 CR3: 00000001fd760003 CR4: 00000000001606e0

On one of the two systems, this would result in the system failing to shut down properly: the kernel would hang completely when trying to shut down.

The problem is known, and can be fixed by this patch, which has been queued in the stable 5.0 tree. It will hopefully be included in the 5.0.8 version.

Linux security hardening recommendations

In a previous blog post, I wrote how to secure OpenSSH against brute force attacks. However, what if someone manages to get a shell on your system, despite all your efforts? You want to protect your system from your users doing nasty things? It is important to harden your system further according to the principle of defense in depth in order.

Software updates

Make sure you are running a supported distribution, and by preference the most recent version one. For example, Debian Jessie is still supported, however upgrading to Debian Stretch is strongly recommended, because it offers various security improvements (more recent kernel with new security hardening, PHP 7 with new security related features, etc…)

Install amd64-microcode (for AMD CPU’s) or intel-microcode (for Intel CPU’s) which are needed to protect against hardware vulnerabilities such as Spectre, Meltdown and L1TF. I recommend installing it from stretch-backports in order to have the latest firmware.

Automatic updates and needrestart

I recommend installing unattened-upgrades . You can configure it to just download updates or to download and install them automatically. By default, unattended-upgrades will only install updates from the official security repositories. This way it is relatively safe to let it do this automatically. If you have already installed it, you can run this command to reconfigure it:

# dpkg-reconfigure unattended-upgrades

When you update system libraries, you should also restart all daemons which are using these libraries to make them use the newly installed version. This is exactly what needrestart does. After you have run apt-get, it will check whether there are any daemons running with older libraries, and will propose you to restart them. If you use it with unattended-upgrades, you should set this option in /etc/needrestart/needrestart.conf to make sure that all services which require a restart are indeed restarted:

$nrconf{restart} = 'a';

Up-to-date kernel

Running an up-to-date kernel is very important, because also the kernel can be vulnerable. In the worst case, an outdated kernel can be exploited to gain root permissions. Do not forget to reboot after updating the kernel.

Every new kernel version also contains various extra security hardening measures. Kernel developer Kees Cook has an overview of security related changes in the kernel.

In case you build your own kernel, you can use kconfig-hardened-check to get recommendation for a hardened kernel configuration.null

Firewall: filtering outgoing traffic

It is very obvious to install a firewall which filters incoming traffic. However, have you considered also filtering outgoing traffic? This is a bit more difficult to set up because you need to whitelist all outgoing hosts to which connections are needed (I think of your distribution’s repositories, NTP servers, DNS servers,…), but it is a very effective measure which will help limiting damage in case a user account gets compromised, despite all your other protective efforts.

Ensuring strong passwords

Prevent your users from setting bad passwords by installing libpam-pwquality, together with some word lists for your language and a few common languages. These will be used for verifying that the user is not using a common word as his password. libpam-quality will be enabled automatically after installation with some default settings.

# apt-get install libpam-pwquality wbritish wamerican wfrench wngerman wdutch

Please note that by default, libpam-pwquality will only enforce strong passwords when a non-root user changes its password. If root is setting a password, it will give a warning if a weak password is set, but will still allow it. If you want to enforce it for root too (which I recommend), then add enforce_for_root in the pam_pwquality line in /etc/pam.d/common-password:

password	requisite			pam_pwquality.so retry=3 enforce_for_root

Automatically log out inactive users

In order to log out inactive users, set a timeout of 600 seconds on the Bash shell. Create /etc/profile.d/tmout.sh:

export TMOUT=600
readonly TMOUT

Prevent creating cron jobs

Make sure users cannot set cron jobs. In case an attacker gets a shell on your system, often cron will be used to ensure the malware continues running after a reboot. In order to prevent normal users to set up cron jobs, create an empty /etc/cron.allow.

Protect against fork bombs and excessive logins and CPU usage

Create a new file in /etc/security/limits.d to impose some limits to user sessions. I strongly recommend setting a value for nproc, in order to prevent fork bombs. maxlogins is the maximum number of logins per user, and cpu is used to set a limit on the CPU time a user can use (in minutes):

*	hard	nproc		1024
*	hard	maxlogins 	4
1000:	hard	cpu		180

Hiding processes from other users

By mounting the /proc filesystem with the hidepid=2 option, users cannot see the PIDs of processes by other users in /proc, and hence these processes also become invisible when using tools like top and ps. Put this in /etc/fstab to mount /proc by default with this option:

none	/proc	proc	defaults,hidepid=2	0	0

Restricting /proc/kallsyms

It is possible to restrict access to /proc/kallsyms at boot time by setting 004 permissions. Put this in /etc/rc.local:

chmod 400 /proc/kallsyms

/proc/kallsyms contains information about how the kernel’s memory is laid out. With this information it becomes easier to attack the kernel itself, so hiding this information is always a good idea. It should be noted though that attackers can get this information from other sources too, such as from the System.map files in /boot.

Harden kernel configuration with sysctl

Several kernel settings can be set at run time using sysctl. To make these settinsg permanent, put these settings in files with the .conf extension in /etc/sysctl.d.

It is possible to hide the kernel messages (which can be read with the dmesg command) from other users than root by setting the sysctl kernel.dmesg_restrict to 1. On Debian Stretch and later this should already be the default value:

kernel.dmesg_restrict = 1

From Linux kernel version 4.19 on it’s possible to disallow opening FIFOs or regular files not owned by the user in world writable sticky directories. This setting would have prevented vulnerabilities found in different user space programs the last couple of years. This protection is activated automatically if you use systemd version 241 or higher with Linux 4.19 or higher. If your kernel supports this feature but you are not using systemd 241, you can activate it yourself by setting the right sysctl settings:

fs.protected_regular = 1
fs.protected_fifos = 1

Also check whether the following sysctl’s have the right value in order to enable protection hard links and symlinks. These work with Linux 3.6 and higher, and likely will already be enabled by default on your system:

fs.protected_hardlinks = 1
fs.protected_symlinks = 1

Also by default on Debian Stretch only root users can access perf events:

kernel.perf_event_paranoid = 3

Show kernel pointers in /proc as null for non-root users:

kernel.kptr_restrict = 1

Disable the kexec system call, which allows you to boot a different kernel without going through the BIOS and boot loader:

kernel.kexec_load_disabled = 1

Allow ptrace access (used by gdb and strace) for non-root users only to child processes. For example strace ls will still work, but strace -p 8659 will not work as non-root user:

kernel.yama.ptrace_scope = 1

The Linux kernel includes eBPF, the extended Berkeley Packet Filter, which is a VM in which unprivileged users can load and run certain code in the kernel. If you are sure no users need to call bpf(), it can be disabled for non-root users:

kernel.unprivileged_bpf_disabled = 1

In case the BPF Just-In-Time compiler is enabled (it is disabled by default, see sysctl net/core/bpf_jit_enable), it is possible to enable some extra hardening against certain vulnerabilities:

net.core.bpf_jit_harden = 2

Take a look at the Kernel Self Protection Project Recommended settings page to find an up to date list of recommended settings.

Lynis

Finally I want to mention Lynis, a security auditing tool. It will check the configuration of your system, and make recommendations for further security hardening.

Further ideas

Debian Stretch on AMD EPYC (ZEN) with an NVIDIA GPU for HPC

Recently at work we bought a new Dell PowerEdge R7425 server for our HPC cluster. These are some of the specifications:

  • 2 AMD EPYC 7351 16-Core Processors
  • 128 GB RAM (16 DIMMs of 8 GB)
  • Tesla V100 GPU
Dell Poweredge R7425 front with cover
Dell Poweredge R7425 front without cover
Dell Poweredge R7425 inside

Our FAI configuration automatically installed Debian stretch on it without any problem. All hardware was recognized and working. The installation of the basic operating system took less than 20 minutes. FAI also sets up Puppet on the machine. After booting the system, Puppet continues setting up the system: installing all needed software, setting up the Slurm daemon (part of the job scheduler), mounting the NFS4 shared directories, etc. Everything together, the system was automatically installed and configured in less than 1 hour.

Linux kernel upgrade

Even though the Linux 4.9 kernel of Debian Stretch works with the hardware, there are still some reasons to update to a newer kernel. Only in more recent kernel versions, the k10temp kernel module is capable of reading out the CPU temperature sensors. We also had problems with fscache (used for NFS4 caching) with the 4.9 kernel in the past, which are fixed in a newer kernel. Furthermore there have been many other performance optimizations which could be interesting for HPC.

You can find a more recent kernel in Debian’s Backports repository. At the time of writing it is a 4.18 based kernel. However, I decided to build my own 4.19 based kernel.

In order to build a Debian kernel package, you will need to have the package kernel-package installed. Download the sources of the Linux kernel you want to build, and configure it (using make menuconfig or any method you prefer). Then build your kernel using this command:

$ make -j 32 deb-pkg

Replace 32 by the number of parallel jobs you want to run; the number of CPU cores you have is a good amount. You can also add the LOCALVERSION and KDEB_PKGVERSION variables to set a custom name and version number. See the Debian handbook for a more complete howto. When the build is finished successfully, you can install the linux-image and linux-headers package using dpkg.

We mentioned temperature monitoring support with the k10temp driver in newer kernels. If you want to check the temperatures of all NUMA nodes on your CPUs, use this command:

$ cat /sys/bus/pci/drivers/k10temp/*/hwmon/hwmon*/temp1_input

Divide the value by 1000 to get the temperature in degrees Celsius. Of course you can also use the sensors command of the lm-sensors package.

Kernel settings

VM dirty values

On systems with lots of RAM, you will encounter problems because the default values of vm.dirty_ratio and vm.dirty_background_ratio are too high. This can cause stalls when all dirty data in the cache is being flushed to disk or to the NFS server.

You can read more information and a solution in SuSE’s knowledge base. On Debian, you can create a file /etc/sysctl.d/dirty.conf with this content:

vm.dirty_bytes = 629145600
vm.dirty_background_bytes = 314572800

Run

# systctl -p /etc/sysctl.d/dirty.conf

to make the settings take effect immediately.

Kernel parameters

In /etc/default/grub, I added the options

transparant_hugepage=always cgroup_enable=memory

to the GRUB_CMDLINE_LINUX variable.

Transparant hugepages can improve performance in some cases. However, it can have a very negative impact on some specific workloads too. Applications like Redis, MongoDB and Oracle DB recommend not enabling transparant hugepages by default. So make sure that it’s worthwhile for your workload before adding this option.

Memory cgroups are used by Slurm to prevent jobs using more memory than what they reserved. Run

# update-grub

to make sure the changes will take effect at the next boot.

I/O scheduler

If you’re using a configuration based on recent Debian’s kernel configuration, you will likely be using the Multi-Queue Block IO Queueing Mechanism with the mq-deadline scheduler as default. This is great for SSDs (especially NVME based ones), but might not be ideal for rotational hard drives. You can use the BFQ scheduler as an alternative on such drives. Be sure to test this properly tough, because with Linux 4.19 I experienced some stability problems which seemed to be related to BFQ. I’ll be reviewing this scheduler again for 4.20 or 4.21.

First if BFQ is built as a module (wich is the case in Debian’s kernel config), you will need to load the module. Create a file /etc/modules-load.d/bfq.conf with contents

bfq

Then to use this scheduler by default for all rotational drives, create the file /etc/udev/rules.d/60-io-scheduler.rules with these contents:

# set scheduler for non-rotating disks
ACTION=="add|change", KERNEL=="sd[a-z]|mmcblk[0-9]*|nvme[0-9]*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
# set scheduler for rotating disks
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq"

Run

# update-initramfs -u

to rebuild the initramfs so it includes this udev rule and it will be loaded at the next boot sequence.

The Arch wiki has more information on I/O schedulers.

CPU Microcode update

We want the latest microcode for the CPU to be loaded. This is needed to mitigate the Spectre vulnerabilities. Install the amd64-microcode package from stretch-backports.

Packages with AMD ZEN support

hwloc

hwloc is a utility which reads out the NUMA topology of your system. It is used by the Slurm workload manager to to bind tasks to certain cores.

The version of hwloc in Stretch (1.11.5) does not have support for the AMD ZEN architecture. However, hwloc 1.11.12 is available in stretch-backports, and this version does have AMD ZEN support. So make sure you have the packages hwloc libhwloc5 libhwloc-dev libhwloc-plugins installed from stretch-backports.

BLAS libraries

There is no BLAS library in Debian Stretch which supports AMD ZEN architecture. Unfortunately, at the moment of writing there is is also no good BLAS implementation for ZEN available in stretch-backports. This will likely change in the near future though, as BLIS has now entered Debian Unstable and will likely be backported too in the stretch-backports repository.

NVIDIA drivers and CUDA libraries

NVIDIA drivers

I installed the NVIDIA drivers an CUDA libraries from the tarballs downloaded from the NVIDIA website because at the time of writing all packages available in the Debian repositories are outdated.

First make sure you have the linux-headers package installed which corresponds with the linux-image kernel package you are running. We will be using DKMS to rebuild the driver automatically whenever we install a new kernel, so also make sure you have the dkms package installed.

Download the NVIDIA driver for your GPU from the NVIDIA website. Remove the nouveau driver with the

# rmmod nouveau

command. And create a file /etc/modules-load.d/blacklist-nouveau.conf with these contents:

blacklist nouveau

and rebuild the initramfs by running

# update-initramfs -u

This will ensure the nouveau module will not be loaded automatically.

Now install the driver, by using a command similar to this:

# NVIDIA-Linux-x86_64-410.79.run -s --dkms

This will do a silent installation, integrating the driver with DKMS so it will get built automatically every time you install a new linux-image together with its corresponding linux-headers package.

To make sure that the necessary device files in /dev exist after rebooting this system, I put this script /usr/local/sbin/load-nvidia:

#!/bin/sh
/sbin/modprobe nvidia
if [ "$?" -eq 0 ]; then
  # Count the number of NVIDIA controllers found.
  NVDEVS=`lspci | grep -i NVIDIA`
  N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
  NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`
  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
    mknod -m 666 /dev/nvidia$i c 195 $i
  done
  mknod -m 666 /dev/nvidiactl c 195 255
else
  exit 1
fi
/sbin/modprobe nvidia-uvm
if [ "$?" -eq 0 ]; then
  # Find out the major device number used by the nvidia-uvm driver
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
  mknod -m 666 /dev/nvidia-uvm c $D 0
else
  exit 1
fi

In order to start this script at boot up, I created a systemd service. Create the file /etc/systemd/system/load-nvidia.service with this content:

[Unit]
Description=Load NVidia driver and creates nodes in /dev
Before=slurmd.service

[Service]
ExecStart=/usr/local/sbin/load-nvidia
Type=oneshot
RemainAfterExit=true

[Install]
WantedBy=multi-user.target

Now run these commands to enable the service:

# systemctl daemon-reload
# systemctl enable load-nvidia.service

You can verify whether everything is working by running the command

$ nvidia-smi

CUDA libraries

Download the CUDA libraries. For Debian, choose Linux x86_64 Ubuntu 18.04 runfile (local).

Then install the libraries with this command:

# cuda_10.0.130_410.48_linux --silent --override --toolkit

This will install the toolkit in silent mode. With the override option it is possible to install the toolkit on systems which don’t have an NVIDIA GPU, which might be useful for compilation purposes.

To make sure your users have the necessary binaries and libraries available in their path, create the file /etc/profile.d/cuda.sh with this content:

export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH


Leap second causing ksoftirqd and java to use lots of cpu time

Today there was a leap second at 23:59:60 UTC. On one of my systems, this caused a high CPU load starting from around 02h00 GMT+2 (which corresponds with the time of the leap second). ksoftirqd and some java (glassfish) process where using lots of CPU time. This system was running Debian Squeeze with kernel 2.6.32-45. The problem is very easy to fix: just run

# date -s "`date`"

and everything will be fine again. I found this solution on the Linux Kernel Mailing List: http://marc.info/?l=linux-kernel&m=134113389621450&w=2. Apparently a similar problem can happen with Firefox, Thunderbird, Chrome/Chromium, Java, Mysql, Virtualbox and probably other processes.

I was a bit suprised that this problem only happened on this particular machine, because I have several other servers running similar kernel versions.

Linux performance improvements

Two years ago I wrote an article presenting some Linux performance improvements. These performance improvements are still valid, but it is time to talk about some new improvements available. As I am using Debian now, I will focus on that distribution, but you should be able to easily implement these things on other distributions too. Some of these improvements are best suited for desktop systems, other for server systems and some are useful for both. Continue reading “Linux performance improvements”

Linux kernel: The battle of the CPU schedulers

Since some time already, different patches are being written for the Linux kernel, which improve the CPU scheduler. The CPU scheduler, is that part of the kernel, that’s responsible for assigning CPU time to the different task running on your system. If you sometimes experience problems with sound stuttering or your mouse becoming jerky while running other CPU intensive tasks, then this is definitely a problem caused by the task scheduler.

Con Kolivas has been maintaining an alternative scheduler for some time. His Staircase scheduler was designed with interactivity in mind, especially for desktop systems, where people want their system to be quickly responsive under all kinds of workloads. This scheduler has been optimized a lot through the years, and as such is very stable. Still there are some rare cases where “starvation” is possible.

At the start of March, Con Kolivas published a new scheduler, which was called RSDL (Rotating Staircase DeadLine scheduler) at first, and has been renamed to SD (Staircase Deadline) afterwards. Based on the experience Con Kolivas gathered with his Staircase scheduler, SD is a more general purpose scheduler, trying to give absolute fairness to the different running tasks, without favouring any process (for example lots of other schedulers favour X). This way, no starvation issues should be possible with this scheduler. A lot of discussion followed after his announcement, and it became quickly clear that a lot of people were not happy with the current scheduler in Linux. Important kernel developers like Mike Galbraith, Nick Piggin, Ingo Molnar, Willy Tarreau and Andrew Morton joined the discussion and also posted other scheduler patches, sometimes not without some trolling and flaming as sometimes happen on such mailing lists. Con Kolivas’ scheduler was added to Andrew Morton’s mm kernel tree to get some more testing. The development of RDSL/SD went up and down sometimes, because of Con Kolivas’ health problems.

Ingo Molnar, which was rather critical of some of the ideas in the new scheduler at first, also recently began the development of CFS (Completely Fair Scheduler, which actually is based on the same basic concept of fairness. Con Kolivas announced that he would stop development of the RD scheduler, because of his health problems, and because his ideas would now continue to be used in the CFS scheduler. But things came out differently, and Con Kolivas continued development in the end. The result is that the SD scheduler is now at its 46’th version (v. 0.46), and it seems most problems have been fixed. Based on all the testing done on the kernel mailing list, it seems SD 0.46 is more mature than CFS 6. Even Willy Tarreau, maintainer of the 2.4 Linux kernel tree, said that thanks to SD, he did make Linux 2.6 the default kernel on his laptop, as he found the scheduler in mainline 2.6 too bad compared to 2.4. It’s unclear however which of these schedulers will be integrated in linux finally, and when this will happen.

Personally I think SD 0.46 should be integrated now in Linux 2.6.22 pre-releases. There has been a lot of testing and bug fixing, and it seems there are no serious bugs open anymore now. I also hope that Mandriva 2008 will come out with one of these new schedulers. The tmb kernel in Mandriva Cooker, already uses the SD scheduler now. People interested in this discussion, can subscribe to the ck mailing list where a lot of the discussion is happening. Sites like LWN.net and Kerneltrap also often post about the progress of this subject.

Struggling with Linux’ OOM killer when building RPMs

Last two weeks, I have created a lot of updated packages for Mandriva 2007.1. I packaged Gnome 2.18.1, and also updated subversion snaphots of kdepim and kdegraphics. Kdepim, because it has received a lot of bug fixing love the last two months, and kdegraphics, because it contains kpdf using new xpdf code, which should be compatible with PDF 1.6 and 1.7 specifications. In kdepim, they also removed the kitchensynk tool, which offered synchronization options with external devices. Apparently it was too buggy to be really useful. I have the impression that Kmail is indeed also more stable than in 3.5.6. I could not reproduce yet the hangs I sometimes experienced with 3.5.6.

Compiling kdepim on an AMD64 system, seems to require a huge amount of memory (much more than on x86 32 bit). 1 GB of RAM and about 250 MB of swap did not prevent g++ eating up all of my memory (even when no other services were active!). Unfortunately, this made Linux become completely unstable: the hard drive started thrashing the whole time, and the system was completely unresponsive. I could only stop it by doing a hard reset. I am clearly not the only one hating this stupid Linux behaviour.

In the Mandriva Cooker channel (irc.freenode.org, -cooker), couriousous suggested to execute

# echo 2 > /proc/sys/vm/overcommit_memory

The default value is 0: when an application asks more memory than is available, Linux will still try to allocate it, even if chances exist that it won’t be available. This seems to be done because some applications ask more memory than what they will really use. If in the end, no more memory or swap is availed, Linux’ OOM killer will kill some (random) processes to free up memory. By setting overcommit_memory to 2, the allocation of too much memory will fail immediately. The application can then react itself to the fact that not enough memory is available. The result was that instead of bringing my system to death, the g++ compiler just exited with the message that I was out of memory. Much nicer! I found a complete technical explanation about memory allocation in Linux on the web. To make this setting default, I put vm.overcommit_memory = 2 in my /etc/sysctl.conf. I do not understand why it is not the default value, making a system unresponsive for several (tens of) minutes does not seem very friendly…

Update 16 april 2007: It seems like this setting has severe problems as well. On my system with 384 MB RAM, which I use as server and desktop, several applications randomly crashed because they did not get the memory they wanted (Evolution, Tilda,…). Changing the setting back to 0, made these applications work correctly again. I suppose playing with the overcommit_ratio value as explained in the article which I mentioned above, can improve this behaviour. But anyway, it sucks that such things are so difficult to get right. This should really be working nicely out of the box

Anyway, in the end I got kdepim compiled on AMD64 by adding some more swap. From now on, I’ll be creating bigger swap partitions is Linux, it can realy be useful, even when you think you have enough memory…

I also built a freetype 2.3.4 RPM for Mandriva. Font rendering on my flat panel is now much nicer comparing with freetype 2.3.1 which is included in Mandriva 2007.1! Not that it was ugly before, but it’s a nice surprise to still see such big improvements, especially from a minor update.