Linux 5.0 Netfilter bug

On two desktop systems running Debian Buster with Linux kernel version 5.0.7, I was experiencing a problem when Shorewall6 was stopping or restarting. This kernel backtrace appeared in the logs:

 [   28.932323] WARNING: CPU: 1 PID: 169 at net/netfilter/nft_compat.c:82 nft_xt_put.part.9+0x21/0x30 [nft_compat]
[   28.932325] Modules linked in: ip6t_REJECT(E) nf_reject_ipv6(E) nft_chain_nat_ipv6(E) nf_nat_ipv6(E) nft_chain_route_ipv6(E) xt_multiport(E) nf_log_ipv6(E) xt_recent(E) xt_comment(E) xt_hashlimit(E) xt_addrtype(E) xt_mark(E) xt_CT(E) nfnetlink_log(E) xt_NFLOG(E) nf_log_ipv4(E) nf_log_common(E) xt_LOG(E) nf_nat_tftp(E) nf_nat_snmp_basic(E) nf_conntrack_snmp(E) nf_nat_sip(E) nf_nat_pptp(E) nf_nat_irc(E) nf_nat_h323(E) nf_nat_ftp(E) nf_nat_amanda(E) ts_kmp(E) nf_conntrack_amanda(E) nf_conntrack_sane(E) nf_conntrack_tftp(E) nf_conntrack_sip(E) nf_conntrack_pptp(E) nf_conntrack_proto_gre(E) nf_conntrack_netlink(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_irc(E) nf_conntrack_h323(E) nf_conntrack_ftp(E) nft_chain_route_ipv4(E) xt_CHECKSUM(E) nft_chain_nat_ipv4(E) ipt_M
 ASQUERADE(E) nf_nat_ipv4(E) nf_nat(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) ipt_REJECT(E) nf_reject_ipv4(E) nft_counter(E) xt_tcpudp(E) nft_compat(E) tun(E) bridge(E) stp(E)
[   28.932357]  llc(E) devlink(E) nf_tables(E) nfnetlink(E) msr(E) cmac(E) cpufreq_userspace(E) cpufreq_powersave(E) cpufreq_conservative(E) bnep(E) binfmt_misc(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) ext4(E) mbcache(E) jbd2(E) fscrypto(E) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) efi_pstore(E) ghash_clmulni_intel(E) btusb(E) mei_wdt(E) btrtl(E) btbcm(E) btintel(E) bluetooth(E) arc4(E) aesni_intel(E) snd_hda_codec_hdmi(E) drbg(E) iwldvm(E) aes_x86_64(E) ansi_cprng(E) crypto_simd(E) ecdh_generic(E) cryptd(E) glue_helper(E) crc16(E) snd_hda_codec_idt(E) mac80211(E) hp_wmi(E) snd_hda_codec_generic(E) sparse_keymap(E) joydev(E) ledtrig_audio(E) snd_hda_intel(E) iwlwifi(E) snd_hda_codec(E) intel_cstate(E) w
 mi_bmof(E) uvcvideo(E) intel_uncore(E) sg(E) serio_raw(E) intel_rapl_perf(E) snd_hda_core(E) videobuf2_vmalloc(E) tpm_infineon(E) videobuf2_memops(E) videobuf2_v4l2(E) videobuf2_common(E) snd_hwdep(E)
[   28.932408]  videodev(E) media(E) snd_pcm(E) efivars(E) snd_timer(E) iTCO_wdt(E) cfg80211(E) iTCO_vendor_support(E) rfkill(E) snd(E) tpm_tis(E) tpm_tis_core(E) soundcore(E) tpm(E) mei_me(E) mei(E) rng_core(E) evdev(E) hp_accel(E) lis3lv02d(E) input_polldev(E) pcc_cpufreq(E) hp_wireless(E) battery(E) ac(E) coretemp(E) loop(E) parport_pc(E) ppdev(E) lp(E) parport(E) bfq(E) efivarfs(E) ip_tables(E) x_tables(E) autofs4(E) btrfs(E) xor(E) zstd_decompress(E) zstd_compress(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) dm_mod(E) sr_mod(E) cdrom(E) sd_mod(E) hid_generic(E) usbhid(E) hid(E) sdhci_pci(E) cqhci(E) i915(E) ahci(E) i2c_algo_bit(E) libahci(E) sdhci(E) drm_kms_helper(E) crc32c_intel(E) mmc_core(E) xhci_pci(E) libata(E) ehci_pci(E) xhci_hcd(E) ehci_hcd(E) scsi_mod(E) psmouse(E) lpc_ich(
 E) firewire_ohci(E) firewire_core(E) crc_itu_t(E) e1000e(E) drm(E) usbcore(E) thermal(E) wmi(E) video(E) button(E)
[   28.932469] CPU: 1 PID: 169 Comm: kworker/1:2 Tainted: G            E     5.0.7 #1
[   28.932471] Hardware name: Hewlett-Packard HP EliteBook 8470p/179B, BIOS 68ICF Ver. F.31 09/24/2012
[   28.932481] Workqueue: events nf_tables_trans_destroy_work [nf_tables]
[   28.932486] RIP: 0010:nft_xt_put.part.9+0x21/0x30 [nft_compat]
[   28.932489] Code: ff ff ff f3 c3 0f 1f 40 00 0f 1f 44 00 00 48 8b 07 48 39 c7 75 14 48 83 ef 80 be 80 00 00 00 e8 f5 54 14 f6 b8 01 00 00 00 c3 <0f> 0b eb e8 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 53
[   28.932491] RSP: 0018:ffffb119411a3db8 EFLAGS: 00010206
[   28.932493] RAX: ffff9a33fe12b300 RBX: ffff9a33fe12b600 RCX: 0000000000000000
[   28.932495] RDX: 0000000000000000 RSI: ffff9a33fe12b678 RDI: ffff9a33fe12b600
[   28.932497] RBP: ffffffffc10e3400 R08: ffffffffc10e3180 R09: ffffffffc1288800
[   28.932498] R10: 0000000000000001 R11: 0000000000000001 R12: ffff9a34081d9e40
[   28.932500] R13: dead000000000200 R14: dead000000000100 R15: ffffffffc12a5088
[   28.932503] FS:  0000000000000000(0000) GS:ffff9a3436840000(0000) knlGS:0000000000000000
[   28.932505] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   28.932506] CR2: 0000557e2fdb5000 CR3: 00000001f6e5e002 CR4: 00000000001606e0
[   28.932508] Call Trace:
[   28.932516]  __nft_match_destroy.isra.10+0x69/0xa0 [nft_compat]
[   28.932526]  nf_tables_expr_destroy+0x1a/0x40 [nf_tables]
[   28.932533]  nf_tables_rule_destroy+0x4f/0x80 [nf_tables]
[   28.932541]  nf_tables_trans_destroy_work+0x1dd/0x200 [nf_tables]
[   28.932548]  process_one_work+0x191/0x380
[   28.932553]  worker_thread+0x204/0x3b0
[   28.932557]  ? rescuer_thread+0x340/0x340
[   28.932560]  kthread+0xf8/0x130
[   28.932563]  ? kthread_create_worker_on_cpu+0x70/0x70
[   28.932569]  ret_from_fork+0x35/0x40
[   28.932573] ---[ end trace fc35add4fa3b2bde ]---
[   29.015565] general protection fault: 0000 [#1] SMP PTI
[   29.015574] CPU: 3 PID: 2069 Comm: ip6tables-resto Tainted: G        W   E     5.0.7 #1
[   29.015577] Hardware name: Hewlett-Packard HP EliteBook 8470p/179B, BIOS 68ICF Ver. F.31 09/24/2012
[   29.015586] RIP: 0010:strcmp+0x4/0x20
[   29.015590] Code: 74 1a 49 39 d0 48 89 d0 75 e9 48 85 d2 74 05 c6 44 17 ff 00 48 c7 c0 f9 ff ff ff c3 f3 c3 f3 c3 66 0f 1f 44 00 00 48 83 c7 01 <0f> b6 47 ff 48 83 c6 01 3a 46 ff 75 07 84 c0 75 eb 31 c0 c3 19 c0
[   29.015593] RSP: 0018:ffffb119428e78e0 EFLAGS: 00010282
[   29.015597] RAX: 00000000ffffffff RBX: ffffb11941401264 RCX: 000000000000000b
[   29.015600] RDX: ffff9a33fe12b600 RSI: ffffb11941401264 RDI: 894810247c8d4849
[   29.015602] RBP: ffff9a340486c510 R08: 0000000000000003 R09: ffff9a33f6d58128
[   29.015605] R10: ffffb119428e7930 R11: 0000000000000002 R12: 0000000000000000
[   29.015607] R13: ffffffffc1294e70 R14: ffff9a340486c500 R15: 894810247c8d4838
[   29.015611] FS:  00007f26d10ba740(0000) GS:ffff9a34368c0000(0000) knlGS:0000000000000000
[   29.015614] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   29.015617] CR2: 00007f26d118a6d0 CR3: 00000001fd760003 CR4: 00000000001606e0
[   29.015619] Call Trace:
[   29.015631]  nft_match_select_ops+0x92/0x210 [nft_compat]
[   29.015646]  nf_tables_expr_parse+0x13e/0x1e0 [nf_tables]
[   29.015653]  ? kvmalloc_node+0x43/0x70
[   29.015663]  nf_tables_newrule+0x247/0x8b0 [nf_tables]
[   29.015671]  nfnetlink_rcv_batch+0x499/0x720 [nfnetlink]
[   29.015679]  ? skb_queue_tail+0x1b/0x50
[   29.015685]  ? _cond_resched+0x16/0x40
[   29.015691]  ? kmem_cache_alloc_node_trace+0x1c1/0x1f0
[   29.015695]  ? __insert_vmap_area+0x99/0x100
[   29.015702]  ? refcount_inc_checked+0x5/0x30
[   29.015707]  ? apparmor_capable+0x70/0xb0
[   29.015713]  ? __nla_parse+0x34/0x150
[   29.015719]  nfnetlink_rcv+0x113/0x136 [nfnetlink]
[   29.015725]  netlink_unicast+0x1b9/0x240
[   29.015731]  netlink_sendmsg+0x2d0/0x3c0
[   29.015735]  sock_sendmsg+0x36/0x40
[   29.015739]  ___sys_sendmsg+0x2e9/0x300
[   29.015744]  ? page_add_file_rmap+0x13/0x1f0
[   29.015750]  ? filemap_map_pages+0x183/0x380
[   29.015756]  ? __handle_mm_fault+0xb89/0x1200
[   29.015760]  ? refcount_inc_checked+0x5/0x30
[   29.015764]  ? apparmor_capable+0x70/0xb0
[   29.015768]  ? security_capable+0x35/0x50
[   29.015772]  ? release_sock+0x19/0x90
[   29.015776]  ? __sys_sendmsg+0x63/0xa0
[   29.015780]  __sys_sendmsg+0x63/0xa0
[   29.015787]  do_syscall_64+0x55/0xf0
[   29.015792]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   29.015797] RIP: 0033:0x7f26d11bcc74
[   29.015800] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 48 8d 05 89 5a 0c 00 8b 00 85 c0 75 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 41 89 d4 55 48 89 f5 53
[   29.015803] RSP: 002b:00007ffd02e15868 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[   29.015807] RAX: ffffffffffffffda RBX: 00007ffd02e15880 RCX: 00007f26d11bcc74
[   29.015809] RDX: 0000000000000000 RSI: 00007ffd02e16900 RDI: 0000000000000003
[   29.015812] RBP: 00007ffd02e16f80 R08: 0000000000000004 R09: 0000000000000000
[   29.015814] R10: 00007ffd02e168ec R11: 0000000000000246 R12: 0000564c33d862a0
[   29.015816] R13: 00007ffd02e19850 R14: 00007ffd02e15870 R15: 00007ffd02e19888
[   29.015820] Modules linked in: ip6t_REJECT(E) nf_reject_ipv6(E) nft_chain_nat_ipv6(E) nf_nat_ipv6(E) nft_chain_route_ipv6(E) xt_multiport(E) nf_log_ipv6(E) xt_recent(E) xt_comment(E) xt_hashlimit(E) xt_addrtype(E) xt_mark(E) xt_CT(E) nfnetlink_log(E) xt_NFLOG(E) nf_log_ipv4(E) nf_log_common(E) xt_LOG(E) nf_nat_tftp(E) nf_nat_snmp_basic(E) nf_conntrack_snmp(E) nf_nat_sip(E) nf_nat_pptp(E) nf_nat_irc(E) nf_nat_h323(E) nf_nat_ftp(E) nf_nat_amanda(E) ts_kmp(E) nf_conntrack_amanda(E) nf_conntrack_sane(E) nf_conntrack_tftp(E) nf_conntrack_sip(E) nf_conntrack_pptp(E) nf_conntrack_proto_gre(E) nf_conntrack_netlink(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_irc(E) nf_conntrack_h323(E) nf_conntrack_ftp(E) nft_chain_route_ipv4(E) xt_CHECKSUM(E) nft_chain_nat_ipv4(E) ipt_M
 ASQUERADE(E) nf_nat_ipv4(E) nf_nat(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) ipt_REJECT(E) nf_reject_ipv4(E) nft_counter(E) xt_tcpudp(E) nft_compat(E) tun(E) bridge(E) stp(E)
[   29.015861]  llc(E) devlink(E) nf_tables(E) nfnetlink(E) msr(E) cmac(E) cpufreq_userspace(E) cpufreq_powersave(E) cpufreq_conservative(E) bnep(E) binfmt_misc(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) ext4(E) mbcache(E) jbd2(E) fscrypto(E) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) efi_pstore(E) ghash_clmulni_intel(E) btusb(E) mei_wdt(E) btrtl(E) btbcm(E) btintel(E) bluetooth(E) arc4(E) aesni_intel(E) snd_hda_codec_hdmi(E) drbg(E) iwldvm(E) aes_x86_64(E) ansi_cprng(E) crypto_simd(E) ecdh_generic(E) cryptd(E) glue_helper(E) crc16(E) snd_hda_codec_idt(E) mac80211(E) hp_wmi(E) snd_hda_codec_generic(E) sparse_keymap(E) joydev(E) ledtrig_audio(E) snd_hda_intel(E) iwlwifi(E) snd_hda_codec(E) intel_cstate(E) w
 mi_bmof(E) uvcvideo(E) intel_uncore(E) sg(E) serio_raw(E) intel_rapl_perf(E) snd_hda_core(E) videobuf2_vmalloc(E) tpm_infineon(E) videobuf2_memops(E) videobuf2_v4l2(E) videobuf2_common(E) snd_hwdep(E)
[   29.015913]  videodev(E) media(E) snd_pcm(E) efivars(E) snd_timer(E) iTCO_wdt(E) cfg80211(E) iTCO_vendor_support(E) rfkill(E) snd(E) tpm_tis(E) tpm_tis_core(E) soundcore(E) tpm(E) mei_me(E) mei(E) rng_core(E) evdev(E) hp_accel(E) lis3lv02d(E) input_polldev(E) pcc_cpufreq(E) hp_wireless(E) battery(E) ac(E) coretemp(E) loop(E) parport_pc(E) ppdev(E) lp(E) parport(E) bfq(E) efivarfs(E) ip_tables(E) x_tables(E) autofs4(E) btrfs(E) xor(E) zstd_decompress(E) zstd_compress(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) dm_mod(E) sr_mod(E) cdrom(E) sd_mod(E) hid_generic(E) usbhid(E) hid(E) sdhci_pci(E) cqhci(E) i915(E) ahci(E) i2c_algo_bit(E) libahci(E) sdhci(E) drm_kms_helper(E) crc32c_intel(E) mmc_core(E) xhci_pci(E) libata(E) ehci_pci(E) xhci_hcd(E) ehci_hcd(E) scsi_mod(E) psmouse(E) lpc_ich(
 E) firewire_ohci(E) firewire_core(E) crc_itu_t(E) e1000e(E) drm(E) usbcore(E) thermal(E) wmi(E) video(E) button(E)
[   29.015977] ---[ end trace fc35add4fa3b2bdf ]---
[   29.613482] RIP: 0010:strcmp+0x4/0x20
[   29.613486] Code: 74 1a 49 39 d0 48 89 d0 75 e9 48 85 d2 74 05 c6 44 17 ff 00 48 c7 c0 f9 ff ff ff c3 f3 c3 f3 c3 66 0f 1f 44 00 00 48 83 c7 01 <0f> b6 47 ff 48 83 c6 01 3a 46 ff 75 07 84 c0 75 eb 31 c0 c3 19 c0
[   29.613488] RSP: 0018:ffffb119428e78e0 EFLAGS: 00010282
[   29.613490] RAX: 00000000ffffffff RBX: ffffb11941401264 RCX: 000000000000000b
[   29.613492] RDX: ffff9a33fe12b600 RSI: ffffb11941401264 RDI: 894810247c8d4849
[   29.613493] RBP: ffff9a340486c510 R08: 0000000000000003 R09: ffff9a33f6d58128
[   29.613494] R10: ffffb119428e7930 R11: 0000000000000002 R12: 0000000000000000
[   29.613495] R13: ffffffffc1294e70 R14: ffff9a340486c500 R15: 894810247c8d4838
[   29.613497] FS:  00007f26d10ba740(0000) GS:ffff9a34368c0000(0000) knlGS:0000000000000000
[   29.613499] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   29.613500] CR2: 00007f26d118a6d0 CR3: 00000001fd760003 CR4: 00000000001606e0

On one of the two systems, this would result in the system failing to shut down properly: the kernel would hang completely when trying to shut down.

The problem is known, and can be fixed by this patch, which has been queued in the stable 5.0 tree. It will hopefully be included in the 5.0.8 version.

Which DNS server to use?

Update 4 August 2020: replace CHAOS class by CH in dig commands so they work with kdig too, Quad9 now does support QNAME minimisation, Quad9 has alternative servers available with ECS support and without QNAME minimisation, Google now also does QNAME minimisation.

DNS is a crucial part of the Internet. However DNS traffic is usually not encrypted and can leak lots of interesting information and originally DNS also did not provide date integrity, making it vulnerable to DNS spoofing.

These days, improvements are being made to fix these problems. Data integrity is proved by DNSSEC and the privacy part is being tackled by the DNS Privacy project, proposing solutions like DNS-over-TLS (all data between resolver and client is encrypted) and QNAME minimisation (not sending the FQDN but only the relevant part to each DNS server when doing recursive resolving). More information about the DNS Privacy project can be found in this Fosdem 2018 talk.

There are basically 3 options for DNS on your client systems:

  1. You forward all requests to your ISP’s DNS servers (which is what is usually done by default).
  2. You forward all requests to a public global DNS service, like Cloudlfare’s 1.1.1.1, Quad9 or Google DNS.
  3. You set up your own DNS recursor which connects itself to authoritative DNS servers.

ISP’s default DNS servers

Quite often the problem with your ISP’s DNS servers, is that they don’t support DNSSEC and QNAME minmisation. There is an online test to check whether your DNS server does DNSSEC validation. To test whether QNAME minimisation is enabled for your current resolver, use this command:

$ dig +nodnssec +short TXT qnamemintest.internet.nl

(replace dig by kdig if you are using Knot’s DNS utils)

Some (mostly American) ISPs serve redirect pages when you enter an unexisting domain name and they often block hosts with content which is illegal in your country (child pornography, sites helping with copyright infringement, illegal gambling sites,…). In less democratic countries local DNS server are abused for censorship.

These might all be reasons in order to not to use your ISP’s DNS servers.

But at least in Europe, ISPs should be restricted by the GDPR to sell DNS data. And your ISP’s DNS servers prevent a single point of failure, a single point of data collection, and a single point of censorship. So there are advantages too.

Public global DNS services

The popular global public DNS services all support DNSSEC by default and you can connect to them using encrypted DNS-over-TLS. Some also do QNAME mimisation.

These public global DNS providers are often praised for their speed. You can find result of benchmarks of public DNS resolvers on dnsperf.com. You can also use namebench to benchmark different DNS servers . For example:

$ namebench 1.1.1.1 8.8.8.8 9.9.9.9 -x -O

You want your DNS resolver to be as close to you as possible, especially if your DNS server does not support EDNS Client Subnet (ECS). This is a method which allows a DNS recursor to send the subnet of the client to the authoritative DNS server. This is used by content delivery networks to provide you with the IP of the nearest server serving the requested content. Many privacy oriented DNS services do not support ECS, so the only information the authoritative DNS server has, is the location of your recursor. If that recursor is far away from you, this will lead to the client being sent to a far away server of the content delivery network, leading to much slower access to the content. For this reason, you should rather not use a DNS server in a foreign country, but use one which is as close as possible to your location. You can check how many hops a server is from you using the traceroute or mtr command, for example

$ mtr --report-wide 1.1.1.1

More information about this issue can be found in the blog post “Using Cloudflare’s 1.1.1.1 might lead to slower CDN performance” by Sajal Kayan.

Also for privacy reasons you would also prefer to have one in your own country, so that it’s not susceptible to legislation of a foreign country. Often, countries have a more relaxed legislation regarding spying on foreign connections.

That brings us to the last, but not least consideration: the privacy policy of the DNS service you are using. Are DNS requests being logged, for how long, and are they shared with a third party? On the PowerDNS blog there is a more elaborate article on the risks of using global DNS providers.

Your own DNS recursor

Then there is the third alternative to using a global public DNS service or your ISP’s DNS servers: running your own local recursive DNS resolver, for example with Knot Resolver. If you DNS server is well configured, it will provide you with DNSSEC validation and QNAME minimisation. However this has a serious privacy disadvantage too because this will reveal your own IP to all authoritative servers you connect to. Furthermore connections to authoritative DNS servers currently are always unencrypted, so your ISP and anyone between you and the authoritative server can see your DNS queries.

Overview of public DNS servers

In case you have decided for whatever reason you do not want to use your ISP’s DNS servers and also don’t want to do recursion yourself, there are many public DNS recursors. On the dnsprivacy.org website you can also find a list of public DNS resolvers and experimental servers with support for DNS-over-TLS. I will review a few of the most important ones here.

Cloudflare

Cloudflare‘s DNS service running on the 1.1.1.1 IP address appears to be the fastest in most cases. Unlike some other services, they do have a local server here in Brussels, which likely contributes to the great performance here. You can check which server you would be using by running this command:

$ dig +short CH TXT id.server @1.1.1.1

You can look up the three-letter code on https://www.cloudflarestatus.com/ (these are IATA codes of nearby airports). Cloudflare does not support EDNS Client Subnet, so make sure there is a server nearby when using 1.1.1.1. Cloudflare claims privacy to be one of their main advantages but in their privacy policy they admit that they share certain antonymous data with APNIC (the organization managing the IP addresses in Asia and the Pacific) for research. Cloudflare does support DNSSEC and QNAME minimisation.

Quad9

Quad9, running on the 9.9.9.9 IP address, is a DNS service set up by a nonprofit organization supported by the Global Cyber Alliance, IBM and PCH together in collaboration with other security partners. Their main feature is that they block malicious hosts (like phishing sites), improving security for your devices. Like Cloudflare, Quad9 shares some anonymized data with their threat-intelligence partners for security analysis.

In 2020, Quad9 had servers in 150 locations in 90 countries. Unfortunately there is no server in Belgium. Probably because of this, resolving of domains which DNS server is located in Belgium, is slower in the few tests I did. Because it also does not support ECS, it might not forward you to the nearest content location. You can check with the same command as Cloudflare which DNS server of Quad9 is in use for your location:

$ dig +short CH TXT id.server @9.9.9.9

You can see all locations where Quad9 does have a DNS cluster on the Quad9 website.

Quad9 does DNSSEC validation and now also supports QNAME minimisation.

However, Quad9 also does have alternative DNS servers (9.9.9.11) available which supports ECS. You can use these to get nearby hosts if Quad9 does not have a server in your country. However note that these alternative servers do not support QNAME minimisation. So by using ECS and disabling QNAME miminisation, more information is leaked to DNS servers.

Google

Google Public DNS, running on 8.8.8.8, appears to be the most public DNS service in use globally. According to SIDN (the registry maintaining the .nl domain), 15% of the requests come from Google’s public DNS servers. That’s probably because it’s around longer than many others (started in December 2009). Also systemd-resolved uses Google’s DNS as a fallback of there are no working default DNS servers set up. This is configured in /etc/systemd/resolved.conf.

Google supports ECS. The list of locations where it has servers can be found in the FAQ.

Google stores request logs a bit longer than some others, some even permanently. These days also more and more people distrust Google with their private data. Google DNS does DNSSEC validation, and now also QNAME minimisation.

OpenDNS

OpenDNS was already launched in 2006 and was acquired by Cisco in 2015. They have been redirecting unexisting domains to a custom search page with advertisements, but stopped doing so in 2014. OpenDNS has optional filtering of adult domains and other unwanted content.

OpenDNS seems to have less servers world-wide than the other services., but they do support ECS though.

Conclusion

Which of all these options to choose, is a personal decision. Personally I think that running your own recursor on your own computer is a bad idea. All authoritative name servers will see your personal IP, and your unencrypted queries can be easily monitored by your ISP. I think this should only be considered if you are setting up a DNS server for a fairly large number of clients.

My own ISP does not support DNSSEC and QNAME minimisation. I think these two are crucial features to protect the user’s privacy and for this reason I prefer to use one of the public DNS services. I have set up Knot Resolver to forward DNS requests to Cloudflare’s DNS service over TLS. Not only does it support QNAME minimisation in addition to DNSSEC and DNS-over-TLS, it is fast and has a local server in Belgium. Combine this with the abuse.ch urlhaus RPZ file to add some protection from malicious domains. More details about this can be found in my previous blog post Secure and private DNS with Knot Resolver. I also use this set up on the network I manage at work.

Secure and private DNS with Knot Resolver

Update 5 March 2018: this post was updated to work around a problem with the RPZ file from abuse.ch being ignored because it contains CRLF instead of LF where Knot Resolver does not expect them (bug 453) and to fix an error in the configuration of the predict module.

Knot Resolver is a modern, feature-rich recursive DNS server. It is used by Cloudflare for its 1.1.1.1 public DNS service.

To install it on Debian, run:

# apt-get install knot-resolver knot-dnsutils lua-cqueues

The knot-dnsutils contains the kdig command which is useful for testing your DNS server. lua-cqueues is needed for automatic detection of changes in the RPZ file.

By default the kresd daemon will listen on localhost only (see /lib/systemd/system/kresd.socket). If you want it to be available on other addresses, you will need to override the kresd.socket file. Execute

# systemctl edit kresd.socket

This will create the file /etc/systemd/system/kresd.socket.d/override.conf. Add a ListenStream and ListenDatagram line for all addresses you want it to listen on. For example:

[Socket]
ListenStream=127.0.0.1:53
ListenDatagram=127.0.0.1:53
 
ListenStream=[::1]:53
ListenDatagram=[::1]:53
 
ListenStream=192.168.0.1:53
ListenDatagram=192.168.0.1:53

If you want to listen on all interfaces, it is enough to put this in the file:

ListenStream=53
ListenDatagram=53

You can do the same with kresd-tls.socket to define the addresses on which to listen over DNS-over-TLS requests (port 853).

Knot Resolver’s configuration file is /etc/knot-resolver/kresd.conf. I give an example configuration file with comments:

-- Default empty Knot DNS Resolver configuration in -*- lua -*-
-- Switch to unprivileged user --
user('knot-resolver','knot-resolver')

-- Set the size of the cache to 1 GB
cache.size = 1*GB

-- Uncomment this only if you need to debug problems.
-- verbose(true)

-- Enable optional modules
modules = {
  'policy',
  'view',
  'hints',
  'serve_stale < cache',
  'workarounds < iterate',
  'stats',
  'predict'
}

-- Accept all requests from these subnets
view:addr('127.0.0.1/8', function (req, qry) return policy.PASS end)
view:addr('[::1]/128', function (req, qry) return policy.PASS end)
view:addr('134.184.26.1/24', function (req, qry) return policy.PASS end)

-- Drop everything that hasn't matched
view:addr('0.0.0.0/0', function (req, qry) return policy.DROP end)

-- Use the urlhaus.abuse.ch RPZ list
policy.add(policy.rpz(policy.DENY, '/etc/knot-resolver/abuse.ch.rpz',true))


-- Forward all requests for example.com to 192.168.0.2 and 192.168.0.3
policy.add(policy.suffix(policy.FORWARD({'192.168.0.2', '192.168.0.3'}), {todname('example.com')}))

-- Uncomment one of the following stanzas in case you want to forward all requests to 1.1.1.1 or 9.9.9.9 via DNS-over-TLS.

-- policy.add(policy.all(policy.TLS_FORWARD({
--          { '1.1.1.1', hostname='cloudflare-dns.com', ca_file='/etc/ssl/certs/ca-certificates.crt' },
--          { '2606:4700:4700::1111', hostname='cloudflare-dns.com', ca_file='/etc/ssl/certs/ca-certificates.crt' },
-- 
-- })))

-- policy.add(policy.all(policy.TLS_FORWARD({
--           { '9.9.9.9', hostname='dns.quad9.net', ca_file='/etc/ssl/certs/ca-certificates.crt' },
--           { '2620:fe::fe', hostname='dns.quad9.net', ca_file='/etc/ssl/certs/ca-certificates.crt' },
-- })))

-- Prefetch learning (20-minute blocks over 24 hours)
predict.config({ window = 20, period = 72 })

I use the urlhaus.abuse.ch RPZ file, which contains a blacklist of malicious domains. You will have to download it first:

# cd /etc/knot-resolver
# curl https://urlhaus.abuse.ch/downloads/rpz/ | sed -e 's/\r$//' -e '/raw.githubusercontent.com/d'> /etc/knot-resolver/abuse.ch.rpz

I use sed to convert CRLF in LF (otherwise Knot Resolver fails to parse the file), and I filter out raw.githubusercontent.com. According to urlhaus.abuse.ch it hosts some malware, but there is too much useful stuff there too to block the domain completely.

In order to update it automatically, create /etc/systemd/system/update-urlhaus-abuse-ch.service:

[Unit]
Description=Update RPZ file from urlhaus.abuse.ch for Knot Resolver

[Service]
Type=oneshot
ExecStart=/bin/sh -c 'curl https://urlhaus.abuse.ch/downloads/rpz/ | sed -e 's/\r$//' -e '/raw.githubusercontent.com/d'> /etc/knot-resolver/abuse.ch.rpz'

and then create a timer which will run the service approximately every 10-15 minutes./etc/systemd/system/update-urlhaus-abuse-ch.timer:

[Unit]
Description=Update RPZ file from urlhaus.abuse.ch for Knot Resolver

[Timer]
OnCalendar=*:0/10
Persistent=true
RandomizedDelaySec=300

[Install]
WantedBy=timers.target

Use the first two commands to enable and start the timer. You can check the status using the last command:

# systemctl enable update-urlhaus-abuse-ch.timer
# systemctl start update-urlhaus-abuse-ch.timer
# systemctl list-timers

Now you need to enable and start one or more instances of kresd. kresd is single-threaded, so if you want to make use of all of your CPU cores, you can start as many instances as the numbers of cores you have. For example in order to enable and start 4 instances run this command:

# systemctl enable --now kresd@{1..4}.service

More information

Importing a VMWare virtual machine in qemu/kvm/libvirtd

So you have a VMWare virtual machine and you want to migrate it to Qemu/KVM setup managed by libvirt? This is very easy, using libguestfs.

You will need libguestfs 1.37.10 or higher, which unfortunately is not available for Debian Stretch. The libguestfs-tools package in Debian Buster is fine though.

The command you need is this:

$ virt-v2v -i vmx /mnt/storage/vmware/vm/vm.vmx -o libvirt -of qcow2 -os storage-pool -n network

Replace storage-pool with the name of the libvirt storage pool where you want to store the new VM it, and network by the network name. In this example the disk images will be converted to qemu’s qcow2 format.

To get a list of all available storage pools, use this:

$ virsh pool-list

This command will show all available networks:

$ virsh net-list

Running different PHP applications as different users

Often you run different web applications on the same web servers. For security reasons, it is strongly recommended to run them in separate PHP-FPM processes under different user accounts. This way permissions can be set so that the user account of one PHP application, cannot access the files from another PHP application. Also open_basedir can be set so that accessing any files outside the base directory becomes impossible.

To create a separate PHP-FPM process for a PHP application on Debian Stretch with PHP 7.0, create a file /etc/php/7.0/fpm/pool.d/webapp.conf with these contents:

[webapp]
user = webapp_php
group = webapp_php
listen = /run/php/php7.0-webapp-fpm.sock
listen.owner = www-data
listen.group = www-data
pm = dynamic
pm.max_children = 12
pm.start_servers = 1
pm.min_spare_servers = 1
pm.max_spare_servers = 2
pm.max_requests = 5000
rlimit_core = unlimited
php_admin_value[open_basedir] = /home/webapp/public_html

Replace webapp by a unique name for your web application. You can actually copy the default www.conf file and adapt it to your needs.

Create the webapp_php, with /bin/false as shell and login disabled to secure it against login attacks:

# adduser --system --disabled-login webapp_php --shell /bin/false --no-create-home --home /home/webapp webapp_php

In the above example the webapp is located in /home/webapp, but you can of course also use a directory somewhere in /var/www.

I strongly recommend against making all your PHP files in /home/webapp owned by webapp_php. This is a dangerous situation, because PHP can overwrite the code itself. This makes it possible for malware to overwrite your PHP files with malicious code. Only make the directories where PHP really needs to be able to write into (for example a directory where files uploaded in your web applications are stored), writable for the webapp_php user. Your code itself should be owned by a different user than webapp_php. It can be a dedicated user account, or just root.

Finally we need to configure Apache to contact the right php-fpm instance for the web application. Create a file /etc/apache2/conf-available/php7.0-webapp-fpm.conf:

&lt;Directory /home/webapp/public_html&gt;

# Redirect to local php-fpm if mod_php is not available
    &lt;IfModule proxy_fcgi_module&gt;
        # Enable http authorization headers
        &lt;IfModule setenvif_module&gt;
        SetEnvIfNoCase ^Authorization$ &quot;(. )&quot; HTTP_AUTHORIZATION=$1
        &lt;/IfModule&gt;

        &lt;FilesMatch &quot;. \.ph(p[3457]?|t|tml)$&quot;&gt;
            SetHandler &quot;proxy:unix:/run/php/php7.0-webapp-fpm.sock|fcgi://localhost-webapp&quot;
        &lt;/FilesMatch&gt;
        &lt;FilesMatch &quot;. \.phps$&quot;&gt;
            # Deny access to raw php sources by default
            # To re-enable it's recommended to enable access to the files
            # only in specific virtual host or directory
            Require all denied
        &lt;/FilesMatch&gt;
        # Deny access to files without filename (e.g. '.php')
        &lt;FilesMatch &quot;^\.ph(p[3457]?|t|tml|ps)$&quot;&gt;
            Require all denied
        &lt;/FilesMatch&gt;
    &lt;/IfModule&gt;
&lt;/Directory&gt;

This file is based on the default php7.0-fpm.conf. You will need to create a symlink to make sure this gets activated:

# cd /etc/apache2/conf-enabled
# ln -s ../conf-available/php7.0-webapp-fpm.conf .

Now restart your Apache and PHP-FPM services and you should be ready. You can see the user your code in /home/webapp/public_html is being run as in the output of the phpinfo() function.


Linux security hardening recommendations

In a previous blog post, I wrote how to secure OpenSSH against brute force attacks. However, what if someone manages to get a shell on your system, despite all your efforts? You want to protect your system from your users doing nasty things? It is important to harden your system further according to the principle of defense in depth in order.

Software updates

Make sure you are running a supported distribution, and by preference the most recent version one. For example, Debian Jessie is still supported, however upgrading to Debian Stretch is strongly recommended, because it offers various security improvements (more recent kernel with new security hardening, PHP 7 with new security related features, etc…)

Install amd64-microcode (for AMD CPU’s) or intel-microcode (for Intel CPU’s) which are needed to protect against hardware vulnerabilities such as Spectre, Meltdown and L1TF. I recommend installing it from stretch-backports in order to have the latest firmware.

Automatic updates and needrestart

I recommend installing unattened-upgrades . You can configure it to just download updates or to download and install them automatically. By default, unattended-upgrades will only install updates from the official security repositories. This way it is relatively safe to let it do this automatically. If you have already installed it, you can run this command to reconfigure it:

# dpkg-reconfigure unattended-upgrades

When you update system libraries, you should also restart all daemons which are using these libraries to make them use the newly installed version. This is exactly what needrestart does. After you have run apt-get, it will check whether there are any daemons running with older libraries, and will propose you to restart them. If you use it with unattended-upgrades, you should set this option in /etc/needrestart/needrestart.conf to make sure that all services which require a restart are indeed restarted:

$nrconf{restart} = 'a';

Up-to-date kernel

Running an up-to-date kernel is very important, because also the kernel can be vulnerable. In the worst case, an outdated kernel can be exploited to gain root permissions. Do not forget to reboot after updating the kernel.

Every new kernel version also contains various extra security hardening measures. Kernel developer Kees Cook has an overview of security related changes in the kernel.

In case you build your own kernel, you can use kconfig-hardened-check to get recommendation for a hardened kernel configuration.null

Firewall: filtering outgoing traffic

It is very obvious to install a firewall which filters incoming traffic. However, have you considered also filtering outgoing traffic? This is a bit more difficult to set up because you need to whitelist all outgoing hosts to which connections are needed (I think of your distribution’s repositories, NTP servers, DNS servers,…), but it is a very effective measure which will help limiting damage in case a user account gets compromised, despite all your other protective efforts.

Ensuring strong passwords

Prevent your users from setting bad passwords by installing libpam-pwquality, together with some word lists for your language and a few common languages. These will be used for verifying that the user is not using a common word as his password. libpam-quality will be enabled automatically after installation with some default settings.

# apt-get install libpam-pwquality wbritish wamerican wfrench wngerman wdutch

Please note that by default, libpam-pwquality will only enforce strong passwords when a non-root user changes its password. If root is setting a password, it will give a warning if a weak password is set, but will still allow it. If you want to enforce it for root too (which I recommend), then add enforce_for_root in the pam_pwquality line in /etc/pam.d/common-password:

password	requisite			pam_pwquality.so retry=3 enforce_for_root

Automatically log out inactive users

In order to log out inactive users, set a timeout of 600 seconds on the Bash shell. Create /etc/profile.d/tmout.sh:

export TMOUT=600
readonly TMOUT

Prevent creating cron jobs

Make sure users cannot set cron jobs. In case an attacker gets a shell on your system, often cron will be used to ensure the malware continues running after a reboot. In order to prevent normal users to set up cron jobs, create an empty /etc/cron.allow.

Protect against fork bombs and excessive logins and CPU usage

Create a new file in /etc/security/limits.d to impose some limits to user sessions. I strongly recommend setting a value for nproc, in order to prevent fork bombs. maxlogins is the maximum number of logins per user, and cpu is used to set a limit on the CPU time a user can use (in minutes):

*	hard	nproc		1024
*	hard	maxlogins 	4
1000:	hard	cpu		180

Hiding processes from other users

By mounting the /proc filesystem with the hidepid=2 option, users cannot see the PIDs of processes by other users in /proc, and hence these processes also become invisible when using tools like top and ps. Put this in /etc/fstab to mount /proc by default with this option:

none	/proc	proc	defaults,hidepid=2	0	0

Restricting /proc/kallsyms

It is possible to restrict access to /proc/kallsyms at boot time by setting 004 permissions. Put this in /etc/rc.local:

chmod 400 /proc/kallsyms

/proc/kallsyms contains information about how the kernel’s memory is laid out. With this information it becomes easier to attack the kernel itself, so hiding this information is always a good idea. It should be noted though that attackers can get this information from other sources too, such as from the System.map files in /boot.

Harden kernel configuration with sysctl

Several kernel settings can be set at run time using sysctl. To make these settinsg permanent, put these settings in files with the .conf extension in /etc/sysctl.d.

It is possible to hide the kernel messages (which can be read with the dmesg command) from other users than root by setting the sysctl kernel.dmesg_restrict to 1. On Debian Stretch and later this should already be the default value:

kernel.dmesg_restrict = 1

From Linux kernel version 4.19 on it’s possible to disallow opening FIFOs or regular files not owned by the user in world writable sticky directories. This setting would have prevented vulnerabilities found in different user space programs the last couple of years. This protection is activated automatically if you use systemd version 241 or higher with Linux 4.19 or higher. If your kernel supports this feature but you are not using systemd 241, you can activate it yourself by setting the right sysctl settings:

fs.protected_regular = 1
fs.protected_fifos = 1

Also check whether the following sysctl’s have the right value in order to enable protection hard links and symlinks. These work with Linux 3.6 and higher, and likely will already be enabled by default on your system:

fs.protected_hardlinks = 1
fs.protected_symlinks = 1

Also by default on Debian Stretch only root users can access perf events:

kernel.perf_event_paranoid = 3

Show kernel pointers in /proc as null for non-root users:

kernel.kptr_restrict = 1

Disable the kexec system call, which allows you to boot a different kernel without going through the BIOS and boot loader:

kernel.kexec_load_disabled = 1

Allow ptrace access (used by gdb and strace) for non-root users only to child processes. For example strace ls will still work, but strace -p 8659 will not work as non-root user:

kernel.yama.ptrace_scope = 1

The Linux kernel includes eBPF, the extended Berkeley Packet Filter, which is a VM in which unprivileged users can load and run certain code in the kernel. If you are sure no users need to call bpf(), it can be disabled for non-root users:

kernel.unprivileged_bpf_disabled = 1

In case the BPF Just-In-Time compiler is enabled (it is disabled by default, see sysctl net/core/bpf_jit_enable), it is possible to enable some extra hardening against certain vulnerabilities:

net.core.bpf_jit_harden = 2

Take a look at the Kernel Self Protection Project Recommended settings page to find an up to date list of recommended settings.

Lynis

Finally I want to mention Lynis, a security auditing tool. It will check the configuration of your system, and make recommendations for further security hardening.

Further ideas

Securing OpenSSH

Security hardening the OpenSSH server is one of the first things that should be done on any newly installed system. Brute force attacks on the SSH daemon are very common and unfortunately I see it going wrong all too often. That’s why I think it’s useful to give a recapitulation here with some best practices, even though this should be basic knowledge for any system administrator.

Firewall

The first thing to think about: should the be SSH server be accessible from the whole world, or can we limit it to certain IP addresses or subnets. This is the most simple and effective form of protection: if your SSH daemon is is only accessible from specific IP addresses, then there is no risk any more from attacks coming from elsewhere.

I prefer to use Shorewall as a firewall, as it’s easy to configure. Be sure to also configure shorewall6 if you have an IPv6 address.

However as defense in depth is an essential security practice, you should not stop here even if you protected your SSH daemon with a firewall. Maybe your firewall one day fails to come up at boot automatically, leaving your system unprotected. Or maybe one day the danger comes from within your own network. That’s why in any case you need to carefully review the next recommendations too.

SSHd configuration

Essential security settings

The SSH server configuration can be found in the file /etc/ssh/sshd_config. We review some essential settings:

  • PermitRootLogin: I strongly recommend setting this to No. This way, you always log in to your system with a normal user account. If you need root access, use su or sudo. The root account is then protected from brute force attacks. And you can always easily find out who used the root account.
  • PasswordAuthentication: This setting really should be No. You will first need to add your SSH public key to your ~/.ssh/authorized_keys . Disabling password authentication is the most effective protection against brute force attacks.
  • X11Forwarding: set this to No, except if your users need to be able to run X11 (graphical) applications remotely.
  • AllowTcpForwarding: I strongly recommend setting this to No. If this is allowed, any user who can ssh into your system, can establish connections from the client to any other system using your host as a proxy. This is even the case even if your users can only use SFTP to transfer files. I have seen this being abused in the past to connect to the local MTA and send spam via the host this way.
  • PermitOpen: this allows you to set the hosts to which TCP forwarding is allowed. Use this if you set AllowTcpForwarding to indicate to which hosts TCP forwarding is limited.
  • ClientAliveInterval, ClientAliveCountMax: These values will determine when a connection will be interrupted when it’s unresponsive (for example in case of network problems). I set ClientAliveInterval to 600 and ClientAliveCountMax to 0. Note that this does not drop the connection when the user is simply inactive. If you want to set a timeout for that, you can set the TMOUT environment variable in Bash.
  • MaxAuthTries: the maximum number of authentication attempts permitted per connection. Set this to 3.
  • AllowUsers: only the users in this space separated list are allowed to log in. I strongly recommend using this (or AllowGroups) to whitelist users that can log in by SSH. It protects against possible disasters when a test user or a system users with a weak password is created.
  • AllowGroups: only the users from the groups in this space separated list are allowed to log in.
  • DenyUsers: users in this space separated list are not allowed to log in
  • DenyGroups: users from the groups in this space separated list are not allowed to log in.
  • These values should already be fine by default, but I recommend verifying them: PermitEmptyPasswords (no), UsePrivilegeSeparation (sandbox), UsePAM (yes), PermitUserEnvironment (no), StrictModes (yes), IgnoreRhosts (yes)

So definitely disable PasswordAuthentication and TCP and X11 forwarding by default and use the AllowUsers or AllowGroups to whitelist who is allowed to log in by SSH.

Match conditional blocks

With Match conditional blocks you can modify some of the default settings for certain users, groups or IP addresses. I give a few examples to illustrate the usage of Match blocks.

To allow TCP forwarding to a specific host for one user:

Match User username
        AllowTcpForwarding yes
        PermitOpen 192.168.0.120:8080

To allow PasswordAuthentication for a trusted IP address (make sure the user has a strong password, even if you trust the host!) :

Match Address 192.168.0.20
        PasswordAuthentication yes

The Address can also be a subnet in CIDR notation, such as 192.168.0.0/24.

To only allow SFTP access for a group of users, disabling TCP, X11 and streamlocal forwarding:

Match group sftponly
        ForceCommand internal-sftp
        AllowTcpForwarding no
        X11Forwarding no
        AllowStreamLocalForwarding no

chroot

You can chroot users to a certain directory, so that they cannot see and access what’s on the file system outside that directory. This is a a great way to protect your system for users who only need SFTP access to a certain location. For this to work, you need to make the user’s home directory being owned by root:root. This means they cannot write directly in their home directory. You can create a subdirectory within the user’s home directory with the appropriate ownership and permissions where the user can write into. Then you can use a Match block to apply this configuration to certain users or groups:

Match Group chrootsftp
        ChrootDirectory %h
        ForceCommand internal-sftp
        AllowTcpForwarding no
        X11Forwarding no
        AllowStreamLocalForwarding no

If you use authentication with keys, you will have to set a custom location for the authorized_keys file:

AuthorizedKeysFile /etc/ssh/authorized_keys/%u .ssh/authorized_keys

Then the keys for every user have to be installed in a file /etc/ssh/authorized_keys/username

Fail2ban

Fail2ban is a utility which monitors your log files for failed logins, and will block IPs if too many failed log in attempts are made within a specified time. It cannot only watch for failed login attempts on the SSH daemon, but also watch other services, like mail (IMAP, SMTP, etc.) services, Apache and others. It is a useful protection against brute force attacks. However, versions of Fail2ban before 0.10.0, only support IPv4, and so don’t offer any protection against attacks from IPv6 addresses. Furthermore, attackers often slow down their brute force attacks so that they don’t trigger the Fail2ban threshold. And then there are distributed attacks: by using many different source IPs, Fail2ban will never be triggered. For this reasons, you should not rely on Fail2ban alone to protect against brute force attacks.

If you want to use Fail2ban on Debian Stretch, I strongly recommend using the one from Debian-backports, because this version has IPv6 support.

# apt-get install -t stretch-backports fail2ban python3-pyinotify python3-systemd

I install python3-systemd in order read the log messages from Systemd’s Journal, while python3-pyinotify is needed to efficiently watch log files.

First we will increase the value for dbpurgeage which is set to 1 day in /etc/fail2ban/fail2ban.conf. We can do this by creating the file /etc/fail2ban/fail2ban.d/local.conf:

[Definition]
dbpurgeage = 10d

This lets us ban an IP for a much longer time than 1 day.

Then the services to protect, the thresholds and the action to take when these are exceeded are defined in /etc/fail2ban/jail.conf. By default all jails, except the sshd jail, are disabled and you have to enable the ones you want to use. This can be done by creating a file /etc/fail2ban/jail.d/local.conf:

[DEFAULT]
banaction = iptables-multiport
banaction_allports = iptables-allports
destemail = email@example.com
sender = root@example.com

[sshd]
mode = aggressive
enabled = true
backend = systemd

[sshd-slow]
filter   = sshd[mode=aggressive]
maxretry = 10
findtime = 3h
bantime  = 8h
backend = systemd
enabled = true

[recidive]
enabled=true
maxretry = 3
action = %(action_mwl)s

First we override some default settings valid for all jails. We configure it to use iptables to block banned users. If you use Shorewall as firewall, then set banaction and banaction_allports to shorewall in order to use the blacklist functionality of Shorewall. In that case, read the instructions in /etc/fail2ban/action.d/shorewall.conf to configure Shorewall to also block established blacklisted connections. Other commonly used values for banactions and banactions_allports are ufw and firewallcmd-ipset, if you use UFW respectively Firewalld. We also define the sender address and destination address where emails should be sent when a host is banned.

Then we set up 3 jails. The sshd and recidive jail are jails which are already defined in /etc/fail2ban/jails.conf and we enable them here. The sshd jail will give a 10 minute ban to IPs which do 5 unsuccessful login attempts on the SSH server in a time span of 10 minutes. The recidive jail gives a one week ban to IPs getting banned 3 times by another Fail2ban jail in a time span of 1 day. Furthermore I define another jail sshd-slow, which gives a 8 hour ban to IPs doing 10 failed attempts on the SSH server in a time span of 3 hours. This catches many attempts which try to evade the default Fail2ban settings by slowing down their brute force attack. In both the sshd and sshd-slow jails I use the aggressive mode which catches more errors, such as probes without trying to log in, and attempts with wrong (outdated) ciphers. See /etc/fail2ban/filter.d/sshd.conf for the complete lists of log message it will search for. The recidive jail will sent a mail to the defined address in case a host gets banned. I enable this only for recidive in order not to receive too much e-mails.

Two-factor authentication

It is possible to enable two-factor authentication (2FA) using the libpam-google-authenticator package. Then you can use an application like FreeOTP+ (Android), AndOTP (Android), Authenticator (iOS), KeepassXC (Linux) to generate the time based token you need to log in.

First install the required PAM module on your SSH server:

# apt-get install libpam-google-authenticator

Then edit the file /etc/ssh/sshd_config:

ChallengeResponseAuthentication yes
AuthenticationMethods publickey keyboard-interactive:pam

You can also put this in a Match block to only enable this for certain users or groups.

This will allow you to log in with either key based authentication, either by password and your time-based token.

Now you need to set up the new secret for the user account you want to use OTP authentication using the google-authenticator command. Run it as the user. Choose time-based authentication tokes, disallow multiple uses of the same authentication token, and don’t choose to increase the time window to 4 minute and enable rate-limiting.

$ google-authenticator
Do you want authentication tokens to be time-based (y/n) y
                                                          
Do you want me to update your "/home/username/.google_authenticator" file (y/n) y

Do you want to disallow multiple uses of the same authentication
token? This restricts you to one login about every 30s, but it increases
your chances to notice or even prevent man-in-the-middle attacks (y/n) Do you want to disallow multiple uses of the same authentication
token? This restricts you to one login about every 30s, but it increases
your chances to notice or even prevent man-in-the-middle attacks (y/n) y

By default, tokens are good for 30 seconds. In order to compensate for
possible time-skew between the client and the server, we allow an extra
token before and after the current time. If you experience problems with
poor time synchronization, you can increase the window from its default
size of +-1min (window size of 3) to about +-4min (window size of
17 acceptable tokens).
Do you want to do so? (y/n) n

If the computer that you are logging into isn't hardened against brute-force
login attempts, you can enable rate-limiting for the authentication module.
By default, this limits attackers to no more than 3 login attempts every 30s.
Do you want to enable rate-limiting (y/n) y

Now enter the code given by this command in your OTP client or scan the QR code.

Edit the file /etc/pam.d/sshd and add this line:

auth required pam_google_authenticator.so noskewadj

You need to make sure to add this line before the line

@include common-auth

Otherwise an attacker can still brute force the password, and then abuse it on other services. That is because of the auth requisite pam_deny.so line in common-auth: this will immediately return a failure message when the password is wrong. The time-based token would only be asked when the password is correct.

The noskewadj option increases security by disabling the option to automatically detect and adjust time skew between client and server.

Now restart the sshd service, and in another shell, try the OTP authentication. Don’t close your existing SSH connection yet, because otherwise you might lock yourself out if something went wrong.

The biggest disadvantage of pam_googleauthenticator is that it allows every individual user to set values for the window size, rate limiting, whether to use HOTP or TOTP, etc. By modifying some of these, the user can reduce the security of the one-time-password. For this reason, I recommend only enabling this for users you trust.

Enabling jumbo frames on your network

Jumbo frames are Ethernet frames with up to 9000 bytes of payload, in contrast to normal frames which have up to 1500 bytes per payload. They are useful on fast (Gigabit Ethernet and faster) networks, because they reduce the overhead. Not only will it result in a higher throughput, it will also reduce CPU usage.

To use jumbo frames, you whole network needs to support it. That means that your switch needs to support jumbo frames (it might need to be enabled by hand), and also all connected hosts need to support jumbo frames. Jumbo frames should also only be used on reliable networks, as the higher payload will make it more costly to resend a frame if packets get lost.

So first you need to make sure your switch has jumbo frame support enabled. I’m using a HP Procurve switch and for this type of switch you can find the instructions here:

$ netcat 10.141.253.1 telnet
Username:
Password:
ProCurve Switch 2824# config
ProCurve Switch 2824(config)# show vlans
ProCurve Switch 2824(config)# vlan $VLAN_ID jumbo
ProCurve Switch 2824(config)# show vlans
ProCurve Switch 2824(config)# write memory

Now that your switch is configured properly, you can configure the hosts.

For hosts which have a static IP configured in /etc/network/interfaces you need to add the line

    mtu 9000

to the iface stanza of the interface on which you want to enable jumbo frames. This does not work for interfaces getting an IP via DHCP, because they will use the MTU value sent by the DHCP server.

To enable jumbo frames via DHCP, edit the /etc/dhcp/dhcpd.conf file on the DHCP server, and add this to the subnet stanza:

option interface-mtu 9000;

Now bring the network interface offline and online, and jumbo frames should be enabled. You can verify with the command

# ip addr show

which will show the mtu values for all network interfaces.

FS-CACHE for NFS clients

FS-CACHE is a system which caches files from remote network mounts on the local disk. It is a very easy to set up facility to improve performance on NFS clients.

I strongly recommend a recent kernel if you want to use FS-CACHE though. I tried this with the 4.9 based Debian Stretch kernel a year ago, and this resulted in a kernel oops from time to time, so I had to disable it again. I’m currently using it again with a 4.19 based kernel, and I did not encounter any stability issues up to now.

First of all, you will need a dedicated file system where you will store the cache on. I prefer to use XFS, because it has nice performance and stability. Mount the file system on /var/cache/fscache.

Then install the cachefilesd package and edit the file /etc/default/cachefilesd so that it contains:

RUN=yes

Then edit the file /etc/cachefilesd.conf. It should look like this:

dir /var/cache/fscache
tag mycache
brun 10%
bcull 7%
bstop 3%
frun 10%
fcull 7%
fstop 3%

These numbers define when cache culling (making space in the cache by discarding less recently used files) happens: when the amount of available disk space or the amount of available files drops below 7%, culling will start. Culling will stop when 10% is available again. If the available disk space or available amount of files drops below 3%, no further cache allocation is done until more than 3% is available again. See also man cachefilesd.conf.

Start cachefilesd by running

# systemctl start cachefilesd

If it fails to start with these messages in the logs:

cachefilesd[1724]: About to bind cache
kernel: CacheFiles: Security denies permission to nominate security context: error -2
cachefilesd[1724]: CacheFiles bind failed: errno 2 (No such file or directory)
cachefilesd[1724]: Starting FilesCache daemon : cachefilesd failed!
systemd[1]: cachefilesd.service: Control process exited, code=exited status=1
systemd[1]: cachefilesd.service: Failed with result 'exit-code'.
systemd[1]: Failed to start LSB: CacheFiles daemon.

then you are hitting this bug. This happens when you are using a kernel with AppArmor enabled (like Debian’s kernel from testing). You can work around it by commenting out the line defining the security context in /etc/cachefilesd.conf:

#secctx system_u:system_r:cachefiles_kernel_t:s0

and starting cachefilesd again.

Now in /etc/fstab add the fsc option to all NFS file systems where you want to enable caching for. For example for NFS4 your fstab line might look like this:

nfs.server:/home	/home	nfs4	_netdev,fsc,noatime,vers=4.2,nodev,nosuid	0	0

Now remount the file system. Just using the remount option is probably not enough: you have to completely umount and mount the NFS file system. Check with the mount command whether the fsc option is present. You can also run

# cat /proc/fs/nfsfs/volumes 

and check whether the FSC column is set to YES.

Try copying some large files from your NFS mount to the local disk. Then you can check the statistics by running

# cat /proc/fs/fscache/stats

You should sea that not all values are 0 any more. You will also see with

$ df -h /var/cache/fscache

your file system is being filled with cached data.

Debian Stretch on AMD EPYC (ZEN) with an NVIDIA GPU for HPC

Recently at work we bought a new Dell PowerEdge R7425 server for our HPC cluster. These are some of the specifications:

  • 2 AMD EPYC 7351 16-Core Processors
  • 128 GB RAM (16 DIMMs of 8 GB)
  • Tesla V100 GPU
Dell Poweredge R7425 front with cover
Dell Poweredge R7425 front without cover
Dell Poweredge R7425 inside

Our FAI configuration automatically installed Debian stretch on it without any problem. All hardware was recognized and working. The installation of the basic operating system took less than 20 minutes. FAI also sets up Puppet on the machine. After booting the system, Puppet continues setting up the system: installing all needed software, setting up the Slurm daemon (part of the job scheduler), mounting the NFS4 shared directories, etc. Everything together, the system was automatically installed and configured in less than 1 hour.

Linux kernel upgrade

Even though the Linux 4.9 kernel of Debian Stretch works with the hardware, there are still some reasons to update to a newer kernel. Only in more recent kernel versions, the k10temp kernel module is capable of reading out the CPU temperature sensors. We also had problems with fscache (used for NFS4 caching) with the 4.9 kernel in the past, which are fixed in a newer kernel. Furthermore there have been many other performance optimizations which could be interesting for HPC.

You can find a more recent kernel in Debian’s Backports repository. At the time of writing it is a 4.18 based kernel. However, I decided to build my own 4.19 based kernel.

In order to build a Debian kernel package, you will need to have the package kernel-package installed. Download the sources of the Linux kernel you want to build, and configure it (using make menuconfig or any method you prefer). Then build your kernel using this command:

$ make -j 32 deb-pkg

Replace 32 by the number of parallel jobs you want to run; the number of CPU cores you have is a good amount. You can also add the LOCALVERSION and KDEB_PKGVERSION variables to set a custom name and version number. See the Debian handbook for a more complete howto. When the build is finished successfully, you can install the linux-image and linux-headers package using dpkg.

We mentioned temperature monitoring support with the k10temp driver in newer kernels. If you want to check the temperatures of all NUMA nodes on your CPUs, use this command:

$ cat /sys/bus/pci/drivers/k10temp/*/hwmon/hwmon*/temp1_input

Divide the value by 1000 to get the temperature in degrees Celsius. Of course you can also use the sensors command of the lm-sensors package.

Kernel settings

VM dirty values

On systems with lots of RAM, you will encounter problems because the default values of vm.dirty_ratio and vm.dirty_background_ratio are too high. This can cause stalls when all dirty data in the cache is being flushed to disk or to the NFS server.

You can read more information and a solution in SuSE’s knowledge base. On Debian, you can create a file /etc/sysctl.d/dirty.conf with this content:

vm.dirty_bytes = 629145600
vm.dirty_background_bytes = 314572800

Run

# systctl -p /etc/sysctl.d/dirty.conf

to make the settings take effect immediately.

Kernel parameters

In /etc/default/grub, I added the options

transparant_hugepage=always cgroup_enable=memory

to the GRUB_CMDLINE_LINUX variable.

Transparant hugepages can improve performance in some cases. However, it can have a very negative impact on some specific workloads too. Applications like Redis, MongoDB and Oracle DB recommend not enabling transparant hugepages by default. So make sure that it’s worthwhile for your workload before adding this option.

Memory cgroups are used by Slurm to prevent jobs using more memory than what they reserved. Run

# update-grub

to make sure the changes will take effect at the next boot.

I/O scheduler

If you’re using a configuration based on recent Debian’s kernel configuration, you will likely be using the Multi-Queue Block IO Queueing Mechanism with the mq-deadline scheduler as default. This is great for SSDs (especially NVME based ones), but might not be ideal for rotational hard drives. You can use the BFQ scheduler as an alternative on such drives. Be sure to test this properly tough, because with Linux 4.19 I experienced some stability problems which seemed to be related to BFQ. I’ll be reviewing this scheduler again for 4.20 or 4.21.

First if BFQ is built as a module (wich is the case in Debian’s kernel config), you will need to load the module. Create a file /etc/modules-load.d/bfq.conf with contents

bfq

Then to use this scheduler by default for all rotational drives, create the file /etc/udev/rules.d/60-io-scheduler.rules with these contents:

# set scheduler for non-rotating disks
ACTION=="add|change", KERNEL=="sd[a-z]|mmcblk[0-9]*|nvme[0-9]*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
# set scheduler for rotating disks
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq"

Run

# update-initramfs -u

to rebuild the initramfs so it includes this udev rule and it will be loaded at the next boot sequence.

The Arch wiki has more information on I/O schedulers.

CPU Microcode update

We want the latest microcode for the CPU to be loaded. This is needed to mitigate the Spectre vulnerabilities. Install the amd64-microcode package from stretch-backports.

Packages with AMD ZEN support

hwloc

hwloc is a utility which reads out the NUMA topology of your system. It is used by the Slurm workload manager to to bind tasks to certain cores.

The version of hwloc in Stretch (1.11.5) does not have support for the AMD ZEN architecture. However, hwloc 1.11.12 is available in stretch-backports, and this version does have AMD ZEN support. So make sure you have the packages hwloc libhwloc5 libhwloc-dev libhwloc-plugins installed from stretch-backports.

BLAS libraries

There is no BLAS library in Debian Stretch which supports AMD ZEN architecture. Unfortunately, at the moment of writing there is is also no good BLAS implementation for ZEN available in stretch-backports. This will likely change in the near future though, as BLIS has now entered Debian Unstable and will likely be backported too in the stretch-backports repository.

NVIDIA drivers and CUDA libraries

NVIDIA drivers

I installed the NVIDIA drivers an CUDA libraries from the tarballs downloaded from the NVIDIA website because at the time of writing all packages available in the Debian repositories are outdated.

First make sure you have the linux-headers package installed which corresponds with the linux-image kernel package you are running. We will be using DKMS to rebuild the driver automatically whenever we install a new kernel, so also make sure you have the dkms package installed.

Download the NVIDIA driver for your GPU from the NVIDIA website. Remove the nouveau driver with the

# rmmod nouveau

command. And create a file /etc/modules-load.d/blacklist-nouveau.conf with these contents:

blacklist nouveau

and rebuild the initramfs by running

# update-initramfs -u

This will ensure the nouveau module will not be loaded automatically.

Now install the driver, by using a command similar to this:

# NVIDIA-Linux-x86_64-410.79.run -s --dkms

This will do a silent installation, integrating the driver with DKMS so it will get built automatically every time you install a new linux-image together with its corresponding linux-headers package.

To make sure that the necessary device files in /dev exist after rebooting this system, I put this script /usr/local/sbin/load-nvidia:

#!/bin/sh
/sbin/modprobe nvidia
if [ "$?" -eq 0 ]; then
  # Count the number of NVIDIA controllers found.
  NVDEVS=`lspci | grep -i NVIDIA`
  N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
  NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`
  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
    mknod -m 666 /dev/nvidia$i c 195 $i
  done
  mknod -m 666 /dev/nvidiactl c 195 255
else
  exit 1
fi
/sbin/modprobe nvidia-uvm
if [ "$?" -eq 0 ]; then
  # Find out the major device number used by the nvidia-uvm driver
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
  mknod -m 666 /dev/nvidia-uvm c $D 0
else
  exit 1
fi

In order to start this script at boot up, I created a systemd service. Create the file /etc/systemd/system/load-nvidia.service with this content:

[Unit]
Description=Load NVidia driver and creates nodes in /dev
Before=slurmd.service

[Service]
ExecStart=/usr/local/sbin/load-nvidia
Type=oneshot
RemainAfterExit=true

[Install]
WantedBy=multi-user.target

Now run these commands to enable the service:

# systemctl daemon-reload
# systemctl enable load-nvidia.service

You can verify whether everything is working by running the command

$ nvidia-smi

CUDA libraries

Download the CUDA libraries. For Debian, choose Linux x86_64 Ubuntu 18.04 runfile (local).

Then install the libraries with this command:

# cuda_10.0.130_410.48_linux --silent --override --toolkit

This will install the toolkit in silent mode. With the override option it is possible to install the toolkit on systems which don’t have an NVIDIA GPU, which might be useful for compilation purposes.

To make sure your users have the necessary binaries and libraries available in their path, create the file /etc/profile.d/cuda.sh with this content:

export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH