clipos_kernel_doc.txt 31.5 KB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924
.. Copyright © 2018 ANSSI.
   CLIP OS is a trademark of the French Republic.
   Content licensed under the Open License version 2.0 as published by Etalab
   (French task force for Open Data).

.. _kernel:

Kernel
======

The CLIP OS kernel is based on Linux. It also integrates:

* existing hardening patches that are not upstream yet and that we consider
  relevant to our security model;
* developments made for previous CLIP OS versions that we have not upstreamed
  yet (or that cannot be);
* entirely new functionalities that have not been upstreamed yet (or that
  cannot be).

Objectives
----------

As the core of a hardened operating system, the CLIP OS kernel is particularly
responsible for:

* providing **robust security mechanisms** to higher levels of the operating
  system, such as reliable isolation primitives;
* maintaining maximal **trust in hardware resources**;
* guaranteeing its **own protection** against various threats.

Configuration
-------------

In this section we discuss our security-relevant configuration choices for
the CLIP OS kernel. Before starting, it is worth mentioning that:

* We do our best to **limit the number of kernel modules**.

  In other words, as many modules as possible should be built-in. Modules are
  only used when needed either for the initramfs or to ease the automation of
  the deployment of CLIP OS on multiple different machines (for the moment, we
  only target a QEMU-KVM guest). This is particularly important as module
  loading is disabled after CLIP OS startup.

* We **focus on a secure configuration**. The remaining of the configuration
  is minimal and it is your job to tune it for your machines and use cases.

* CLIP OS only supports the x86-64 architecture for now.

* Running 32-bit programs is voluntarily unsupported. Should you change that
  in your custom kernel, keep in mind that it requires further attention when
  configuring it (e.g., ensure that ``CONFIG_COMPAT_VDSO=n``).

* Many options that are not useful to us are disabled in order to cut attack
  surface. As they are not all detailed below, please see
  ``src/portage/clip/sys-kernel/clipos-kernel/files/config.d/blacklist`` for an
  exhaustive list of the ones we **explicitly** disable.

General setup
~~~~~~~~~~~~~

.. describe:: CONFIG_AUDIT=y

   CLIP OS will need the auditing infrastructure.

.. describe:: CONFIG_IKCONFIG=n
              CONFIG_IKHEADERS=n

   We do not need ``.config`` to be available at runtime, neither do we need
   access to kernel headers through *sysfs*.

.. describe:: CONFIG_KALLSYMS=n

   Symbols are only useful for debug and attack purposes.

.. describe:: CONFIG_USERFAULTFD=n

   The ``userfaultfd()`` system call adds attack surface and can `make heap
   sprays easier <https://duasynt.com/blog/linux-kernel-heap-spray>`_. Note
   that the ``vm.unprivileged_userfaultfd`` sysctl can also be used to restrict
   the use of this system call to privileged users.

.. describe:: CONFIG_EXPERT=y

   This unlocks additional configuration options we need.

.. ---

.. describe:: CONFIG_USER_NS=n

   User namespaces can be useful for some use cases but even more to an
   attacker. We choose to disable them for the moment, but we could also enable
   them and use the ``kernel.unprivileged_userns_clone`` sysctl provided by
   linux-hardened to disable their unprivileged use.

.. ---

.. describe:: CONFIG_SLUB_DEBUG=y

   Allow allocator validation checking to be enabled.

.. describe:: CONFIG_SLAB_MERGE_DEFAULT=n

   Merging SLAB caches can make heap exploitation easier.

.. describe:: CONFIG_SLAB_FREELIST_RANDOM=y

   Randomize allocator freelists

.. describe:: CONFIG_SLAB_FREELIST_HARDENED=y

   Harden slab metadata

.. describe:: CONFIG_SLAB_CANARY=y

   Place canaries at the end of slab allocations. [linux-hardened]_

.. ---

.. describe:: CONFIG_SHUFFLE_PAGE_ALLOCATOR=y

   Page allocator randomization is primarily a performance improvement for
   direct-mapped memory-side-cache utilization, but it does reduce the
   predictability of page allocations and thus complements
   ``SLAB_FREELIST_RANDOM``. The ``page_alloc.shuffle=1`` parameter needs to be
   added to the kernel command line.

.. ---

.. describe:: CONFIG_COMPAT_BRK=n

   Enabling this would disable brk ASLR.

.. ---

.. describe:: CONFIG_GCC_PLUGINS=y

   Enable GCC plugins, some of which are security-relevant; GCC 4.7 at least is
   required.

   .. describe:: CONFIG_GCC_PLUGIN_LATENT_ENTROPY=y

      Instrument some kernel code to gather additional (but not
      cryptographically secure) entropy at boot time.

   .. describe:: CONFIG_GCC_PLUGIN_STRUCTLEAK=y
                 CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL=y

      Prevent potential information leakage by forcing zero-initialization of:

        - structures on the stack containing userspace addresses;
        - any stack variable (thus including structures) that may be passed by
          reference and has not already been explicitly initialized.

      This is particularly important to prevent trivial bypassing of KASLR.

   .. describe:: CONFIG_GCC_PLUGIN_RANDSTRUCT=y

      Randomize layout of sensitive kernel structures. Exploits targeting such
      structures then require an additional information leak vulnerability.

   .. describe:: CONFIG_GCC_PLUGIN_RANDSTRUCT_PERFORMANCE=n

      Do not weaken structure randomization

.. ---

.. describe:: CONFIG_ARCH_MMAP_RND_BITS=32

   Use maximum number of randomized bits for the mmap base address on x86_64.
   Note that thanks to a linux-hardened patch, this also impacts the number of
   randomized bits for the stack base address.

.. ---

.. describe:: CONFIG_STACKPROTECTOR=y
              CONFIG_STACKPROTECTOR_STRONG=y

   Use ``-fstack-protector-strong`` for best stack canary coverage; GCC 4.9 at
   least is required.

.. describe:: CONFIG_VMAP_STACK=y

   Virtually-mapped stacks benefit from guard pages, thus making kernel stack
   overflows harder to exploit.

.. describe:: CONFIG_REFCOUNT_FULL=y

   Do extensive checks on reference counting to prevent use-after-free
   conditions. Without this option, on x86, there already is a fast
   assembly-based protection based on the PaX implementation but it does not
   cover all cases.

.. ---

.. describe:: CONFIG_STRICT_MODULE_RWX=y

   Enforce strict memory mappings permissions for loadable kernel modules.

.. ---

Although CLIP OS stores kernel modules in a read-only rootfs whose integrity is
guaranteed by dm-verity, we still enable and enforce module signing as an
additional layer of security:

 .. describe:: CONFIG_MODULE_SIG=y
               CONFIG_MODULE_SIG_FORCE=y
               CONFIG_MODULE_SIG_ALL=y
               CONFIG_MODULE_SIG_SHA512=y
               CONFIG_MODULE_SIG_HASH="sha512"

.. ---

.. describe:: CONFIG_INIT_STACK_ALL=n

   This option requires compiler support that is currently only available in
   Clang.

Processor type and features
~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. describe:: CONFIG_RETPOLINE=y

   Retpolines are needed to protect against Spectre v2. GCC 7.3.0 or higher is
   required.

.. describe:: CONFIG_LEGACY_VSYSCALL_NONE=y

   The vsyscall table is not required anymore by libc and is a fixed-position
   potential source of ROP gadgets.

.. describe:: CONFIG_X86_VSYSCALL_EMULATE=n
              CONFIG_LEGACY_VSYSCALL_XONLY=n

   See above.

.. describe:: CONFIG_MICROCODE=y

   Needed to benefit from microcode updates and thus security fixes (e.g.,
   additional Intel pseudo-MSRs to be used by the kernel as a mitigation for
   various speculative execution vulnerabilities).

.. describe:: CONFIG_X86_MSR=n
              CONFIG_X86_CPUID=n

   Enabling those features would only present userspace with more attack
   surface.

.. describe:: CONFIG_KSM=n

   Enabling this feature can make cache side-channel attacks such as
   FLUSH+RELOAD much easier to carry out.

.. ---

.. describe:: CONFIG_DEFAULT_MMAP_MIN_ADDR=65536

   This should in particular be non-zero to prevent the exploitation of kernel
   NULL pointer bugs.

.. describe:: CONFIG_MTRR=y

   Memory Type Range Registers can make speculative execution bugs a bit harder
   to exploit.

.. describe:: CONFIG_X86_PAT=y

   Page Attribute Tables are the modern equivalents of MTRRs, which we
   described above.

.. describe:: CONFIG_ARCH_RANDOM=y

   Enable the RDRAND instruction to benefit from a secure hardware RNG if
   supported. See also ``CONFIG_RANDOM_TRUST_CPU``.

.. describe:: CONFIG_X86_SMAP=y

   Enable Supervisor Mode Access Prevention to prevent ret2usr exploitation
   techniques.

.. describe:: CONFIG_X86_INTEL_UMIP=y

   Enable User Mode Instruction Prevention. Note that hardware supporting this
   feature is not common yet.

.. describe:: CONFIG_X86_INTEL_MPX=n

   Intel Memory Protection Extensions add hardware assistance to memory
   protection. Compiler support is required but is deprecated in GCC 8 and will
   probably be dropped in GCC 9.

.. describe:: CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS=n

   Memory Protection Keys are a promising feature but they are still not
   supported on current hardware.

.. describe:: CONFIG_X86_INTEL_TSX_MODE_OFF=y

   Set the default value of the ``tsx`` kernel parameter to ``off``.

.. ---

Enable the **seccomp** BPF userspace API for syscall attack surface reduction:

  .. describe:: CONFIG_SECCOMP=y
                CONFIG_SECCOMP_FILTER=y

.. ---

.. describe:: CONFIG_RANDOMIZE_BASE=y

   While this may be seen as a `controversial
   <https://grsecurity.net/kaslr_an_exercise_in_cargo_cult_security.php>`_
   feature, it makes sense for CLIP OS. Indeed, KASLR may be defeated thanks to
   the kernel interfaces that are available to an attacker, or through attacks
   leveraging hardware vulnerabilities such as speculative and out-of-order
   execution ones. However, CLIP OS follows the *defense in depth* principle
   and an attack surface reduction approach. Thus, the following points make
   KASLR relevant in the CLIP OS kernel:

   * KASLR was initially designed to counter remote attacks but the strong
     security model of CLIP OS (e.g., no sysfs mounts in most containers,
     minimal procfs, no arbitrary code execution) makes a local attack
     more complex to carry out.
   * STRUCTLEAK, STACKLEAK, kptr_restrict and
     ``CONFIG_SECURITY_DMESG_RESTRICT`` are enabled in CLIP OS.
   * The CLIP OS kernel is custom-compiled (at least for a given deployment),
     its image is unreadable to all users including privileged ones and updates
     are end-to-end encrypted. This makes both the content and addresses of the
     kernel image secret. Note that, however, the production kernel image is
     currently part of an EFI binary and is not encrypted, causing it to be
     accessible to a physical attacker. This will change in the future as we
     will only use the kernel included in the EFI binary to boot and then
     *kexec* to the real production kernel whose image will be located on an
     encrypted disk partition.
   * We enable ``CONFIG_PANIC_ON_OOPS`` by default so that the kernel
     cannot recover from failed exploit attempts, thus preventing any brute
     forcing.
   * We enable Kernel Page Table Isolation, mitigating Meltdown and potential
     other hardware information leakage. Variante 3a (Rogue System Register
     Read) however remains an important threat to KASLR.

.. ---

.. describe:: CONFIG_RANDOMIZE_MEMORY=y

   Most of the above explanations stand for that feature.

.. describe:: CONFIG_KEXEC=n
              CONFIG_KEXEC_FILE=n

   Disable the ``kexec()`` system call to prevent an already-root attacker from
   rebooting on an untrusted kernel.

.. describe:: CONFIG_CRASH_DUMP=n

   A crash dump can potentially provide an attacker with useful information.
   However we disabled ``kexec()`` syscalls above thus this configuration
   option should have no impact anyway.

.. ---

.. describe:: CONFIG_MODIFY_LDT_SYSCALL=n

   This is not supposed to be needed by userspace applications and only
   increases the kernel attack surface.

Power management and ACPI options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. describe:: CONFIG_HIBERNATION=n

   The CLIP OS swap partition is encrypted with an ephemeral key and thus
   cannot support suspend to disk.

Firmware Drivers
~~~~~~~~~~~~~~~~

.. describe:: CONFIG_RESET_ATTACK_MITIGATION=n

   In order to work properly, this mitigation requires userspace support that
   is currently not available in CLIP OS. Moreover, due to our use of Secure
   Boot, Trusted Boot and the fact that machines running CLIP OS are expected
   to lock their BIOS with a password, the type of *cold boot attacks* this
   mitigation is supposed to thwart should not be an issue.

Executable file formats / Emulations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. describe:: CONFIG_BINFMT_MISC=n

   We do not want our kernel to support miscellaneous binary classes. ELF
   binaries and interpreted scripts starting with a shebang are enough.

.. describe:: CONFIG_COREDUMP=n

   Core dumps can provide an attacker with useful information.

Networking support
~~~~~~~~~~~~~~~~~~

.. describe:: CONFIG_SYN_COOKIES=y

   Enable TCP syncookies.

Device Drivers
~~~~~~~~~~~~~~

.. describe:: CONFIG_HW_RANDOM_TPM=y

   Expose the TPM's Random Number Generator (RNG) as a Hardware RNG (HWRNG)
   device, allowing the kernel to collect randomness from it. See documentation
   of ``CONFIG_RANDOM_TRUST_CPU`` and the ``rng_core.default_quality`` command
   line parameter for supplementary information.

.. describe:: CONFIG_TCG_TPM=y

   CLIP OS leverages the TPM to ensure :ref:`boot integrity <trusted_boot>`.

.. describe:: CONFIG_DEVMEM=n

   The ``/dev/mem`` device should not be required by any user application
   nowadays.

   .. note::

      If you must enable it, at least enable ``CONFIG_STRICT_DEVMEM`` and
      ``CONFIG_IO_STRICT_DEVMEM`` to restrict at best access to this device.

.. describe:: CONFIG_DEVKMEM=n

   This virtual device is only useful for debug purposes and is very dangerous
   as it allows direct kernel memory writing (particularly useful for
   rootkits).

.. describe:: CONFIG_LEGACY_PTYS=n

   Use the modern PTY interface only.

.. describe:: CONFIG_LDISC_AUTOLOAD=n

   Do not automatically load any line discipline that is in a kernel module
   when an unprivileged user asks for it.

.. describe:: CONFIG_DEVPORT=n

   The ``/dev/port`` device should not be used anymore by userspace, and it
   could increase the kernel attack surface.

.. describe:: CONFIG_RANDOM_TRUST_CPU=n

   Do not **credit** entropy generated by the CPU manufacturer's HWRNG and
   included in Linux's entropy pool. Fast and robust initialization of Linux's
   CSPRNG is instead achieved thanks to the TPM's HWRNG (see documentation of
   ``CONFIG_HW_RANDOM_TPM`` and the ``rng_core.default_quality`` command line
   parameter).

The IOMMU allows for protecting the system's main memory from arbitrary
accesses from devices (e.g., DMA attacks). Note that this is related to
hardware features. On a recent Intel machine, we enable the following:

  .. describe:: CONFIG_IOMMU_SUPPORT=y
                CONFIG_INTEL_IOMMU=y
                CONFIG_INTEL_IOMMU_SVM=y
                CONFIG_INTEL_IOMMU_DEFAULT_ON=y

File systems
~~~~~~~~~~~~

.. describe:: CONFIG_PROC_KCORE=n

   Enabling this would provide an attacker with precious information on the
   running kernel.

Kernel hacking
~~~~~~~~~~~~~~

.. describe:: CONFIG_MAGIC_SYSRQ=n

   This should only be needed for debugging.

.. describe:: CONFIG_DEBUG_KERNEL=y

   This is useful even in a production kernel to enable further configuration
   options that have security benefits.

.. describe:: CONFIG_DEBUG_VIRTUAL=y

   Enable sanity checks in virtual to page code.

.. describe:: CONFIG_STRICT_KERNEL_RWX=y

   Ensure kernel page tables have strict permissions.

.. describe:: CONFIG_DEBUG_WX=y

   Check and report any dangerous memory mapping permissions, i.e., both
   writable and executable kernel pages.

.. describe:: CONFIG_DEBUG_FS=n

   The debugfs virtual file system is only useful for debugging and protecting
   it would require additional work.

.. describe:: CONFIG_SLUB_DEBUG_ON=n

   Using the ``slub_debug`` command line parameter provides more fine grained
   control.

.. describe:: CONFIG_PANIC_ON_OOPS=y
              CONFIG_PANIC_TIMEOUT=-1

   Prevent potential further exploitation of a bug by immediately panicking the
   kernel.

The following options add additional checks and validation for various
commonly targeted kernel structures:

  .. describe:: CONFIG_DEBUG_CREDENTIALS=y
                CONFIG_DEBUG_NOTIFIERS=y
                CONFIG_DEBUG_LIST=y
                CONFIG_DEBUG_SG=y
  .. describe:: CONFIG_BUG_ON_DATA_CORRUPTION=y

     Note that linux-hardened patches add more places where this configuration
     option has an impact.

  .. describe:: CONFIG_SCHED_STACK_END_CHECK=y
  .. describe:: CONFIG_PAGE_POISONING=n

     We choose to poison pages with zeroes and thus prefer using
     ``init_on_free`` in combination with linux-hardened's
     ``PAGE_SANITIZE_VERIFY``.

Security
~~~~~~~~

.. describe:: CONFIG_SECURITY_DMESG_RESTRICT=y

   Prevent unprivileged users from gathering information from the kernel log
   buffer via ``dmesg(8)``. Note that this still can be overridden through the
   ``kernel.dmesg_restrict`` sysctl.

.. describe:: CONFIG_PAGE_TABLE_ISOLATION=y

   Enable KPTI to prevent Meltdown attacks and, more generally, reduce the
   number of hardware side channels.

.. ---

.. describe:: CONFIG_INTEL_TXT=n

   CLIP OS does not use Intel Trusted Execution Technology.

.. ---

.. describe:: CONFIG_HARDENED_USERCOPY=y

   Harden data copies between kernel and user spaces, preventing classes of
   heap overflow exploits and information leaks.

.. describe:: CONFIG_HARDENED_USERCOPY_FALLBACK=n

   Use strict whitelisting mode, i.e., do not ``WARN()``.

.. describe:: CONFIG_FORTIFY_SOURCE=y

   Leverage compiler to detect buffer overflows.

.. describe:: CONFIG_FORTIFY_SOURCE_STRICT_STRING=n

   This extends ``FORTIFY_SOURCE`` to intra-object overflow checking. It is
   useful to find bugs but not recommended for a production kernel yet.
   [linux-hardened]_

.. describe:: CONFIG_STATIC_USERMODEHELPER=y

   This makes the kernel route all usermode helper calls to a single binary
   that cannot have its name changed. Without this, the kernel can be tricked
   into calling an attacker-controlled binary (e.g. to bypass SMAP, cf.
   `exploitation <https://seclists.org/oss-sec/2016/q4/621>`_ of
   CVE-2016-8655).

   .. describe:: CONFIG_STATIC_USERMODEHELPER_PATH=""

      Currently, we have no need for usermode helpers therefore we simply
      disable them. If we ever need some, this path will need to be set to a
      custom trusted binary in charge of filtering and choosing what real
      helpers should then be called.

.. ---

.. describe:: CONFIG_SECURITY=y

   Enable us to choose different security modules.

.. describe:: CONFIG_SECURITY_SELINUX=y

   CLIP OS intends to leverage SELinux in its security model.

.. describe:: CONFIG_SECURITY_SELINUX_BOOTPARAM=n

   We do not need SELinux to be disableable.

.. describe:: CONFIG_SECURITY_SELINUX_DISABLE=n

   We do not want SELinux to be disabled. In addition, this would prevent LSM
   structures such as security hooks from being marked as read-only.

.. describe:: CONFIG_SECURITY_SELINUX_DEVELOP=y

   For now, but will eventually be ``n``.

.. ---

.. describe:: CONFIG_LSM="yama"

   SELinux shall be stacked too once CLIP OS uses it.

.. ---

.. describe:: CONFIG_SECURITY_YAMA=y

   The Yama LSM currently provides ptrace scope restriction (which might be
   redundant with CLIP-LSM in the future).

.. ---

.. describe:: CONFIG_INTEGRITY=n

   The integrity subsystem provides several components, the security benefits
   of which are already enforced by CLIP OS (e.g., read-only mounts for all
   parts of the system containing executable programs).

.. ---

.. describe:: CONFIG_SECURITY_PERF_EVENTS_RESTRICT=y

   See documentation about the ``kernel.perf_event_paranoid`` sysctl below.
   [linux-hardened]_

.. ---

.. describe:: CONFIG_SECURITY_TIOCSTI_RESTRICT=y

   This prevents unprivileged users from using the TIOCSTI ioctl to inject
   commands into other processes that share a tty session. [linux-hardened]_

.. ---

.. describe:: CONFIG_GCC_PLUGIN_STACKLEAK=y
              CONFIG_STACKLEAK_TRACK_MIN_SIZE=100
              CONFIG_STACKLEAK_METRICS=n
              CONFIG_STACKLEAK_RUNTIME_DISABLE=n

``STACKLEAK`` erases the kernel stack before returning from system calls,
leaving it initialized to a poison value. This both reduces the information
that kernel stack leak bugs can reveal and the exploitability of uninitialized
stack variables. However, it does not cover functions reaching the same stack
depth as prior functions during the same system call.

It used to also block kernel stack depth overflows caused by ``alloca()``, such
as Stack Clash attacks. We maintained this functionality for our kernel for a
while but eventually `dropped it
<https://github.com/clipos/src_external_linux/commit/3e5f9114fc2f70f6d2ae5d10db10869e0564eb03>`_.

.. describe:: CONFIG_INIT_ON_FREE_DEFAULT_ON=y
              CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y

   These set ``init_on_free=1`` and ``init_on_alloc=1`` on the kernel command
   line. See the documentation of these kernel parameters for details.

.. describe:: CONFIG_PAGE_SANITIZE_VERIFY=y
              CONFIG_SLAB_SANITIZE_VERIFY=y

   Verify that newly allocated pages and slab allocations are zeroed to detect
   write-after-free bugs. This works in concert with ``init_on_free`` and is
   adjusted to not be redundant with ``init_on_alloc``.
   [linux-hardened]_

.. ---

We incorporated most of the *Lockdown* patch series into the CLIP OS kernel,
though it may be merged into the mainline kernel in the near future.
Basically, *Lockdown* tries to disable many mechanisms that could allow the
superuser to eventually run untrusted code in kernel mode (note that a
significant portion of them are already disabled in the CLIP OS kernel due to
our custom configuration). This is an interesting work for CLIP OS as we want
to avoid persistence on a compromised machine even in the case of an
already-root attacker. Among the several configuration options brought by
*Lockdown*, we enable the following ones:

  .. describe:: CONFIG_LOCK_DOWN_KERNEL=y
                CONFIG_LOCK_DOWN_MANDATORY=y


Compilation
-----------

GCC version 7.3.0 or higher is required to fully benefit from retpolines
(``-mindirect-branch=thunk-extern``).


Sysctl Security Tuning
----------------------

Many sysctls are not security-relevant or only play a role if some kernel
configuration options are enabled/disabled. In other words, the following is
tightly related to the CLIP OS kernel configuration detailed above.

.. describe:: dev.tty.ldisc_autoload = 0

   See ``CONFIG_LDISC_AUTOLOAD`` above, which serves as a default value for
   this sysctl.

.. describe:: kernel.kptr_restrict = 2

   Hide kernel addresses in ``/proc`` and other interfaces, even to privileged
   users.

.. describe:: kernel.yama.ptrace_scope = 3

   Enable the strictest ptrace scope restriction provided by the Yama LSM.

.. describe:: kernel.perf_event_paranoid = 3

   This completely disallows unprivileged access to the ``perf_event_open()``
   system call. This is actually not needed as we already enable
   ``CONFIG_SECURITY_PERF_EVENTS_RESTRICT``. [linux-hardened]_

   Note that this requires a patch included in linux-hardened (see `here
   <https://lwn.net/Articles/696216/>`_ for the reason why it is not upstream).
   Indeed, on a mainline kernel without such a patch, the above is equivalent
   to setting this sysctl to ``2``, which would still allow the profiling of
   user processes.

.. describe:: kernel.tiocsti_restrict = 1

   This is already forced by the ``CONFIG_SECURITY_TIOCSTI_RESTRICT`` kernel
   configuration option that we enable. [linux-hardened]_

The following two sysctls help mitigating TOCTOU vulnerabilities by preventing
users from creating symbolic or hard links to files they do not own or have
read/write access to:

  .. describe:: fs.protected_symlinks = 1
                fs.protected_hardlinks = 1

In addition, the following other two sysctls impose restrictions on the
opening of FIFOs and regular files in order to make similar spoofing attacks
harder:

  .. describe:: fs.protected_fifos = 2
                fs.protected_regular = 2

We do not simply disable the BPF Just in Time compiler as CLIP OS plans on
using it:

  .. describe:: kernel.unprivileged_bpf_disabled = 1

     Prevent unprivileged users from using BPF.

  .. describe:: net.core.bpf_jit_harden = 2

     Trades off performance but helps mitigate JIT spraying.

.. describe:: kernel.deny_new_usb = 0

   The management of USB devices is handled at a higher level by CLIP OS.
   [linux-hardened]_

.. describe:: kernel.device_sidechannel_restrict = 1

   Restrict device timing side channels. [linux-hardened]_

.. describe:: fs.suid_dumpable = 0

   Do not create core dumps of setuid executables.  Note that we already
   disable all core dumps by setting ``CONFIG_COREDUMP=n``.

.. describe:: kernel.pid_max = 65536

   Increase the space for PID values.

.. describe:: kernel.modules_disabled = 1

   Disable module loading once systemd has loaded the ones required for the
   running machine according to a profile (i.e., a predefined and
   hardware-specific list of modules).

Pure network sysctls (``net.ipv4.*`` and ``net.ipv6.*``) will be detailed in a
separate place.


Command line parameters
-----------------------

We pass the following command line parameters to the kernel:

.. describe:: extra_latent_entropy

   This parameter provided by a linux-hardened patch (based on the PaX
   implementation) enables a very simple form of latent entropy extracted
   during system start-up and added to the entropy obtained with
   ``GCC_PLUGIN_LATENT_ENTROPY``. [linux-hardened]_

.. describe:: pti=on

   This force-enables KPTI even on CPUs claiming to be safe from Meltdown.

.. describe:: spectre_v2=on

   Same reasoning as above but for the Spectre v2 vulnerability. Note that this
   implies ``spectre_v2_user=on``, which enables the mitigation against user
   space to user space task attacks (namely IBPB and STIBP when available and
   relevant).

.. describe:: spec_store_bypass_disable=seccomp

   Same reasoning as above but for the Spectre v4 vulnerability. Note that this
   mitigation requires updated microcode for Intel processors.


.. describe:: mds=full,nosmt

   This parameter controls optional mitigations for the Microarchitectural Data
   Sampling (MDS) class of Intel CPU vulnerabilities. Not specifying this
   parameter is equivalent to setting ``mds=full``, which leaves SMT enabled
   and therefore is not a complete mitigation. Note that this mitigation
   requires an Intel microcode update and also addresses the TSX Asynchronous
   Abort (TAA) Intel CPU vulnerability on systems that are affected by MDS.

.. describe:: iommu=force

   Even if we correctly enable the IOMMU in the kernel configuration, the
   kernel can still decide for various reasons to not initialize it at boot.
   Therefore, we force it with this parameter. Note that with some Intel
   chipsets, you may need to add ``intel_iommu=igfx_off`` to allow your GPU to
   access the physical memory directly without going through the DMA Remapping.

.. describe:: slub_debug=F

   The ``F`` option adds many sanity checks to various slab operations. Other
   interesting options that we considered but eventually chose to not use are:

    * The ``P`` option, which enables poisoning on slab cache allocations,
      disables the ``init_on_free`` and ``SLAB_SANITIZE_VERIFY`` features. As
      they respectively poison with zeroes on object freeing and check the
      zeroing on object allocations, we prefer enabling them instead of using
      ``slub_debug=P``.
    * The ``Z`` option enables red zoning, i.e., it adds extra areas around
      slab objects that detect when one is overwritten past its real size.
      This can help detect overflows but we already rely on ``SLAB_CANARY``
      provided by linux-hardened. A canary is much better than a simple red
      zone as it is supposed to be random.

.. describe:: page_alloc.shuffle=1

   See ``CONFIG_SHUFFLE_PAGE_ALLOCATOR``.

.. describe:: rng_core.default_quality=512

   Increase trust in the TPM's HWRNG to robustly and fastly initialize Linux's
   CSPRNG by **crediting** half of the entropy it provides.

Also, note that:

* ``slub_nomerge`` is not used as we already set
  ``CONFIG_SLAB_MERGE_DEFAULT=n`` in the kernel configuration.
* ``l1tf``: The built-in PTE Inversion mitigation is sufficient to mitigate
  the L1TF vulnerability as long as CLIP OS is not used as an hypervisor with
  untrusted guest VMs. If it were to be someday, ``l1tf=full,force`` should be
  used to force-enable VMX unconditional cache flushes and force-disable SMT
  (note that an Intel microcode update is not required for this mitigation to
  work but improves performance by providing a way to invalidate caches with a
  finer granularity).
* ``tsx=off``: This parameter is already set by default thanks to
  ``CONFIG_X86_INTEL_TSX_MODE_OFF``. It deactivates the Intel TSX feature on
  CPUs that support TSX control (i.e. are recent enough or received a microcode
  update) and that are not already vulnerable to MDS, therefore mitigating the
  TSX Asynchronous Abort (TAA) Intel CPU vulnerability.
* ``tsx_async_abort``: This parameter controls optional mitigations for the TSX
  Asynchronous Abort (TAA) Intel CPU vulnerability. Due to our use of
  ``mds=full,nosmt`` in addition to ``CONFIG_X86_INTEL_TSX_MODE_OFF``, CLIP OS
  is already protected against this vulnerability as long as the CPU microcode
  has been updated, whether or not the CPU is affected by MDS. For the record,
  if we wanted to keep TSX activated, we could specify
  ``tsx_async_abort=full,nosmt``. Not specifying this parameter is equivalent
  to setting ``tsx_async_abort=full``, which leaves SMT enabled and therefore
  is not a complete mitigation. Note that this mitigation requires an Intel
  microcode update and has no effect on systems that are already affected by
  MDS and enable mitigations against it, nor on systems that disable TSX.
* ``kvm.nx_huge_pages``: This parameter allows to control the KVM hypervisor
  iTLB multihit mitigations. Such mitigations are not needed as long as CLIP OS
  is not used as an hypervisor with untrusted guest VMs. If it were to be
  someday, ``kvm.nx_huge_pages=force`` should be used to ensure that guests
  cannot exploit the iTLB multihit erratum to crash the host.
* ``mitigations``: This parameter controls optional mitigations for CPU
  vulnerabilities in an arch-independent and more coarse-grained way. For now,
  we keep using arch-specific options for the sake of explicitness. Not setting
  this parameter equals setting it to ``auto``, which itself does not update
  anything.
* ``init_on_free=1`` is automatically set due to ``INIT_ON_FREE_DEFAULT_ON``. It
  zero-fills page and slab allocations on free to reduce risks of information
  leaks and help mitigate a subset of use-after-free vulnerabilities.
* ``init_on_alloc=1`` is automatically set due to ``INIT_ON_ALLOC_DEFAULT_ON``.
  The purpose of this functionality is to eliminate several kinds of
  *uninitialized heap memory* flaws by zero-filling:

  * all page allocator and slab allocator memory when allocated: this is
    already guaranteed by our use of ``init_on_free`` in combination with
    ``PAGE_SANITIZE_VERIFY`` and ``SLAB_SANITIZE_VERIFY`` from linux-hardened,
    and thus has no effect;
  * a few more *special* objects when allocated: these are the ones for which
    we enable ``init_on_alloc`` as they are not covered by the aforementioned
    combination of ``init_on_free`` and ``SANITIZE_VERIFY`` features.

.. rubric:: Citations and origin of some items

.. [linux-hardened]
   This item is provided by the ``linux-hardened`` patches.

.. vim: set tw=79 ts=2 sts=2 sw=2 et: