Merge pull request #23200 from keszybz/oomd-docs

Extend the documentation for oomd a bit
update TODO
2026-04-25 16:34:50 +02:00 · 2022-04-28 17:46:03 +02:00 · 2022-04-28 17:16:33 +02:00 · 2022-04-28 17:16:33 +02:00 · 2022-04-28 15:46:44 +02:00 · 2022-04-28 15:46:44 +02:00
7 changed files with 190 additions and 60 deletions
--- a/91
+++ b/91
@ -4,7 +4,7 @@ CHANGES WITH 251:

        Backwards-incompatible changes:

-        * The minimum kernel version required has been bumped from 3.13 to 3.15,
+        * The minimum kernel version required has been bumped from 3.13 to 4.15,
          and CLOCK_BOOTTIME is now assumed to always exist.

        * C11 with GNU extensions (aka "gnu11") is now used to build our
@ -204,6 +204,19 @@ CHANGES WITH 251:
          similar to sd_id128_to_string() but formats the ID in RFC 4122 UUID
          format instead of simple series of hex characters.

+        * The sd-device API gained two new calls sd_device_new_from_devname()
+          and sd_device_new_from_path() which permit allocating an sd_device
+          object from a device node name or file system path.
+
+        * sd-device also gained a new call sd_device_open() which will open the
+          device node associated with a device for which an sd_device object
+          has been allocated. The call is supposed to address races around
+          device nodes being removed/recycled due to hotplug events, or media
+          change events: the call checks internally whether the major/minor of
+          the device node and the "diskseq" (in case of block devices) match
+          with the metadata loaded in the sd_device object, thus ensuring that
+          the device once opened really matches the provided sd_device object.
+
        Changes in PID1, systemctl, and systemd-oomd:

        * A new set of service monitor environment variables will be passed to
@ -280,6 +293,32 @@ CHANGES WITH 251:
          necessary to fix this aspect. Absolute links are interpreted as
          before, and it is still possible to create them via other means.

+        * A new "taint" flag named "old-kernel" is introduced which is set when
+          the kernel systemd runs on is older then the current baseline version
+          (see above). The flag is shown in "systemctl status" output.
+
+        * Two additional taint flags "short-uid-range" and "short-gid-range"
+          have been added as well, which are set when systemd notices it is run
+          within a userns namespace that does not define the full 0…65535 UID
+          range
+
+        * A new "unmerged-usr" taint flag has been added that is set whenever
+          running on systems where /bin/ + /sbin/ are *not* symlinks to their
+          counterparts in /usr/, i.e. on systems where the /usr/-merge has been
+          completed.
+
+        * Generators invoked by PID 1 will now have a couple of useful
+          environment variables set describing the execution context a
+          bit. $SYSTEMD_SCOPE encodes whether the generator is called from the
+          system service manager, or from the per-user service
+          manager. $SYSTEMD_IN_INITRD encodes whether the generator is invoked
+          in initrd context or on the host. $SYSTEMD_FIRST_BOOT encodes whether
+          systemd considers the current boot to be a "first"
+          boot. $SYSTEMD_VIRTUALIZATION encode whether virtualization is
+          detected and which type of hypervisor/container
+          manager. $SYSTEMD_ARCHITECTURE indicates which architecture the
+          kernel is built for.
+
        Changes in systemd-journald:

        * The journal JSON export format has been added to listed of stable
@ -311,6 +350,32 @@ CHANGES WITH 251:
          already-initialized devices, and only devices which haven't been
          initialized yet, respectively.

+        * udevadm gained a new "wait" command for safely waiting for a specific
+          device to show up in the udev device database. This is useful in
+          scripts that asynchronously allocate a block device (e.g. through
+          repartitioning, or allocating a loopback device or similar) and need
+          to synchronize on the creation to complete.
+
+        * udevadm gained a new "lock" command for locking one or more block
+          devices while formatting it or writing a partition table to it. It is
+          an implementation of https://systemd.io/BLOCK_DEVICE_LOCKING and
+          usable in scripts dealing with block devices.
+
+        * udevadm info will show a couple of additional device fields in its
+          output, and will not apply a limited set of coloring to line types.
+
+        * udevadm info --tree will now show a tree of objects (i.e. devices and
+          suchlike) in the /sys/ hierarchy.
+
+        * Block devices will now get a new set of device symlinks in
+          /dev/disk/by-diskseq/<nr>, which may be used to reference block
+          device nodes via the kernel's "diskseq" value. Note that this does
+          not guarantee that opening a device by a symlink like this will
+          guarantee that the opened device actually matches the specified
+          diskseq value. To be safe against races, the actual diskseq value of
+          the opened device (BLKGETDISKSEQ ioctl()) must still be compred with
+          the one in the symlink path.
+
        * .link files gained support for setting MDI/MID-X on a link.

        * .link files gained support for [Match] Firmware= setting to match on
@ -377,6 +442,10 @@ CHANGES WITH 251:
          used, to ensure that communication between CPU and discrete TPM chips
          cannot be eavesdropped to acquire disk encryption keys.

+        * A new switch --fido2-credential-algorithm= has been added to
+          systemd-cryptenroll allowing selection of the credential algorithm to
+          use when binding encryption to FIDO2 tokens.
+
        Changes in systemd-hostnamed:

        * HARDWARE_VENDOR= and HARDWARE_MODEL= can be set in /etc/machine-info
@ -387,7 +456,9 @@ CHANGES WITH 251:
          hostnamed.

        * hostnamed's D-Bus interface gained a new method GetHardwareSerial()
-          for reading the hardware serial number, as reportd by DMI.
+          for reading the hardware serial number, as reportd by DMI. It also
+          exposes a new method D-Bus property FirmwareVersion that encode the
+          firmware version of the system.

        Changes in other components:

@ -404,6 +475,22 @@ CHANGES WITH 251:
          used to set the default shell for user records and nspawn shell
          invocations (instead of of the default /bin/bash).

+        * systemd-timesyncd now provides a D-Bus API for receiving NTP server
+          information dynamically at runtime via IPC.
+
+        * The systemd-creds tool gained a new "has-tpm2" verb, which reports
+          whether a functioning TPM2 infrastructure is available, i.e. if
+          firmware, kernel driver and systemd all have TPM2 support enabled and
+          a device found.
+
+        * The systemd-creds tool gained support for generating encrypted
+          credentials that are using an empty encryption key. While this
+          provides no integrity nor confidentiality it's useful to implement
+          codeflows that work the same on TPM-ful and TPM2-less systems. The
+          service manager will only accept credentials "encrypted" that way if
+          a TPM2 device cannot be detected, to ensure that credentials
+          "encrypted" like that cannot be used to trick TPM2 systems.
+
        Experimental features:

        * sd-boot gained a new *experimental* setting "reboot-for-bitlocker" in
--- a/18
+++ b/18
@ -78,6 +78,24 @@ Janitorial Clean-ups:

 Features:

+* TPM2: add auth policy for signed PCR values to make updates easy. i.e. do
+  what tpm2_policyauthorize tool does.  To be truly useful scheme needs to be a
+  bit more elaborate though: policy probably must take some nvram based
+  generation counter into account that can only monotonically increase and can
+  be used to invalidate old PCR signatures. Otherwise people could downgrade to
+  old signed PCR sets whenever they want. Usecase: encrypt the rootfs with LUKS
+  with a key that can only be unlocked via a pristine pre-built Fedora
+  kernel+initrd.
+
+* update HACKING.md to suggest developing systemd with the ideas from:
+  https://0pointer.net/blog/testing-my-system-code-in-usr-without-modifying-usr.html
+  https://0pointer.net/blog/running-an-container-off-the-host-usr.html
+
+* add a clear concept how the initrd can make up credentials on their own to
+  pass to the system when transitioning into the host OS. usecase: things like
+  cloud-init/ignitation and similar can parameterize the host with data they
+  acquire.
+
 * Add ConditionCredentialExists= or so, that allows conditionalizing services
  depending on whether a specific system credential is set. Usecase: a service
  similar to the ssh keygen service that installs any SSH host key supplied via
--- a/man/systemd-oomd.service.xml
+++ b/man/systemd-oomd.service.xml
@ -29,23 +29,36 @@
  <refsect1>
    <title>Description</title>

-    <para><command>systemd-oomd</command> is a system service that uses cgroups-v2 and pressure stall information (PSI)
-    to monitor and take action on processes before an OOM occurs in kernel space.</para>
+    <para><command>systemd-oomd</command> is a system service that uses cgroups-v2 and pressure stall
+    information (PSI) to monitor and take corrective action before an OOM occurs in the kernel space.</para>

-    <para>You can enable monitoring and actions on units by setting <varname>ManagedOOMSwap=</varname> and/or
-    <varname>ManagedOOMMemoryPressure=</varname> to the appropriate value. <command>systemd-oomd</command> will
-    periodically poll enabled units' cgroup data to detect when corrective action needs to occur. When an action needs
-    to happen, it will only be performed on the descendant cgroups of the enabled units. More precisely, only cgroups with
-    <filename>memory.oom.group</filename> set to <constant>1</constant> and leaf cgroup nodes are eligible candidates.
-    Action will be taken recursively on all of the processes under the chosen candidate.</para>
+    <para>You can enable monitoring and actions on units by setting <varname>ManagedOOMSwap=</varname> and
+    <varname>ManagedOOMMemoryPressure=</varname> in the unit configuration, see
+    <citerefentry><refentrytitle>systemd.resource-control</refentrytitle><manvolnum>5</manvolnum></citerefentry>.
+    <command>systemd-oomd</command> retrieves information about such units from <command>systemd</command>
+    when it starts and watches for subsequent changes.</para>

-    <para>See
-    <citerefentry><refentrytitle>oomd.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>
+    <para>Cgroups of units with <varname>ManagedOOMSwap=</varname> or
+    <varname>ManagedOOMMemoryPressure=</varname> set to <option>kill</option> will be monitored.
+    <command>systemd-oomd</command> periodically polls PSI statistics for the system and those cgroups to
+    decide when to take action. If the configured limits are exceeded, <command>systemd-oomd</command> will
+    select a cgroup to terminate, and send <constant>SIGKILL</constant> to all processes in it. Note that
+    only descendant cgroups are eligible candidates for killing; the unit with its property set to
+    <option>kill</option> is not a candidate (unless one of its ancestors set their property to
+    <option>kill</option>). Also only leaf cgroups and cgroups with <filename>memory.oom.group</filename> set
+    to <constant>1</constant> are eligible candidates; see <varname>OOMPolicy=</varname> in
+    <citerefentry><refentrytitle>systemd.service</refentrytitle><manvolnum>5</manvolnum></citerefentry>.
+    </para>
+
+    <para><citerefentry><refentrytitle>oomctl</refentrytitle><manvolnum>1</manvolnum></citerefentry> can
+    be used to list monitored cgroups and pressure information.</para>
+
+    <para>See <citerefentry><refentrytitle>oomd.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>
    for more information about the configuration of this service.</para>
  </refsect1>

  <refsect1>
-    <title>Setup Information</title>
+    <title>System requirements and configuration</title>

    <para>The system must be running systemd with a full unified cgroup hierarchy for the expected cgroups-v2 features.
    Furthermore, memory accounting must be turned on for all units monitored by <command>systemd-oomd</command>.
@ -53,23 +66,25 @@
    is set to <constant>true</constant> in
    <citerefentry><refentrytitle>systemd-system.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>.</para>

-    <para>You will need a kernel compiled with PSI support. This is available in Linux 4.20 and above.</para>
+    <para>The kernel must be compiled with PSI support. This is available in Linux 4.20 and above.</para>

-    <para>It is highly recommended for the system to have swap enabled for <command>systemd-oomd</command> to function
-    optimally. With swap enabled, the system spends enough time swapping pages to let <command>systemd-oomd</command> react.
-    Without swap, the system enters a livelocked state much more quickly and may prevent <command>systemd-oomd</command>
-    from responding in a reasonable amount of time. See
-    <ulink url="https://chrisdown.name/2018/01/02/in-defence-of-swap.html">"In defence of swap: common misconceptions"</ulink>
-    for more details on swap. Any swap-based actions on systems without swap will be ignored. While
-    <command>systemd-oomd</command> can perform pressure-based actions on a system without swap, the pressure increases
-    will be more abrupt and may require more tuning to get the desired thresholds and behavior.</para>
+    <para>It is highly recommended for the system to have swap enabled for <command>systemd-oomd</command> to
+    function optimally. With swap enabled, the system spends enough time swapping pages to let
+    <command>systemd-oomd</command> react.  Without swap, the system enters a livelocked state much more
+    quickly and may prevent <command>systemd-oomd</command> from responding in a reasonable amount of
+    time. See <ulink url="https://chrisdown.name/2018/01/02/in-defence-of-swap.html">"In defence of swap:
+    common misconceptions"</ulink> for more details on swap. Any swap-based actions on systems without swap
+    will be ignored. While <command>systemd-oomd</command> can perform pressure-based actions on such a
+    system, the pressure increases will be more abrupt and may require more tuning to get the desired
+    thresholds and behavior.</para>

    <para>Be aware that if you intend to enable monitoring and actions on <filename>user.slice</filename>,
-    <filename>user-$UID.slice</filename>, or their ancestor cgroups, it is highly recommended that your programs be
-    managed by the systemd user manager to prevent running too many processes under the same session scope (and thus
-    avoid a situation where memory intensive tasks trigger <command>systemd-oomd</command> to kill everything under the
-    cgroup). If you're using a desktop environment like GNOME, it already spawns many session components with the
-    systemd user manager.</para>
+    <filename>user-$UID.slice</filename>, or their ancestor cgroups, it is highly recommended that your
+    programs be managed by the systemd user manager to prevent running too many processes under the same
+    session scope (and thus avoid a situation where memory intensive tasks trigger
+    <command>systemd-oomd</command> to kill everything under the cgroup). If you're using a desktop
+    environment like GNOME or KDE, it already spawns many session components with the systemd user manager.
+    </para>
  </refsect1>

  <refsect1>
@ -79,11 +94,11 @@
    <filename>-.slice</filename>, and allowing all descendant cgroups to be eligible candidates may make the most
    sense.</para>

-    <para><varname>ManagedOOMMemoryPressure=</varname> tends to work better on the cgroups below the root slice
-    <filename>-.slice</filename>. For units which tend to have processes that are less latency sensitive (e.g.
-    <filename>system.slice</filename>), a higher limit like the default of 60% may be acceptable, as those processes
-    can usually ride out slowdowns caused by lack of memory without serious consequences. However, something like
-    <filename>user@$UID.service</filename> may prefer a much lower value like 40%.</para>
+    <para><varname>ManagedOOMMemoryPressure=</varname> tends to work better on the cgroups below the root
+    slice. For units which tend to have processes that are less latency sensitive (e.g.
+    <filename>system.slice</filename>), a higher limit like the default of 60% may be acceptable, as those
+    processes can usually ride out slowdowns caused by lack of memory without serious consequences. However,
+    something like <filename>user@$UID.service</filename> may prefer a much lower value like 40%.</para>
  </refsect1>

  <refsect1>
--- a/man/systemd.resource-control.xml
+++ b/man/systemd.resource-control.xml
@ -1108,24 +1108,24 @@ DeviceAllow=/dev/loop-control
          <citerefentry><refentrytitle>systemd-oomd.service</refentrytitle><manvolnum>8</manvolnum></citerefentry>
          will act on this unit's cgroups. Defaults to <option>auto</option>.</para>

-          <para>When set to <option>kill</option>, <command>systemd-oomd</command> will actively monitor this unit's
-          cgroup metrics to decide whether it needs to act. If the cgroup passes the limits set by
-          <citerefentry><refentrytitle>oomd.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry> or its
-          overrides, <command>systemd-oomd</command> will send a <constant>SIGKILL</constant> to all of the processes
-          under the chosen candidate cgroup. Note that only descendant cgroups can be eligible candidates for killing;
-          the unit that set its property to <option>kill</option> is not a candidate (unless one of its ancestors set
-          their property to <option>kill</option>). You can find more details on candidates and kill behavior at
+          <para>When set to <option>kill</option>, the unit becomes a candidate for monitoring by
+          <command>systemd-oomd</command>. If the cgroup passes the limits set by
+          <citerefentry><refentrytitle>oomd.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry> or
+          the unit configuration, <command>systemd-oomd</command> will select a descendant cgroup and send
+          <constant>SIGKILL</constant> to all of the processes under it. You can find more details on
+          candidates and kill behavior at
          <citerefentry><refentrytitle>systemd-oomd.service</refentrytitle><manvolnum>8</manvolnum></citerefentry>
-          and <citerefentry><refentrytitle>oomd.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>. Setting
-          either of these properties to <option>kill</option> will also automatically acquire
-          <varname>After=</varname> and <varname>Wants=</varname> dependencies on
-          <filename>systemd-oomd.service</filename> unless <varname>DefaultDependencies=no</varname>.
-        </para>
+          and
+          <citerefentry><refentrytitle>oomd.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>.</para>

-          <para>When set to <option>auto</option>, <command>systemd-oomd</command> will not actively use this cgroup's
-          data for monitoring and detection. However, if an ancestor cgroup has one of these properties set to
-          <option>kill</option>, a unit with <option>auto</option> can still be an eligible candidate for
-          <command>systemd-oomd</command> to act on.</para>
+          <para>Setting either of these properties to <option>kill</option> will also result in
+          <varname>After=</varname> and <varname>Wants=</varname> dependencies on
+          <filename>systemd-oomd.service</filename> unless <varname>DefaultDependencies=no</varname>.</para>
+
+          <para>When set to <option>auto</option>, <command>systemd-oomd</command> will not actively use this
+          cgroup's data for monitoring and detection. However, if an ancestor cgroup has one of these
+          properties set to <option>kill</option>, a unit with <option>auto</option> can still be a candidate
+          for <command>systemd-oomd</command> to terminate.</para>
        </listitem>
      </varlistentry>

--- a/man/systemd.service.xml
+++ b/man/systemd.service.xml
@ -1123,15 +1123,25 @@
      <varlistentry>
        <term><varname>OOMPolicy=</varname></term>

-        <listitem><para>Configure the Out-Of-Memory (OOM) killer policy. On Linux, when memory becomes scarce
-        the kernel might decide to kill a running process in order to free up memory and reduce memory
+        <listitem><para>Configure the out-of-memory (OOM) kernel killer policy. Note that the userspace OOM
+        killer
+        <citerefentry><refentrytitle>systemd-oomd.service</refentrytitle><manvolnum>8</manvolnum></citerefentry>
+        is a more flexible solution that aims to prevent out-of-memory situations for the userspace, not just
+        the kernel.</para>
+
+        <para>On Linux, when memory becomes scarce to the point that the kernel has trouble allocating memory
+        for itself, it might decide to kill a running process in order to free up memory and reduce memory
        pressure. This setting takes one of <constant>continue</constant>, <constant>stop</constant> or
        <constant>kill</constant>. If set to <constant>continue</constant> and a process of the service is
        killed by the kernel's OOM killer this is logged but the service continues running. If set to
        <constant>stop</constant> the event is logged but the service is terminated cleanly by the service
        manager. If set to <constant>kill</constant> and one of the service's processes is killed by the OOM
-        killer the kernel is instructed to kill all remaining processes of the service, too. Defaults to the
-        setting <varname>DefaultOOMPolicy=</varname> in
+        killer the kernel is instructed to kill all remaining processes of the service too, by setting the
+        <filename>memory.oom.group</filename> attribute to <constant>1</constant>; also see <ulink
+        url="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html">kernel documentation</ulink>.
+        </para>
+
+        <para>Defaults to the setting <varname>DefaultOOMPolicy=</varname> in
        <citerefentry><refentrytitle>systemd-system.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>
        is set to, except for services where <varname>Delegate=</varname> is turned on, where it defaults to
        <constant>continue</constant>.</para>
@ -1142,9 +1152,9 @@
        <citerefentry><refentrytitle>systemd.exec</refentrytitle><manvolnum>5</manvolnum></citerefentry> for
        details.</para>

-        <para>This setting also applies to <command>systemd-oomd</command>, similar to kernel OOM kills
-        this setting determines the state of the service after <command>systemd-oomd</command> kills a cgroup associated
-        with the service.</para></listitem>
+        <para>This setting also applies to <command>systemd-oomd</command>, similar to the kernel OOM kills
+        this setting determines the state of the service after <command>systemd-oomd</command> kills a cgroup
+        associated with the service.</para></listitem>
      </varlistentry>

    </variablelist>
--- a/src/oom/oomd-manager.c
+++ b/src/oom/oomd-manager.c
@ -180,13 +180,13 @@ finish:
        return r;
 }

-/* Fill `new_h` with `path`'s descendent OomdCGroupContexts. Only include descendent cgroups that are possible
+/* Fill 'new_h' with 'path's descendant OomdCGroupContexts. Only include descendant cgroups that are possible
 * candidates for action. That is, only leaf cgroups or cgroups with memory.oom.group set to "1".
 *
- * This function ignores most errors in order to handle cgroups that may have been cleaned up while populating
- * the hashmap.
+ * This function ignores most errors in order to handle cgroups that may have been cleaned up while
+ * populating the hashmap.
 *
- * `new_h` is of the form { key: cgroup paths -> value: OomdCGroupContext } */
+ * 'new_h' is of the form { key: cgroup paths -> value: OomdCGroupContext } */
 static int recursively_get_cgroup_context(Hashmap *new_h, const char *path) {
        _cleanup_free_ char *subpath = NULL;
        _cleanup_closedir_ DIR *d = NULL;
--- a/src/oom/oomd.c
+++ b/src/oom/oomd.c
@ -170,7 +170,7 @@ static int run(int argc, char *argv[]) {
        assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGTERM, SIGINT, -1) >= 0);

        if (arg_mem_pressure_usec > 0 && arg_mem_pressure_usec < 1 * USEC_PER_SEC)
-                log_error_errno(SYNTHETIC_ERRNO(EINVAL), "DefaultMemoryPressureDurationSec= must be 0 or at least 1s");
+                return log_error_errno(SYNTHETIC_ERRNO(EINVAL), "DefaultMemoryPressureDurationSec= must be 0 or at least 1s");

        r = manager_new(&m);
        if (r < 0)
Author	SHA1	Message	Date
Zbigniew Jędrzejewski-Szmek	6ef00eb846	Merge pull request #23200 from keszybz/oomd-docs Extend the documentation for oomd a bit	2022-04-28 17:46:03 +02:00
Lennart Poettering	98045d12f6	update TODO	2022-04-28 17:16:33 +02:00
Lennart Poettering	61ade25782	NEWS: updates for 251-rc2	2022-04-28 17:16:33 +02:00
Zbigniew Jędrzejewski-Szmek	4d620b90d9	oomd: "descendent" → "descendant" The latter is the common spelling apparently.	2022-04-28 15:46:44 +02:00
Zbigniew Jędrzejewski-Szmek	3b18f3017c	man: direct users to systemd-oomd if they read about OOMPolicy OOMPolicy remains valid, but let's push users for the userspace solution.	2022-04-28 15:46:44 +02:00
Zbigniew Jędrzejewski-Szmek	6f83ea60e9	man: beef up the description of systemd-oomd.service The gist of the description is moved from systemd.resource-control to systemd-oomd man page. Cross-references to OOMPolicy, memory.oom.group, oomctl, ManagedOOMSwap and ManagedOOMMemoryPressure are added in all places. The descriptions are also more down-to-earth: instead of talking about "taking action" let's just say "kill". We might add configuration for different actions in the future, but we're not there yet, so let's just describe what we do now.	2022-04-28 15:46:44 +02:00
Zbigniew Jędrzejewski-Szmek	c0a96b1b1d	oomd: actually fail if configuration is bad Follow-up for a858355e4a7168625ec1b9e5d17fdb6a11dfecb8.	2022-04-26 08:54:39 +02:00