1
0
mirror of https://github.com/systemd/systemd synced 2026-04-25 16:34:50 +02:00

Compare commits

...

7 Commits

Author SHA1 Message Date
Zbigniew Jędrzejewski-Szmek
6ef00eb846
Merge pull request #23200 from keszybz/oomd-docs
Extend the documentation for oomd a bit
2022-04-28 17:46:03 +02:00
Lennart Poettering
98045d12f6 update TODO 2022-04-28 17:16:33 +02:00
Lennart Poettering
61ade25782 NEWS: updates for 251-rc2 2022-04-28 17:16:33 +02:00
Zbigniew Jędrzejewski-Szmek
4d620b90d9 oomd: "descendent" → "descendant"
The latter is the common spelling apparently.
2022-04-28 15:46:44 +02:00
Zbigniew Jędrzejewski-Szmek
3b18f3017c man: direct users to systemd-oomd if they read about OOMPolicy
OOMPolicy remains valid, but let's push users for the userspace solution.
2022-04-28 15:46:44 +02:00
Zbigniew Jędrzejewski-Szmek
6f83ea60e9 man: beef up the description of systemd-oomd.service
The gist of the description is moved from systemd.resource-control
to systemd-oomd man page. Cross-references to OOMPolicy, memory.oom.group,
oomctl, ManagedOOMSwap and ManagedOOMMemoryPressure are added in all
places.

The descriptions are also more down-to-earth: instead of talking
about "taking action" let's just say "kill". We *might* add configuration
for different actions in the future, but we're not there yet, so let's
just describe what we do now.
2022-04-28 15:46:44 +02:00
Zbigniew Jędrzejewski-Szmek
c0a96b1b1d oomd: actually fail if configuration is bad
Follow-up for a858355e4a7168625ec1b9e5d17fdb6a11dfecb8.
2022-04-26 08:54:39 +02:00
7 changed files with 190 additions and 60 deletions

91
NEWS
View File

@ -4,7 +4,7 @@ CHANGES WITH 251:
Backwards-incompatible changes:
* The minimum kernel version required has been bumped from 3.13 to 3.15,
* The minimum kernel version required has been bumped from 3.13 to 4.15,
and CLOCK_BOOTTIME is now assumed to always exist.
* C11 with GNU extensions (aka "gnu11") is now used to build our
@ -204,6 +204,19 @@ CHANGES WITH 251:
similar to sd_id128_to_string() but formats the ID in RFC 4122 UUID
format instead of simple series of hex characters.
* The sd-device API gained two new calls sd_device_new_from_devname()
and sd_device_new_from_path() which permit allocating an sd_device
object from a device node name or file system path.
* sd-device also gained a new call sd_device_open() which will open the
device node associated with a device for which an sd_device object
has been allocated. The call is supposed to address races around
device nodes being removed/recycled due to hotplug events, or media
change events: the call checks internally whether the major/minor of
the device node and the "diskseq" (in case of block devices) match
with the metadata loaded in the sd_device object, thus ensuring that
the device once opened really matches the provided sd_device object.
Changes in PID1, systemctl, and systemd-oomd:
* A new set of service monitor environment variables will be passed to
@ -280,6 +293,32 @@ CHANGES WITH 251:
necessary to fix this aspect. Absolute links are interpreted as
before, and it is still possible to create them via other means.
* A new "taint" flag named "old-kernel" is introduced which is set when
the kernel systemd runs on is older then the current baseline version
(see above). The flag is shown in "systemctl status" output.
* Two additional taint flags "short-uid-range" and "short-gid-range"
have been added as well, which are set when systemd notices it is run
within a userns namespace that does not define the full 0…65535 UID
range
* A new "unmerged-usr" taint flag has been added that is set whenever
running on systems where /bin/ + /sbin/ are *not* symlinks to their
counterparts in /usr/, i.e. on systems where the /usr/-merge has been
completed.
* Generators invoked by PID 1 will now have a couple of useful
environment variables set describing the execution context a
bit. $SYSTEMD_SCOPE encodes whether the generator is called from the
system service manager, or from the per-user service
manager. $SYSTEMD_IN_INITRD encodes whether the generator is invoked
in initrd context or on the host. $SYSTEMD_FIRST_BOOT encodes whether
systemd considers the current boot to be a "first"
boot. $SYSTEMD_VIRTUALIZATION encode whether virtualization is
detected and which type of hypervisor/container
manager. $SYSTEMD_ARCHITECTURE indicates which architecture the
kernel is built for.
Changes in systemd-journald:
* The journal JSON export format has been added to listed of stable
@ -311,6 +350,32 @@ CHANGES WITH 251:
already-initialized devices, and only devices which haven't been
initialized yet, respectively.
* udevadm gained a new "wait" command for safely waiting for a specific
device to show up in the udev device database. This is useful in
scripts that asynchronously allocate a block device (e.g. through
repartitioning, or allocating a loopback device or similar) and need
to synchronize on the creation to complete.
* udevadm gained a new "lock" command for locking one or more block
devices while formatting it or writing a partition table to it. It is
an implementation of https://systemd.io/BLOCK_DEVICE_LOCKING and
usable in scripts dealing with block devices.
* udevadm info will show a couple of additional device fields in its
output, and will not apply a limited set of coloring to line types.
* udevadm info --tree will now show a tree of objects (i.e. devices and
suchlike) in the /sys/ hierarchy.
* Block devices will now get a new set of device symlinks in
/dev/disk/by-diskseq/<nr>, which may be used to reference block
device nodes via the kernel's "diskseq" value. Note that this does
not guarantee that opening a device by a symlink like this will
guarantee that the opened device actually matches the specified
diskseq value. To be safe against races, the actual diskseq value of
the opened device (BLKGETDISKSEQ ioctl()) must still be compred with
the one in the symlink path.
* .link files gained support for setting MDI/MID-X on a link.
* .link files gained support for [Match] Firmware= setting to match on
@ -377,6 +442,10 @@ CHANGES WITH 251:
used, to ensure that communication between CPU and discrete TPM chips
cannot be eavesdropped to acquire disk encryption keys.
* A new switch --fido2-credential-algorithm= has been added to
systemd-cryptenroll allowing selection of the credential algorithm to
use when binding encryption to FIDO2 tokens.
Changes in systemd-hostnamed:
* HARDWARE_VENDOR= and HARDWARE_MODEL= can be set in /etc/machine-info
@ -387,7 +456,9 @@ CHANGES WITH 251:
hostnamed.
* hostnamed's D-Bus interface gained a new method GetHardwareSerial()
for reading the hardware serial number, as reportd by DMI.
for reading the hardware serial number, as reportd by DMI. It also
exposes a new method D-Bus property FirmwareVersion that encode the
firmware version of the system.
Changes in other components:
@ -404,6 +475,22 @@ CHANGES WITH 251:
used to set the default shell for user records and nspawn shell
invocations (instead of of the default /bin/bash).
* systemd-timesyncd now provides a D-Bus API for receiving NTP server
information dynamically at runtime via IPC.
* The systemd-creds tool gained a new "has-tpm2" verb, which reports
whether a functioning TPM2 infrastructure is available, i.e. if
firmware, kernel driver and systemd all have TPM2 support enabled and
a device found.
* The systemd-creds tool gained support for generating encrypted
credentials that are using an empty encryption key. While this
provides no integrity nor confidentiality it's useful to implement
codeflows that work the same on TPM-ful and TPM2-less systems. The
service manager will only accept credentials "encrypted" that way if
a TPM2 device cannot be detected, to ensure that credentials
"encrypted" like that cannot be used to trick TPM2 systems.
Experimental features:
* sd-boot gained a new *experimental* setting "reboot-for-bitlocker" in

18
TODO
View File

@ -78,6 +78,24 @@ Janitorial Clean-ups:
Features:
* TPM2: add auth policy for signed PCR values to make updates easy. i.e. do
what tpm2_policyauthorize tool does. To be truly useful scheme needs to be a
bit more elaborate though: policy probably must take some nvram based
generation counter into account that can only monotonically increase and can
be used to invalidate old PCR signatures. Otherwise people could downgrade to
old signed PCR sets whenever they want. Usecase: encrypt the rootfs with LUKS
with a key that can only be unlocked via a pristine pre-built Fedora
kernel+initrd.
* update HACKING.md to suggest developing systemd with the ideas from:
https://0pointer.net/blog/testing-my-system-code-in-usr-without-modifying-usr.html
https://0pointer.net/blog/running-an-container-off-the-host-usr.html
* add a clear concept how the initrd can make up credentials on their own to
pass to the system when transitioning into the host OS. usecase: things like
cloud-init/ignitation and similar can parameterize the host with data they
acquire.
* Add ConditionCredentialExists= or so, that allows conditionalizing services
depending on whether a specific system credential is set. Usecase: a service
similar to the ssh keygen service that installs any SSH host key supplied via

View File

@ -29,23 +29,36 @@
<refsect1>
<title>Description</title>
<para><command>systemd-oomd</command> is a system service that uses cgroups-v2 and pressure stall information (PSI)
to monitor and take action on processes before an OOM occurs in kernel space.</para>
<para><command>systemd-oomd</command> is a system service that uses cgroups-v2 and pressure stall
information (PSI) to monitor and take corrective action before an OOM occurs in the kernel space.</para>
<para>You can enable monitoring and actions on units by setting <varname>ManagedOOMSwap=</varname> and/or
<varname>ManagedOOMMemoryPressure=</varname> to the appropriate value. <command>systemd-oomd</command> will
periodically poll enabled units' cgroup data to detect when corrective action needs to occur. When an action needs
to happen, it will only be performed on the descendant cgroups of the enabled units. More precisely, only cgroups with
<filename>memory.oom.group</filename> set to <constant>1</constant> and leaf cgroup nodes are eligible candidates.
Action will be taken recursively on all of the processes under the chosen candidate.</para>
<para>You can enable monitoring and actions on units by setting <varname>ManagedOOMSwap=</varname> and
<varname>ManagedOOMMemoryPressure=</varname> in the unit configuration, see
<citerefentry><refentrytitle>systemd.resource-control</refentrytitle><manvolnum>5</manvolnum></citerefentry>.
<command>systemd-oomd</command> retrieves information about such units from <command>systemd</command>
when it starts and watches for subsequent changes.</para>
<para>See
<citerefentry><refentrytitle>oomd.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>
<para>Cgroups of units with <varname>ManagedOOMSwap=</varname> or
<varname>ManagedOOMMemoryPressure=</varname> set to <option>kill</option> will be monitored.
<command>systemd-oomd</command> periodically polls PSI statistics for the system and those cgroups to
decide when to take action. If the configured limits are exceeded, <command>systemd-oomd</command> will
select a cgroup to terminate, and send <constant>SIGKILL</constant> to all processes in it. Note that
only descendant cgroups are eligible candidates for killing; the unit with its property set to
<option>kill</option> is not a candidate (unless one of its ancestors set their property to
<option>kill</option>). Also only leaf cgroups and cgroups with <filename>memory.oom.group</filename> set
to <constant>1</constant> are eligible candidates; see <varname>OOMPolicy=</varname> in
<citerefentry><refentrytitle>systemd.service</refentrytitle><manvolnum>5</manvolnum></citerefentry>.
</para>
<para><citerefentry><refentrytitle>oomctl</refentrytitle><manvolnum>1</manvolnum></citerefentry> can
be used to list monitored cgroups and pressure information.</para>
<para>See <citerefentry><refentrytitle>oomd.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>
for more information about the configuration of this service.</para>
</refsect1>
<refsect1>
<title>Setup Information</title>
<title>System requirements and configuration</title>
<para>The system must be running systemd with a full unified cgroup hierarchy for the expected cgroups-v2 features.
Furthermore, memory accounting must be turned on for all units monitored by <command>systemd-oomd</command>.
@ -53,23 +66,25 @@
is set to <constant>true</constant> in
<citerefentry><refentrytitle>systemd-system.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>.</para>
<para>You will need a kernel compiled with PSI support. This is available in Linux 4.20 and above.</para>
<para>The kernel must be compiled with PSI support. This is available in Linux 4.20 and above.</para>
<para>It is highly recommended for the system to have swap enabled for <command>systemd-oomd</command> to function
optimally. With swap enabled, the system spends enough time swapping pages to let <command>systemd-oomd</command> react.
Without swap, the system enters a livelocked state much more quickly and may prevent <command>systemd-oomd</command>
from responding in a reasonable amount of time. See
<ulink url="https://chrisdown.name/2018/01/02/in-defence-of-swap.html">"In defence of swap: common misconceptions"</ulink>
for more details on swap. Any swap-based actions on systems without swap will be ignored. While
<command>systemd-oomd</command> can perform pressure-based actions on a system without swap, the pressure increases
will be more abrupt and may require more tuning to get the desired thresholds and behavior.</para>
<para>It is highly recommended for the system to have swap enabled for <command>systemd-oomd</command> to
function optimally. With swap enabled, the system spends enough time swapping pages to let
<command>systemd-oomd</command> react. Without swap, the system enters a livelocked state much more
quickly and may prevent <command>systemd-oomd</command> from responding in a reasonable amount of
time. See <ulink url="https://chrisdown.name/2018/01/02/in-defence-of-swap.html">"In defence of swap:
common misconceptions"</ulink> for more details on swap. Any swap-based actions on systems without swap
will be ignored. While <command>systemd-oomd</command> can perform pressure-based actions on such a
system, the pressure increases will be more abrupt and may require more tuning to get the desired
thresholds and behavior.</para>
<para>Be aware that if you intend to enable monitoring and actions on <filename>user.slice</filename>,
<filename>user-$UID.slice</filename>, or their ancestor cgroups, it is highly recommended that your programs be
managed by the systemd user manager to prevent running too many processes under the same session scope (and thus
avoid a situation where memory intensive tasks trigger <command>systemd-oomd</command> to kill everything under the
cgroup). If you're using a desktop environment like GNOME, it already spawns many session components with the
systemd user manager.</para>
<filename>user-$UID.slice</filename>, or their ancestor cgroups, it is highly recommended that your
programs be managed by the systemd user manager to prevent running too many processes under the same
session scope (and thus avoid a situation where memory intensive tasks trigger
<command>systemd-oomd</command> to kill everything under the cgroup). If you're using a desktop
environment like GNOME or KDE, it already spawns many session components with the systemd user manager.
</para>
</refsect1>
<refsect1>
@ -79,11 +94,11 @@
<filename>-.slice</filename>, and allowing all descendant cgroups to be eligible candidates may make the most
sense.</para>
<para><varname>ManagedOOMMemoryPressure=</varname> tends to work better on the cgroups below the root slice
<filename>-.slice</filename>. For units which tend to have processes that are less latency sensitive (e.g.
<filename>system.slice</filename>), a higher limit like the default of 60% may be acceptable, as those processes
can usually ride out slowdowns caused by lack of memory without serious consequences. However, something like
<filename>user@$UID.service</filename> may prefer a much lower value like 40%.</para>
<para><varname>ManagedOOMMemoryPressure=</varname> tends to work better on the cgroups below the root
slice. For units which tend to have processes that are less latency sensitive (e.g.
<filename>system.slice</filename>), a higher limit like the default of 60% may be acceptable, as those
processes can usually ride out slowdowns caused by lack of memory without serious consequences. However,
something like <filename>user@$UID.service</filename> may prefer a much lower value like 40%.</para>
</refsect1>
<refsect1>

View File

@ -1108,24 +1108,24 @@ DeviceAllow=/dev/loop-control
<citerefentry><refentrytitle>systemd-oomd.service</refentrytitle><manvolnum>8</manvolnum></citerefentry>
will act on this unit's cgroups. Defaults to <option>auto</option>.</para>
<para>When set to <option>kill</option>, <command>systemd-oomd</command> will actively monitor this unit's
cgroup metrics to decide whether it needs to act. If the cgroup passes the limits set by
<citerefentry><refentrytitle>oomd.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry> or its
overrides, <command>systemd-oomd</command> will send a <constant>SIGKILL</constant> to all of the processes
under the chosen candidate cgroup. Note that only descendant cgroups can be eligible candidates for killing;
the unit that set its property to <option>kill</option> is not a candidate (unless one of its ancestors set
their property to <option>kill</option>). You can find more details on candidates and kill behavior at
<para>When set to <option>kill</option>, the unit becomes a candidate for monitoring by
<command>systemd-oomd</command>. If the cgroup passes the limits set by
<citerefentry><refentrytitle>oomd.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry> or
the unit configuration, <command>systemd-oomd</command> will select a descendant cgroup and send
<constant>SIGKILL</constant> to all of the processes under it. You can find more details on
candidates and kill behavior at
<citerefentry><refentrytitle>systemd-oomd.service</refentrytitle><manvolnum>8</manvolnum></citerefentry>
and <citerefentry><refentrytitle>oomd.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>. Setting
either of these properties to <option>kill</option> will also automatically acquire
<varname>After=</varname> and <varname>Wants=</varname> dependencies on
<filename>systemd-oomd.service</filename> unless <varname>DefaultDependencies=no</varname>.
</para>
and
<citerefentry><refentrytitle>oomd.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>.</para>
<para>When set to <option>auto</option>, <command>systemd-oomd</command> will not actively use this cgroup's
data for monitoring and detection. However, if an ancestor cgroup has one of these properties set to
<option>kill</option>, a unit with <option>auto</option> can still be an eligible candidate for
<command>systemd-oomd</command> to act on.</para>
<para>Setting either of these properties to <option>kill</option> will also result in
<varname>After=</varname> and <varname>Wants=</varname> dependencies on
<filename>systemd-oomd.service</filename> unless <varname>DefaultDependencies=no</varname>.</para>
<para>When set to <option>auto</option>, <command>systemd-oomd</command> will not actively use this
cgroup's data for monitoring and detection. However, if an ancestor cgroup has one of these
properties set to <option>kill</option>, a unit with <option>auto</option> can still be a candidate
for <command>systemd-oomd</command> to terminate.</para>
</listitem>
</varlistentry>

View File

@ -1123,15 +1123,25 @@
<varlistentry>
<term><varname>OOMPolicy=</varname></term>
<listitem><para>Configure the Out-Of-Memory (OOM) killer policy. On Linux, when memory becomes scarce
the kernel might decide to kill a running process in order to free up memory and reduce memory
<listitem><para>Configure the out-of-memory (OOM) kernel killer policy. Note that the userspace OOM
killer
<citerefentry><refentrytitle>systemd-oomd.service</refentrytitle><manvolnum>8</manvolnum></citerefentry>
is a more flexible solution that aims to prevent out-of-memory situations for the userspace, not just
the kernel.</para>
<para>On Linux, when memory becomes scarce to the point that the kernel has trouble allocating memory
for itself, it might decide to kill a running process in order to free up memory and reduce memory
pressure. This setting takes one of <constant>continue</constant>, <constant>stop</constant> or
<constant>kill</constant>. If set to <constant>continue</constant> and a process of the service is
killed by the kernel's OOM killer this is logged but the service continues running. If set to
<constant>stop</constant> the event is logged but the service is terminated cleanly by the service
manager. If set to <constant>kill</constant> and one of the service's processes is killed by the OOM
killer the kernel is instructed to kill all remaining processes of the service, too. Defaults to the
setting <varname>DefaultOOMPolicy=</varname> in
killer the kernel is instructed to kill all remaining processes of the service too, by setting the
<filename>memory.oom.group</filename> attribute to <constant>1</constant>; also see <ulink
url="https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html">kernel documentation</ulink>.
</para>
<para>Defaults to the setting <varname>DefaultOOMPolicy=</varname> in
<citerefentry><refentrytitle>systemd-system.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>
is set to, except for services where <varname>Delegate=</varname> is turned on, where it defaults to
<constant>continue</constant>.</para>
@ -1142,9 +1152,9 @@
<citerefentry><refentrytitle>systemd.exec</refentrytitle><manvolnum>5</manvolnum></citerefentry> for
details.</para>
<para>This setting also applies to <command>systemd-oomd</command>, similar to kernel OOM kills
this setting determines the state of the service after <command>systemd-oomd</command> kills a cgroup associated
with the service.</para></listitem>
<para>This setting also applies to <command>systemd-oomd</command>, similar to the kernel OOM kills
this setting determines the state of the service after <command>systemd-oomd</command> kills a cgroup
associated with the service.</para></listitem>
</varlistentry>
</variablelist>

View File

@ -180,13 +180,13 @@ finish:
return r;
}
/* Fill `new_h` with `path`'s descendent OomdCGroupContexts. Only include descendent cgroups that are possible
/* Fill 'new_h' with 'path's descendant OomdCGroupContexts. Only include descendant cgroups that are possible
* candidates for action. That is, only leaf cgroups or cgroups with memory.oom.group set to "1".
*
* This function ignores most errors in order to handle cgroups that may have been cleaned up while populating
* the hashmap.
* This function ignores most errors in order to handle cgroups that may have been cleaned up while
* populating the hashmap.
*
* `new_h` is of the form { key: cgroup paths -> value: OomdCGroupContext } */
* 'new_h' is of the form { key: cgroup paths -> value: OomdCGroupContext } */
static int recursively_get_cgroup_context(Hashmap *new_h, const char *path) {
_cleanup_free_ char *subpath = NULL;
_cleanup_closedir_ DIR *d = NULL;

View File

@ -170,7 +170,7 @@ static int run(int argc, char *argv[]) {
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGTERM, SIGINT, -1) >= 0);
if (arg_mem_pressure_usec > 0 && arg_mem_pressure_usec < 1 * USEC_PER_SEC)
log_error_errno(SYNTHETIC_ERRNO(EINVAL), "DefaultMemoryPressureDurationSec= must be 0 or at least 1s");
return log_error_errno(SYNTHETIC_ERRNO(EINVAL), "DefaultMemoryPressureDurationSec= must be 0 or at least 1s");
r = manager_new(&m);
if (r < 0)