1
0
mirror of https://github.com/systemd/systemd synced 2025-10-06 20:24:45 +02:00

Compare commits

...

13 Commits

Author SHA1 Message Date
Daan De Meyer
9120022587
vmspawn: Run auxiliary daemons inside scope instead of separate service (#38047)
Currently, vmspawn is in this really weird state where vmspawn itself
and qemu will inherit the caller's execution environment but the
auxiliary
daemons it spawn will run in a fully pristine environment in the service
manager. In practice, this causes issues as checks for whether auxiliary
daemons are installed happen in the caller's execution environment but
they
might not exist in the spawned service's execution environment.

A good example of where this causes issues is trying to use
systemd-vmspawn
in our CI. We use mkosi in CI to run systemd-vmspawn in a custom
userspace
with all the necessary tools available, but systemd-vmspawn then tries
to
spawn services that run these tools using the host userspace, where the
tools are not available or too old and hence systemd-vmspawn fails to
start.

Let's make things more consistent and allow using systemd-vmspawn in CI
at
the same time by having systemd-vmspawn spawn auxiliary daemons itself
instead of having the service manager spawn them. We use
systemd-socket-activate to still have socket activation for these
services,
even though we now spawn them ourselves. To make sure we wait for
systemd-socket-activate to bind to its socket before continuing, we use
the
new general fork_notify() helper.

Why not support both "online" and "offline" operation? systemd-vmspawn
is not
well tested as is and supporting two completely separate modes for
spawning
auxiliary daemons will drastically increase the surface area for bugs.
Given
there doesn't seem to be a major benefit to running daemons in services,
it
seems better to only support offline operation and not both. Should we
want
separate resource control for the auxiliary daemons in the future, we
can run
move them into separate scopes if needed.
2025-07-14 16:51:18 +02:00
DaanDeMeyer
852de7ed70 nspawn: Prepare --bind-user= logic for reuse in systemd-vmspawn
Aside from the usual boilerplate of moving the shared logic to shared/,
we also rework the implementation of --bind-user= to be similar to what
we'll do in systemd-vmspawn. Instead of messing with the nspawn container
user namespace, we use idmapped mounts to map the user's home directory on
the host to the mapped uid in the container.

Ideally we'd also use the "userdb.transient" credentials to provision the
user records, but this would only work for booted containers, whereas the
current logic works for non-booted containers as well.

Aside from being similar to how we'll implement --bind-user= in vmspawn,
using idmapped mounts also allows supporting --bind-user= without having to
use --private-users=.
2025-07-14 16:25:22 +02:00
DaanDeMeyer
c81fa16ddf vmspawn: Run auxiliary daemons inside scope instead of separate service
Currently, vmspawn is in this really weird state where vmspawn itself
and qemu will inherit the caller's execution environment but the auxiliary
daemons it spawn will run in a fully pristine environment in the service
manager. In practice, this causes issues as checks for whether auxiliary
daemons are installed happen in the caller's execution environment but they
might not exist in the spawned service's execution environment.

A good example of where this causes issues is trying to use systemd-vmspawn
in our CI. We use mkosi in CI to run systemd-vmspawn in a custom userspace
with all the necessary tools available, but systemd-vmspawn then tries to
spawn services that run these tools using the host userspace, where the
tools are not available or too old and hence systemd-vmspawn fails to start.

Let's make things more consistent and allow using systemd-vmspawn in CI at
the same time by having systemd-vmspawn spawn auxiliary daemons itself
instead of having the service manager spawn them. We use
systemd-socket-activate to still have socket activation for these services,
even though we now spawn them ourselves. To make sure we wait for
systemd-socket-activate to bind to its socket before continuing, we use the
new general fork_notify() helper.

Why not support both "online" and "offline" operation? systemd-vmspawn is not
well tested as is and supporting two completely separate modes for spawning
auxiliary daemons will drastically increase the surface area for bugs. Given
there doesn't seem to be a major benefit to running daemons in services, it
seems better to only support offline operation and not both. Should we want
separate resource control for the auxiliary daemons in the future, we can run
move them into separate scopes if needed.

As a bonus, this approach allows us to get rid of the extra complexity of
having to fork off the qemu process first so we can allocate a scope for it
that the other services bind to. This means large parts of
0fc45c8d20ad46ab9be0d8f29b16e606e0dd44ca are reverted by this commit.
2025-07-14 15:07:48 +02:00
DaanDeMeyer
e4691ebb49 fork-journal: Generalize to fork-notify
Most of the logic isn't journalctl specific, let's generalize a bit
so we can reuse this for other commands as well.
2025-07-14 15:07:48 +02:00
DaanDeMeyer
be3f7aaf44 fork-journal: Don't log if process is already gone in journal_terminate() 2025-07-14 15:07:48 +02:00
Daan De Meyer
43d0d111d2
core/cgroup: always submit unit to realize queue if all controllers are being invalidated (#38194) 2025-07-14 15:07:16 +02:00
DaanDeMeyer
a79e94aa58 vmspawn: Pass credentials via files
Credentials data can get potentially very large. Passing it all via
the command line is rather messy. Let's pass all the credential data
via files instead to both make the final command line less verbose
and reduce the chance of us running into command line size limits if
many or large credentials are used.
2025-07-14 14:54:19 +02:00
DaanDeMeyer
e19e17df57 mkosi: Disable systemd-timesyncd by default
It causes quite a bit of debug log noise by repeated DNS lookups so
let's disable it by default to avoid log noise.
2025-07-14 14:51:52 +02:00
DaanDeMeyer
1408505318 meson: Fix missing test dependencies
These test would fail when executed directly with meson test before
doing a build because the required dependencies are not declared, let's
fix that.
2025-07-14 13:07:29 +01:00
DaanDeMeyer
b955051244 nspawn: Don't clear idmapping if we're not doing an idmapped mount
We only need to clear the existing idmapping if we're going to be
replacing it with another idmapping. Otherwise we should keep the
existing idmapping in place.
2025-07-14 11:54:56 +01:00
Mike Yuan
e0d3213e09
core/cgroup: always submit unit to realize queue if all controllers are being invalidated
Alternative to #38190
Fixes #38112
2025-07-12 17:41:52 +02:00
Mike Yuan
77af13ffdb
core/cgroup: remove deserialization for "cpuacct-usage-base"
This has been superseded by "cpu-usage-base" ever since
the introduction of cgroup v2. With upgrading and thus
deserialzing from cgroup v1 systems becoming impossible
it is eligible for removal.
2025-07-12 17:41:04 +02:00
Mike Yuan
db54d1a6b7
core/exec-invoke: pass "/" instead of "" for cgroup root 2025-07-12 17:32:32 +02:00
26 changed files with 827 additions and 1125 deletions

View File

@ -1605,10 +1605,8 @@ After=sys-subsystem-net-devices-ens1.device</programlisting>
<orderedlist>
<listitem><para>The user's home directory is bind mounted from the host into
<filename>/run/host/home/</filename>.</para></listitem>
<listitem><para>An additional UID/GID mapping is added that maps the host user's UID/GID to a
container UID/GID, allocated from the 60514…60577 range.</para></listitem>
<filename>/run/host/home/</filename>, using an idmapped mount to map the host user's UID/GID to its
assigned UID/GID in the container.</para></listitem>
<listitem><para>A JSON user and group record is generated in <filename>/run/userdb/</filename> that
describes the mapped user. It contains a minimized representation of the host's user record,
@ -1644,9 +1642,6 @@ After=sys-subsystem-net-devices-ens1.device</programlisting>
the container's <filename>/etc/passwd</filename> and <filename>/etc/group</filename>, and thus might
not detect existing accounts in other databases.</para>
<para>This operation is only supported in combination with
<option>--private-users=</option>/<option>-U</option>.</para>
<xi:include href="version-info.xml" xpointer="v249"/></listitem>
</varlistentry>

View File

@ -32,8 +32,8 @@ disable dnf-makecache.*
# We have journald to receive audit data so let's make sure we're not running auditd as well
disable auditd.service
# systemd-timesyncd is not enabled by default in the default systemd preset so enable it here instead.
enable systemd-timesyncd.service
# systemd-timesyncd causes quite a bit of debug log noise so let's disable it by default.
disable systemd-timesyncd.service
# Enabled by default on OpenSUSE and not conditioned out in containers, so let's disable these here instead.
disable iscsi.service

View File

@ -291,6 +291,8 @@ typedef struct ImagePolicy ImagePolicy;
typedef struct InstallInfo InstallInfo;
typedef struct LookupPaths LookupPaths;
typedef struct LoopDevice LoopDevice;
typedef struct MachineBindUserContext MachineBindUserContext;
typedef struct MachineCredentialContext MachineCredentialContext;
typedef struct MountOptions MountOptions;
typedef struct OpenFile OpenFile;
typedef struct Pkcs11EncryptedKey Pkcs11EncryptedKey;

View File

@ -1,7 +1,7 @@
# SPDX-License-Identifier: LGPL-2.1-or-later
efi_config_h_dir = meson.current_build_dir()
efi_addon = ''
efi_addon = []
libefitest = static_library(
'efitest',
@ -466,12 +466,12 @@ foreach efi_elf_binary : efi_elf_binaries
# This is supposed to match exactly one time
if name == 'addon@0@.efi.stub'.format(efi_arch)
efi_addon = exe.full_path()
efi_addon = [exe]
endif
test('check-alignment-@0@'.format(name),
check_efi_alignment_py,
args : exe.full_path(),
args : exe,
suite : 'boot')
endforeach

View File

@ -3950,7 +3950,11 @@ bool unit_invalidate_cgroup(Unit *u, CGroupMask m) {
if (!crt)
return false;
if (FLAGS_SET(crt->cgroup_invalidated_mask, m)) /* NOP? */
/* If all controllers shall be invalidated, let's unconditionally submit the unit to realize queue.
* We initialize the field to _CGROUP_MASK_ALL after all, and semantically it makes sense to use
* it as a special signal to forcibly re-realize cgroup. */
if (m != _CGROUP_MASK_ALL &&
FLAGS_SET(crt->cgroup_invalidated_mask, m)) /* NOP? */
return false;
crt->cgroup_invalidated_mask |= m;
@ -4380,8 +4384,7 @@ int cgroup_runtime_deserialize_one(Unit *u, const char *key, const char *value,
if (!UNIT_HAS_CGROUP_CONTEXT(u))
return 0;
if (MATCH_DESERIALIZE_IMMEDIATE(u, "cpu-usage-base", key, value, safe_atou64, cpu_usage_base) ||
MATCH_DESERIALIZE_IMMEDIATE(u, "cpuacct-usage-base", key, value, safe_atou64, cpu_usage_base))
if (MATCH_DESERIALIZE_IMMEDIATE(u, "cpu-usage-base", key, value, safe_atou64, cpu_usage_base))
return 1;
if (MATCH_DESERIALIZE_IMMEDIATE(u, "cpu-usage-last", key, value, safe_atou64, cpu_usage_last))

View File

@ -5552,7 +5552,7 @@ int exec_invoke(
* memory_pressure_path != NULL in the conditional below. */
if (memory_pressure_path && needs_sandboxing && exec_needs_cgroup_namespace(context)) {
memory_pressure_path = mfree(memory_pressure_path);
r = cg_get_path("memory", "", "memory.pressure", &memory_pressure_path);
r = cg_get_path("memory", "/", "memory.pressure", &memory_pressure_path);
if (r < 0) {
*exit_status = EXIT_MEMORY;
return log_oom();

View File

@ -1,338 +1,21 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
#include <grp.h>
#include <pwd.h>
#include <unistd.h>
#include "sd-json.h"
#include "alloc-util.h"
#include "chase.h"
#include "fd-util.h"
#include "fileio.h"
#include "format-util.h"
#include "json-util.h"
#include "log.h"
#include "nspawn-mount.h"
#include "nspawn.h"
#include "machine-bind-user.h"
#include "nspawn-bind-user.h"
#include "user-record.h"
#include "group-record.h"
#include "path-util.h"
#include "string-util.h"
#include "strv.h"
#include "user-util.h"
#include "userdb.h"
static int check_etc_passwd_collisions(
const char *directory,
const char *name,
uid_t uid) {
_cleanup_fclose_ FILE *f = NULL;
int r;
assert(directory);
assert(name || uid_is_valid(uid));
r = chase_and_fopen_unlocked("/etc/passwd", directory, CHASE_PREFIX_ROOT, "re", NULL, &f);
if (r == -ENOENT)
return 0; /* no user database? then no user, hence no collision */
if (r < 0)
return log_error_errno(r, "Failed to open /etc/passwd of container: %m");
for (;;) {
struct passwd *pw;
r = fgetpwent_sane(f, &pw);
if (r < 0)
return log_error_errno(r, "Failed to iterate through /etc/passwd of container: %m");
if (r == 0) /* EOF */
return 0; /* no collision */
if (name && streq_ptr(pw->pw_name, name))
return 1; /* name collision */
if (uid_is_valid(uid) && pw->pw_uid == uid)
return 1; /* UID collision */
}
}
static int check_etc_group_collisions(
const char *directory,
const char *name,
gid_t gid) {
_cleanup_fclose_ FILE *f = NULL;
int r;
assert(directory);
assert(name || gid_is_valid(gid));
r = chase_and_fopen_unlocked("/etc/group", directory, CHASE_PREFIX_ROOT, "re", NULL, &f);
if (r == -ENOENT)
return 0; /* no group database? then no group, hence no collision */
if (r < 0)
return log_error_errno(r, "Failed to open /etc/group of container: %m");
for (;;) {
struct group *gr;
r = fgetgrent_sane(f, &gr);
if (r < 0)
return log_error_errno(r, "Failed to iterate through /etc/group of container: %m");
if (r == 0)
return 0; /* no collision */
if (name && streq_ptr(gr->gr_name, name))
return 1; /* name collision */
if (gid_is_valid(gid) && gr->gr_gid == gid)
return 1; /* gid collision */
}
}
static int convert_user(
const char *directory,
UserRecord *u,
GroupRecord *g,
uid_t allocate_uid,
const char *shell,
bool shell_copy,
UserRecord **ret_converted_user,
GroupRecord **ret_converted_group) {
_cleanup_(group_record_unrefp) GroupRecord *converted_group = NULL;
_cleanup_(user_record_unrefp) UserRecord *converted_user = NULL;
_cleanup_free_ char *h = NULL;
sd_json_variant *p, *hp = NULL, *ssh = NULL;
int r;
assert(u);
assert(g);
assert(user_record_gid(u) == g->gid);
if (shell_copy)
shell = u->shell;
r = check_etc_passwd_collisions(directory, u->user_name, UID_INVALID);
if (r < 0)
return r;
if (r > 0)
return log_error_errno(SYNTHETIC_ERRNO(EBUSY),
"Sorry, the user '%s' already exists in the container.", u->user_name);
r = check_etc_group_collisions(directory, g->group_name, GID_INVALID);
if (r < 0)
return r;
if (r > 0)
return log_error_errno(SYNTHETIC_ERRNO(EBUSY),
"Sorry, the group '%s' already exists in the container.", g->group_name);
h = path_join("/run/host/home/", u->user_name);
if (!h)
return log_oom();
/* Acquire the source hashed password array as-is, so that it retains the JSON_VARIANT_SENSITIVE flag */
p = sd_json_variant_by_key(u->json, "privileged");
if (p) {
hp = sd_json_variant_by_key(p, "hashedPassword");
ssh = sd_json_variant_by_key(p, "sshAuthorizedKeys");
}
r = user_record_build(
&converted_user,
SD_JSON_BUILD_OBJECT(
SD_JSON_BUILD_PAIR("userName", SD_JSON_BUILD_STRING(u->user_name)),
SD_JSON_BUILD_PAIR("uid", SD_JSON_BUILD_UNSIGNED(allocate_uid)),
SD_JSON_BUILD_PAIR("gid", SD_JSON_BUILD_UNSIGNED(allocate_uid)),
SD_JSON_BUILD_PAIR_CONDITION(u->disposition >= 0, "disposition", SD_JSON_BUILD_STRING(user_disposition_to_string(u->disposition))),
SD_JSON_BUILD_PAIR("homeDirectory", SD_JSON_BUILD_STRING(h)),
SD_JSON_BUILD_PAIR("service", JSON_BUILD_CONST_STRING("io.systemd.NSpawn")),
JSON_BUILD_PAIR_STRING_NON_EMPTY("shell", shell),
SD_JSON_BUILD_PAIR("privileged", SD_JSON_BUILD_OBJECT(
SD_JSON_BUILD_PAIR_CONDITION(!strv_isempty(u->hashed_password), "hashedPassword", SD_JSON_BUILD_VARIANT(hp)),
SD_JSON_BUILD_PAIR_CONDITION(!!ssh, "sshAuthorizedKeys", SD_JSON_BUILD_VARIANT(ssh))))));
if (r < 0)
return log_error_errno(r, "Failed to build container user record: %m");
r = group_record_build(
&converted_group,
SD_JSON_BUILD_OBJECT(
SD_JSON_BUILD_PAIR("groupName", SD_JSON_BUILD_STRING(g->group_name)),
SD_JSON_BUILD_PAIR("gid", SD_JSON_BUILD_UNSIGNED(allocate_uid)),
SD_JSON_BUILD_PAIR_CONDITION(g->disposition >= 0, "disposition", SD_JSON_BUILD_STRING(user_disposition_to_string(g->disposition))),
SD_JSON_BUILD_PAIR("service", JSON_BUILD_CONST_STRING("io.systemd.NSpawn"))));
if (r < 0)
return log_error_errno(r, "Failed to build container group record: %m");
*ret_converted_user = TAKE_PTR(converted_user);
*ret_converted_group = TAKE_PTR(converted_group);
return 0;
}
static int find_free_uid(const char *directory, uid_t max_uid, uid_t *current_uid) {
int r;
assert(directory);
assert(current_uid);
for (;; (*current_uid)++) {
if (*current_uid > MAP_UID_MAX || *current_uid > max_uid)
return log_error_errno(
SYNTHETIC_ERRNO(EBUSY),
"No suitable available UID in range " UID_FMT "" UID_FMT " in container detected, can't map user.",
MAP_UID_MIN, MAP_UID_MAX);
r = check_etc_passwd_collisions(directory, NULL, *current_uid);
if (r < 0)
return r;
if (r > 0) /* already used */
continue;
/* We want to use the UID also as GID, hence check for it in /etc/group too */
r = check_etc_group_collisions(directory, NULL, (gid_t) *current_uid);
if (r <= 0)
return r;
}
}
BindUserContext* bind_user_context_free(BindUserContext *c) {
if (!c)
return NULL;
FOREACH_ARRAY(d, c->data, c->n_data) {
user_record_unref(d->host_user);
group_record_unref(d->host_group);
user_record_unref(d->payload_user);
group_record_unref(d->payload_group);
}
return mfree(c);
}
int bind_user_prepare(
const char *directory,
char **bind_user,
const char *bind_user_shell,
bool bind_user_shell_copy,
uid_t uid_shift,
uid_t uid_range,
CustomMount **custom_mounts,
size_t *n_custom_mounts,
BindUserContext **ret) {
_cleanup_(bind_user_context_freep) BindUserContext *c = NULL;
uid_t current_uid = MAP_UID_MIN;
int r;
assert(custom_mounts);
assert(n_custom_mounts);
assert(ret);
/* This resolves the users specified in 'bind_user', generates a minimalized JSON user + group record
* for it to stick in the container, allocates a UID/GID for it, and updates the custom mount table,
* to include an appropriate bind mount mapping.
*
* This extends the passed custom_mounts/n_custom_mounts with the home directories, and allocates a
* new BindUserContext for the user records */
if (strv_isempty(bind_user)) {
*ret = NULL;
return 0;
}
c = new0(BindUserContext, 1);
if (!c)
return log_oom();
STRV_FOREACH(n, bind_user) {
_cleanup_(user_record_unrefp) UserRecord *u = NULL, *cu = NULL;
_cleanup_(group_record_unrefp) GroupRecord *g = NULL, *cg = NULL;
_cleanup_free_ char *sm = NULL, *sd = NULL;
r = userdb_by_name(*n, /* match= */ NULL, USERDB_DONT_SYNTHESIZE_INTRINSIC|USERDB_DONT_SYNTHESIZE_FOREIGN, &u);
if (r < 0)
return log_error_errno(r, "Failed to resolve user '%s': %m", *n);
/* For now, let's refuse mapping the root/nobody users explicitly. The records we generate
* are strictly additive, nss-systemd is typically placed last in /etc/nsswitch.conf. Thus
* even if we wanted, we couldn't override the root or nobody user records. Note we also
* check for name conflicts in /etc/passwd + /etc/group later on, which would usually filter
* out root/nobody too, hence these checks might appear redundant but they actually are
* not, as we want to support environments where /etc/passwd and /etc/group are non-existent,
* and the user/group databases fully synthesized at runtime. Moreover, the name of the
* user/group name of the "nobody" account differs between distros, hence a check by numeric
* UID is safer. */
if (user_record_is_root(u))
return log_error_errno(SYNTHETIC_ERRNO(EINVAL), "Mapping 'root' user not supported, sorry.");
if (user_record_is_nobody(u))
return log_error_errno(SYNTHETIC_ERRNO(EINVAL), "Mapping 'nobody' user not supported, sorry.");
if (!uid_is_valid(u->uid))
return log_error_errno(SYNTHETIC_ERRNO(EINVAL), "Cannot bind user with no UID, refusing.");
if (u->uid >= uid_shift && u->uid < uid_shift + uid_range)
return log_error_errno(
SYNTHETIC_ERRNO(EINVAL),
"UID "UID_FMT" of user '%s' to map is already in container UID range ("UID_FMT" - "UID_FMT"), refusing.",
u->uid, u->user_name, uid_shift, uid_shift + uid_range);
r = groupdb_by_gid(user_record_gid(u), /* match= */ NULL, USERDB_DONT_SYNTHESIZE_INTRINSIC|USERDB_DONT_SYNTHESIZE_FOREIGN, &g);
if (r < 0)
return log_error_errno(r, "Failed to resolve group of user '%s': %m", u->user_name);
if (g->gid >= uid_shift && g->gid < uid_shift + uid_range)
return log_error_errno(SYNTHETIC_ERRNO(EINVAL), "GID of group '%s' to map is already in container GID range, refusing.", g->group_name);
/* We want to synthesize exactly one user + group from the host into the container. This only
* makes sense if the user on the host has its own private group. We can't reasonably check
* this, so we just check of the name of user and group match.
*
* One of these days we might want to support users in a shared/common group too, but it's
* not clear to me how this would have to be mapped, precisely given that the common group
* probably already exists in the container. */
if (!streq(u->user_name, g->group_name))
return log_error_errno(SYNTHETIC_ERRNO(EOPNOTSUPP),
"Sorry, mapping users without private groups is currently not supported.");
r = find_free_uid(directory, uid_range, &current_uid);
if (r < 0)
return r;
r = convert_user(directory, u, g, current_uid, bind_user_shell, bind_user_shell_copy, &cu, &cg);
if (r < 0)
return r;
if (!GREEDY_REALLOC(c->data, c->n_data + 1))
return log_oom();
sm = strdup(user_record_home_directory(u));
if (!sm)
return log_oom();
sd = strdup(user_record_home_directory(cu));
if (!sd)
return log_oom();
if (!GREEDY_REALLOC(*custom_mounts, *n_custom_mounts + 1))
return log_oom();
(*custom_mounts)[(*n_custom_mounts)++] = (CustomMount) {
.type = CUSTOM_MOUNT_BIND,
.source = TAKE_PTR(sm),
.destination = TAKE_PTR(sd),
};
c->data[c->n_data++] = (BindUserData) {
.host_user = TAKE_PTR(u),
.host_group = TAKE_PTR(g),
.payload_user = TAKE_PTR(cu),
.payload_group = TAKE_PTR(cg),
};
current_uid++;
}
*ret = TAKE_PTR(c);
return 1;
}
static int write_and_symlink(
const char *root,
@ -384,10 +67,7 @@ static int write_and_symlink(
return 0;
}
int bind_user_setup(
const BindUserContext *c,
const char *root) {
int bind_user_setup(const MachineBindUserContext *c, const char *root) {
static const UserRecordLoadFlags strip_flags = /* Removes privileged info */
USER_RECORD_LOAD_MASK_PRIVILEGED|
USER_RECORD_PERMISSIVE;

View File

@ -1,29 +1,5 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
#pragma once
#include "forward.h"
typedef struct CustomMount CustomMount;
typedef struct BindUserData {
/* The host's user/group records */
UserRecord *host_user;
GroupRecord *host_group;
/* The mapped records to place into the container */
UserRecord *payload_user;
GroupRecord *payload_group;
} BindUserData;
typedef struct BindUserContext {
BindUserData *data;
size_t n_data;
} BindUserContext;
BindUserContext* bind_user_context_free(BindUserContext *c);
DEFINE_TRIVIAL_CLEANUP_FUNC(BindUserContext*, bind_user_context_free);
int bind_user_prepare(const char *directory, char **bind_user, const char *bind_user_shell, bool bind_user_shell_copy, uid_t uid_shift, uid_t uid_range, CustomMount **custom_mounts, size_t *n_custom_mounts, BindUserContext **ret);
int bind_user_setup(const BindUserContext *c, const char *root);
int bind_user_setup(const MachineBindUserContext *c, const char *root);

View File

@ -25,6 +25,7 @@
#include "string-util.h"
#include "strv.h"
#include "tmpfile-util.h"
#include "user-util.h"
CustomMount* custom_mount_add(CustomMount **l, size_t *n, CustomMountType t) {
CustomMount *ret;
@ -41,7 +42,8 @@ CustomMount* custom_mount_add(CustomMount **l, size_t *n, CustomMountType t) {
(*n)++;
*ret = (CustomMount) {
.type = t
.type = t,
.destination_uid = UID_INVALID,
};
return ret;
@ -829,7 +831,7 @@ static int mount_bind(const char *dest, CustomMount *m, uid_t uid_shift, uid_t u
m->source,
OPEN_TREE_CLONE|OPEN_TREE_CLOEXEC,
&(struct mount_attr) {
.attr_clr = MOUNT_ATTR_IDMAP,
.attr_clr = idmapping != REMOUNT_IDMAPPING_NONE ? MOUNT_ATTR_IDMAP : 0,
});
if (ERRNO_IS_NEG_NOT_SUPPORTED(fd_clone))
/* We can only clear idmapped mounts with open_tree_attr(), but there might not be one in
@ -849,7 +851,7 @@ static int mount_bind(const char *dest, CustomMount *m, uid_t uid_shift, uid_t u
if (stat(where, &dest_st) < 0)
return log_error_errno(errno, "Failed to stat %s: %m", where);
dest_uid = dest_st.st_uid;
dest_uid = uid_is_valid(m->destination_uid) ? uid_shift + m->destination_uid : dest_st.st_uid;
if (S_ISDIR(source_st.st_mode) && !S_ISDIR(dest_st.st_mode))
return log_error_errno(SYNTHETIC_ERRNO(EINVAL),
@ -880,7 +882,7 @@ static int mount_bind(const char *dest, CustomMount *m, uid_t uid_shift, uid_t u
if (chown(where, uid_shift, uid_shift) < 0)
return log_error_errno(errno, "Failed to chown %s: %m", where);
dest_uid = uid_shift;
dest_uid = uid_shift + (uid_is_valid(m->destination_uid) ? m->destination_uid : 0);
}
if (move_mount(fd_clone, "", AT_FDCWD, where, MOVE_MOUNT_F_EMPTY_PATH) < 0)

View File

@ -38,6 +38,7 @@ typedef struct CustomMount {
bool read_only;
char *source; /* for overlayfs this is the upper directory */
char *destination;
uid_t destination_uid;
char *options;
char *work_dir;
char **lower;

View File

@ -34,7 +34,6 @@
#include "capability-list.h"
#include "capability-util.h"
#include "cgroup-setup.h"
#include "cgroup-util.h"
#include "chase.h"
#include "common-signal.h"
#include "constants.h"
@ -55,7 +54,6 @@
#include "format-util.h"
#include "fs-util.h"
#include "gpt.h"
#include "group-record.h"
#include "hexdecoct.h"
#include "hostname-setup.h"
#include "hostname-util.h"
@ -66,6 +64,7 @@
#include "log.h"
#include "loop-util.h"
#include "loopback-setup.h"
#include "machine-bind-user.h"
#include "machine-credential.h"
#include "main-func.h"
#include "mkdir.h"
@ -1731,9 +1730,6 @@ static int verify_arguments(void) {
return log_error_errno(SYNTHETIC_ERRNO(EINVAL), "AmbientCapability= setting is not useful for boot mode.");
}
if (arg_userns_mode == USER_NAMESPACE_NO && !strv_isempty(arg_bind_user))
return log_error_errno(SYNTHETIC_ERRNO(EINVAL), "--bind-user= requires --private-users");
/* Drop duplicate --bind-user= entries */
strv_uniq(arg_bind_user);
@ -3878,7 +3874,6 @@ static int outer_child(
int netns_fd,
const char *unix_export_path) {
_cleanup_(bind_user_context_freep) BindUserContext *bind_user_context = NULL;
_cleanup_strv_free_ char **os_release_pairs = NULL;
bool idmap = false;
ssize_t l;
@ -4043,38 +4038,41 @@ static int outer_child(
if (r < 0)
return r;
r = bind_user_prepare(
_cleanup_(machine_bind_user_context_freep) MachineBindUserContext *bind_user_context = NULL;
r = machine_bind_user_prepare(
directory,
arg_bind_user,
arg_bind_user_shell,
arg_bind_user_shell_copy,
chown_uid,
chown_range,
&arg_custom_mounts, &arg_n_custom_mounts,
&bind_user_context);
if (r < 0)
return r;
if (arg_userns_mode != USER_NAMESPACE_NO && bind_user_context) {
/* Send the user maps we determined to the parent, so that it installs it in our user
* namespace UID map table */
if (bind_user_context)
FOREACH_ARRAY(bind_user, bind_user_context->data, bind_user_context->n_data) {
_cleanup_free_ char *sm = strdup(user_record_home_directory(bind_user->host_user));
if (!sm)
return log_oom();
FOREACH_ARRAY(d, bind_user_context->data, bind_user_context->n_data) {
uid_t map[] = {
d->payload_user->uid,
d->host_user->uid,
(uid_t) d->payload_group->gid,
(uid_t) d->host_group->gid,
_cleanup_free_ char *sd = strdup(user_record_home_directory(bind_user->payload_user));
if (!sd)
return log_oom();
if (!GREEDY_REALLOC(arg_custom_mounts, arg_n_custom_mounts + 1))
return log_oom();
char *options = strdup("owneridmap");
if (!options)
return log_oom();
arg_custom_mounts[arg_n_custom_mounts++] = (CustomMount) {
.type = CUSTOM_MOUNT_BIND,
.source = TAKE_PTR(sm),
.destination = TAKE_PTR(sd),
.options = TAKE_PTR(options),
.destination_uid = bind_user->payload_user->uid,
};
l = send(fd_outer_socket, map, sizeof(map), MSG_NOSIGNAL);
if (l < 0)
return log_error_errno(errno, "Failed to send user UID map: %m");
if (l != sizeof(map))
return log_error_errno(SYNTHETIC_ERRNO(EIO),
"Short write while sending user UID map.");
}
}
r = mount_custom(
directory,
@ -4492,69 +4490,6 @@ static int uid_shift_pick(uid_t *shift, LockFile *ret_lock_file) {
}
}
static int add_one_uid_map(
char **p,
uid_t container_uid,
uid_t host_uid,
uid_t range) {
return strextendf(p,
UID_FMT " " UID_FMT " " UID_FMT "\n",
container_uid, host_uid, range);
}
static int make_uid_map_string(
const uid_t bind_user_uid[],
size_t n_bind_user_uid,
size_t offset,
char **ret) {
_cleanup_free_ char *s = NULL;
uid_t previous_uid = 0;
int r;
assert(n_bind_user_uid == 0 || bind_user_uid);
assert(IN_SET(offset, 0, 2)); /* used to switch between UID and GID map */
assert(ret);
/* The bind_user_uid[] array is a series of 4 uid_t values, for each --bind-user= entry one
* quadruplet, consisting of host and container UID + GID. */
for (size_t i = 0; i < n_bind_user_uid; i++) {
uid_t payload_uid = bind_user_uid[i*4+offset],
host_uid = bind_user_uid[i*4+offset+1];
assert(previous_uid <= payload_uid);
assert(payload_uid < arg_uid_range);
/* Add a range to close the gap to previous entry */
if (payload_uid > previous_uid) {
r = add_one_uid_map(&s, previous_uid, arg_uid_shift + previous_uid, payload_uid - previous_uid);
if (r < 0)
return r;
}
/* Map this specific user */
r = add_one_uid_map(&s, payload_uid, host_uid, 1);
if (r < 0)
return r;
previous_uid = payload_uid + 1;
}
/* And add a range to close the gap to finish the range */
if (arg_uid_range > previous_uid) {
r = add_one_uid_map(&s, previous_uid, arg_uid_shift + previous_uid, arg_uid_range - previous_uid);
if (r < 0)
return r;
}
assert(s);
*ret = TAKE_PTR(s);
return 0;
}
static int setup_uid_map(
const PidRef *pid,
const uid_t bind_user_uid[],
@ -4567,8 +4502,7 @@ static int setup_uid_map(
assert(pidref_is_set(pid));
assert(pid->pid > 1);
/* Build the UID map string */
if (make_uid_map_string(bind_user_uid, n_bind_user_uid, 0, &s) < 0) /* offset=0 contains the UID pair */
if (asprintf(&s, "0 " UID_FMT " " UID_FMT "\n", arg_uid_shift, arg_uid_range) < 0)
return log_oom();
xsprintf(uid_map, "/proc/" PID_FMT "/uid_map", pid->pid);
@ -4576,11 +4510,6 @@ static int setup_uid_map(
if (r < 0)
return log_error_errno(r, "Failed to write UID map: %m");
/* And now build the GID map string */
s = mfree(s);
if (make_uid_map_string(bind_user_uid, n_bind_user_uid, 2, &s) < 0) /* offset=2 contains the GID pair */
return log_oom();
xsprintf(uid_map, "/proc/" PID_FMT "/gid_map", pid->pid);
r = write_string_file(uid_map, s, WRITE_STRING_FILE_DISABLE_BUFFER);
if (r < 0)
@ -5314,26 +5243,6 @@ static int run_container(
if (l != sizeof arg_uid_shift)
return log_error_errno(SYNTHETIC_ERRNO(EIO), "Short write while writing UID shift.");
}
n_bind_user_uid = strv_length(arg_bind_user);
if (n_bind_user_uid > 0) {
/* Right after the UID shift, we'll receive the list of UID mappings for the
* --bind-user= logic. Always a quadruplet of payload and host UID + GID. */
bind_user_uid = new(uid_t, n_bind_user_uid*4);
if (!bind_user_uid)
return log_oom();
for (size_t i = 0; i < n_bind_user_uid; i++) {
l = recv(fd_outer_socket_pair[0], bind_user_uid + i*4, sizeof(uid_t)*4, 0);
if (l < 0)
return log_error_errno(errno, "Failed to read user UID map pair: %m");
if (l != sizeof(uid_t)*4)
return log_full_errno(l == 0 ? LOG_DEBUG : LOG_WARNING,
SYNTHETIC_ERRNO(EIO),
"Short read while reading bind user UID pairs.");
}
}
}
/* Wait for the outer child. */

View File

@ -12,9 +12,11 @@ in_files = [
# The last two don't get installed anywhere, one of them needs to included in
# the rpm spec file definition instead.
rpm_depends = []
foreach tuple : in_files
file = tuple[0]
custom_target(
rpm_depends += custom_target(
input : file + '.in',
output : file,
command : [jinja2_cmdline, '@INPUT@', '@OUTPUT@'],

View File

@ -34,7 +34,7 @@
#include "exec-util.h"
#include "exit-status.h"
#include "fd-util.h"
#include "fork-journal.h"
#include "fork-notify.h"
#include "format-table.h"
#include "format-util.h"
#include "fs-util.h"
@ -2438,7 +2438,7 @@ static int start_transient_service(sd_bus *bus) {
return r;
peer_fd = safe_close(peer_fd);
_cleanup_(journal_terminate) PidRef journal_pid = PIDREF_NULL;
_cleanup_(fork_notify_terminate) PidRef journal_pid = PIDREF_NULL;
if (arg_verbose)
(void) journal_fork(arg_runtime_scope, STRV_MAKE(c.unit), &journal_pid);
@ -2517,7 +2517,7 @@ static int start_transient_service(sd_bus *bus) {
return log_error_errno(r, "Failed to run event loop: %m");
/* Close the journal watch logic before we output the exit summary */
journal_terminate(&journal_pid);
fork_notify_terminate(&journal_pid);
if (arg_wait && !arg_quiet)
run_context_show_result(&c);

View File

@ -1,8 +0,0 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
#pragma once
#include "forward.h"
int journal_fork(RuntimeScope scope, char * const *units, PidRef *ret_pidref);
void journal_terminate(PidRef *pidref);

View File

@ -7,7 +7,7 @@
#include "escape.h"
#include "event-util.h"
#include "exit-status.h"
#include "fork-journal.h"
#include "fork-notify.h"
#include "log.h"
#include "notify-recv.h"
#include "parse-util.h"
@ -27,13 +27,13 @@ static int on_child_exit(sd_event_source *s, const siginfo_t *si, void *userdata
if (si->si_code == CLD_EXITED) {
if (si->si_status == EXIT_SUCCESS)
log_debug("journalctl " PID_FMT " exited successfully.", si->si_pid);
log_debug("Child process " PID_FMT " exited successfully.", si->si_pid);
else
log_debug("journalctl " PID_FMT " died with a failure exit status %i, ignoring.", si->si_pid, si->si_status);
log_debug("Child process " PID_FMT " died with a failure exit status %i, ignoring.", si->si_pid, si->si_status);
} else if (si->si_code == CLD_KILLED)
log_debug("journalctl " PID_FMT " was killed by signal %s, ignoring.", si->si_pid, signal_to_string(si->si_status));
log_debug("Child process " PID_FMT " was killed by signal %s, ignoring.", si->si_pid, signal_to_string(si->si_status));
else if (si->si_code == CLD_DUMPED)
log_debug("journalctl " PID_FMT " dumped core by signal %s, ignoring.", si->si_pid, signal_to_string(si->si_status));
log_debug("Child process " PID_FMT " dumped core by signal %s, ignoring.", si->si_pid, signal_to_string(si->si_status));
else
log_debug("Got unexpected exit code %i via SIGCHLD, ignoring.", si->si_code);
@ -87,19 +87,15 @@ static int on_child_notify(sd_event_source *s, int fd, uint32_t revents, void *u
return 0;
}
int journal_fork(RuntimeScope scope, char * const *units, PidRef *ret_pidref) {
int fork_notify(char * const *argv, PidRef *ret_pidref) {
int r;
assert(scope >= 0);
assert(scope < _RUNTIME_SCOPE_MAX);
assert(!strv_isempty(argv));
assert(ret_pidref);
if (!is_main_thread())
return -EPERM;
if (strv_isempty(units))
return 0;
_cleanup_(sd_event_unrefp) sd_event *event = NULL;
r = sd_event_new(&event);
if (r < 0)
@ -123,22 +119,6 @@ int journal_fork(RuntimeScope scope, char * const *units, PidRef *ret_pidref) {
if (r < 0)
return r;
_cleanup_strv_free_ char **argv = strv_new(
"journalctl",
"-q",
"--follow",
"--no-pager",
"--lines=1",
"--synchronize-on-exit=yes");
if (!argv)
return log_oom_debug();
STRV_FOREACH(u, units)
if (strv_extendf(&argv,
scope == RUNTIME_SCOPE_SYSTEM ? "--unit=%s" : "--user-unit=%s",
*u) < 0)
return log_oom_debug();
if (DEBUG_LOGGING) {
_cleanup_free_ char *l = quote_command_line(argv, SHELL_ESCAPE_EMPTY);
log_debug("Invoking '%s' as child.", strnull(l));
@ -147,7 +127,7 @@ int journal_fork(RuntimeScope scope, char * const *units, PidRef *ret_pidref) {
BLOCK_SIGNALS(SIGCHLD);
r = pidref_safe_fork_full(
"(journalctl)",
"(fork-notify)",
(const int[3]) { -EBADF, STDOUT_FILENO, STDERR_FILENO },
/* except_fds= */ NULL,
/* n_except_fds= */ 0,
@ -164,7 +144,7 @@ int journal_fork(RuntimeScope scope, char * const *units, PidRef *ret_pidref) {
}
r = invoke_callout_binary(argv[0], argv);
log_debug_errno(r, "Failed to invoke journalctl: %m");
log_debug_errno(r, "Failed to invoke %s: %m", argv[0]);
_exit(EXIT_EXEC);
}
@ -177,7 +157,7 @@ int journal_fork(RuntimeScope scope, char * const *units, PidRef *ret_pidref) {
if (r < 0)
return r;
(void) sd_event_source_set_description(child_event_source, "fork-journal-child");
(void) sd_event_source_set_description(child_event_source, "fork-notify-child");
r = sd_event_loop(event);
if (r < 0)
@ -189,16 +169,66 @@ int journal_fork(RuntimeScope scope, char * const *units, PidRef *ret_pidref) {
return 0;
}
void journal_terminate(PidRef *pidref) {
static void fork_notify_terminate_internal(PidRef *pidref) {
int r;
if (!pidref_is_set(pidref))
return;
r = pidref_kill(pidref, SIGTERM);
if (r < 0)
log_debug_errno(r, "Failed to send SIGTERM to journalctl child " PID_FMT ", ignoring: %m", pidref->pid);
if (r < 0 && r != -ESRCH)
log_debug_errno(r, "Failed to send SIGTERM to child " PID_FMT ", ignoring: %m", pidref->pid);
(void) pidref_wait_for_terminate_and_check("journalctl", pidref, /* flags= */ 0);
(void) pidref_wait_for_terminate_and_check(/* name= */ NULL, pidref, /* flags= */ 0);
}
void fork_notify_terminate(PidRef *pidref) {
fork_notify_terminate_internal(pidref);
pidref_done(pidref);
}
void fork_notify_terminate_many(sd_event_source **array, size_t n) {
int r;
assert(array || n == 0);
FOREACH_ARRAY(s, array, n) {
PidRef child;
r = event_source_get_child_pidref(*s, &child);
if (r >= 0)
fork_notify_terminate_internal(&child);
else
log_debug_errno(r, "Could not get pidref for event source: %m");
sd_event_source_unref(*s);
}
free(array);
}
int journal_fork(RuntimeScope scope, char * const* units, PidRef *ret_pidref) {
assert(scope >= 0);
assert(scope < _RUNTIME_SCOPE_MAX);
if (strv_isempty(units))
return 0;
_cleanup_strv_free_ char **argv = strv_new(
"journalctl",
"-q",
"--follow",
"--no-pager",
"--lines=1",
"--synchronize-on-exit=yes");
if (!argv)
return log_oom_debug();
STRV_FOREACH(u, units)
if (strv_extendf(&argv,
scope == RUNTIME_SCOPE_SYSTEM ? "--unit=%s" : "--user-unit=%s",
*u) < 0)
return log_oom_debug();
return fork_notify(argv, ret_pidref);
}

12
src/shared/fork-notify.h Normal file
View File

@ -0,0 +1,12 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
#pragma once
#include "forward.h"
int fork_notify(char * const *cmdline, PidRef *ret_pidref);
void fork_notify_terminate(PidRef *pidref);
void fork_notify_terminate_many(sd_event_source **array, size_t n);
int journal_fork(RuntimeScope scope, char * const *units, PidRef *ret_pidref);

View File

@ -0,0 +1,302 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
#include <grp.h>
#include <pwd.h>
#include <unistd.h>
#include "alloc-util.h"
#include "chase.h"
#include "fd-util.h"
#include "format-util.h"
#include "json-util.h"
#include "log.h"
#include "machine-bind-user.h"
#include "path-util.h"
#include "string-util.h"
#include "strv.h"
#include "user-util.h"
#include "userdb.h"
static int check_etc_passwd_collisions(
const char *directory,
const char *name,
uid_t uid) {
_cleanup_fclose_ FILE *f = NULL;
int r;
assert(name || uid_is_valid(uid));
if (!directory)
return 0;
r = chase_and_fopen_unlocked("/etc/passwd", directory, CHASE_PREFIX_ROOT, "re", NULL, &f);
if (r == -ENOENT)
return 0; /* no user database? then no user, hence no collision */
if (r < 0)
return log_error_errno(r, "Failed to open /etc/passwd of container: %m");
for (;;) {
struct passwd *pw;
r = fgetpwent_sane(f, &pw);
if (r < 0)
return log_error_errno(r, "Failed to iterate through /etc/passwd of container: %m");
if (r == 0) /* EOF */
return 0; /* no collision */
if (name && streq_ptr(pw->pw_name, name))
return 1; /* name collision */
if (uid_is_valid(uid) && pw->pw_uid == uid)
return 1; /* UID collision */
}
}
static int check_etc_group_collisions(
const char *directory,
const char *name,
gid_t gid) {
_cleanup_fclose_ FILE *f = NULL;
int r;
assert(name || gid_is_valid(gid));
if (!directory)
return 0;
r = chase_and_fopen_unlocked("/etc/group", directory, CHASE_PREFIX_ROOT, "re", NULL, &f);
if (r == -ENOENT)
return 0; /* no group database? then no group, hence no collision */
if (r < 0)
return log_error_errno(r, "Failed to open /etc/group of container: %m");
for (;;) {
struct group *gr;
r = fgetgrent_sane(f, &gr);
if (r < 0)
return log_error_errno(r, "Failed to iterate through /etc/group of container: %m");
if (r == 0)
return 0; /* no collision */
if (name && streq_ptr(gr->gr_name, name))
return 1; /* name collision */
if (gid_is_valid(gid) && gr->gr_gid == gid)
return 1; /* gid collision */
}
}
static int convert_user(
const char *directory,
UserRecord *u,
GroupRecord *g,
uid_t allocate_uid,
const char *shell,
bool shell_copy,
UserRecord **ret_converted_user,
GroupRecord **ret_converted_group) {
_cleanup_(group_record_unrefp) GroupRecord *converted_group = NULL;
_cleanup_(user_record_unrefp) UserRecord *converted_user = NULL;
_cleanup_free_ char *h = NULL;
sd_json_variant *p, *hp = NULL, *ssh = NULL;
int r;
assert(u);
assert(g);
assert(user_record_gid(u) == g->gid);
if (shell_copy)
shell = u->shell;
r = check_etc_passwd_collisions(directory, u->user_name, UID_INVALID);
if (r < 0)
return r;
if (r > 0)
return log_error_errno(SYNTHETIC_ERRNO(EBUSY),
"Sorry, the user '%s' already exists in the container.", u->user_name);
r = check_etc_group_collisions(directory, g->group_name, GID_INVALID);
if (r < 0)
return r;
if (r > 0)
return log_error_errno(SYNTHETIC_ERRNO(EBUSY),
"Sorry, the group '%s' already exists in the container.", g->group_name);
h = path_join("/run/host/home/", u->user_name);
if (!h)
return log_oom();
/* Acquire the source hashed password array as-is, so that it retains the JSON_VARIANT_SENSITIVE flag */
p = sd_json_variant_by_key(u->json, "privileged");
if (p) {
hp = sd_json_variant_by_key(p, "hashedPassword");
ssh = sd_json_variant_by_key(p, "sshAuthorizedKeys");
}
r = user_record_build(
&converted_user,
SD_JSON_BUILD_OBJECT(
SD_JSON_BUILD_PAIR("userName", SD_JSON_BUILD_STRING(u->user_name)),
SD_JSON_BUILD_PAIR("uid", SD_JSON_BUILD_UNSIGNED(allocate_uid)),
SD_JSON_BUILD_PAIR("gid", SD_JSON_BUILD_UNSIGNED(allocate_uid)),
SD_JSON_BUILD_PAIR_CONDITION(u->disposition >= 0, "disposition", SD_JSON_BUILD_STRING(user_disposition_to_string(u->disposition))),
SD_JSON_BUILD_PAIR("homeDirectory", SD_JSON_BUILD_STRING(h)),
SD_JSON_BUILD_PAIR("service", JSON_BUILD_CONST_STRING("io.systemd.NSpawn")),
JSON_BUILD_PAIR_STRING_NON_EMPTY("shell", shell),
SD_JSON_BUILD_PAIR("privileged", SD_JSON_BUILD_OBJECT(
SD_JSON_BUILD_PAIR_CONDITION(!strv_isempty(u->hashed_password), "hashedPassword", SD_JSON_BUILD_VARIANT(hp)),
SD_JSON_BUILD_PAIR_CONDITION(!!ssh, "sshAuthorizedKeys", SD_JSON_BUILD_VARIANT(ssh))))));
if (r < 0)
return log_error_errno(r, "Failed to build container user record: %m");
r = group_record_build(
&converted_group,
SD_JSON_BUILD_OBJECT(
SD_JSON_BUILD_PAIR("groupName", SD_JSON_BUILD_STRING(g->group_name)),
SD_JSON_BUILD_PAIR("gid", SD_JSON_BUILD_UNSIGNED(allocate_uid)),
SD_JSON_BUILD_PAIR_CONDITION(g->disposition >= 0, "disposition", SD_JSON_BUILD_STRING(user_disposition_to_string(g->disposition))),
SD_JSON_BUILD_PAIR("service", JSON_BUILD_CONST_STRING("io.systemd.NSpawn"))));
if (r < 0)
return log_error_errno(r, "Failed to build container group record: %m");
*ret_converted_user = TAKE_PTR(converted_user);
*ret_converted_group = TAKE_PTR(converted_group);
return 0;
}
static int find_free_uid(const char *directory, uid_t *current_uid) {
int r;
assert(current_uid);
for (;; (*current_uid)++) {
if (*current_uid > MAP_UID_MAX)
return log_error_errno(
SYNTHETIC_ERRNO(EBUSY),
"No suitable available UID in range " UID_FMT "" UID_FMT " in container detected, can't map user.",
MAP_UID_MIN, MAP_UID_MAX);
r = check_etc_passwd_collisions(directory, NULL, *current_uid);
if (r < 0)
return r;
if (r > 0) /* already used */
continue;
/* We want to use the UID also as GID, hence check for it in /etc/group too */
r = check_etc_group_collisions(directory, NULL, (gid_t) *current_uid);
if (r <= 0)
return r;
}
}
MachineBindUserContext* machine_bind_user_context_free(MachineBindUserContext *c) {
if (!c)
return NULL;
FOREACH_ARRAY(d, c->data, c->n_data) {
user_record_unref(d->host_user);
group_record_unref(d->host_group);
user_record_unref(d->payload_user);
group_record_unref(d->payload_group);
}
return mfree(c);
}
int machine_bind_user_prepare(
const char *directory,
char **bind_user,
const char *bind_user_shell,
bool bind_user_shell_copy,
MachineBindUserContext **ret) {
_cleanup_(machine_bind_user_context_freep) MachineBindUserContext *c = NULL;
uid_t current_uid = MAP_UID_MIN;
int r;
assert(ret);
/* This resolves the users specified in 'bind_user', generates a minimalized JSON user + group record
* for it to stick in the container, allocates a UID/GID for it, and updates the custom mount table,
* to include an appropriate bind mount mapping.
*
* This extends the passed custom_mounts/n_custom_mounts with the home directories, and allocates a
* new BindUserContext for the user records */
if (strv_isempty(bind_user)) {
*ret = NULL;
return 0;
}
c = new0(MachineBindUserContext, 1);
if (!c)
return log_oom();
STRV_FOREACH(n, bind_user) {
_cleanup_(user_record_unrefp) UserRecord *u = NULL, *cu = NULL;
_cleanup_(group_record_unrefp) GroupRecord *g = NULL, *cg = NULL;
r = userdb_by_name(*n, /* match= */ NULL, USERDB_DONT_SYNTHESIZE_INTRINSIC|USERDB_DONT_SYNTHESIZE_FOREIGN, &u);
if (r < 0)
return log_error_errno(r, "Failed to resolve user '%s': %m", *n);
/* For now, let's refuse mapping the root/nobody users explicitly. The records we generate
* are strictly additive, nss-systemd is typically placed last in /etc/nsswitch.conf. Thus
* even if we wanted, we couldn't override the root or nobody user records. Note we also
* check for name conflicts in /etc/passwd + /etc/group later on, which would usually filter
* out root/nobody too, hence these checks might appear redundant but they actually are
* not, as we want to support environments where /etc/passwd and /etc/group are non-existent,
* and the user/group databases fully synthesized at runtime. Moreover, the name of the
* user/group name of the "nobody" account differs between distros, hence a check by numeric
* UID is safer. */
if (user_record_is_root(u))
return log_error_errno(SYNTHETIC_ERRNO(EINVAL), "Mapping 'root' user not supported, sorry.");
if (user_record_is_nobody(u))
return log_error_errno(SYNTHETIC_ERRNO(EINVAL), "Mapping 'nobody' user not supported, sorry.");
if (!uid_is_valid(u->uid))
return log_error_errno(SYNTHETIC_ERRNO(EINVAL), "Cannot bind user with no UID, refusing.");
r = groupdb_by_gid(user_record_gid(u), /* match= */ NULL, USERDB_DONT_SYNTHESIZE_INTRINSIC|USERDB_DONT_SYNTHESIZE_FOREIGN, &g);
if (r < 0)
return log_error_errno(r, "Failed to resolve group of user '%s': %m", u->user_name);
/* We want to synthesize exactly one user + group from the host into the container. This only
* makes sense if the user on the host has its own private group. We can't reasonably check
* this, so we just check of the name of user and group match.
*
* One of these days we might want to support users in a shared/common group too, but it's
* not clear to me how this would have to be mapped, precisely given that the common group
* probably already exists in the container. */
if (!streq(u->user_name, g->group_name))
return log_error_errno(SYNTHETIC_ERRNO(EOPNOTSUPP),
"Sorry, mapping users without private groups is currently not supported.");
r = find_free_uid(directory, &current_uid);
if (r < 0)
return r;
r = convert_user(directory, u, g, current_uid, bind_user_shell, bind_user_shell_copy, &cu, &cg);
if (r < 0)
return r;
if (!GREEDY_REALLOC(c->data, c->n_data + 1))
return log_oom();
c->data[c->n_data++] = (MachineBindUserData) {
.host_user = TAKE_PTR(u),
.host_group = TAKE_PTR(g),
.payload_user = TAKE_PTR(cu),
.payload_group = TAKE_PTR(cg),
};
current_uid++;
}
*ret = TAKE_PTR(c);
return 1;
}

View File

@ -0,0 +1,30 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
#pragma once
#include "forward.h"
typedef struct MachineBindUserData {
/* The host's user/group records */
UserRecord *host_user;
GroupRecord *host_group;
/* The mapped records to place into the container */
UserRecord *payload_user;
GroupRecord *payload_group;
} MachineBindUserData;
typedef struct MachineBindUserContext {
MachineBindUserData *data;
size_t n_data;
} MachineBindUserContext;
MachineBindUserContext* machine_bind_user_context_free(MachineBindUserContext *c);
DEFINE_TRIVIAL_CLEANUP_FUNC(MachineBindUserContext*, machine_bind_user_context_free);
int machine_bind_user_prepare(
const char *directory,
char **bind_user,
const char *bind_user_shell,
bool bind_user_shell_copy,
MachineBindUserContext **ret);

View File

@ -78,7 +78,7 @@ shared_sources = files(
'find-esp.c',
'firewall-util-nft.c',
'firewall-util.c',
'fork-journal.c',
'fork-notify.c',
'format-table.c',
'fstab-util.c',
'generator.c',
@ -119,6 +119,7 @@ shared_sources = files(
'loop-util.c',
'loopback-setup.c',
'lsm-util.c',
'machine-bind-user.c',
'machine-credential.c',
'machine-id-setup.c',
'machine-pool.c',

View File

@ -10,7 +10,7 @@
#include "bus-util.h"
#include "bus-wait-for-jobs.h"
#include "bus-wait-for-units.h"
#include "fork-journal.h"
#include "fork-notify.h"
#include "pidref.h"
#include "runtime-scope.h"
#include "special.h"
@ -390,7 +390,7 @@ int verb_start(int argc, char *argv[], void *userdata) {
return log_error_errno(r, "Failed to allocate unit watch context: %m");
}
_cleanup_(journal_terminate) PidRef journal_pid = PIDREF_NULL;
_cleanup_(fork_notify_terminate) PidRef journal_pid = PIDREF_NULL;
if (arg_marked)
ret = enqueue_marked_jobs(bus, w);
else {

View File

@ -11,17 +11,14 @@ test_hashmap_ordered_c = custom_target(
generated_sources += test_hashmap_ordered_c
path = run_command(sh, '-c', 'echo "$PATH"', check: true).stdout().strip()
test_env = environment()
test_env = {
'PATH' : meson.project_build_root() + ':' + path,
'PROJECT_BUILD_ROOT' : meson.project_build_root(),
'SYSTEMD_SLOW_TESTS' : want_slow_tests ? '1' : '0',
'PYTHONDONTWRITEBYTECODE' : '1',
}
if conf.get('ENABLE_LOCALED') == 1
test_env.set('SYSTEMD_LANGUAGE_FALLBACK_MAP', language_fallback_map)
endif
test_env.set('PATH', meson.project_build_root() + ':' + path)
test_env.set('PROJECT_BUILD_ROOT', meson.project_build_root())
test_env.set('SYSTEMD_SLOW_TESTS', want_slow_tests ? '1' : '0')
test_env.set('PYTHONDONTWRITEBYTECODE', '1')
if efi_addon != ''
test_env.set('EFI_ADDON', efi_addon)
test_env += {'SYSTEMD_LANGUAGE_FALLBACK_MAP' : language_fallback_map}
endif
############################################################

View File

@ -15,7 +15,8 @@ if want_ukify and want_tests != 'false'
test('test-ukify',
files('test_ukify.py'),
args: args,
env : test_env,
env : test_env + {'EFI_ADDON' : efi_addon.length() > 0 ? efi_addon[0].full_path() : ''},
timeout : 120,
suite : 'ukify')
suite : 'ukify',
depends : efi_addon)
endif

View File

@ -7,16 +7,12 @@
#include "bus-unit-util.h"
#include "bus-util.h"
#include "bus-wait-for-jobs.h"
#include "escape.h"
#include "event-util.h"
#include "log.h"
#include "pidref.h"
#include "random-util.h"
#include "socket-util.h"
#include "special.h"
#include "string-util.h"
#include "strv.h"
#include "unit-def.h"
#include "unit-name.h"
#include "vmspawn-scope.h"
static int append_controller_property(sd_bus *bus, sd_bus_message *m) {
@ -41,15 +37,17 @@ int allocate_scope(
sd_bus *bus,
const char *machine_name,
const PidRef *pid,
sd_event_source **auxiliary,
size_t n_auxiliary,
const char *scope,
const char *slice,
char **properties,
bool allow_pidfd,
char **ret_scope) {
bool allow_pidfd) {
_cleanup_(bus_wait_for_jobs_freep) BusWaitForJobs *w = NULL;
_cleanup_(sd_bus_error_free) sd_bus_error error = SD_BUS_ERROR_NULL;
_cleanup_(sd_bus_message_unrefp) sd_bus_message *reply = NULL, *m = NULL;
_cleanup_free_ char *scope = NULL, *description = NULL;
_cleanup_free_ char *description = NULL;
const char *object;
int r;
@ -62,10 +60,6 @@ int allocate_scope(
if (r < 0)
return log_error_errno(r, "Could not watch job: %m");
r = unit_name_mangle_with_suffix(machine_name, "as machine name", /* flags= */ 0, ".scope", &scope);
if (r < 0)
return log_error_errno(r, "Failed to mangle scope name: %m");
description = strjoin("Virtual Machine ", machine_name);
if (!description)
return log_oom();
@ -87,6 +81,18 @@ int allocate_scope(
if (r < 0)
return bus_log_create_error(r);
FOREACH_ARRAY(aux, auxiliary, n_auxiliary) {
PidRef pidref;
r = event_source_get_child_pidref(*aux, &pidref);
if (r < 0)
return log_error_errno(r, "Could not get pidref for event source: %m");
r = bus_append_scope_pidref(m, &pidref, allow_pidfd);
if (r < 0)
return bus_log_create_error(r);
}
r = sd_bus_message_append(m, "(sv)(sv)(sv)(sv)",
"Description", "s", description,
"CollectMode", "s", "inactive-or-failed",
@ -125,10 +131,12 @@ int allocate_scope(
bus,
machine_name,
pid,
auxiliary,
n_auxiliary,
scope,
slice,
properties,
/* allow_pidfd= */ false,
ret_scope);
/* allow_pidfd= */ false);
return log_error_errno(r, "Failed to start transient scope unit: %s", bus_error_message(&error, r));
}
@ -137,32 +145,17 @@ int allocate_scope(
if (r < 0)
return bus_log_parse_error(r);
r = bus_wait_for_jobs_one(
return bus_wait_for_jobs_one(
w,
object,
BUS_WAIT_JOBS_LOG_ERROR,
/* extra_args= */ NULL);
if (r < 0)
return r;
if (ret_scope)
*ret_scope = TAKE_PTR(scope);
return 0;
}
int terminate_scope(
sd_bus *bus,
const char *machine_name) {
int terminate_scope(sd_bus *bus, const char *scope) {
_cleanup_(sd_bus_error_free) sd_bus_error error = SD_BUS_ERROR_NULL;
_cleanup_free_ char *scope = NULL;
int r;
r = unit_name_mangle_with_suffix(machine_name, "to terminate", /* flags= */ 0, ".scope", &scope);
if (r < 0)
return log_error_errno(r, "Failed to mangle scope name: %m");
r = bus_call_method(bus, bus_systemd_mgr, "AbandonScope", &error, /* ret_reply= */ NULL, "s", scope);
if (r < 0) {
log_debug_errno(r, "Failed to abandon scope '%s', ignoring: %s", scope, bus_error_message(&error, r));
@ -190,197 +183,3 @@ int terminate_scope(
return 0;
}
static int message_add_commands(sd_bus_message *m, const char *exec_type, char ***commands, size_t n_commands) {
int r;
assert(m);
assert(exec_type);
assert(commands || n_commands == 0);
/* A small helper for adding an ExecStart / ExecStopPost / etc.. property to an sd_bus_message */
r = sd_bus_message_open_container(m, 'r', "sv");
if (r < 0)
return bus_log_create_error(r);
r = sd_bus_message_append(m, "s", exec_type);
if (r < 0)
return bus_log_create_error(r);
r = sd_bus_message_open_container(m, 'v', "a(sasb)");
if (r < 0)
return bus_log_create_error(r);
r = sd_bus_message_open_container(m, 'a', "(sasb)");
if (r < 0)
return bus_log_create_error(r);
FOREACH_ARRAY(cmd, commands, n_commands) {
char **cmdline = *cmd;
r = sd_bus_message_open_container(m, 'r', "sasb");
if (r < 0)
return bus_log_create_error(r);
r = sd_bus_message_append(m, "s", cmdline[0]);
if (r < 0)
return bus_log_create_error(r);
r = sd_bus_message_append_strv(m, cmdline);
if (r < 0)
return bus_log_create_error(r);
r = sd_bus_message_append(m, "b", 0);
if (r < 0)
return bus_log_create_error(r);
r = sd_bus_message_close_container(m);
if (r < 0)
return bus_log_create_error(r);
}
r = sd_bus_message_close_container(m);
if (r < 0)
return bus_log_create_error(r);
r = sd_bus_message_close_container(m);
if (r < 0)
return bus_log_create_error(r);
r = sd_bus_message_close_container(m);
if (r < 0)
return bus_log_create_error(r);
return 0;
}
void socket_service_pair_done(SocketServicePair *p) {
assert(p);
p->exec_start_pre = strv_free(p->exec_start_pre);
p->exec_start = strv_free(p->exec_start);
p->exec_stop_post = strv_free(p->exec_stop_post);
p->unit_name_prefix = mfree(p->unit_name_prefix);
p->listen_address = mfree(p->listen_address);
p->socket_type = 0;
}
int start_socket_service_pair(sd_bus *bus, const char *scope, SocketServicePair *p) {
_cleanup_(bus_wait_for_jobs_freep) BusWaitForJobs *w = NULL;
_cleanup_(sd_bus_error_free) sd_bus_error error = SD_BUS_ERROR_NULL;
_cleanup_(sd_bus_message_unrefp) sd_bus_message *m = NULL, *reply = NULL;
_cleanup_free_ char *service_desc = NULL, *service_name = NULL, *socket_name = NULL;
const char *object, *socket_type_str;
int r;
/* Starts a socket/service unit pair bound to the given scope. */
assert(bus);
assert(scope);
assert(p);
assert(p->unit_name_prefix);
assert(p->exec_start);
assert(p->listen_address);
r = bus_wait_for_jobs_new(bus, &w);
if (r < 0)
return log_error_errno(r, "Could not watch job: %m");
socket_name = strjoin(p->unit_name_prefix, ".socket");
if (!socket_name)
return log_oom();
service_name = strjoin(p->unit_name_prefix, ".service");
if (!service_name)
return log_oom();
service_desc = quote_command_line(p->exec_start, SHELL_ESCAPE_EMPTY);
if (!service_desc)
return log_oom();
socket_type_str = socket_address_type_to_string(p->socket_type);
if (!socket_type_str)
return log_error_errno(SYNTHETIC_ERRNO(EOPNOTSUPP), "Invalid socket type: %d", p->socket_type);
r = bus_message_new_method_call(bus, &m, bus_systemd_mgr, "StartTransientUnit");
if (r < 0)
return bus_log_create_error(r);
r = sd_bus_message_append(m, "ssa(sv)",
/* ss - name, mode */
socket_name, "fail",
/* a(sv) - Properties */
5,
"Description", "s", p->listen_address,
"AddRef", "b", true,
"BindsTo", "as", 1, scope,
"Listen", "a(ss)", 1, socket_type_str, p->listen_address,
"CollectMode", "s", "inactive-or-failed",
"RemoveOnStop", "b", true);
if (r < 0)
return bus_log_create_error(r);
/* aux */
r = sd_bus_message_open_container(m, 'a', "(sa(sv))");
if (r < 0)
return bus_log_create_error(r);
r = sd_bus_message_open_container(m, 'r', "sa(sv)");
if (r < 0)
return bus_log_create_error(r);
r = sd_bus_message_append(m, "s", service_name);
if (r < 0)
return bus_log_create_error(r);
r = sd_bus_message_open_container(m, 'a', "(sv)");
if (r < 0)
return bus_log_create_error(r);
r = sd_bus_message_append(m, "(sv)(sv)(sv)(sv)",
"Description", "s", service_desc,
"AddRef", "b", 1,
"BindsTo", "as", 1, scope,
"CollectMode", "s", "inactive-or-failed");
if (r < 0)
return bus_log_create_error(r);
if (p->exec_start_pre) {
r = message_add_commands(m, "ExecStartPre", &p->exec_start_pre, 1);
if (r < 0)
return r;
}
r = message_add_commands(m, "ExecStart", &p->exec_start, 1);
if (r < 0)
return r;
if (p->exec_stop_post) {
r = message_add_commands(m, "ExecStopPost", &p->exec_stop_post, 1);
if (r < 0)
return r;
}
r = sd_bus_message_close_container(m);
if (r < 0)
return bus_log_create_error(r);
r = sd_bus_message_close_container(m);
if (r < 0)
return bus_log_create_error(r);
r = sd_bus_message_close_container(m);
if (r < 0)
return bus_log_create_error(r);
r = sd_bus_call(bus, m, 0, &error, &reply);
if (r < 0)
return log_error_errno(r, "Failed to start %s as transient unit: %s", p->exec_start[0], bus_error_message(&error, r));
r = sd_bus_message_read(reply, "o", &object);
if (r < 0)
return bus_log_parse_error(r);
return bus_wait_for_jobs_one(w, object, /* quiet */ false, NULL);
}

View File

@ -14,8 +14,15 @@ typedef struct SocketServicePair {
void socket_service_pair_done(SocketServicePair *p);
int allocate_scope(sd_bus *bus, const char *machine_name, const PidRef *pid, const char *slice, char **properties, bool allow_pidfd, char **ret_scope);
int allocate_scope(
sd_bus *bus,
const char *machine_name,
const PidRef *pid,
sd_event_source **auxiliary,
size_t n_auxiliary,
const char *scope,
const char *slice,
char **properties,
bool allow_pidfd);
int terminate_scope(sd_bus *bus, const char *machine_name);
int start_socket_service_pair(sd_bus *bus, const char *scope, SocketServicePair *p);
int terminate_scope(sd_bus *bus, const char *scope);

View File

@ -21,7 +21,6 @@
#include "bus-internal.h"
#include "bus-locator.h"
#include "bus-util.h"
#include "bus-wait-for-jobs.h"
#include "capability-util.h"
#include "common-signal.h"
#include "copy.h"
@ -32,6 +31,7 @@
#include "event-util.h"
#include "extract-word.h"
#include "fd-util.h"
#include "fork-notify.h"
#include "format-util.h"
#include "fs-util.h"
#include "gpt.h"
@ -39,8 +39,6 @@
#include "hostname-setup.h"
#include "hostname-util.h"
#include "id128-util.h"
#include "io-util.h"
#include "iovec-util.h"
#include "log.h"
#include "machine-credential.h"
#include "main-func.h"
@ -48,7 +46,6 @@
#include "namespace-util.h"
#include "netif-util.h"
#include "nsresource.h"
#include "nulstr-util.h"
#include "osc-context.h"
#include "pager.h"
#include "parse-argument.h"
@ -1010,7 +1007,36 @@ fallback:
}
static int on_child_exit(sd_event_source *s, const siginfo_t *si, void *userdata) {
sd_event_exit(sd_event_source_get_event(s), 0);
assert(si);
/* Let's first do some logging about the exit status of the child. */
int ret;
if (si->si_code == CLD_EXITED) {
if (si->si_status == EXIT_SUCCESS)
log_debug("Child process " PID_FMT " exited successfully.", si->si_pid);
else
log_error("Child process " PID_FMT " died with a failure exit status %i.", si->si_pid, si->si_status);
ret = si->si_status;
} else if (si->si_code == CLD_KILLED)
ret = log_error_errno(SYNTHETIC_ERRNO(EPROTO),
"Child process " PID_FMT " was killed by signal %s.",
si->si_pid, signal_to_string(si->si_status));
else if (si->si_code == CLD_DUMPED)
ret = log_error_errno(SYNTHETIC_ERRNO(EPROTO),
"Child process " PID_FMT " dumped core by signal %s.",
si->si_pid, signal_to_string(si->si_status));
else
ret = log_error_errno(SYNTHETIC_ERRNO(EPROTO),
"Got unexpected exit code %i via SIGCHLD,",
si->si_code);
/* Regardless of whether the main qemu process or an auxiliary process died, let's exit either way
* as it's very likely that the main qemu process won't be able to operate properly anymore if one
* of the auxiliary processes died. */
sd_event_exit(sd_event_source_get_event(s), ret);
return 0;
}
@ -1036,7 +1062,9 @@ static int cmdline_add_vsock(char ***cmdline, int vsock_fd) {
return 0;
}
static int cmdline_add_kernel_cmdline(char ***cmdline, const char *kernel) {
static int cmdline_add_kernel_cmdline(char ***cmdline, const char *kernel, const char *smbios_dir) {
int r;
assert(cmdline);
if (strv_isempty(arg_kernel_cmdline_extra))
@ -1055,28 +1083,32 @@ static int cmdline_add_kernel_cmdline(char ***cmdline, const char *kernel) {
return 0;
}
_cleanup_free_ char *escaped_kcl = NULL;
escaped_kcl = escape_qemu_value(kcl);
if (!escaped_kcl)
return log_oom();
FOREACH_STRING(id, "io.systemd.stub.kernel-cmdline-extra", "io.systemd.boot.kernel-cmdline-extra") {
_cleanup_free_ char *p = path_join(smbios_dir, id);
if (!p)
return log_oom();
if (strv_extend(cmdline, "-smbios") < 0)
return log_oom();
r = write_string_filef(
p,
WRITE_STRING_FILE_CREATE|WRITE_STRING_FILE_AVOID_NEWLINE|WRITE_STRING_FILE_MODE_0600,
"%s=%s", id, kcl);
if (r < 0)
return log_error_errno(r, "Failed to write smbios kernel command line to file: %m");
if (strv_extendf(cmdline, "type=11,value=io.systemd.stub.kernel-cmdline-extra=%s", escaped_kcl) < 0)
return log_oom();
if (strv_extend(cmdline, "-smbios") < 0)
return log_oom();
if (strv_extend(cmdline, "-smbios") < 0)
return log_oom();
if (strv_extendf(cmdline, "type=11,value=io.systemd.boot.kernel-cmdline-extra=%s", escaped_kcl) < 0)
return log_oom();
if (strv_extendf(cmdline, "type=11,path=%s", p) < 0)
return log_oom();
}
}
return 0;
}
static int cmdline_add_smbios11(char ***cmdline) {
static int cmdline_add_smbios11(char ***cmdline, const char* smbios_dir) {
int r;
assert(cmdline);
if (strv_isempty(arg_smbios11))
@ -1088,15 +1120,22 @@ static int cmdline_add_smbios11(char ***cmdline) {
}
STRV_FOREACH(i, arg_smbios11) {
_cleanup_free_ char *escaped = NULL;
escaped = escape_qemu_value(*i);
if (!escaped)
return log_oom();
_cleanup_(unlink_and_freep) char *p = NULL;
r = tempfn_random_child(smbios_dir, "smbios11", &p);
if (r < 0)
return r;
r = write_string_file(
p, *i,
WRITE_STRING_FILE_CREATE|WRITE_STRING_FILE_AVOID_NEWLINE|WRITE_STRING_FILE_MODE_0600);
if (r < 0)
return log_error_errno(r, "Failed to write smbios data to smbios file %s: %m", p);
if (strv_extend(cmdline, "-smbios") < 0)
return log_oom();
if (strv_extendf(cmdline, "type=11,value=%s", escaped) < 0)
if (strv_extendf(cmdline, "type=11,path=%s", p) < 0)
return log_oom();
}
@ -1104,15 +1143,15 @@ static int cmdline_add_smbios11(char ***cmdline) {
}
static int start_tpm(
sd_bus *bus,
const char *scope,
const char *swtpm,
const char *runtime_dir,
char **ret_listen_address) {
const char *sd_socket_activate,
char **ret_listen_address,
PidRef *ret_pidref) {
int r;
assert(bus);
assert(scope);
assert(swtpm);
assert(runtime_dir);
@ -1122,16 +1161,8 @@ static int start_tpm(
if (r < 0)
return log_error_errno(r, "Failed to strip .scope suffix from scope: %m");
_cleanup_(socket_service_pair_done) SocketServicePair ssp = {
.socket_type = SOCK_STREAM,
};
ssp.unit_name_prefix = strjoin(scope_prefix, "-tpm");
if (!ssp.unit_name_prefix)
return log_oom();
ssp.listen_address = path_join(runtime_dir, "tpm.sock");
if (!ssp.listen_address)
_cleanup_free_ char *listen_address = path_join(runtime_dir, "tpm.sock");
if (!listen_address)
return log_oom();
_cleanup_free_ char *transient_state_dir = NULL;
@ -1139,7 +1170,11 @@ static int start_tpm(
if (arg_tpm_state_path)
state_dir = arg_tpm_state_path;
else {
transient_state_dir = path_join(runtime_dir, ssp.unit_name_prefix);
_cleanup_free_ char *dirname = strjoin(scope_prefix, "-tpm");
if (!dirname)
return log_oom();
transient_state_dir = path_join(runtime_dir, dirname);
if (!transient_state_dir)
return log_oom();
@ -1155,74 +1190,88 @@ static int start_tpm(
if (r < 0)
return log_error_errno(r, "Failed to find swtpm_setup binary: %m");
ssp.exec_start_pre = strv_new(swtpm_setup, "--tpm-state", state_dir, "--tpm2", "--pcr-banks", "sha256", "--not-overwrite");
if (!ssp.exec_start_pre)
_cleanup_strv_free_ char **argv = strv_new(swtpm_setup, "--tpm-state", state_dir, "--tpm2", "--pcr-banks", "sha256", "--not-overwrite");
if (!argv)
return log_oom();
ssp.exec_start = strv_new(swtpm, "socket", "--tpm2", "--tpmstate");
if (!ssp.exec_start)
r = safe_fork("(swtpm-setup)", FORK_CLOSE_ALL_FDS|FORK_LOG|FORK_WAIT, NULL);
if (r == 0) {
/* Child */
execvp(argv[0], argv);
log_error_errno(errno, "Failed to execute '%s': %m", argv[0]);
_exit(EXIT_FAILURE);
}
strv_free(argv);
argv = strv_new(sd_socket_activate, "--listen", listen_address, swtpm, "socket", "--tpm2", "--tpmstate");
if (!argv)
return log_oom();
r = strv_extendf(&ssp.exec_start, "dir=%s", state_dir);
r = strv_extendf(&argv, "dir=%s", state_dir);
if (r < 0)
return log_oom();
r = strv_extend_many(&ssp.exec_start, "--ctrl", "type=unixio,fd=3");
r = strv_extend_many(&argv, "--ctrl", "type=unixio,fd=3");
if (r < 0)
return log_oom();
r = start_socket_service_pair(bus, scope, &ssp);
r = fork_notify(argv, ret_pidref);
if (r < 0)
return r;
if (ret_listen_address)
*ret_listen_address = TAKE_PTR(ssp.listen_address);
*ret_listen_address = TAKE_PTR(listen_address);
return 0;
}
static int start_systemd_journal_remote(
sd_bus *bus,
const char *scope,
unsigned port,
const char *sd_journal_remote,
char **ret_listen_address) {
const char *sd_socket_activate,
char **ret_listen_address,
PidRef *ret_pidref) {
int r;
assert(bus);
assert(scope);
assert(sd_journal_remote);
_cleanup_free_ char *scope_prefix = NULL;
r = unit_name_to_prefix(scope, &scope_prefix);
if (r < 0)
return log_error_errno(r, "Failed to strip .scope suffix from scope: %m");
_cleanup_(socket_service_pair_done) SocketServicePair ssp = {
.socket_type = SOCK_STREAM,
};
ssp.unit_name_prefix = strjoin(scope_prefix, "-forward-journal");
if (!ssp.unit_name_prefix)
_cleanup_free_ char *listen_address = NULL;
if (asprintf(&listen_address, "vsock:2:%u", port) < 0)
return log_oom();
if (asprintf(&ssp.listen_address, "vsock:2:%u", port) < 0)
return log_oom();
_cleanup_free_ char *sd_journal_remote = NULL;
r = find_executable_full(
"systemd-journal-remote",
/* root = */ NULL,
STRV_MAKE(LIBEXECDIR),
/* use_path_envvar = */ true, /* systemd-journal-remote should be installed in
* LIBEXECDIR, but for supporting fancy setups. */
&sd_journal_remote,
/* ret_fd = */ NULL);
if (r < 0)
return log_error_errno(r, "Failed to find systemd-journal-remote binary: %m");
ssp.exec_start = strv_new(
_cleanup_strv_free_ char **argv = strv_new(
sd_socket_activate,
"--listen", listen_address,
sd_journal_remote,
"--output", arg_forward_journal,
"--split-mode", endswith(arg_forward_journal, ".journal") ? "none" : "host");
if (!ssp.exec_start)
if (!argv)
return log_oom();
r = start_socket_service_pair(bus, scope, &ssp);
r = fork_notify(argv, ret_pidref);
if (r < 0)
return r;
if (ret_listen_address)
*ret_listen_address = TAKE_PTR(ssp.listen_address);
*ret_listen_address = TAKE_PTR(listen_address);
return 0;
}
@ -1291,17 +1340,16 @@ static int find_virtiofsd(char **ret) {
}
static int start_virtiofsd(
sd_bus *bus,
const char *scope,
const char *directory,
bool uidmap,
const char *runtime_dir,
char **ret_listen_address) {
const char *sd_socket_activate,
char **ret_listen_address,
PidRef *ret_pidref) {
static unsigned virtiofsd_instance = 0;
int r;
assert(bus);
assert(scope);
assert(directory);
assert(runtime_dir);
@ -1316,45 +1364,46 @@ static int start_virtiofsd(
if (r < 0)
return log_error_errno(r, "Failed to strip .scope suffix from scope: %m");
_cleanup_(socket_service_pair_done) SocketServicePair ssp = {
.socket_type = SOCK_STREAM,
};
if (asprintf(&ssp.unit_name_prefix, "%s-virtiofsd-%u", scope_prefix, virtiofsd_instance++) < 0)
return log_oom();
if (asprintf(&ssp.listen_address, "%s/sock-%"PRIx64, runtime_dir, random_u64()) < 0)
_cleanup_free_ char *listen_address = NULL;
if (asprintf(&listen_address, "%s/sock-%"PRIx64, runtime_dir, random_u64()) < 0)
return log_oom();
/* QEMU doesn't support submounts so don't announce them */
ssp.exec_start = strv_new(virtiofsd, "--shared-dir", directory, "--xattr", "--fd", "3", "--no-announce-submounts");
if (!ssp.exec_start)
_cleanup_strv_free_ char **argv = strv_new(
sd_socket_activate,
"--listen", listen_address,
virtiofsd,
"--shared-dir", directory,
"--xattr",
"--fd", "3",
"--no-announce-submounts");
if (!argv)
return log_oom();
if (uidmap && arg_uid_shift != UID_INVALID) {
r = strv_extend(&ssp.exec_start, "--uid-map");
r = strv_extend(&argv, "--uid-map");
if (r < 0)
return log_oom();
r = strv_extendf(&ssp.exec_start, ":0:" UID_FMT ":" UID_FMT ":", arg_uid_shift, arg_uid_range);
r = strv_extendf(&argv, ":0:" UID_FMT ":" UID_FMT ":", arg_uid_shift, arg_uid_range);
if (r < 0)
return log_oom();
r = strv_extend(&ssp.exec_start, "--gid-map");
r = strv_extend(&argv, "--gid-map");
if (r < 0)
return log_oom();
r = strv_extendf(&ssp.exec_start, ":0:" GID_FMT ":" GID_FMT ":", arg_uid_shift, arg_uid_range);
r = strv_extendf(&argv, ":0:" GID_FMT ":" GID_FMT ":", arg_uid_shift, arg_uid_range);
if (r < 0)
return log_oom();
}
r = start_socket_service_pair(bus, scope, &ssp);
r = fork_notify(argv, ret_pidref);
if (r < 0)
return r;
if (ret_listen_address)
*ret_listen_address = TAKE_PTR(ssp.listen_address);
*ret_listen_address = TAKE_PTR(listen_address);
return 0;
}
@ -1598,134 +1647,6 @@ static int on_request_stop(sd_bus_message *m, void *userdata, sd_bus_error *erro
return 0;
}
static int datagram_read_cmdline_and_exec(int _fd /* always taking possession, even on error */) {
_cleanup_close_ int fd = TAKE_FD(_fd);
int r;
assert(fd >= 0);
/* The first datagram contains the cmdline */
r = fd_wait_for_event(fd, POLLIN, USEC_INFINITY);
if (r < 0)
return log_error_errno(r, "Failed to wait for command line: %m");
ssize_t n = next_datagram_size_fd(fd);
if (n < 0)
return log_error_errno(n, "Failed to determine datagram size: %m");
n += 1; /* extra byte to validate that the size we determined here was correct */
_cleanup_free_ char *p = malloc(n);
if (!p)
return log_oom();
ssize_t m = recv(fd, p, n, /* flags= */ 0);
if (m < 0)
return log_error_errno(errno, "Failed to read datagram: %m");
if (m >= n)
return log_error_errno(SYNTHETIC_ERRNO(EBADMSG), "Unexpected message size.");
_cleanup_strv_free_ char **a = strv_parse_nulstr(p, m);
if (!a)
return log_oom();
if (strv_isempty(a))
return log_error_errno(SYNTHETIC_ERRNO(EBADMSG), "Invalid command line.");
/* The second datagram contains an integer array of the intended fd numbers, and the an SCM_RIGHTS fd
* list along with it, matching that. */
r = fd_wait_for_event(fd, POLLIN, USEC_INFINITY);
if (r < 0)
return log_error_errno(r, "Failed to wait for command line: %m");
n = next_datagram_size_fd(fd);
if (n < 0)
return log_error_errno(n, "Failed to determine datagram size: %m");
n += 1; /* extra byte to validate that the size we determined here was correct */
_cleanup_free_ int *f = malloc(n);
if (!p)
return log_oom();
struct iovec iov = {
.iov_base = f,
.iov_len = n,
};
int *fds = NULL;
size_t n_fds = 0;
CLEANUP_ARRAY(fds, n_fds, close_many_and_free);
m = receive_many_fds_iov(
fd,
&iov, /* iovlen= */ 1,
&fds,
&n_fds,
/* flags= */ MSG_TRUNC);
if (m < 0)
return log_error_errno(m, "Failed to read datagram: %m");
if (m >= n || (size_t) m != n_fds * sizeof(int))
return log_error_errno(SYNTHETIC_ERRNO(EBADMSG), "Unexpected message size.");
fd = safe_close(fd);
/* At this point the fds[] contains the file descriptors we got, and f[] contains the numbers we want
* for them. Let's rearrange things. */
/* 1. Determine largest number we want */
int max_fd = 2;
for (size_t k = 0; k < n_fds; k++)
max_fd = MAX(max_fd, f[k]);
/* 2. Move all fds we got above that */
for (size_t k = 0; k < n_fds; k++) {
if (fds[k] > max_fd)
continue;
_cleanup_close_ int copy = fcntl(fds[k], F_DUPFD_CLOEXEC, max_fd+1);
if (copy < 0)
return log_error_errno(errno, "Failed to duplicate file descriptor: %m");
safe_close(fds[k]);
fds[k] = TAKE_FD(copy);
assert(fds[k] > max_fd);
}
log_close();
r = close_all_fds(fds, n_fds);
if (r < 0)
return log_error_errno(r, "Failed to close remaining file descriptors: %m");
/* 3. Move into place (this also disables O_CLOEXEC) */
for (size_t k = 0; k < n_fds; k++) {
if (dup2(fds[k], f[k]) < 0)
return log_error_errno(errno, "Failed to move file descriptor: %m");
safe_close(fds[k]);
fds[k] = f[k];
}
execv(a[0], a);
return log_error_errno(errno, "Failed to execve %s: %m", a[0]);
}
_noreturn_ static void child(int cmdline_fd) {
assert(cmdline_fd >= 0);
/* set LANG if they are missing */
if (setenv("LANG", "C.UTF-8", /* override= */ 0) < 0) {
log_oom();
goto fail;
}
/* Now wait for the command line from the parent, and then execute it */
(void) datagram_read_cmdline_and_exec(TAKE_FD(cmdline_fd));
fail:
_exit(EXIT_FAILURE);
}
static int run_virtual_machine(int kvm_device_fd, int vhost_device_fd) {
_cleanup_(ovmf_config_freep) OvmfConfig *ovmf_config = NULL;
_cleanup_free_ char *qemu_binary = NULL, *mem = NULL, *kernel = NULL;
@ -1733,10 +1654,13 @@ static int run_virtual_machine(int kvm_device_fd, int vhost_device_fd) {
_cleanup_close_ int notify_sock_fd = -EBADF;
_cleanup_strv_free_ char **cmdline = NULL;
_cleanup_free_ int *pass_fds = NULL;
size_t n_pass_fds = 0;
sd_event_source **children = NULL;
size_t n_children = 0, n_pass_fds = 0;
const char *accel;
int r;
CLEANUP_ARRAY(children, n_children, fork_notify_terminate_many);
polkit_agent_open();
/* Registration always happens on the system bus */
@ -1770,76 +1694,6 @@ static int run_virtual_machine(int kvm_device_fd, int vhost_device_fd) {
runtime_bus = sd_bus_ref(user_bus);
}
assert_se(sigprocmask_many(SIG_BLOCK, /* ret_old_mask=*/ NULL, SIGCHLD) >= 0);
_cleanup_close_pair_ int cmdline_socket[2] = EBADF_PAIR;
if (socketpair(AF_UNIX, SOCK_DGRAM | SOCK_CLOEXEC | SOCK_NONBLOCK, 0, cmdline_socket) < 0)
return log_error_errno(errno, "Failed to allocate command line socket pair: %m");
/* Fork off child early on, as we need to assign it to a scope unit, which we can generate
* dependencies towards for swtpm, virtiofsd and so on. It's just going to hang until we fully
* prepared a command line */
_cleanup_(pidref_done) PidRef child_pidref = PIDREF_NULL;
r = pidref_safe_fork_full(
"(qemu)",
/* stdio_fds= */ NULL,
cmdline_socket + 0, 1,
FORK_RESET_SIGNALS|FORK_CLOSE_ALL_FDS|FORK_DEATHSIG_SIGTERM|FORK_LOG|FORK_CLOEXEC_OFF|FORK_RLIMIT_NOFILE_SAFE,
&child_pidref);
if (r < 0)
return r;
if (r == 0) {
cmdline_socket[1] = -EBADF; /* closed due to FORK_CLOEXEC_ALL_FDS */
child(cmdline_socket[0]);
assert_not_reached();
}
cmdline_socket[0] = safe_close(cmdline_socket[0]);
if (!arg_keep_unit) {
/* When a new scope is created for this container, then we'll be registered as its controller, in which
* case PID 1 will send us a friendly RequestStop signal, when it is asked to terminate the
* scope. Let's hook into that, and cleanly shut down the container, and print a friendly message. */
r = sd_bus_match_signal_async(
runtime_bus,
/* ret= */ NULL,
"org.freedesktop.systemd1",
/* path= */ NULL,
"org.freedesktop.systemd1.Scope",
"RequestStop",
on_request_stop,
/* install_callback= */ NULL,
/* userdata= */ NULL);
if (r < 0)
return log_error_errno(r, "Failed to request RequestStop match: %m");
}
_cleanup_free_ char *unit = NULL;
bool scope_allocated = false;
if (!arg_keep_unit && (!arg_register || !arg_privileged)) {
r = allocate_scope(
runtime_bus,
arg_machine,
&child_pidref,
arg_slice,
arg_property,
/* allow_pidfd= */ true,
&unit);
if (r < 0)
return r;
scope_allocated = true;
} else {
if (arg_privileged)
r = cg_pid_get_unit(0, &unit);
else
r = cg_pid_get_user_unit(0, &unit);
if (r < 0)
return log_error_errno(r, "Failed to get our own unit: %m");
}
bool use_kvm = arg_kvm > 0;
if (arg_kvm < 0) {
r = qemu_check_kvm_support();
@ -2272,18 +2126,51 @@ static int run_virtual_machine(int kvm_device_fd, int vhost_device_fd) {
return r;
}
assert_se(sigprocmask_many(SIG_BLOCK, /* ret_old_mask=*/ NULL, SIGCHLD) >= 0);
_cleanup_(sd_event_unrefp) sd_event *event = NULL;
r = sd_event_new(&event);
if (r < 0)
return log_error_errno(r, "Failed to get default event loop: %m");
(void) sd_event_set_watchdog(event, true);
_cleanup_free_ char *unit = NULL;
r = unit_name_mangle_with_suffix(arg_machine, "as machine name", /* flags= */ 0, ".scope", &unit);
if (r < 0)
return log_error_errno(r, "Failed to mangle scope name: %m");
_cleanup_free_ char *sd_socket_activate = NULL;
r = find_executable("systemd-socket-activate", &sd_socket_activate);
if (r < 0)
return log_error_errno(r, "Failed to find systemd-socket-activate binary: %m");
if (arg_directory) {
_cleanup_free_ char *listen_address = NULL;
_cleanup_(fork_notify_terminate) PidRef child = PIDREF_NULL;
if (!GREEDY_REALLOC(children, n_children + 1))
return log_oom();
r = start_virtiofsd(
runtime_bus,
unit,
arg_directory,
/* uidmap= */ true,
runtime_dir,
&listen_address);
sd_socket_activate,
&listen_address,
&child);
if (r < 0)
return r;
_cleanup_(sd_event_source_unrefp) sd_event_source *source = NULL;
r = event_add_child_pidref(event, &source, &child, WEXITED, on_child_exit, /* userdata= */ NULL);
if (r < 0)
return r;
pidref_done(&child);
children[n_children++] = TAKE_PTR(source);
_cleanup_free_ char *escaped_listen_address = escape_qemu_value(listen_address);
if (!escaped_listen_address)
return log_oom();
@ -2347,16 +2234,30 @@ static int run_virtual_machine(int kvm_device_fd, int vhost_device_fd) {
FOREACH_ARRAY(mount, arg_runtime_mounts.mounts, arg_runtime_mounts.n_mounts) {
_cleanup_free_ char *listen_address = NULL;
_cleanup_(fork_notify_terminate) PidRef child = PIDREF_NULL;
if (!GREEDY_REALLOC(children, n_children + 1))
return log_oom();
r = start_virtiofsd(
runtime_bus,
unit,
mount->source,
/* uidmap= */ false,
runtime_dir,
&listen_address);
sd_socket_activate,
&listen_address,
&child);
if (r < 0)
return r;
_cleanup_(sd_event_source_unrefp) sd_event_source *source = NULL;
r = event_add_child_pidref(event, &source, &child, WEXITED, on_child_exit, /* userdata= */ NULL);
if (r < 0)
return r;
pidref_done(&child);
children[n_children++] = TAKE_PTR(source);
_cleanup_free_ char *escaped_listen_address = escape_qemu_value(listen_address);
if (!escaped_listen_address)
return log_oom();
@ -2386,11 +2287,16 @@ static int run_virtual_machine(int kvm_device_fd, int vhost_device_fd) {
return log_oom();
}
r = cmdline_add_kernel_cmdline(&cmdline, kernel);
_cleanup_(rm_rf_physical_and_freep) char *smbios_dir = NULL;
r = mkdtemp_malloc("/var/tmp/vmspawn-smbios-XXXXXX", &smbios_dir);
if (r < 0)
return log_error_errno(r, "Failed to create temporary directory: %m");
r = cmdline_add_kernel_cmdline(&cmdline, kernel, smbios_dir);
if (r < 0)
return r;
r = cmdline_add_smbios11(&cmdline);
r = cmdline_add_smbios11(&cmdline, smbios_dir);
if (r < 0)
return r;
@ -2444,11 +2350,12 @@ static int run_virtual_machine(int kvm_device_fd, int vhost_device_fd) {
_cleanup_free_ char *tpm_socket_address = NULL;
if (swtpm) {
r = start_tpm(runtime_bus,
unit,
swtpm,
runtime_dir,
&tpm_socket_address);
_cleanup_(fork_notify_terminate) PidRef child = PIDREF_NULL;
if (!GREEDY_REALLOC(children, n_children + 1))
return log_oom();
r = start_tpm(unit, swtpm, runtime_dir, sd_socket_activate, &tpm_socket_address, &child);
if (r < 0) {
/* only bail if the user asked for a tpm */
if (arg_tpm > 0)
@ -2456,6 +2363,14 @@ static int run_virtual_machine(int kvm_device_fd, int vhost_device_fd) {
log_debug_errno(r, "Failed to start tpm, ignoring: %m");
}
_cleanup_(sd_event_source_unrefp) sd_event_source *source = NULL;
r = event_add_child_pidref(event, &source, &child, WEXITED, on_child_exit, /* userdata= */ NULL);
if (r < 0)
return r;
pidref_done(&child);
children[n_children++] = TAKE_PTR(source);
}
if (tpm_socket_address) {
@ -2501,28 +2416,24 @@ static int run_virtual_machine(int kvm_device_fd, int vhost_device_fd) {
}
if (arg_forward_journal) {
_cleanup_free_ char *sd_journal_remote = NULL, *listen_address = NULL, *cred = NULL;
_cleanup_free_ char *listen_address = NULL, *cred = NULL;
r = find_executable_full(
"systemd-journal-remote",
/* root = */ NULL,
STRV_MAKE(LIBEXECDIR),
/* use_path_envvar = */ true, /* systemd-journal-remote should be installed in
* LIBEXECDIR, but for supporting fancy setups. */
&sd_journal_remote,
/* ret_fd = */ NULL);
if (r < 0)
return log_error_errno(r, "Failed to find systemd-journal-remote binary: %m");
if (!GREEDY_REALLOC(children, n_children + 1))
return log_oom();
r = start_systemd_journal_remote(
runtime_bus,
unit,
child_cid,
sd_journal_remote,
&listen_address);
_cleanup_(fork_notify_terminate) PidRef child = PIDREF_NULL;
r = start_systemd_journal_remote(unit, child_cid, sd_socket_activate, &listen_address, &child);
if (r < 0)
return r;
_cleanup_(sd_event_source_unrefp) sd_event_source *source = NULL;
r = event_add_child_pidref(event, &source, &child, WEXITED, on_child_exit, /* userdata= */ NULL);
if (r < 0)
return r;
pidref_done(&child);
children[n_children++] = TAKE_PTR(source);
cred = strjoin("journal.forward_to_socket:", listen_address);
if (!cred)
return log_oom();
@ -2586,18 +2497,29 @@ static int run_virtual_machine(int kvm_device_fd, int vhost_device_fd) {
if (ARCHITECTURE_SUPPORTS_SMBIOS)
FOREACH_ARRAY(cred, arg_credentials.credentials, arg_credentials.n_credentials) {
_cleanup_free_ char *cred_data_b64 = NULL;
_cleanup_free_ char *p = NULL, *cred_data_b64 = NULL;
ssize_t n;
n = base64mem(cred->data, cred->size, &cred_data_b64);
if (n < 0)
return log_oom();
p = path_join(smbios_dir, cred->id);
if (!p)
return log_oom();
r = write_string_filef(
p,
WRITE_STRING_FILE_CREATE|WRITE_STRING_FILE_AVOID_NEWLINE|WRITE_STRING_FILE_MODE_0600,
"io.systemd.credential.binary:%s=%s", cred->id, cred_data_b64);
if (r < 0)
return log_error_errno(r, "Failed to write smbios credential file %s: %m", p);
r = strv_extend(&cmdline, "-smbios");
if (r < 0)
return log_oom();
r = strv_extendf(&cmdline, "type=11,value=io.systemd.credential.binary:%s=%s", cred->id, cred_data_b64);
r = strv_extendf(&cmdline, "type=11,path=%s", p);
if (r < 0)
return log_oom();
}
@ -2631,6 +2553,77 @@ static int run_virtual_machine(int kvm_device_fd, int vhost_device_fd) {
log_debug("Executing: %s", joined);
}
assert_se(sigprocmask_many(SIG_BLOCK, /* ret_old_mask=*/ NULL, SIGCHLD) >= 0);
_cleanup_(pidref_done) PidRef child_pidref = PIDREF_NULL;
r = pidref_safe_fork_full(
qemu_binary,
/* stdio_fds= */ NULL,
pass_fds, n_pass_fds,
FORK_RESET_SIGNALS|FORK_CLOSE_ALL_FDS|FORK_DEATHSIG_SIGTERM|FORK_LOG|FORK_CLOEXEC_OFF|FORK_RLIMIT_NOFILE_SAFE,
&child_pidref);
if (r < 0)
return r;
if (r == 0) {
if (setenv("LANG", "C.UTF-8", 0) < 0) {
log_oom();
goto fail;
}
execv(qemu_binary, cmdline);
log_error_errno(errno, "Failed to execve %s: %m", qemu_binary);
fail:
_exit(EXIT_FAILURE);
}
/* Close relevant fds we passed to qemu in the parent. We don't need them anymore. */
child_vsock_fd = safe_close(child_vsock_fd);
tap_fd = safe_close(tap_fd);
if (!arg_keep_unit) {
/* When a new scope is created for this container, then we'll be registered as its controller, in which
* case PID 1 will send us a friendly RequestStop signal, when it is asked to terminate the
* scope. Let's hook into that, and cleanly shut down the container, and print a friendly message. */
r = sd_bus_match_signal_async(
runtime_bus,
/* ret= */ NULL,
"org.freedesktop.systemd1",
/* path= */ NULL,
"org.freedesktop.systemd1.Scope",
"RequestStop",
on_request_stop,
/* install_callback= */ NULL,
/* userdata= */ NULL);
if (r < 0)
return log_error_errno(r, "Failed to request RequestStop match: %m");
}
bool scope_allocated = false;
if (!arg_keep_unit && (!arg_register || !arg_privileged)) {
r = allocate_scope(
runtime_bus,
arg_machine,
&child_pidref,
children,
n_children,
unit,
arg_slice,
arg_property,
/* allow_pidfd= */ true);
if (r < 0)
return r;
scope_allocated = true;
} else {
if (arg_privileged)
r = cg_pid_get_unit(0, &unit);
else
r = cg_pid_get_user_unit(0, &unit);
if (r < 0)
return log_error_errno(r, "Failed to get our own unit: %m");
}
bool registered = false;
if (arg_register) {
char vm_address[STRLEN("vsock/") + DECIMAL_STR_MAX(unsigned)];
@ -2652,33 +2645,6 @@ static int run_virtual_machine(int kvm_device_fd, int vhost_device_fd) {
registered = true;
}
_cleanup_free_ char *nulstr = NULL;
size_t nulstr_size = 0;
if (strv_make_nulstr(cmdline, &nulstr, &nulstr_size) < 0)
return log_oom();
/* First datagram: the command line to execute */
ssize_t n = send(cmdline_socket[1], nulstr, nulstr_size, /* flags= */ 0);
if (n < 0)
return log_error_errno(errno, "Failed to send command line: %m");
/* Second datagram: the file descriptor array and the fds inside it */
n = send_many_fds_iov(
cmdline_socket[1],
pass_fds, n_pass_fds, /* both as payload … */
&IOVEC_MAKE(pass_fds, n_pass_fds * sizeof(int)), /* … and as auxiliary fds */
/* iovlen= */ 1,
/* flags= */ 0);
if (n < 0)
return log_error_errno(n, "Failed to send file descriptors to child: %m");
/* We submitted the command line now, qemu is running now */
cmdline_socket[1] = safe_close(cmdline_socket[1]);
/* Close relevant fds we passed to qemu in the parent. We don't need them anymore. */
child_vsock_fd = safe_close(child_vsock_fd);
tap_fd = safe_close(tap_fd);
/* Report that the VM is now set up */
(void) sd_notifyf(/* unset_environment= */ false,
"STATUS=VM started.\n"
@ -2695,12 +2661,6 @@ static int run_virtual_machine(int kvm_device_fd, int vhost_device_fd) {
polkit_agent_close();
_cleanup_(sd_event_source_unrefp) sd_event_source *notify_event_source = NULL;
_cleanup_(sd_event_unrefp) sd_event *event = NULL;
r = sd_event_new(&event);
if (r < 0)
return log_error_errno(r, "Failed to get default event source: %m");
(void) sd_event_set_watchdog(event, true);
if (system_bus) {
r = sd_bus_attach_event(system_bus, event, 0);

View File

@ -220,7 +220,8 @@ if rpm.found() and rpmspec.found()
test('test-rpm-macros',
test_rpm_macros,
suite : 'dist',
args : [meson.project_build_root()])
args : [meson.project_build_root()],
depends : rpm_depends)
endif
else
message('Skipping test-rpm-macros since rpm and/or rpmspec are not available')