Compare commits

...

10 Commits

Author SHA1 Message Date
John Rinehart 4289aecd4c
Merge 09700f287e into d99198819c 2024-11-22 11:15:55 +08:00
John Rinehart 09700f287e chore: remove unnecessary identifier change 2024-05-10 21:36:19 -07:00
John Rinehart 19e7859eb4 chore: remove unnecessary identifier (`sysctl_nr_open_max`) 2024-05-10 18:02:31 -07:00
John Rinehart 4673cac761 chore(style): replace a tab with some spaces 2024-05-10 18:02:31 -07:00
John Rinehart b4ed665390 chore: extend comment to ~110 columns 2024-05-10 18:02:31 -07:00
John Rinehart d35e1edbf1 chore: simplify comparison logic 2024-05-10 18:02:31 -07:00
John Rinehart 9cbbfab218 chore: remove unnecessary bitwise arithmetic 2024-05-10 18:02:31 -07:00
John Rinehart a989acf11b chore(style): change comment format 2024-05-10 18:02:31 -07:00
John Rinehart e7ca7a0a3f chore(style): swap 'unsigned int' for 'unsigned' 2024-05-10 18:02:31 -07:00
John Rinehart ee5e9e1e56 fix: use Linux's value for sysctl_nr_open_max
While Lennart moaned, a bit, in the comments about vendoring this
expression and this value being unknowable, it hasn't changed in over 10
years (and even then the change was a refactor to define this value
statically instead of defining it via delayed execution, cf.
7f4b36f9bb930b3b2105a9a2cb0121fa7028c432 of the kernel source) so it
could be argued that it's pretty knowable (even if it isn't available
through public kernel headers like it might ought to be). The previous
implementation of `bump_file_max_and_nr_open` isn't great, since the
initial value is a bit higher than the maximum allowed value by the
kernel which triggers a cut down to half of the initial value. We ran
into this in production while trying to increase system resource limits
and found that PID 1 wasn't respecting our sysctl.nr_file values.
Enabling debug logs using `systemd.log_level=debug` we observed
```
May 10 02:46:45 localhost systemd[1]: systemd 251.16 running in system mode (+PAM +AUDIT -SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 >
May 10 02:46:45 localhost systemd[1]: Detected virtualization amazon.
May 10 02:46:45 localhost systemd[1]: Detected architecture x86-64.
May 10 02:46:45 localhost systemd[1]: Detected initialized system, this is not the first boot.
May 10 02:46:45 localhost systemd[1]: Kernel version 5.15.119, our baseline is 4.15
May 10 02:46:45 localhost systemd[1]: No hostname configured, using default hostname.
May 10 02:46:45 localhost systemd[1]: Hostname set to <localhost>.
May 10 02:46:45 localhost systemd[1]: Successfully added address 127.0.0.1 to loopback interface
May 10 02:46:45 localhost systemd[1]: Successfully added address ::1 to loopback interface
May 10 02:46:45 localhost systemd[1]: Successfully brought loopback interface up
May 10 02:46:45 localhost systemd[1]: Setting '/proc/sys/fs/file-max' to '9223372036854775807'
May 10 02:46:45 localhost systemd[1]: Initial value of v is 2147483584.
May 10 02:46:45 localhost systemd[1]: Setting '/proc/sys/fs/nr_open' to '2147483584'
May 10 02:46:45 localhost systemd[1]: Successfully bumped fs.nr_open to 2147483584
May 10 02:46:45 localhost systemd[1]: No credentials passed via fw_cfg.
```
including this change while we observed
```
May 10 03:31:37 localhost systemd[1]: systemd 251.16 running in system mode (+PAM +AUDIT -SELINUX +APPARMOR +IMA >
May 10 03:31:37 localhost systemd[1]: Detected virtualization amazon.
May 10 03:31:37 localhost systemd[1]: Detected architecture x86-64.
May 10 03:31:37 localhost systemd[1]: Detected initialized system, this is not the first boot.
May 10 03:31:37 localhost systemd[1]: Kernel version 5.15.119, our baseline is 4.15
May 10 03:31:37 localhost systemd[1]: No hostname configured, using default hostname.
May 10 03:31:37 localhost systemd[1]: Hostname set to <localhost>.
May 10 03:31:37 localhost systemd[1]: Successfully added address 127.0.0.1 to loopback interface
May 10 03:31:37 localhost systemd[1]: Successfully added address ::1 to loopback interface
May 10 03:31:37 localhost systemd[1]: Successfully brought loopback interface up
May 10 03:31:37 localhost systemd[1]: Setting '/proc/sys/fs/file-max' to '9223372036854775807'
May 10 03:31:37 localhost systemd[1]: Setting '/proc/sys/fs/nr_open' to '2147483640'
May 10 03:31:37 localhost systemd[1]: Couldn't write fs.nr_open as 2147483640, halving it.
May 10 03:31:37 localhost systemd[1]: Setting '/proc/sys/fs/nr_open' to '1073741816'
May 10 03:31:37 localhost systemd[1]: Successfully bumped fs.nr_open to 1073741816
May 10 03:31:37 localhost systemd[1]: No credentials passed via fw_cfg.
```
without this patch. We also observed
```
$ cat /proc/1/limits | grep files
Max open files            2147483584           2147483584           files
```
with this patch, but
```
$ cat /proc/1/limits | grep files
Max open files            1073741816           1073741816           files
```
without this patch. Below is a short program that anyone can use to
determine what the value of `sysctl_nr_open_max` should be for their
architecture.
```cpp
// Type your code here, or load an example.

/*
sysctl_nr_open_max defines the largest value for which you can do something like
```
$ sysctl -w fs.nr_open=10000
fs.nr_open = 10000
```

The value is apparently architecture-dependent and is located at
448b3fe5a0/fs/file.c (L27-L32)

This little snippet, below, yields this value for different architectures.

It's mostly an experiment with using Godbolt (C++ not C) to confirm certain
Linux results.
*/

// cf. https://stackoverflow.com/a/66249936
extern "C" {
const char *getBuild() {  // Get current architecture, detectx nearly every
                          // architecture. Coded by Freak
    return "x86_64";
    return "x86_32";
    return "ARM2";
    return "ARM3";
    return "ARM4T";
    return "ARM5"
    return "ARM6T2";
    defined(__ARM_ARCH_6K__) || defined(__ARM_ARCH_6Z__) ||  \
    defined(__ARM_ARCH_6ZK__)
    return "ARM6";
    defined(__ARM_ARCH_7R__) || defined(__ARM_ARCH_7M__) ||  \
    defined(__ARM_ARCH_7S__)
    return "ARM7";
    defined(__ARM_ARCH_7M__) || defined(__ARM_ARCH_7S__)
    return "ARM7A";
    defined(__ARM_ARCH_7S__)
    return "ARM7R";
    return "ARM7M";
    return "ARM7S";
    return "ARM64";
    return "MIPS";
    return "SUPERH";
    defined(__POWERPC__) || defined(__ppc__) || defined(__PPC__) ||           \
    defined(_ARCH_PPC)
    return "POWERPC";
    return "POWERPC64";
    return "SPARC";
    return "M68K";
    return "UNKNOWN";
}
}

// __WORDSIZE defined by GCC/llvm

void fn() {
    const char *system = getBuild();

    std::printf("==== Results for %s ====\n\n", system);

    std::printf("INT_MAX (%s): %x (%d)\n", system, INT_MAX, INT_MAX);

    unsigned long native_size = ~(size_t)0 / sizeof(void *);
    std::printf("~(size_t)0/sizeof(void *): %lx (%ld)\n", native_size,
                native_size);

    unsigned long min = __const_min(INT_MAX, ~(size_t)0 / sizeof(void *));
    std::printf("Minimum of those two things: %lx (%ld)\n", min, min);

    std::cout << "BITS_PER_LONG: " << BITS_PER_LONG << std::endl;

    unsigned int sysctl_nr_open_max =
        __const_min(INT_MAX, ~(size_t)0 / sizeof(void *)) & -BITS_PER_LONG;

    printf("sysctl_nr_open_max: %d", sysctl_nr_open_max);
}

int main() {
    fn();

    return 0;
}
```
2024-05-10 18:02:31 -07:00
1 changed files with 14 additions and 16 deletions

View File

@ -1288,37 +1288,35 @@ static void bump_file_max_and_nr_open(void) {
#endif
#if BUMP_PROC_SYS_FS_NR_OPEN
int v = INT_MAX;
/* Argh! The kernel enforces maximum and minimum values on the fs.nr_open, but we don't really know
* what they are. The expression by which the maximum is determined is dependent on the architecture,
* and is something we don't really want to copy to userspace, as it is dependent on implementation
* details of the kernel. Since the kernel doesn't expose the maximum value to us, we can only try
* and hope. Hence, let's start with INT_MAX, and then keep halving the value until we find one that
* works. Ugly? Yes, absolutely, but kernel APIs are kernel APIs, so what do can we do... 🤯 */
/* cf. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/file.c?h=v6.8#n27
* for the progeny of the below value for sysctl_nr_open_max. Note that the below logic was first
* introduced in git commit eceea0b3df05ed262ae32e0c6340cc7a3626632d of the linux kernel and was
* designed to prevent overflows when determining the maximum number of possible file descriptors. */
#define BITS_PER_LONG __WORDSIZE
#define __const_min(x, y) ((x) < (y) ? (x) : (y))
unsigned v =
__const_min(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG;
for (;;) {
int k;
v &= ~(__SIZEOF_POINTER__ - 1); /* Round down to next multiple of the pointer size */
if (v < 1024) {
log_warning("Can't bump fs.nr_open, value too small.");
break;
}
k = read_nr_open();
int k = read_nr_open();
if (k < 0) {
log_error_errno(k, "Failed to read fs.nr_open: %m");
break;
}
if (k >= v) { /* Already larger */
if ((unsigned)k >= v) { /* Already larger */
log_debug("Skipping bump, value is already larger.");
break;
}
r = sysctl_writef("fs/nr_open", "%i", v);
r = sysctl_writef("fs/nr_open", "%u", v);
if (r == -EINVAL) {
log_debug("Couldn't write fs.nr_open as %i, halving it.", v);
log_debug("Couldn't write fs.nr_open as %u, halving it.", v);
v /= 2;
continue;
}
@ -1327,7 +1325,7 @@ static void bump_file_max_and_nr_open(void) {
break;
}
log_debug("Successfully bumped fs.nr_open to %i", v);
log_debug("Successfully bumped fs.nr_open to %u", v);
break;
}
#endif