FKIE_CVE-2026-46223
Vulnerability from fkie_nvd - Published: 2026-05-28 10:16 - Updated: 2026-06-17 10:53
Severity
Summary
In the Linux kernel, the following vulnerability has been resolved:
cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated
A chain of commits going back to v7.0 reworked rmdir to satisfy the
controller invariant that a subsystem's ->css_offline() must not run while
tasks are still doing kernel-side work in the cgroup.
[1] d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out")
[2] a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup")
[3] 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir")
[4] 4c56a8ac6869 ("cgroup: Fix cgroup_drain_dying() testing the wrong condition")
[5] 13e786b64bd3 ("cgroup: Increment nr_dying_subsys_* from rmdir context")
[1] moved task cset unlink from do_exit() to finish_task_switch() so a
task's cset link drops only after the task has fully stopped scheduling.
That made tasks past exit_signals() linger on cset->tasks until their final
context switch, which led to a series of problems as what userspace expected
to see after rmdir diverged from what the kernel needs to wait for. [2]-[5]
tried to bridge that divergence: [2] filtered the exiting tasks from
cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4]
fixed the wait's condition; [5] made nr_dying_subsys_* visible
synchronously.
The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the
rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g.
host PID 1 systemd reaping orphan pids that were re-parented to it during
the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those
pids to free, the pids can't free because PID 1 is the reaper and it's stuck
in rmdir, and the system A-A deadlocks. No internal lock ordering breaks
this; the wait itself is the bug.
The css killing side that drove the original reorder, however, can be made
cleanly asynchronous: ->css_offline() is already async, run from
css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to
make that chain start only after all tasks have left the cgroup. rmdir's
user-visible side then returns as soon as cgroup.procs and friends are
empty, while ->css_offline() still runs only after the cgroup is fully
drained.
Verified by the original reproducer (pidns teardown + zombie reaper, runs
under vng) which hangs vanilla and succeeds here, and by per-commit
deterministic repros for [2], [3], [4], [5] with a boot parameter that
widens the post-exit_signals() window so each state is reliably reachable.
Some stress tests on top of that.
cgroup_apply_control_disable() has the same shape of pre-existing race:
when a controller is disabled via subtree_control, kill_css() ran
synchronously while tasks past exit_signals() could still be linked to
the cgroup's csets, and ->css_offline() could fire before they drained.
This patch preserves the existing synchronous behavior at that call site
(kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch
will defer kill_css_finish() there using a per-css trigger.
This seems like the right approach and I don't see problems with it. The
changes are somewhat invasive but not excessively so, so backporting to
-stable should be okay. If something does turn out to be wrong, the fallback
is to revert the entire chain ([1]-[5]) and rework in the development branch
instead.
v2: Pin cgrp across the deferred destroy work with explicit
cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1
wasn't actually broken (ordered cgroup_offline_wq + queue_work order
in cgroup_task_dead() saved it) but the explicit ref removes the
dependency on those non-obvious invariants. Also note the
pre-existing cgroup_apply_control_disable() race in the description;
a follow-up will defer kill_css_finish() there.
References
Impacted products
| Vendor | Product | Version | |
|---|---|---|---|
| linux | linux_kernel | * | |
| linux | linux_kernel | * | |
| linux | linux_kernel | 7.0 | |
| linux | linux_kernel | 7.0 | |
| linux | linux_kernel | 7.1 | |
| linux | linux_kernel | 7.1 |
{
"affected": [
{
"affectedData": [
{
"defaultStatus": "unaffected",
"product": "Linux",
"programFiles": [
"include/linux/cgroup-defs.h",
"kernel/cgroup/cgroup.c"
],
"repo": "https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git",
"vendor": "Linux",
"versions": [
{
"lessThan": "33fa2e6b1507a0a377a151a8826438bedad1d0b0",
"status": "affected",
"version": "1b164b876c36c3eb5561dd9b37702b04401b0166",
"versionType": "git"
},
{
"lessThan": "93618edf753838a727dbff63c7c291dee22d656b",
"status": "affected",
"version": "1b164b876c36c3eb5561dd9b37702b04401b0166",
"versionType": "git"
},
{
"status": "affected",
"version": "78c72bce4a87819126211c0d24e18350010604fb",
"versionType": "git"
},
{
"lessThan": "6.20",
"status": "affected",
"version": "6.19.12",
"versionType": "semver"
}
]
},
{
"defaultStatus": "affected",
"product": "Linux",
"programFiles": [
"include/linux/cgroup-defs.h",
"kernel/cgroup/cgroup.c"
],
"repo": "https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git",
"vendor": "Linux",
"versions": [
{
"status": "affected",
"version": "7.0"
},
{
"lessThan": "7.0",
"status": "unaffected",
"version": "0",
"versionType": "semver"
},
{
"lessThanOrEqual": "7.0.*",
"status": "unaffected",
"version": "7.0.9",
"versionType": "semver"
},
{
"lessThanOrEqual": "*",
"status": "unaffected",
"version": "7.1",
"versionType": "original_commit_for_fix"
}
]
}
],
"source": "416baaa9-dc9f-4396-8d5f-8c081fb06d67"
}
],
"configurations": [
{
"nodes": [
{
"cpeMatch": [
{
"criteria": "cpe:2.3:o:linux:linux_kernel:*:*:*:*:*:*:*:*",
"matchCriteriaId": "3E8EC0D8-1F19-40EC-8F9E-9E4FCF84758C",
"versionEndExcluding": "7.0",
"versionStartIncluding": "6.19.12",
"vulnerable": true
},
{
"criteria": "cpe:2.3:o:linux:linux_kernel:*:*:*:*:*:*:*:*",
"matchCriteriaId": "D623DD8A-11FD-499C-9170-FDCF8C2AB363",
"versionEndExcluding": "7.0.9",
"versionStartIncluding": "7.0.1",
"vulnerable": true
},
{
"criteria": "cpe:2.3:o:linux:linux_kernel:7.0:-:*:*:*:*:*:*",
"matchCriteriaId": "EF897730-3F1E-47A2-8B07-22535202C487",
"vulnerable": true
},
{
"criteria": "cpe:2.3:o:linux:linux_kernel:7.0:rc7:*:*:*:*:*:*",
"matchCriteriaId": "512EE3A8-A590-4501-9A94-5D4B268D6138",
"vulnerable": true
},
{
"criteria": "cpe:2.3:o:linux:linux_kernel:7.1:rc1:*:*:*:*:*:*",
"matchCriteriaId": "B1EF7059-E670-45F4-B422-54C40FA86390",
"vulnerable": true
},
{
"criteria": "cpe:2.3:o:linux:linux_kernel:7.1:rc2:*:*:*:*:*:*",
"matchCriteriaId": "0D38F0BF-A728-4133-A358-D44A2F7EE6D6",
"vulnerable": true
}
],
"negate": false,
"operator": "OR"
}
]
}
],
"cveTags": [],
"descriptions": [
{
"lang": "en",
"value": "In the Linux kernel, the following vulnerability has been resolved:\n\ncgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated\n\nA chain of commits going back to v7.0 reworked rmdir to satisfy the\ncontroller invariant that a subsystem\u0027s -\u003ecss_offline() must not run while\ntasks are still doing kernel-side work in the cgroup.\n\n[1] d245698d727a (\"cgroup: Defer task cgroup unlink until after the task is done switching out\")\n[2] a72f73c4dd9b (\"cgroup: Don\u0027t expose dead tasks in cgroup\")\n[3] 1b164b876c36 (\"cgroup: Wait for dying tasks to leave on rmdir\")\n[4] 4c56a8ac6869 (\"cgroup: Fix cgroup_drain_dying() testing the wrong condition\")\n[5] 13e786b64bd3 (\"cgroup: Increment nr_dying_subsys_* from rmdir context\")\n\n[1] moved task cset unlink from do_exit() to finish_task_switch() so a\ntask\u0027s cset link drops only after the task has fully stopped scheduling.\nThat made tasks past exit_signals() linger on cset-\u003etasks until their final\ncontext switch, which led to a series of problems as what userspace expected\nto see after rmdir diverged from what the kernel needs to wait for. [2]-[5]\ntried to bridge that divergence: [2] filtered the exiting tasks from\ncgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4]\nfixed the wait\u0027s condition; [5] made nr_dying_subsys_* visible\nsynchronously.\n\nThe cgroup_drain_dying() wait in [3] turned out to be a dead end. When the\nrmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g.\nhost PID 1 systemd reaping orphan pids that were re-parented to it during\nthe same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those\npids to free, the pids can\u0027t free because PID 1 is the reaper and it\u0027s stuck\nin rmdir, and the system A-A deadlocks. No internal lock ordering breaks\nthis; the wait itself is the bug.\n\nThe css killing side that drove the original reorder, however, can be made\ncleanly asynchronous: -\u003ecss_offline() is already async, run from\ncss_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to\nmake that chain start only after all tasks have left the cgroup. rmdir\u0027s\nuser-visible side then returns as soon as cgroup.procs and friends are\nempty, while -\u003ecss_offline() still runs only after the cgroup is fully\ndrained.\n\nVerified by the original reproducer (pidns teardown + zombie reaper, runs\nunder vng) which hangs vanilla and succeeds here, and by per-commit\ndeterministic repros for [2], [3], [4], [5] with a boot parameter that\nwidens the post-exit_signals() window so each state is reliably reachable.\nSome stress tests on top of that.\n\ncgroup_apply_control_disable() has the same shape of pre-existing race:\nwhen a controller is disabled via subtree_control, kill_css() ran\nsynchronously while tasks past exit_signals() could still be linked to\nthe cgroup\u0027s csets, and -\u003ecss_offline() could fire before they drained.\nThis patch preserves the existing synchronous behavior at that call site\n(kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch\nwill defer kill_css_finish() there using a per-css trigger.\n\nThis seems like the right approach and I don\u0027t see problems with it. The\nchanges are somewhat invasive but not excessively so, so backporting to\n-stable should be okay. If something does turn out to be wrong, the fallback\nis to revert the entire chain ([1]-[5]) and rework in the development branch\ninstead.\n\nv2: Pin cgrp across the deferred destroy work with explicit\n cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1\n wasn\u0027t actually broken (ordered cgroup_offline_wq + queue_work order\n in cgroup_task_dead() saved it) but the explicit ref removes the\n dependency on those non-obvious invariants. Also note the\n pre-existing cgroup_apply_control_disable() race in the description;\n a follow-up will defer kill_css_finish() there."
}
],
"id": "CVE-2026-46223",
"lastModified": "2026-06-17T10:53:21.750",
"metrics": {
"cvssMetricV31": [
{
"cvssData": {
"attackComplexity": "LOW",
"attackVector": "LOCAL",
"availabilityImpact": "HIGH",
"baseScore": 5.5,
"baseSeverity": "MEDIUM",
"confidentialityImpact": "NONE",
"integrityImpact": "NONE",
"privilegesRequired": "LOW",
"scope": "UNCHANGED",
"userInteraction": "NONE",
"vectorString": "CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H",
"version": "3.1"
},
"exploitabilityScore": 1.8,
"impactScore": 3.6,
"source": "nvd@nist.gov",
"type": "Primary"
}
]
},
"published": "2026-05-28T10:16:37.913",
"references": [
{
"source": "416baaa9-dc9f-4396-8d5f-8c081fb06d67",
"tags": [
"Patch"
],
"url": "https://git.kernel.org/stable/c/33fa2e6b1507a0a377a151a8826438bedad1d0b0"
},
{
"source": "416baaa9-dc9f-4396-8d5f-8c081fb06d67",
"tags": [
"Patch"
],
"url": "https://git.kernel.org/stable/c/93618edf753838a727dbff63c7c291dee22d656b"
}
],
"sourceIdentifier": "416baaa9-dc9f-4396-8d5f-8c081fb06d67",
"vulnStatus": "Analyzed",
"weaknesses": [
{
"description": [
{
"lang": "en",
"value": "CWE-667"
}
],
"source": "nvd@nist.gov",
"type": "Primary"
}
]
}
Loading…
Loading…
Experimental. This forecast is provided for visualization only and may change without notice. Do not use it for operational decisions.
Forecast uses a logistic model when the trend is rising, or an exponential decay model when the trend is falling. Fitted via linearized least squares.
Sightings
| Author | Source | Type | Date | Other |
|---|
Nomenclature
- Seen: The vulnerability was mentioned, discussed, or observed by the user.
- Confirmed: The vulnerability has been validated from an analyst's perspective.
- Published Proof of Concept: A public proof of concept is available for this vulnerability.
- Exploited: The vulnerability was observed as exploited by the user who reported the sighting.
- Patched: The vulnerability was observed as successfully patched by the user who reported the sighting.
- Not exploited: The vulnerability was not observed as exploited by the user who reported the sighting.
- Not confirmed: The user expressed doubt about the validity of the vulnerability.
- Not patched: The vulnerability was not observed as successfully patched by the user who reported the sighting.
Loading…
Loading…