ghsa-38vr-8qrj-3x4h
Vulnerability from github
In the Linux kernel, the following vulnerability has been resolved:
xhci: Handle TD clearing for multiple streams case
When multiple streams are in use, multiple TDs might be in flight when an endpoint is stopped. We need to issue a Set TR Dequeue Pointer for each, to ensure everything is reset properly and the caches cleared. Change the logic so that any N>1 TDs found active for different streams are deferred until after the first one is processed, calling xhci_invalidate_cancelled_tds() again from xhci_handle_cmd_set_deq() to queue another command until we are done with all of them. Also change the error/"should never happen" paths to ensure we at least clear any affected TDs, even if we can't issue a command to clear the hardware cache, and complain loudly with an xhci_warn() if this ever happens.
This problem case dates back to commit e9df17eb1408 ("USB: xhci: Correct assumptions about number of rings per endpoint.") early on in the XHCI driver's life, when stream support was first added. It was then identified but not fixed nor made into a warning in commit 674f8438c121 ("xhci: split handling halted endpoints into two steps"), which added a FIXME comment for the problem case (without materially changing the behavior as far as I can tell, though the new logic made the problem more obvious).
Then later, in commit 94f339147fc3 ("xhci: Fix failure to give back some cached cancelled URBs."), it was acknowledged again.
[Mathias: commit 94f339147fc3 ("xhci: Fix failure to give back some cached cancelled URBs.") was a targeted regression fix to the previously mentioned patch. Users reported issues with usb stuck after unmounting/disconnecting UAS devices. This rolled back the TD clearing of multiple streams to its original state.]
Apparently the commit author was aware of the problem (yet still chose to submit it): It was still mentioned as a FIXME, an xhci_dbg() was added to log the problem condition, and the remaining issue was mentioned in the commit description. The choice of making the log type xhci_dbg() for what is, at this point, a completely unhandled and known broken condition is puzzling and unfortunate, as it guarantees that no actual users would see the log in production, thereby making it nigh undebuggable (indeed, even if you turn on DEBUG, the message doesn't really hint at there being a problem at all).
It took me months of random xHC crashes to finally find a reliable repro and be able to do a deep dive debug session, which could all have been avoided had this unhandled, broken condition been actually reported with a warning, as it should have been as a bug intentionally left in unfixed (never mind that it shouldn't have been left in at all).
Another fix to solve clearing the caches of all stream rings with cancelled TDs is needed, but not as urgent.
3 years after that statement and 14 years after the original bug was introduced, I think it's finally time to fix it. And maybe next time let's not leave bugs unfixed (that are actually worse than the original bug), and let's actually get people to review kernel commits please.
Fixes xHC crashes and IOMMU faults with UAS devices when handling
errors/faults. Easiest repro is to use hdparm
to mark an early sector
(e.g. 1024) on a disk as bad, then cat /dev/sdX > /dev/null
in a loop.
At least in the case of JMicron controllers, the read errors end up
having to cancel two TDs (for two queued requests to different streams)
and the one that didn't get cleared properly ends up faulting the xHC
entirely when it tries to access DMA pages that have since been unmapped,
referred to by the stale TDs. This normally happens quickly (after two
or three loops). After this fix, I left the cat
in a loop running
overnight and experienced no xHC failures, with all read errors
recovered properly. Repro'd and tested on an Apple M1 Mac Mini
(dwc3 host).
On systems without an IOMMU, this bug would instead silently corrupt freed memory, making this a ---truncated---
{ "affected": [], "aliases": [ "CVE-2024-40927" ], "database_specific": { "cwe_ids": [], "github_reviewed": false, "github_reviewed_at": null, "nvd_published_at": "2024-07-12T13:15:15Z", "severity": null }, "details": "In the Linux kernel, the following vulnerability has been resolved:\n\nxhci: Handle TD clearing for multiple streams case\n\nWhen multiple streams are in use, multiple TDs might be in flight when\nan endpoint is stopped. We need to issue a Set TR Dequeue Pointer for\neach, to ensure everything is reset properly and the caches cleared.\nChange the logic so that any N\u003e1 TDs found active for different streams\nare deferred until after the first one is processed, calling\nxhci_invalidate_cancelled_tds() again from xhci_handle_cmd_set_deq() to\nqueue another command until we are done with all of them. Also change\nthe error/\"should never happen\" paths to ensure we at least clear any\naffected TDs, even if we can\u0027t issue a command to clear the hardware\ncache, and complain loudly with an xhci_warn() if this ever happens.\n\nThis problem case dates back to commit e9df17eb1408 (\"USB: xhci: Correct\nassumptions about number of rings per endpoint.\") early on in the XHCI\ndriver\u0027s life, when stream support was first added.\nIt was then identified but not fixed nor made into a warning in commit\n674f8438c121 (\"xhci: split handling halted endpoints into two steps\"),\nwhich added a FIXME comment for the problem case (without materially\nchanging the behavior as far as I can tell, though the new logic made\nthe problem more obvious).\n\nThen later, in commit 94f339147fc3 (\"xhci: Fix failure to give back some\ncached cancelled URBs.\"), it was acknowledged again.\n\n[Mathias: commit 94f339147fc3 (\"xhci: Fix failure to give back some cached\ncancelled URBs.\") was a targeted regression fix to the previously mentioned\npatch. Users reported issues with usb stuck after unmounting/disconnecting\nUAS devices. This rolled back the TD clearing of multiple streams to its\noriginal state.]\n\nApparently the commit author was aware of the problem (yet still chose\nto submit it): It was still mentioned as a FIXME, an xhci_dbg() was\nadded to log the problem condition, and the remaining issue was mentioned\nin the commit description. The choice of making the log type xhci_dbg()\nfor what is, at this point, a completely unhandled and known broken\ncondition is puzzling and unfortunate, as it guarantees that no actual\nusers would see the log in production, thereby making it nigh\nundebuggable (indeed, even if you turn on DEBUG, the message doesn\u0027t\nreally hint at there being a problem at all).\n\nIt took me *months* of random xHC crashes to finally find a reliable\nrepro and be able to do a deep dive debug session, which could all have\nbeen avoided had this unhandled, broken condition been actually reported\nwith a warning, as it should have been as a bug intentionally left in\nunfixed (never mind that it shouldn\u0027t have been left in at all).\n\n\u003e Another fix to solve clearing the caches of all stream rings with\n\u003e cancelled TDs is needed, but not as urgent.\n\n3 years after that statement and 14 years after the original bug was\nintroduced, I think it\u0027s finally time to fix it. And maybe next time\nlet\u0027s not leave bugs unfixed (that are actually worse than the original\nbug), and let\u0027s actually get people to review kernel commits please.\n\nFixes xHC crashes and IOMMU faults with UAS devices when handling\nerrors/faults. Easiest repro is to use `hdparm` to mark an early sector\n(e.g. 1024) on a disk as bad, then `cat /dev/sdX \u003e /dev/null` in a loop.\nAt least in the case of JMicron controllers, the read errors end up\nhaving to cancel two TDs (for two queued requests to different streams)\nand the one that didn\u0027t get cleared properly ends up faulting the xHC\nentirely when it tries to access DMA pages that have since been unmapped,\nreferred to by the stale TDs. This normally happens quickly (after two\nor three loops). After this fix, I left the `cat` in a loop running\novernight and experienced no xHC failures, with all read errors\nrecovered properly. Repro\u0027d and tested on an Apple M1 Mac Mini\n(dwc3 host).\n\nOn systems without an IOMMU, this bug would instead silently corrupt\nfreed memory, making this a\n---truncated---", "id": "GHSA-38vr-8qrj-3x4h", "modified": "2024-07-12T15:31:28Z", "published": "2024-07-12T15:31:28Z", "references": [ { "type": "ADVISORY", "url": "https://nvd.nist.gov/vuln/detail/CVE-2024-40927" }, { "type": "WEB", "url": "https://git.kernel.org/stable/c/26460c1afa311524f588e288a4941432f0de6228" }, { "type": "WEB", "url": "https://git.kernel.org/stable/c/5ceac4402f5d975e5a01c806438eb4e554771577" }, { "type": "WEB", "url": "https://git.kernel.org/stable/c/61593dc413c3655e4328a351555235bc3089486a" }, { "type": "WEB", "url": "https://git.kernel.org/stable/c/633f72cb6124ecda97b641fbc119340bd88d51a9" }, { "type": "WEB", "url": "https://git.kernel.org/stable/c/949be4ec5835e0ccb3e2a8ab0e46179cb5512518" } ], "schema_version": "1.4.0", "severity": [] }
- Seen: The vulnerability was mentioned, discussed, or seen somewhere by the user.
- Confirmed: The vulnerability is confirmed from an analyst perspective.
- Exploited: This vulnerability was exploited and seen by the user reporting the sighting.
- Patched: This vulnerability was successfully patched by the user reporting the sighting.
- Not exploited: This vulnerability was not exploited or seen by the user reporting the sighting.
- Not confirmed: The user expresses doubt about the veracity of the vulnerability.
- Not patched: This vulnerability was not successfully patched by the user reporting the sighting.