You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

fastparquet: out-of-bounds read in the BYTE_ARRAY (string) column decoder

This repo contains a proof-of-concept malicious Parquet file that triggers an out-of-bounds read in fastparquet when the file is loaded, crashing the process (SIGSEGV) and potentially disclosing adjacent heap memory.

Security PoC for a huntr Model File Format report ("Parquet"). The file is intentionally malformed and the repo is gated.

Affected

  • fastparquet, confirmed on 2026.5.0 (current); the vulnerable code is unchanged on main.
  • Reachable from the documented read APIs:
    • fastparquet.ParquetFile(path).to_pandas()
    • pandas.read_parquet(path, engine="fastparquet")
  • Note: pandas.read_parquet uses pyarrow by default; fastparquet is used when selected via engine="fastparquet" or when pyarrow is not installed.

Root cause

fastparquet/speedups.pyx, unpack_byte_array_arrow(). PLAIN-encoded BYTE_ARRAY values are a 4-byte little-endian length prefix followed by that many bytes. The decoder reads the length and then advances/copies that many bytes without validating the length, or the running offset, against the end of the input buffer:

  • speedups.c:21722 reads the next value's 4-byte length prefix past the buffer.
  • speedups.c:22042 reads/copies the string bytes past the buffer.

The only loop guard is the value count, not the bytes consumed. A single oversized length walks the read pointer off the end of the allocation.

Proof of concept

poc.parquet is a valid 848-byte single-string-column Parquet whose first BYTE_ARRAY value has its 4-byte length prefix patched from 7 to 0x7FFFFFFF. Loading it makes the decoder attempt to read ~2 GiB starting inside a small heap block.

pip install fastparquet pandas
python make_poc.py     # writes seed.parquet and poc.parquet
python verify.py       # loads poc.parquet in a child process; reports the crash

Observed on fastparquet 2026.5.0 (Linux): the child process terminates with SIGSEGV (exit -11). Under valgrind:

Invalid read of size 1
   at __pyx_pf_11fastparquet_8speedups_6unpack_byte_array_arrow (speedups.c:22042)
 Address 0x... is 0 bytes after a block of size 286 alloc'd

Impact

Loading an untrusted .parquet through fastparquet triggers an out-of-bounds read in native (Cython-compiled) code. At minimum this is a reliable crash / denial of service at file-read time. Because the over-read bytes are copied into the returned string column values, it can also disclose adjacent heap memory (information leak) for reads that do not immediately fault.

Fix

In unpack_byte_array_arrow (and unpack_byte_array), validate each value's length against the remaining buffer size before reading the length prefix and before copying the data, e.g. reject when src_pos + 4 > buflen or src_pos + 4 + length > buflen.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support