Parsing a large model needs > 200GB RAM? OOM -> killed

I am trying to push a large publicly available model (facebook/vjepa2-vitl-fpc64-256 on HF) through the conversion pipeline, to see if it can work.

It’s “only” a 326M param model - is that beyond anything I can ever expect to run? I can run it on CPU, with peak RAM ~ a few GB.

But I am quite stuck: the hailor parser will always get an OOM error, even on a rented system with 200GB of RAM.

I am getting the model as:

model = AutoModel.from_pretrained(hf_repo).encoder

(note - only getting the encoder)

I am converting to onnx like this:

torch.onnx.export(
    model,
    torch.zeros(1, 64, 3, 256, 256, device="cpu"),  # exact size
    "model.onnx",
    input_names=["input"],
    output_names=["output"],
    opset_version=14,           
    dynamic_axes=None            
)

And then trying to parse like this:

hailo parser onnx model.onnx  \
           --har-path model.har \
           --hw-arch  hailo8   

Any ideas? Impossible?

I can do hailomz parse vit_base, which works fine - although that’s 1/4 the number of params as the model I am trying to convert.

Hey @Andrew_Pullin,

Welcome to the Hailo Community!

So here’s what’s happening with your model - our parser has to unroll all 64 frames into one big static computation graph during parsing, and that’s where things get messy. Your model weights (~326M parameters) are only using a few gigs, but the real memory hog is those activation buffers. Think feature maps times 64 time steps - during ONNX shape inference and graph building, this can easily blow up to tens or hundreds of GB. That’s why you’re running out of memory even with 200GB available.

The harsh reality is that a full 64-frame ViT-Large model just won’t fit in the Hailo-8’s memory in one go. But here’s what you can try:

Parse it in chunks

Break down the parsing by targeting specific parts of your graph using our CLI flags:

hailo parser onnx model.onnx \
  --start-node-names input.1 \
  --end-node-names 192 \
  --net-input-shapes input.1:1,64,3,256,256 \
  --har-path model_part1.har \
  --hw-arch hailo8

This only parses from “input.1” to node “192” (you’ll need to figure out the right boundary nodes for your model), which cuts down the peak memory usage significantly.

Bottom line - parsing a 64-frame ViT-Large as one static graph is just too much for the Hailo-8. I’d start with the segmented approach or maybe rework it to process frames individually. Those tend to be the most realistic options for this hardware.

Hope this helps!

Interesting. My understanding was that V-JEPA2, this case, turns a 64-frame 256x256 image stack into 16x16x2 tubelets, so it should ultimately be (256/16)**2 * (64/2) = 8192 tokens into the transformer stack.

So, not quite “rolling out” over frames, insofar as repeated inferences of the same model, for a sequence on inputs.
For example, running the stock code on the HF model page give us an output of size torch.Size([1, 8192, 1024]).

Still pretty big. But surprised we’re getting > 200GB for the graph, here.
Maybe it has something to do with having all the transformers materialzied (24 for this model) at-once?

I’ll try the partitioning technique, though. (also pretty interesting - implies that any model could be “pipelined” into a lot of smaller models)

So: Is there public information on what the maximum memory is? And is it the case for the Hailo8 & family that the whole model (graph, weights, intermediate product) must all fit into the RAM on the module?

I’d recommend checking out the DFC and MZ guides - they contain most of our compilation documentation. You can find them here: https://hailo.ai/developer-zone/documentation/

And yes, we regularly pipeline models when they’re large. We handle it in various ways, but it’s definitely feasible to do what you’re looking for.