Can I use 3dCNN for motion detection

I’m currently working on a project where we need to detect whether a bus is stationary or in motion using video input from an onboard camera.The idea is to analyze short video clips (a few consecutive frames) and classify them as either “moving” or “stationary”. From my research, a 3D CNN seems like a good fit for this task, since it can learn both spatial and temporal features directly from video data (unlike a 2D CNN which looks at single frames).
However, I found in the Hailo Dataflow Compiler User Guide (pages 132–133) that:

  1. Models containing Conv3D layers must have rank-4 inputs/outputs, so the Conv3D must reside inside a “2D” model.

  2. The input to the first 3D Conv needs to be created using a Concat layer after Unsqueeze.

  3. The last Conv3D must have output_features = 1, followed by Squeeze, then Conv2D or Resize.

Given these limitations, I wanted to ask:

  1. Is it practically feasible to implement a 3D CNN model for this kind of motion classification task on Hailo?

  2. Alternatively, would a CNN + LSTM or frame-difference-based method be more suitable given current Hailo SDK support?

Hey @Teja_tejaram432 ,

The Challenge with 3D CNNs

Conv3D operations are currently in preview with significant limitations. You can only use specific configurations (3x3x3 or 3x3xAny kernels, 1x1x1 or 1x1xAny strides, and limited padding options). Unfortunately, this means standard 3D CNN architectures with rank-5 inputs (batch, time, height, width, channels) won’t work directly. Full Conv3D support isn’t on the roadmap for this year, as it requires changes across the entire Hailo Dataflow Compiler and runtime.

Recommended Approach: CNN + LSTM/RNN

For your motion classification task, I’d suggest a hybrid architecture that’s better supported:

  1. Spatial Feature Extraction: Use a 2D CNN backbone (ResNet, MobileNet, or even YOLO Seg) to extract features from individual frames
  2. Temporal Modeling: Connect an LSTM or GRU layer to capture motion patterns across frames
  3. Keep Sequences Short: Limit your sequence length to 10 frames or fewer (5 for bidirectional) to avoid performance issues

What’s Supported

Hailo supports both GRU and LSTM layers (PyTorch and ONNX only), including:

  • Forward LSTM/RNN (uses information from previous frames)
  • Bidirectional LSTM (uses both past and future context)

The main caveat is that sequences are unrolled, so longer sequences add many layers and can degrade performance.

Alternative to Consider

You could also explore frame-difference methods combined with 2D CNNs as a simpler approach to capture motion without temporal layers.

Hope this helps!