SegFault on hailo_configure_vdevice in C

I’m trying to build a zig program utilizing the C headers because of the great C interop. I’ve gotten the initializing code basically copied from the vstreams example, however, when I try to run the program, I get a Segmentation Fault inside of the libhailort.so library, so I can’t see the stack trace to figure out why it failed. See the program output below:

[HailoRT] [info] OS Version: Linux 6.6.47+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.47-1+rpt1 (2024-09-02) aarch64
[HailoRT] [info] firmware_version is: 4.18.0
[HailoRT] [info] VDevice Infos: 0000:01:00.0
[HailoRT] [info] Planned internal buffer memory: CMA memory 0, user memory 9014272. memory to edge layer usage factor is 0.72102547
[HailoRT] [info] Default Internal buffer planner executed successfully
[HailoRT] [info] Configuring HEF took 54.316362 milliseconds
[HailoRT] [info] Configuring HEF on VDevice took 65.092267 milliseconds
Segmentation fault at address 0x1021408
???:?:?: 0x7ff77852d8 in ??? (libhailort.so.4.18.0)
Unwind information for `libhailort.so.4.18.0:0x7ff77852d8` was not available, trace may be incomplete

...../src/main.zig:56:41: 0x1046227 in main (zig-gst)
    status = hlo.hailo_configure_vdevice(vdevice, hef, @ptrCast(&config_params), &network_group, @constCast(&network_group_size));

It looks like the device is initializing correctly, the hef file is fine (I also tested the hef file with hailortcli test or whatever the command for validating hef files is), and the device params are all defaults from hailo_init_configure_params_by_vdevice.
Any debugging tips or ideas?

I ran this program with valgrind, and I got 12.5k lines of output mostly looking like this:

12574   │ ==7332== 57,118,736 bytes in 1 blocks are still reachable in loss record 850 of 850
12575   │ ==7332==    at 0x4888864: operator new[](unsigned long, std::nothrow_t const&) (in /nix/store/84599gzafh8c0xayd0glm9r3dn899dpf-valgrind-3.23.0/libe
        │ xec/valgrind/vgpreload_memcheck-arm64-linux.so)
12576   │ ==7332==    by 0x49D072B: hailort::HeapStorage::create(unsigned long) (in /usr/lib/libhailort.so.4.18.0)
12577   │ ==7332==    by 0x49D0B4F: hailort::BufferStorage::create(unsigned long, hailort::BufferStorageParams const&) (in /usr/lib/libhailort.so.4.18.0)
12578   │ ==7332==    by 0x49D386B: hailort::Buffer::create(unsigned long, hailort::BufferStorageParams const&) (in /usr/lib/libhailort.so.4.18.0)
12579   │ ==7332==    by 0x4C234E3: hailort::read_binary_file(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, hailort
        │ ::BufferStorageParams const&) (in /usr/lib/libhailort.so.4.18.0)
12580   │ ==7332==    by 0x4ACC3F7: hailort::Hef::Impl::parse_hef_file(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&
        │ ) (in /usr/lib/libhailort.so.4.18.0)

with a summary of:

12590   │ ==7332== LEAK SUMMARY:
12591   │ ==7332==    definitely lost: 27 bytes in 1 blocks
12592   │ ==7332==    indirectly lost: 0 bytes in 0 blocks
12593   │ ==7332==      possibly lost: 59,138 bytes in 553 blocks
12594   │ ==7332==    still reachable: 58,352,226 bytes in 20,566 blocks
12595   │ ==7332==         suppressed: 0 bytes in 0 blocks
12596   │ ==7332==
12597   │ ==7332== For lists of detected and suppressed errors, rerun with: -s
12598   │ ==7332== ERROR SUMMARY: 34 errors from 34 contexts (suppressed: 0 from 0)

but those are just memory leaks, which may be related to hailo-8l device memory, and there are so many reports that I doubt I would be able to figure out why it is seg-faulting

Hey @nicholas.young

Welcome to the Hailo Community!

It sounds like you’re facing a segmentation fault issue when using libhailort.so in your Zig program, likely during the configuration of the Hailo device. Based on the provided error logs, it seems that the device is initializing correctly, and the HEF file is properly validated, but the issue occurs after configuring the virtual device (VDevice). Here are a few debugging steps and ideas that could help pinpoint the root cause of the segmentation fault:

Steps to Debug and Fix the Segmentation Fault:

1. Check for Null Pointers and Memory Management Issues

Since the error points to memory management (Segmentation fault at address 0x1021408), it could be related to how Zig handles memory interop with C, especially when calling libhailort.so functions. You should verify that all pointers passed to the Hailo C API are properly initialized and allocated.

Specifically:

  • Check vdevice, hef, config_params, and network_group_size pointers to ensure they’re properly initialized and not null before you call hailo_configure_vdevice.
  • Use Zig’s std.debug.assert to check if these pointers are valid:
    std.debug.assert(vdevice != null);
    std.debug.assert(hef != null);
    std.debug.assert(&config_params != null);
    std.debug.assert(&network_group_size != null);
    

2. Ensure Proper Memory Alignment

In C interop, especially when using Zig, improper alignment of structs or memory regions can cause segmentation faults. Verify that the structs (e.g., config_params) are properly aligned for the Hailo C API. You might need to use @alignCast in Zig to ensure the memory alignment is correct for C structures.

Example:

const config_params_ptr = @alignCast(align_of(hailo_configure_params), &config_params);

3. Review the Hailo C API Documentation

Make sure the hailo_configure_vdevice function is used correctly and that the config_params structure is set up according to the latest HailoRT documentation. Even though you copied the initialization code from the vstreams example, there might be version-specific changes to the API.

Check:

  • Structure initialization: Ensure that hailo_init_configure_params_by_vdevice properly initializes config_params.
  • HEF file: You mentioned that the HEF file is valid, but it’s worth confirming that the HEF is compatible with the version of HailoRT you’re using (4.18.0).

4. Increase Logging Level for libhailort

Enable more verbose logging for libhailort.so to help trace the exact point where the segmentation fault occurs. This might give you a better idea of what happens right before the crash.

To increase logging, you can usually set environment variables. For example:

export HAILO_LOG_LEVEL=debug

Then rerun your Zig program and check if more detailed logs point to where the issue originates.

5. Valgrind Insights

The Valgrind output you provided points to memory leaks but doesn’t clearly show the root cause of the segmentation fault. Memory leaks in libhailort.so might not directly lead to the segmentation fault. However, you should focus on:

  • The definitely lost block of 27 bytes: Although small, this might be worth investigating, especially if it’s related to device memory allocation.
  • Use Valgrind’s --track-origins=yes option to try to trace the origins of invalid memory reads or writes more clearly:
    valgrind --track-origins=yes ./your_program
    

6. Zig’s Error Handling

Ensure that your Zig code is properly handling errors, especially when dealing with external C libraries. For instance, make sure you’re checking the return status of all libhailort.so calls:

const status = hlo.hailo_configure_vdevice(vdevice, hef, @ptrCast(&config_params), &network_group, @constCast(&network_group_size));
if (status != HAILO_SUCCESS) {
    std.debug.print("Error: {}\n", .{status});
    return status;
}

7. Check for Compatibility Between HailoRT and Your Zig Program

Ensure that your HailoRT version is compatible with your device and the Linux kernel you’re using (6.6.47+rpt-rpi-v8). Segmentation faults can occur if there’s a mismatch in versions or if HailoRT depends on specific kernel features.

  • Test with a Different Kernel Version: If possible, try running your program on a slightly older or more stable kernel (e.g., 5.x series), as the 6.x series might introduce some incompatibilities with the Hailo drivers.
  • HailoRT version: Ensure you’re using the correct version of libhailort.so (4.18.0 in your case) that is compatible with the firmware (4.18.0), which seems correct based on your logs.

8. Segmentation Fault Debugging with GDB

You can use GDB to get more insights into the segmentation fault. Run the program with GDB attached and check the stack trace:

gdb ./your_program
run

When the segmentation fault occurs, use backtrace to see the exact point where the crash happened:

backtrace

This might provide additional clues about the failure inside libhailort.so.

9. Check HailoRT Examples for Version Compatibility

Since you mentioned copying the code from the vstreams example, make sure that the example code is compatible with HailoRT 4.18.0. Sometimes, examples from earlier versions of HailoRT may not work with the latest version due to changes in the API.

Try running the unmodified vstreams example provided by Hailo to see if the problem persists.

Conclusion:

  • Check for null pointers and ensure proper memory initialization and alignment in your Zig program.
  • Verify that your pointers and memory structures passed to hailo_configure_vdevice are correct.
  • Enable verbose logging to trace the issue in more detail.
  • Use GDB and Valgrind to further debug the segmentation fault and check memory-related issues.
  • Ensure that HailoRT 4.18.0 and the Linux kernel version are compatible with your device.

These steps should help you identify and resolve the root cause of the segmentation fault. Let me know if any of these debugging steps help, or if you need more detailed guidance on any specific part!

Best Regards,
Omri

  1. I was unable to use asserts as in the code given, however when I printed out the contents of all of the arguments of the erroring hailo function, they all had values except for network_group which was null as expected:
vDevice: cimport.struct__hailo_vdevice@1126e50
hef: cimport.struct__hailo_hef@11271f0
network_group_size: 1
network_group: null
config_params: cimport.hailo_configure_params_t{ .network_group_params_count = 1, .network_group_params = { cimport.hailo_configure_network_group_params_t{ .name = { ... }, .batch_size = 0, .power_mode = 0, .latency = 0, .stream_params_by_name_count = 4, .stream_params_by_name = { ... }, .network_params_by_name_count = 1, .network_params_by_name = { ... } }, cimport.hailo_configure_network_group_params_t{ .name = { ... }, .batch_size = 43690, .power_mode = 2863311530, .latency = 2863311530, .stream_params_by_name_count = 12297829382473034410, .stream_params_by_name = { ... }, .network_params_by_name_count = 12297829382473034410, .network_params_by_name = { ... } }, cimport.hailo_configure_network_group_params_t{ .name = { ... }, .batch_size = 43690, .power_mode = 2863311530, .latency = 2863311530, .stream_params_by_name_count = 12297829382473034410, .stream_params_by_name = { ... }, .network_params_by_name_count = 12297829382473034410, .network_params_by_name = { ... } }, cimport.hailo_configure_network_group_params_t{ .name = { ... }, .batch_size = 43690, .power_mode = 2863311530, .latency = 2863311530, .stream_params_by_name_count = 12297829382473034410, .stream_params_by_name = { ... }, .network_params_by_name_count = 12297829382473034410, .network_params_by_name = { ... } }, cimport.hailo_configure_network_group_params_t{ .name = { ... }, .batch_size = 43690, .power_mode = 2863311530, .latency = 2863311530, .stream_params_by_name_count = 12297829382473034410, .stream_params_by_name = { ... }, .network_params_by_name_count = 12297829382473034410, .network_params_by_name = { ... } }, cimport.hailo_configure_network_group_params_t{ .name = { ... }, .batch_size = 43690, .power_mode = 2863311530, .latency = 2863311530, .stream_params_by_name_count = 12297829382473034410, .stream_params_by_name = { ... }, .network_params_by_name_count = 12297829382473034410, .network_params_by_name = { ... } }, cimport.hailo_configure_network_group_params_t{ .name = { ... }, .batch_size = 43690, .power_mode = 2863311530, .latency = 2863311530, .stream_params_by_name_count = 12297829382473034410, .stream_params_by_name = { ... }, .network_params_by_name_count = 12297829382473034410, .network_params_by_name = { ... } }, cimport.hailo_configure_network_group_params_t{ .name = { ... }, .batch_size = 43690, .power_mode = 2863311530, .latency = 2863311530, .stream_params_by_name_count = 12297829382473034410, .stream_params_by_name = { ... }, .network_params_by_name_count = 12297829382473034410, .network_params_by_name = { ... } } } }
  1. I tried performing an @alignCast on the config_params that were passed into the function, however that did not change any of the output or logs.

  2. I looked at the API docs, and everything I saw matched what I had written

  3. I tried exporting the env var you specified, but it did not change the level of output. I checked the HailoRT environment vars, and it said that the most verbose setting is to set HAILORT_CONSOLE_LOGGER_LEVEL to info, which it already was set to in the output I gave above.

  4. When I inspected the valgrind output as you suggested, the 27 bytes lots were in this context:

2929   │ ==7332== 27 bytes in 1 blocks are definitely lost in loss record 209 of 850
2930   │ ==7332==    at 0x4885CDC: malloc (in /nix/store/84599gzafh8c0xayd0glm9r3dn899dpf-valgrind-3.23.0/libexec/valgrind/vgpreload_memcheck-arm64-linux.so)
2931   │ ==7332==    by 0x4C1CB9F: hailort::TempFile::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /usr/lib/li
       │ bhailort.so.4.18.0)
2932   │ ==7332==    by 0x49C8553: hailort::MonitorHandler::open_temp_mon_file() (in /usr/lib/libhailort.so.4.18.0)
2933   │ ==7332==    by 0x49C8B33: hailort::MonitorHandler::start_mon(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /usr/lib/libhailort.so.4.18.0)
2934   │ ==7332==    by 0x4A056DB: hailort::VDeviceBase::create(hailo_vdevice_params_t const&) (in /usr/lib/libhailort.so.4.18.0)
2935   │ ==7332==    by 0x4A0B2A3: hailort::VDeviceHandle::create(hailo_vdevice_params_t const&) (in /usr/lib/libhailort.so.4.18.0)
2936   │ ==7332==    by 0x4A0C4DF: hailort::VDevice::create(hailo_vdevice_params_t const&) (in /usr/lib/libhailort.so.4.18.0)
2937   │ ==7332==    by 0x4A0C6A7: hailort::VDevice::create() (in /usr/lib/libhailort.so.4.18.0)
2938   │ ==7332==    by 0x4985FDB: hailo_create_vdevice (in /usr/lib/libhailort.so.4.18.0)
2939   │ ==7332==    by 0x104619F: main.main (main.zig:43)
2940   │ ==7332==    by 0x1045873: callMain (start.zig:612)
2941   │ ==7332==    by 0x1045873: callMainWithArgs (start.zig:581)
2942   │ ==7332==    by 0x1045873: main (start.zig:596)
  1. I am performing an assert that status == hlo.HAILO_SUCCESS after every call, and I am not tripping any of the asserts during runtime.

  2. According to everything that I could find, it looks like I am using the correct firmware and kernel versions.
    I ran hailortcli fw-control identify and got this output

[HailoRT] [info] OS Version: Linux 6.6.47+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.47-1+rpt1 (2024-09-02) aarch64
[HailoRT] [info] firmware_version is: 4.18.0
Executing on device: 0000:01:00.0
[HailoRT] [info] firmware_version is: 4.18.0
Identifying board
Control Protocol Version: 2
Firmware Version: 4.18.0 (release,app,extended context switch buffer)
Logger Version: 0
Board Name: Hailo-8
Device Architecture: HAILO8L

I also ran hailortcli parse-hef on the file I downloaded from hailo.ai:

Architecture HEF was compiled for: HAILO8L
Network group name: yolov7, Multi Context - Number of contexts: 7
    Network name: yolov7/yolov7
        VStream infos:
            Input  yolov7/input_layer1 UINT8, NHWC(640x640x3)
            Output yolov7/yolov5_nms_postprocess FLOAT32, HAILO NMS(number of classes: 80, maximum bounding boxes per class: 80, maximum frame size: 128320)
            Operation:
  1. When I tried to use gdb, I was unable to get a stack trace from the libhailort.so because it was compiled without stack traces and debug information.

  2. I looked through the diff on the example (which had been updated for 4.19) and there was no change between the 4.18 example and the 4.19 example code, and it should have worked fine.

Is there a recommended way to compile the hailo runtime library from source so that I could compile it with debug flags to get the whole stack trace from gdb on the segfaults?

I built the library from source, and the Seg-faulting code in the library is:

Segmentation fault at address 0x10215c8
/home/nixolas/Documents/hailort/hailort/libhailort/src/hailort.cpp:2408:31: 0x7ff773d158 in hailo_configure_vdevice (/home/nixolas/Documents/hailort/hailort/libhailort/src/hailort.cpp)
    *number_of_network_groups = added_net_groups.value().size();
                              ^
/home/nixolas/Documents/HailoImageProcessor/src/main.zig:69:41: 0x1047f1f in main (zig-gst)
    status = hlo.hailo_configure_vdevice(vdevice, hef, &config_params, &network_group, @constCast(&network_group_size));

I’ll debug this more tomorrow, but it appears that there is an input type/memory format mis-match caused by Zig.

I have passed the configure vdevice function successfully, and I am now failing on the init_input_vstreams function. Once I get a basic program up, I will post my code for future googlers.

Huzza! I have started and stopped the device with an HEF file from zig without segfaults! I have attached my successful code that emulates most of the vstreams C example init function (no actual inference happens yet. Just initialization).

I don’t know what exactly it was that made this work, but I think getting rid of a lot of the @constCast and having all of the variables actually be variables helped.


const std = @import("std");
const hlo = @cImport({
    @cInclude("hailort.h");
});

const hef_file = "yolov7.hef";
const max_edge_layers = 32;

pub fn main() void {
    std.fs.cwd().access(hef_file, .{ }) catch |e| {
        std.debug.panic("Could not open hef file! '{any}'", .{ e });
    };

    var status: hlo.hailo_status = undefined;
    var vdevice      : hlo.hailo_vdevice = undefined;
    var hef          : hlo.hailo_hef = undefined;
    var config_params: hlo.hailo_configure_params_t = undefined;
    var network_group: hlo.hailo_configured_network_group = undefined;
    var network_group_size     : usize = 1;
    var input_vstream_params   : [max_edge_layers]hlo.hailo_input_vstream_params_by_name_t = undefined;
    var output_vstream_params  : [max_edge_layers]hlo.hailo_output_vstream_params_by_name_t = undefined;
    var output_vstreams        : [max_edge_layers]hlo.hailo_output_vstream = undefined;
    var input_vstreams         : [max_edge_layers]hlo.hailo_input_vstream = undefined;
    var input_vstream_size       : usize = max_edge_layers;
    var output_vstream_size      : usize = max_edge_layers;

    status = hlo.hailo_create_vdevice(null, &vdevice);
    assert(status == hlo.HAILO_SUCCESS);

    status = hlo.hailo_create_hef_file(&hef, hef_file);
    assert(status == hlo.HAILO_SUCCESS);

    std.debug.print("vdevice inited, hef file created!\n", .{});
    
    status = hlo.hailo_init_configure_params_by_vdevice(hef, vdevice, &config_params);
    assert(status == hlo.HAILO_SUCCESS);

    std.debug.print("Init configure params complete\n", .{});

    assert(vdevice != null);
    assert(hef != null);
    assert(network_group_size != 0);

    status = hlo.hailo_configure_vdevice(vdevice, hef, &config_params, &network_group, &network_group_size);
    assert(status == hlo.HAILO_SUCCESS);

    std.debug.print("Configure vdevice complete!\n", .{});

    status = hlo.hailo_make_input_vstream_params(network_group, false, hlo.HAILO_FORMAT_TYPE_AUTO,
        &input_vstream_params, &input_vstream_size);

    assert(status == hlo.HAILO_SUCCESS);

    std.debug.print("Input vstream params initialized\n", .{});

    status = hlo.hailo_make_output_vstream_params(network_group, true, hlo.HAILO_FORMAT_TYPE_AUTO,
        &output_vstream_params, &output_vstream_size);
    assert(status == hlo.HAILO_SUCCESS);

    std.debug.print("Output vstream params initialized\n", .{});

    assert(input_vstream_size <= max_edge_layers);

    status = hlo.hailo_create_input_vstreams(network_group, &input_vstream_params, input_vstream_size, &input_vstreams);
    assert(status == hlo.HAILO_SUCCESS);

    std.debug.print("Input vstreams initialized\n", .{});

    status = hlo.hailo_create_output_vstreams(network_group, &output_vstream_params, output_vstream_size, &output_vstreams);
    assert(status == hlo.HAILO_SUCCESS);

    std.debug.print("Output vstreams initialized\n", .{});

    _ = hlo.hailo_release_output_vstreams(&output_vstreams, output_vstream_size);
    _ = hlo.hailo_release_input_vstreams(&input_vstreams, input_vstream_size);
    _ = hlo.hailo_release_hef(hef);
    _ = hlo.hailo_release_vdevice(vdevice);
}

I hope this helps someone else and saves them some pain!

1 Like

I actually ran a test, and it was the @constCast(&input_vstream_params) and the like that caused a lot of my seg faults. Good to know!

1 Like

Great job!

Thank you for sharing this. If you need any further assistance, please don’t hesitate to ask.