Skip to content

Error handling improvements #159

@javierhonduco

Description

@javierhonduco

👋 . Been using libbpfgo recently and found some problems with the way errors are being handled. Would love to hear your opinion and comments on this subject 😄 .

Context

While helping @kakkoyun debug #147 to make our map retrieval and deletion path more efficient, we were puzzled as the kernel did not seem to return -EPERM in the code path we were executing.

This can be seen with this bpftrace script:

$ sudo ./bpfsnoop_delete_batch.sh `pidof parca-agent`
Attaching 4 probes...

return from BPF syscall
        syscall+40
        0xfffffffff000
        runtime.asmcgocall.abi0+124
 with -2
return from libbpf
        _cgo_3f570fc81fe2_C2func_bpf_map_lookup_and_delete_batch+1196
        runtime.asmcgocall.abi0+124
 with -1

As we can see here, the "raw" (without any massaging of the return code by libc) BPF syscall is returning -2 (ENOENT), while libbpf is seeing -1 (EPERM). Why is this happening?

libc's syscall(2), which libbpf calls here changes the return code to -1 on error, and sets errno to the actual return value of the "raw" syscall.

The syscall(2) manpage:

The return value is defined by the system call being invoked. In
general, a 0 return value indicates success. A -1 return value
indicates an error, and an error number is stored in errno.

🤔 So it seems we thought we were getting EPERM when in reality, it was "just" -1 indicating that the function failed (*) and that we had to check errno to see what the actual error is. Interpreting the return value as errno is not correct here.

As far as I understand, cgo calls have two return values, the return code, and errno (https://pkg.go.dev/cmd/cgo). In several of libbpfgo's calls to libbpf, only one return value is checked, which if I am not mistaken, it would be the return code of the function.

How this is affecting libbpfs APIs right now

Besides the batch APIs not returning the data correctly, even though we might have processed it just fine, this is also affecting some other APIs.

Let's take a look at this fragment extracted from our codebase. We want to delete keys from a map, but ignore if the key is non-existent. After taking a look at the bpf manpage we saw that ENOENT means exactly this, so let's ignore it and fail otherwise.

err := m.stacks.DeleteKey(unsafe.Pointer(&stackId))
if err != nil {
  if !errors.Is(err, syscall.ENOENT) {
    return fmt.Errorf("failed to delete stack trace: %w", err)
  }
}

Unfortunately, this code isn't correct and will return with an error, not just when the key doesn't exist. As DeleteKey is just returning one argument, the return value is -1 which indicates a generic failure. We are interpreting it as if it were an errno value (EPERM), which is not correct.

By reading errno and passing it to the callee, we would know which error this is and be able to handle it accordingly. A possible fix for DeleteKey could be:

func (b *BPFMap) DeleteKey(key unsafe.Pointer) error {
	ret, errC := C.bpf_map_delete_elem(b.fd, key)
	if ret != -1 {
		return fmt.Errorf("failed to get lookup key %d from map %s: %w", key, b.name, errC)
	}
	return nil
}

The road to Libbpf 1.0

Another possibility to fix this is to turn on LIBBPF_STRICT_DIRECT_ERRS (https://github.com/libbpf/libbpf/blob/b69f8ee93ef6aa3518f8fbfd9d1df6c2c84fd08f/src/libbpf_internal.h#L492) with libbpf_set_strict_mode as described in https://github.com/libbpf/libbpf/wiki/Libbpf-1.0-migration-guide.

While this might fix the problems we are experiencing, perhaps it would be a bit of a big change that could potentially break things, so careful testing should be done. For example, if we are currently checking for -1 for errors, we should check for negative values instead.

At the same time, at some point, the v1.0 migration will happen, so maybe it could be a good excuse to start migrating to the new behavior?

Related issues / PRs

Thanks!! Looking forward to your thoughts!

cc/ @kakkoyun @v-thakkar @Sylfrena

(*) There's one exception that I am aware of, batch APIs, which might return -1, but depending on the errno value we might have achieved a partial success. More context on #157.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions