Error handling improvements

👋 . Been using libbpfgo recently and found some problems with the way errors are being handled. Would love to hear your opinion and comments on this subject 😄 .

## Context

While helping @kakkoyun debug https://github.com/aquasecurity/libbpfgo/issues/147 to make our map retrieval and deletion path more efficient, we were puzzled as the kernel did not seem to return `-EPERM` in the code path we were executing.

This can be seen with this [bpftrace script](https://github.com/javierhonduco/bpf-playground/blob/master/bpfsnoop_delete_batch.sh):

```
$ sudo ./bpfsnoop_delete_batch.sh `pidof parca-agent`
Attaching 4 probes...

return from BPF syscall
        syscall+40
        0xfffffffff000
        runtime.asmcgocall.abi0+124
 with -2
return from libbpf
        _cgo_3f570fc81fe2_C2func_bpf_map_lookup_and_delete_batch+1196
        runtime.asmcgocall.abi0+124
 with -1
```

As we can see here, the "raw" (without any massaging of the return code by libc) BPF syscall is returning `-2` (`ENOENT`), while libbpf is seeing `-1` (`EPERM`). Why is this happening?

libc's `syscall(2)`, which libbpf calls [here](https://github.com/libbpf/libbpf/blob/3a4e26307d0f9b227e3ebd28b443a1a715e4e17d/src/bpf.c#L75) changes the return code to `-1` on error, and sets `errno` to the actual return value of the "raw" syscall.

The [`syscall(2)`](https://man7.org/linux/man-pages/man2/syscall.2.html) manpage:
> The return value is defined by the system call being invoked.  In
>        general, a 0 return value indicates success.  A -1 return value
>       indicates an error, and an error number is stored in [errno](https://man7.org/linux/man-pages/man3/errno.3.html).

🤔  So it seems we thought we were getting `EPERM` when in reality, it was "just" `-1` indicating that the function failed (*) and that we had to check `errno` to see what the actual error is. Interpreting the return value as `errno` is not correct here.

As far as I understand, cgo calls have two return values, the return code, and `errno` (https://pkg.go.dev/cmd/cgo). In several of libbpfgo's calls to libbpf, only one return value is checked, which if I am not mistaken, it would be the return code of the function.

## How this is affecting libbpfs APIs right now
Besides the batch APIs not returning the data correctly, even though we might have processed it just fine, this is also affecting some other APIs.

Let's take a look at this fragment extracted from our codebase. We want to delete keys from a map, but ignore if the key is non-existent. After taking a look at the [bpf manpage](https://man7.org/linux/man-pages/man2/bpf.2.html) we saw that `ENOENT` means exactly this, so let's ignore it and fail otherwise.

```go
err := m.stacks.DeleteKey(unsafe.Pointer(&stackId))
if err != nil {
  if !errors.Is(err, syscall.ENOENT) {
    return fmt.Errorf("failed to delete stack trace: %w", err)
  }
}
```

Unfortunately, this code isn't correct and will return with an error, not just when the key doesn't exist. As `DeleteKey` is just returning one argument, the return value is `-1` which indicates a generic failure. We are interpreting it as if it were an `errno` value (`EPERM`), which is not correct. 

By reading `errno` and passing it to the callee, we would know which error this is and be able to handle it accordingly. A possible fix for `DeleteKey` could be:
```go
func (b *BPFMap) DeleteKey(key unsafe.Pointer) error {
	ret, errC := C.bpf_map_delete_elem(b.fd, key)
	if ret != -1 {
		return fmt.Errorf("failed to get lookup key %d from map %s: %w", key, b.name, errC)
	}
	return nil
}
```

## The road to Libbpf 1.0

Another possibility to fix this is to turn on `LIBBPF_STRICT_DIRECT_ERRS` (https://github.com/libbpf/libbpf/blob/b69f8ee93ef6aa3518f8fbfd9d1df6c2c84fd08f/src/libbpf_internal.h#L492) with `libbpf_set_strict_mode` as described in https://github.com/libbpf/libbpf/wiki/Libbpf-1.0-migration-guide.

While this might fix the problems we are experiencing, perhaps it would be a bit of a big change that could potentially break things, so careful testing should be done. For example, if we are currently checking for -1 for errors, we should check for negative values instead.

At the same time, at some point, the v1.0 migration will happen, so maybe it could be a good excuse to start migrating to the new behavior? 

## Related issues / PRs
- https://github.com/aquasecurity/libbpfgo/issues/147 
- https://github.com/aquasecurity/libbpfgo/pull/101
- https://github.com/aquasecurity/libbpfgo/pull/112
- https://github.com/aquasecurity/libbpfgo/pull/100


Thanks!! Looking forward to your thoughts! 

cc/ @kakkoyun @v-thakkar @Sylfrena

(*) There's one exception that I am aware of, batch APIs, which might return `-1`, but depending on the `errno` value we might have achieved a partial success. More context on https://github.com/aquasecurity/libbpfgo/pull/157.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error handling improvements #159

Context

How this is affecting libbpfs APIs right now

The road to Libbpf 1.0

Related issues / PRs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error handling improvements #159

Description

Context

How this is affecting libbpfs APIs right now

The road to Libbpf 1.0

Related issues / PRs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions