Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit dfee07cc authored by Alexei Starovoitov's avatar Alexei Starovoitov Committed by David S. Miller
Browse files

net: filter: doc: expand and improve BPF documentation



In particular, this patch tries to clarify internal BPF calling
convention and adds internal BPF examples, JIT guide, use cases.

Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parent 24901551
Loading
Loading
Loading
Loading
+154 −4
Original line number Diff line number Diff line
@@ -613,7 +613,7 @@ Some core changes of the new internal format:

  Therefore, BPF calling convention is defined as:

    * R0	- return value from in-kernel function
    * R0	- return value from in-kernel function, and exit value for BPF program
    * R1 - R5	- arguments from BPF program to in-kernel function
    * R6 - R9	- callee saved registers that in-kernel function will preserve
    * R10	- read-only frame pointer to access stack
@@ -659,9 +659,140 @@ Some core changes of the new internal format:
- Introduces bpf_call insn and register passing convention for zero overhead
  calls from/to other kernel functions:

  After a kernel function call, R1 - R5 are reset to unreadable and R0 has a
  return type of the function. Since R6 - R9 are callee saved, their state is
  preserved across the call.
  Before an in-kernel function call, the internal BPF program needs to
  place function arguments into R1 to R5 registers to satisfy calling
  convention, then the interpreter will take them from registers and pass
  to in-kernel function. If R1 - R5 registers are mapped to CPU registers
  that are used for argument passing on given architecture, the JIT compiler
  doesn't need to emit extra moves. Function arguments will be in the correct
  registers and BPF_CALL instruction will be JITed as single 'call' HW
  instruction. This calling convention was picked to cover common call
  situations without performance penalty.

  After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has
  a return value of the function. Since R6 - R9 are callee saved, their state
  is preserved across the call.

  For example, consider three C functions:

  u64 f1() { return (*_f2)(1); }
  u64 f2(u64 a) { return f3(a + 1, a); }
  u64 f3(u64 a, u64 b) { return a - b; }

  GCC can compile f1, f3 into x86_64:

  f1:
    movl $1, %edi
    movq _f2(%rip), %rax
    jmp  *%rax
  f3:
    movq %rdi, %rax
    subq %rsi, %rax
    ret

  Function f2 in BPF may look like:

  f2:
    bpf_mov R2, R1
    bpf_add R1, 1
    bpf_call f3
    bpf_exit

  If f2 is JITed and the pointer stored to '_f2'. The calls f1 -> f2 -> f3 and
  returns will be seamless. Without JIT, __sk_run_filter() interpreter needs to
  be used to call into f2.

  For practical reasons all BPF programs have only one argument 'ctx' which is
  already placed into R1 (e.g. on __sk_run_filter() startup) and the programs
  can call kernel functions with up to 5 arguments. Calls with 6 or more arguments
  are currently not supported, but these restrictions can be lifted if necessary
  in the future.

  On 64-bit architectures all register map to HW registers one to one. For
  example, x86_64 JIT compiler can map them as ...

    R0 - rax
    R1 - rdi
    R2 - rsi
    R3 - rdx
    R4 - rcx
    R5 - r8
    R6 - rbx
    R7 - r13
    R8 - r14
    R9 - r15
    R10 - rbp

  ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
  and rbx, r12 - r15 are callee saved.

  Then the following internal BPF pseudo-program:

    bpf_mov R6, R1 /* save ctx */
    bpf_mov R2, 2
    bpf_mov R3, 3
    bpf_mov R4, 4
    bpf_mov R5, 5
    bpf_call foo
    bpf_mov R7, R0 /* save foo() return value */
    bpf_mov R1, R6 /* restore ctx for next call */
    bpf_mov R2, 6
    bpf_mov R3, 7
    bpf_mov R4, 8
    bpf_mov R5, 9
    bpf_call bar
    bpf_add R0, R7
    bpf_exit

  After JIT to x86_64 may look like:

    push %rbp
    mov %rsp,%rbp
    sub $0x228,%rsp
    mov %rbx,-0x228(%rbp)
    mov %r13,-0x220(%rbp)
    mov %rdi,%rbx
    mov $0x2,%esi
    mov $0x3,%edx
    mov $0x4,%ecx
    mov $0x5,%r8d
    callq foo
    mov %rax,%r13
    mov %rbx,%rdi
    mov $0x2,%esi
    mov $0x3,%edx
    mov $0x4,%ecx
    mov $0x5,%r8d
    callq bar
    add %r13,%rax
    mov -0x228(%rbp),%rbx
    mov -0x220(%rbp),%r13
    leaveq
    retq

  Which is in this example equivalent in C to:

    u64 bpf_filter(u64 ctx)
    {
        return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9);
    }

  In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
  arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
  registers and place their return value into '%rax' which is R0 in BPF.
  Prologue and epilogue are emitted by JIT and are implicit in the
  interpreter. R0-R5 are scratch registers, so BPF program needs to preserve
  them across the calls as defined by calling convention.

  For example the following program is invalid:

    bpf_mov R1, 1
    bpf_call foo
    bpf_mov R0, R1
    bpf_exit

  After the call the registers R1-R5 contain junk values and cannot be read.
  In the future a BPF verifier can be used to validate internal BPF programs.

Also in the new design, BPF is limited to 4096 insns, which means that any
program will terminate quickly and will only call a fixed number of kernel
@@ -676,6 +807,25 @@ A program, that is translated internally consists of the following elements:

  op:16, jt:8, jf:8, k:32    ==>    op:8, a_reg:4, x_reg:4, off:16, imm:32

So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field
has room for new instructions. Some of them may use 16/24/32 byte encoding. New
instructions must be multiple of 8 bytes to preserve backward compatibility.

Internal BPF is a general purpose RISC instruction set. Not every register and
every instruction are used during translation from original BPF to new format.
For example, socket filters are not using 'exclusive add' instruction, but
tracing filters may do to maintain counters of events, for example. Register R9
is not used by socket filters either, but more complex filters may be running
out of registers and would have to resort to spill/fill to stack.

Internal BPF can used as generic assembler for last step performance
optimizations, socket filters and seccomp are using it as assembler. Tracing
filters may use it as assembler to generate code from kernel. In kernel usage
may not be bounded by security considerations, since generated internal BPF code
may be optimizing internal code path and not being exposed to the user space.
Safety of internal BPF can come from a verifier (TBD). In such use cases as
described, it may be used as safe instruction set.

Just like the original BPF, the new format runs within a controlled environment,
is deterministic and the kernel can easily prove that. The safety of the program
can be determined in two steps: first step does depth-first-search to disallow