The Go low-level calling convention on x86-64 (updated)
=======================================================

What's new in 2020 and in Go 1.15
+++++++++++++++++++++++++++++++++

:Author: Raphael ‘kena’ Poss
:Date: November 2020
:modified: 2020-12-01
:subtitle: What's new in 2020 and in Go 1.15
:slug: go-calling-convention-x86-64-2020
:category: Programming
:tags: golang, compilers, analysis, programming languages, c++
:series: Go low-level code analysis

.. raw:: latex

   \clearpage

.. contents::

.. raw:: latex

   \clearpage

.. note::
   The latest version of this document can be found online at
   https://dr-knz.net/go-calling-convention-x86-64-2020.html.
   Alternate formats:
   `Source <https://dr-knz.net/go-calling-convention-x86-64-2020.txt>`_,
   `PDF <https://dr-knz.net/go-calling-convention-x86-64-2020.pdf>`_.

Introduction
------------

Two years ago, `this article
<{filename}go-calling-convention-x86-64.rst>`_ reviewed the low-level
code generation of the Go compiler, as of version 1.10. A few things
have changed since, and so an update is in order.

As previously, All tests below are performed using the
``freebsd/amd64`` target; this time using go 1.15.5.  The assembly
listing are produced with ``go tool objdump`` and use the `Go
assembler syntax`__.

.. __: https://golang.org/doc/asm

Calling convention
------------------

Arguments and return value
~~~~~~~~~~~~~~~~~~~~~~~~~~

The mechanisms for passing arguments and return values remain largely
unchanged since go 1.10: they are passed via memory, on the stack.

Call sequence: how a function gets called
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As in Go 1.10, in 1.15 a function places arguments for its callee into
its activation record, and makes space for the return value there as
well. The callee writes the return value the caller's activation
record.

As before, a side effect of this design is that when a function returns the
same value as one of its callees, it needs to read the return value
from the callee from its own activation record, then place it back
onto the stack at a return value in its caller's activation
record. `Tail call optimizations (TCO)
<https://en.wikipedia.org/wiki/Tail_call>`_ thus remain impossible.

Additionally, function prologues remain largely unchanged:

- a function that uses local variables needs to set up an activation
  record by adjusting the SP register, and does this always in
  its prologue.

- as before, every function also sets up a frame pointer in the BP
  register to facilitate exception unwinds.

- as before, a function that uses more than a few words of stack, or
  that performs a function call, also needs to check the remaining
  size of the stack upfront and allocate more stack if needed. This is
  because Go allocates tiny stacks to goroutines by default.

Naturally, the epilogue un-does these operations.

Here is an example function prologue and epilogue, taken from one of
the Go runtime's internal functions:

.. code-block:: asm

   internal/cpu.Initialize:
          ; Check remaining stack size:
          MOVQ FS:0xfffffff8, CX
          CMPQ 0x10(CX), SP ; at least 24 bytes on the stack?
          JBE 0x401047      ; no: go to block at end of function below

          ; Allocate activation record:
          SUBQ $0x18, SP    ; 24 bytes in activation record

          ; Set up the frame pointer
          MOVQ BP, 0x10(SP) ; BP is callee-save: store it
          LEAQ 0x10(SP), BP ; set up new frame pointer
          ...

          MOVQ 0x10(SP), BP ; restore the caller's frame pointer
          ADDQ $0x18, SP    ; deallocate the activation record
          RET               ; return

  0x401047:
          CALL runtime.morestack_noctxt(SB) ; alloc more stack
          JMP internal/cpu.Initialize(SB) ; restart

Callee-save registers—or not
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Go 1.15 did not evolve from 1.10: there are still no callee-save
registers. All temporaries are spilled to the stack upon a function
call.

The cost of pointers and interfaces
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The layout of pointers and interfaces remains unchanged:

- a pointer takes one word.
- an interface takes two words: one for the vtable, one for the
  reference to the object.
- strings have two words: the length and a pointer to the string's
  bytes.
- slices have three words: the length, the capacity and a pointer to
  the data.

As before, interface references to empty structs use a zero pointer as
object reference; the entirety of the implementation is decided by the
vtable pointer.

When an object implements an interface by value, *in the general case*
converting the object to an interface reference moves the object to
the heap.

To see how this happens, we can use the following code:

.. code-block:: go

    // Define a struct type implementing the interface by value.
    type bar struct{ x int }
    func (bar) foo() {}
    // Define a global variable so we don't use the heap allocator.
    var y bar

    // Make an interface value.
    func MakeInterface2() Foo { return y }

This gives us:

.. code-block:: asm

   MakeInterface2:
      ; <function prologue>

      ; write y to 0(SP), as an argument to runtime.convT64
      0x45c55d                488b057cc70900          MOVQ main.y(SB), AX
      0x45c564                48890424                MOVQ AX, 0(SP)
      ; call runtime.convT64, this converts the object to a heap reference
      0x45c568                e833c5faff              CALL runtime.convT64(SB)
      ; extract the return value
      0x45c56d                488b442408              MOVQ 0x8(SP), AX
      ; take the vtable pointer
      0x45c572                488d0d07c00200          LEAQ go.itab.main.bar,main.Foo(SB), CX
      ; write both to the return value slot for MakeInterface2
      0x45c579                48894c2420              MOVQ CX, 0x20(SP)
      0x45c57e                4889442428              MOVQ AX, 0x28(SP)

      ; <function epilogue>
      0x45c58c                c3                      RET

This code generation in v1.15 is slightly different from what it was in v1.10; back then we would see instead:

.. code-block:: asm


   MakeInterface2:
      ; <function prologue>

      ; take the vtable pointer
      0x4805dd              488d053c020400          LEAQ go.itab.src.bar,src.Foo(SB), AX
      ; pass it as argument to convT2I64
      0x4805e4              48890424                MOVQ AX, 0(SP)
      ; take the address of y
      0x4805e8              488d05e9f10b00          LEAQ main.y(SB), AX
      ; pass it as argument to convT2I64
      0x4805ef              4889442408              MOVQ AX, 0x8(SP)
      ; convert to interface reference
      0x4805f4              e8e7b2f8ff              CALL runtime.convT2I64(SB)
      ; copy the return value from runtime.convT2I64 to the return slot of MakeInterface2
      0x4805f9              488b442410              MOVQ 0x10(SP), AX
      0x4805fe              488b4c2418              MOVQ 0x18(SP), CX
      0x480603              4889442430              MOVQ AX, 0x30(SP)
      0x480608              48894c2438              MOVQ CX, 0x38(SP)

      ; <function epilogue>
      0x480616              c3                      RET

This is a new optimization: the conversion of an object to an interface
reference now costs 7 instructions instead of 9 previously. The main
change is that previously, ``runtime.convT2I64`` was responsible both
for moving the object to the heap and attaching the vtable pointer;
whereas in v1.15 ``runtime.convT64`` just moves the object to the heap
and returns a naked pointer, and the caller is responsible for
attaching the vtable pointer.


Additionally, another optimization is performed inside the ``convT64``
function: for certain specific values, no heap allocation is
performed.  In v1.10, this optimization was restricted to the case of
a single-word value or struct that was initialized to its default (all
zero bytes). In v1.15, the optimization was extended to include all
integer values smaller than 256 (i.e. 0x00-0xFF).

This optimization is available for all word-sized types or
smaller. For example, it works with an integer type implementing the
interface directly, as well as for a struct with a single integer
field.

Vararg calls
~~~~~~~~~~~~

Go supports variable numbers of arguments, via the ``...``
construct. In a nutshell, the caller prepares a slice object on the
stack and makes it point at the positional arguments (also on the
stack), then passes that slice as fixed-position argument to the
callee.

In addition to this, if the vararg list was declared with an interface
type (which is a common case, for example ``fmt.Printf`` has
``...interface{}``), a conversion from each argument value to an
interface reference must also take place. This conversion moves each
argument to the heap in the general case, with a “small numbers
optimization” as described in the previous section.

Let us see how this looks like. First we can look at the case of a
vararg list that is *not* an interface type:

.. code-block:: go

   func f(...int) {}

   var x,y,z,w int
   func caller() {
      f(x,y,z,w)
   }

This gives us:

.. code-block:: asm

   caller:
       ; <function prologue>

       ; fill the slice:
       XORPS X0, X0          ; set 2 words (128 bit) to zero in X0
       MOVUPS X0, 0x18(SP)   ; initialize the 4-element slice to zero
       MOVUPS X0, 0x28(SP)   ; initialize the 4-element slice to zero
       MOVQ main.x(SB), AX
       MOVQ AX, 0x18(SP)     ; store x into 1st position
       MOVQ main.y(SB), AX
       MOVQ AX, 0x20(SP)     ; store y into 2nd position
       MOVQ main.z(SB), AX
       MOVQ AX, 0x28(SP)     ; store z into 3rd position
       MOVQ main.w(SB), AX
       MOVQ AX, 0x30(SP)     ; store w into 4th position

       ; prepare the slice as outgoing argument:
       LEAQ 0x18(SP), AX     ; store the base address
       MOVQ AX, 0(SP)
       MOVQ $0x4, 0x8(SP)    ; store the length
       MOVQ $0x4, 0x10(SP)   ; store the capacity

       CALL main.g(SB)       ; call the function

       ; <function epilogue>
       RET

So far, no surprises. It may be interesting to note that Go always
wastes instructions to pre-initialize the vararg slice to zero, even
though it immediately populates it afterwards with the argument
values. A C++ compiler would not do that for vararg calls and simply
writes the argument directly to their final spots.

We can then compare what happens when the function takes its arguments
using an interface type:

.. code-block:: go

   // note: now we have an interface type.
   func f(...interface{}) {}

   var x,y,z,w int
   func caller() {
      f(x,y,z,w)
   }

This gives us:

.. code-block:: asm

   caller:
       ; <function prologue>

       ; fill the slice:
       XORPS X0, X0         ; zero out the slice
       MOVUPS X0, 0x38(SP)
       MOVUPS X0, 0x48(SP)
       MOVUPS X0, 0x58(SP)
       MOVUPS X0, 0x68(SP)

       MOVQ main.x(SB), AX
       MOVQ AX, 0x30(SP)    ; copy x on the stack, out of the slice
       LEAQ 0x7995(IP), AX
       MOVQ AX, 0x38(SP)    ; place x's interface{} vtable ptr in the slice
       LEAQ 0x30(SP), CX
       MOVQ CX, 0x40(SP)    ; place the address of x's copy in the slice

       MOVQ main.y(SB), CX
       MOVQ CX, 0x28(SP)    ; copy y on the stack, out of the slice
       MOVQ AX, 0x48(SP)    ; place the same vtable ptr as x in the slice
       LEAQ 0x28(SP), CX
       MOVQ CX, 0x50(SP)    ; place the address of y's copy in the slice

       MOVQ main.z(SB), CX
       MOVQ CX, 0x20(SP)    ; copy z on the stack, out of the slice
       MOVQ AX, 0x58(SP)    ; place the same vtable ptr as x in the slice
       LEAQ 0x20(SP), CX
       MOVQ CX, 0x60(SP)    ; place the address of z's copy in the slice

       MOVQ main.w(SB), CX  ; copy w on the stack, out of the slice
       MOVQ CX, 0x18(SP)
       MOVQ AX, 0x68(SP)    ; place the same vtable ptr as x in the slice
       LEAQ 0x18(SP), AX
       MOVQ AX, 0x70(SP)    ; place the address of w's copy in the slice

       LEAQ 0x38(SP), AX
       MOVQ AX, 0(SP)       ; set the slice base address as argument
       MOVQ $0x4, 0x8(SP)   ; the slice's size
       MOVQ $0x4, 0x10(SP)  ; the slice's capacity

       CALL main.g(SB)       ; call the function

       ; <function epilogue>
       RET

Here are the main differences:

- Each position in the vararg slice now has two words instead of just one.
- Each value passed must be passed by reference. Simple object types
  (such as integer heres) are merely copied into the caller's
  activation record, and the address of their stack copy is added to
  the slice.

  (The reason why the object is first copied to the stack, instead of
  placing the address to the global variable directly in the slice, is
  that Go must preserve the sequential semantics that the value is
  sampled at the time the call is performed, and will not change in
  the slice even if the global variable is modified in the callee or
  another goroutine.)

- If the interface was non-trivial, we would also see a call to
  `runtime.convT` for each argument.

Exception handling
------------------

Implementation of ``defer``
~~~~~~~~~~~~~~~~~~~~~~~~~~~

``defer`` is the keyword by which a programmer can specify one or more
callback functions to call on every return path, including exception
unwinds. This helps implement RAII__ patterns in Go.

.. __: https://en.wikipedia.org/wiki/Resource_acquisition_is_initialization

In Go 1.10, each use of ``defer`` was translated to a *call* to
``runtime.deferproc`` which would register the deferred call onto the
current goroutine's unwind stack. Because this was done via a call,
every function containing the ``defer`` keyword also had to check its
stack size and prepare an activation record, *even for functions that
did not perform function calls otherwise*. In other words, the cost of
using ``defer`` in “leaf” functions was rather high.

Additionally, in Go 1.10, the compiler would place a call to
``runtime.deferreturn`` on every return path, and that runtime
function was responsible for performing the defer calls.

`An example of these previous mechanisms is given in the previous analysis <https://dr-knz.net/go-calling-convention-x86-64.html#implementation-of-defer>`_.

In contrast, Go 1.15 contains two optimizations that make the
implementation of ``defer`` rather different in the case
when ``defer`` is used unconditionally—i.e. it is always reached
from the function's entry point. In that case:

- when there are 8 uses of ``defer`` or fewer, the Go compiler
  optimizes them by writing the callbacks to the function's activation
  record directly. There is no need for a ``runtime.deferproc`` any
  more in that case. During exception unwinding, the unwinding code
  knows where to look for defers in each activation record.

- separately, the compiler also emits the full call sequences to the
  deferred functions in every return path, so that the natural return
  control flow performs these calls, and there is no call to
  ``runtime.deferreturn``.

This way, a function containing 8 or less unconditional ``defers`` to
functions that themselves can be inlined does not pay the overhead of
setting up a caller context if it does not otherwise perform function
calls.

(Naturally, these optimizations do not work if a ``defer`` is
conditional, or occurs inside a loop.)

Here is an example:

.. code-block:: go

     func Defer1() int { defer f(); return 123 }

This compiles to:

.. code-block:: asm

   Defer1:
          ; <function prologue>

          0x45c4bd         MOVQ $0x0, AX
          0x45c4c4         MOVQ AX, 0x8(SP)   ; set up a word full with zeroes
          0x45c4c9         MOVB $0x0, 0x7(SP) ; set the first byte to zero (redundant)

          ; write zero to the return value slot
          0x45c4ce         MOVQ $0x0, 0x20(SP)

          ; defer the call to f()
          0x45c4d7         LEAQ 0x1b672(IP), AX
          0x45c4de         MOVQ AX, 0x8(SP)      ; write the address of f
          0x45c4e3         MOVB $0x1, 0x7(SP)    ; let the runtime know there is 1 defer

          ; write the return value 123
          0x45c4e8         MOVQ $0x7b, 0x20(SP)

          ; un-defer
          0x45c4f1         MOVB $0x0, 0x7(SP)    ; let the runtime know there is no more defer
          ; final call to f() on the return path
          0x45c4f6         CALL main.f(SB)

          ; <function epilogue>
          0x45c504         RET

          ; the following code is called during unwinds after a recover,
          ; not on the common case:
          0x45c505         CALL runtime.deferreturn(SB)
          0x45c50a         MOVQ 0x10(SP), BP
          0x45c50f         ADDQ $0x18, SP
          0x45c513         RET

In this example, the ``f()`` function is non-inlinable so the call to
``f()`` remains explicit in the generated code. If ``f()`` had been
inlinable, then the return paths would be simplified and the
``Defer1()`` function would not need an activation record.

Implementation of ``panic``
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Throwing an already-built value as an exception within a function body
works in Go 1.15 very much like in Go 1.10: as a call to the function
``runtime.gopanic``. This function takes an argument of type
``interface{}``; therefore, whatever value is passed must be promoted
to an interface reference as explained above.

Here is an example:

.. code-block:: go

    func Panic() { panic(123) }

This compiles to:

.. code-block:: asm

   Panic:
      ; <function prologue>

      ; load the vtable for interface{}:
      LEAQ 0x78dc(IP), AX
      MOVQ AX, 0(SP)

      ; load the address of a static copy of the
      ; integer value 123:
      LEAQ 0x2afa9(IP), AX
      MOVQ AX, 0x8(SP)

      ; call gopanic:
      CALL runtime.gopanic(SB)
      NOPL
      ; note: function epilogue omitted in this case

There is no surprise here—the compiler knows that the function never
returns and thus the return path (and, in this case, the entirety of
the function's epilogue) is omitted.

There is a small optimization in v1.15 compared to v1.10: the padding
instruction after the call used to be an undefined 2-byte opcode
0x0F0B (disassembled as UD2); this is now generated as a regular NOP,
which is smaller (just 1 byte, 0x90).

Catching exceptions: ``defer`` + ``recover``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The mechanism for catching exception has not changed: any use of
``recover()`` in the source is compiled as a regular function call to
``runtime.gorecover``. As in previous versions, this halts the
exception propagation and outputs the panic object as return value.

Summary and conclusions
-----------------------

In this chapter, we revisited the findings from two years ago.

To a large extent, the low-level calling convention in Go v1.15 is not
much different from what it was in v1.10; the `previous observations <https://dr-knz.net/go-calling-convention-x86-64.html#id39>`_
thus largely remain unchanged:

- arguments and return values are still passed via memory.
- activation records are still registered dynamically, instead of using
  static unwinding tables as is done in e.g. C++.
- pointers occupy one word; ``string`` values and interface
  references, two; and slices occupy three words.
- the promotion of objects to interface references, when they
  implement the interface by value, requires a move to the heap
  via a call to a runtime function in the general case.

The notable changes are as follows:

- converting a value to an interface reference has become simpler, as
  the caller does not pass the vtable pointer to the runtime ``conv``
  functions any more. This saves instructions on the way in and out of
  the conversion.

- the “small value” optimization, which aims to avoid a heap
  allocation when promoting a value to an interface reference, has been
  extended to all 1-word values from zero up to and including 255.

- when a function contains 8 or less non-conditional uses of
  ``defer``, an optimization kicks in that prevents calls to
  ``runtime.deferproc`` and ``runtime.deferreturn`` entirely. In that
  case, the deferred callback information is stored in the function's
  activation record. The exception unwinding code is now equipped to
  find deferred callbacks there, in addition to the goroutine
  struct. This greatly reduces the runtime overhead in the common case
  when a function only uses ``defer`` once or twice, in the main
  control path.

Additionally, this time we visited a more detailed example of calling
a vararg function, with the step-by-step construction of the argument
slice.

Because of the lack of major changes, `the open question from last
time
<https://dr-knz.net/go-calling-convention-x86-64.html#an-interesting-question-error-vs-panic>`_
just as valid with Go 1.15 as it was in 1.10:

  What is cheaper: handling exceptions via ``panic`` / ``recover``, or
  passing and testing error results with ``if err := ...; err != nil {
  return err }``?

This question is non-trivial because the cost of a ``panic`` call and
the top-level error recovery with ``defer`` and ``recover`` can be
amortized across a workload. Where would the inflection point lie?

We will revisit this question in the next part.

Also in the series:

- `The Go low-level calling convention on x86-64`__

  .. __: https://dr-knz.net/go-calling-convention-x86-64.html

- `Measuring argument passing in Go and C++`__

  .. __: https://dr-knz.net/measuring-argument-passing-in-go-and-cpp.html

- `Measuring multiple return values in Go and C++`__

  .. __: https://dr-knz.net/measuring-multiple-return-values-in-go-and-cpp.html

- `Measuring errors vs. exceptions in Go and C++`__

  .. __: https://dr-knz.net/measuring-errors-vs-exceptions-in-go-and-cpp.html