This week, I collaborated with my friend Martin at Daiteq to add a feature to the UT-LEON3 processor and support it in our variant of the C language called SL.

In short, we extended the processors’ thread creation protocol to support a wider variety of synchronization modes, and created a new way to pass values to new threads. The goal was to provide more flexibility to users of the platform, in particular students of a graduate program where Martin is teaching.

We designed the ISA change together, Martin worked on adapting the VHDL specification of UT-LEON3 accordingly, and I worked on the compiler changes. I will be relating this experience below.

❦❦❦

What changed in the platform

There were really two changes:

  1. an extension of which threads can join (wait for termination of) another thread.
  2. a new way to pass arguments to newly created threads.

Background: UT-LEON3, like the other processors of the Microgrid family (including the various microthreaded designs from the Apple-CORE project and LEON2-MT designed by Daiteq), provides custom hardware for thread management and synchronization. Its programming model provides a large number of very simple virtual cores, and a logical create operation available to programmers that combines reserving one or more virtual processors in bulk (a bulk is called family), assigning an initial PC and some arguments to all of them simultaneously, and firing their execution. The virtual processors then execute their shared thread program to completion, using separate PCs and local registers (MIMD execution).

Synchronization on termination

Before the change, on UT-LEON3, only the creating thread (that issued a creation) could wait on termination of a thread family.

Specifically, the low-level create instruction would both fire execution of the thread family and also start waiting on its termination. This wait was performed by binding a local register in the parent thread to the termination future. In this design, further uses of the register cannot proceed until the child family terminates, thereby limiting the abilit to join to the parent thread.

This severely limits usability as many interesting programming problems call for more flexible synchronization patterns.

This limitation was also unique to UT-LEON3: the other microthreaded processors provide a “detached” synchronization protocol:

  • create merely fires execution,
  • a separate instruction, called sync or fence, starts waiting for termination of a family identifier by a numeric ID. It is possible and valid to issue this instruction in an unrelated thread.

So, this week, Martin and I worked to extend UT-LEON3 accordingly. UT-LEON3 now recognizes the same sync instruction already previously defined for MT-SPARC (see reference, appendix D).

An interesting pitfall was to properly handle the case where a family terminates before another thread starts to wait on its termination. A naive implementation would let it finish and also deallocate its data structures, turning any subsequent sync into undefined behavior. This specific problem was encountered for the first time in 2010 (if my memory serves me right), while implementing the MGSim simulator, and it was amusing to encounter it again nearly 10 years afterwards.

Extra arguments for thread families

The “Microthreading way” to communicate a value to all threads in a family is to give it a global parameter, shared by all threads in the family and populated just once when creating the family. A valuable feature of the architecture is that this parameter is implemented using a single physical register in hardware that is visible from the architectural register window of every logical thread in the family, thereby properly factoring its hardware cost.

In all microthreaded processors except UT-LEON3, the global parameters are physically shared by all threads of a family but are physically separate from the registers of the parent thread. This makes it possible to let execution in the parent thread proceed independently of the child family.

In UT-LEON3 however, for historical reasons that are out of scope here, the global parameters are also physically shared with the parent thread. The effect of this (mis)design is that these particular registers become pinned and unusable in the parent thread during the entire execution of the child family.

It also conflicts with the desire from the previous section to create more asynchrony and enable waiting on a child family in an unrelated thread.

So, this week, we worked on this problem.

Unfortunately, we found out that changing the register mapping in UT-LEON3 to mirror the semantics of the other microthreaded processors was too complex, because decoupling a family’s globals from its parent thread would require additional complexity (both in hardware and the compiler) to allocate separate physical registers during creation and provision their initial value. Given that the UT-LEON3 allocation logic has grown to become rather complex and hard to understand, we deemed too risky to further change it.

Instead, we opted for an ad-hoc solution specific to UT-LEON3:

  1. we’re extending the hardware data structure for family parameters by an additional 32-bit field called the “extra argument”.
  2. upon initialization of each thread in the family, the extra argument is copied to the 2nd local register (the 1st local register already receives the logical index of the thread in the family).
  3. we’re adding a new setarg instruction for a parent thread to provision the new field in the family entry.

Of course, really this approach is a hack and does not even approach the full generality and flexibility of the concept of family parameters (multiple parameters + dataflow synchronization) supported by the others processors. Nonetheless, it will do as a small incremental improvement in UT-LEON3, for use in combination with the new asynchrony defined in the previous section: as a mean to provide at least one argument to asynchronous families.

❦❦❦

Language changes

Background: to exploit the special threading support of the various microthreaded processors, I have designed an extension to the C language called “SL, a dozen years ago. The accompanying SL compiler is able to generate code for the 4 published processor designs based off the SPARC and Alpha ISAs, 3 forms of software simulation, an experimental MIPS-based design, and many derived variants.

This provides the following three constructs to create a thread family:

  1. bulk creation and synchronization (traditional fork/join parallelism):

    sl_create([fid], [pid], [start], [limit], [step], [block], fn, [args...]);
    ...
    sl_sync();
    
  2. bulk creation for an asynchronous family that can be joined in an unrelated thread:

    sl_spawndecl(fid);
    sl_create(fid, [pid], [start], [limit], [step], [block], fn, [args...]);
    ...
    sl_forcespawn(fid);
    

    (This couples with a sl_spawnsync(fid) primitive to join on the family’s termination.)

  3. bulk creation for a fully asynchronous family that cannot be joined afterwards:

    sl_create([fid], [pid], [start], [limit], [step], [block], fn, [args...]);
    ...
    sl_detach();
    

Prior to this week, only the sl_create/sl_sync pair was supported on UT-LEON3, whereas all three would be supported for the other microthreaded processors.

Code generation for asynchronous families

Prior, the code generation for UT-LEON3 was implemented as follows:

  • sl_create would expand to a copy of the various creation parameters into hidden local variables in the current C scope.
  • sl_sync would expand to a family creation proper, that is an adequate sequence of allocate, set* and create instructions. It would also issue the wait on the termination future produced by create.

The astute reader may notice that despite the name “sl_create“, no actual thread creation was occurring at that point. This differs from the code generation for the other microthreaded processors, where creation does actually happen at that point.

This restriction is unique to UT-LEON3 and is mandated by the sharing of physical registers between parent thread and child family: I found it impossible (in fact, proved impossible, see reference, appendix G) to teach the register allocator in the C compiler about the registers that become pinned until termination of the children threads.

With the availability of the new sync instruction in UT-LEON3 it became possible to support the two variants sl_create/sl_detach and sl_create/sl_forcespawn.

I thus changed the code generation for the base case to become like this:

  • sl_create now expands to uses of the the allocate and set* instructions, which prepares but does not fire bulk creation.
  • sl_sync now expands to create and a join on termination.
  • sl_detach now expands to create and no join, but including a marker for the family to automatically clean-up.
  • sl_forcespawn now expands to create and no join, and no marker for the family to automatically clean up.
  • sl_spawnsync now expands to an explicit sync instruction to perform the join (and family clean-up).

Obviously, since UT-LEON3 still shares physical registers for parameters between parent thread and child family, the SL compiler refuses to compile the two pairs sl_create/sl_detach and sl_create/sl_forcespawn if there are any family parameters declared (as it would be impossible to generate code otherwise).

This sets the motivation for the next exercise: to provide another way to pass values to asynchronous families via the new “family argument” feature specific to UT-LEON3.

Language change and code generation for the extra parameter

To give access to the new family “extra argument” field from C/SL, I decided to extend the language semantics as follows:

  • a thread program can be declared with the sl__extra function attribute. For example:

    sl_def(foo, sl__extra) ...
    
  • inside thread programs marked with sl__extra, it is possible to declare a variable using sl_extra_parameter(x) and this makes x receive the value of the extra argument of the current family automatically. For example:

    sl_def(foo, sl__extra) {
         sl_extra_parameter(x);
    
         int y = x + x;
    } sl_enddef
    
  • when creating a family, the sl__extra creation attribute must be specified too, as well as a sl_xarg to actually pass the value For example:

    sl_create(,,,,,, sl__extra, foo, sl_xarg(123));
    

The presence of sl__extra in the attributes triggers a separate code generation path for the calling convention that connects thread functions/programs with sl__create.

Transparently, sl_extra_parameter maps to use of the second local register in the thread (as defined in the ISA Change), and sl_xarg expands to the new setarg instruction.

For simplicity, the extra argument is always typed as a C long. The programmer can recast the value to a different type if so desired.

For now, I have changed the SL compiler to support these new language features when compiling for UT-LEON3 and when flattening the code for pure sequential execution. Using them for any of the other compilation targets will produce an error.

❦❦❦

Next steps

Testing, loads of testing, of course!

The teaching period starts next month. We may need to improve this for next year.

❦❦❦

Lessons learned

  • I found it fascinating how program sources released so many years ago (from 2011-2012, for the oldest dependencies used by the SL toolchain) still build fine on modern systems. The promises made by GNU Autoconf and Automake are holding strong, despite their detractors.
  • The lesson “Ingenuity and expediency are effective complements to analysis and synthesis” that I had learned seven years ago still holds true.
  • However, I found it profoundly satisfying to confirm the wisdom of thinking a lot about the compiler’s design upfront. I knew back in 2009 that I had to think about it ahead of time in inverse proportion to its expected shelf life; I foresaw that its lifecycle would span many years, and I invested accordingly. It continues to pay off, again, again, again, and again to this day.

❦❦❦

References

Like this post? Share on: TwitterHacker NewsRedditLinkedInEmail


Raphael ‘kena’ Poss Avatar Raphael ‘kena’ Poss is a computer scientist and software engineer specialized in compiler construction, computer architecture, operating systems and databases.
Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below.


Keep Reading


Reading Time

~8 min read

Published

Category

Research

Tags

Stay in Touch