Blog

Coroutine demonstration with setjmp/longjmp (STM32)

Hi Reader,

Embedded programmers working without an RTOS for whatever reason are frequently rolling their own solutions for multitasking. Protothreads are a de-facto default for ultra-lightweight threading but since they are stackless, they are extremely limited.

Here’s a demonstration of a technique to allow cooperative multitasking by abusing the ancient setjmp/longjmp routines in C. This implementation is for STM32 and compiled with GCC and is not portable (though it is pretty easy to port). I used a similar technique on the Mooshimeter for the thread which has to deal with the SD card and filesystem (it has too much state and too many blocking points to easily handle stackless…).

What are setjmp/longjmp? In the long-long ago, they were used for exception handling in C. setjmp allows you to store your execution context in a jmp_buf structure, and longjmp allows you to restore that context from a jmp_buf. Context here means your position in the code, your CPU registers and your stack position. The original idea was that you’d call setjmp in front of some error handling code, then do whatever your application needed to do, and if something went wrong deep within the code the offending function could call longjmp and pop right back out to the top level error handling code rather unwinding the call stack in the conventional way, with every function checking and forwarding the return value of its callees.

AFAIK almost nobody does this anymore but the functions still exist in the C standard, so we can abuse it as a starting point for a coroutine library.

Goals of the library:

  • Make it easy to run a function as a coroutine with its own stack
  • Allow data passing between the coroutine and the main function
  • Facilitate an deferred-processing iterator pattern

First we generate typedefs to store all the information we need:

typedef struct {
  bool init_done; // Has this coroutine been initialized?
  bool finished;  // Has the base function of the coroutine hit CR_END?
  void* arg;      // Used for data exchange between coroutine and main function
  jmp_buf invoking_env; // main context
  jmp_buf cr_env;       // coroutine context
} coroutine_t;

// This is the prototype for the function we will run as a coroutine
typedef void(*coroutine_func_t)(coroutine_t*);

Now some basic glue functions for swapping between contexts. setjmp returns zero when it loads the jmp_buf on its first call. It returns non-zero when you are landing there from a remote longjmp.

void swapJmpBufs(jmp_buf* from, jmp_buf* to) {
  if(0 == setjmp(*from)) {
    longjmp(*to, 1);
  }
}

void switchToCoroutine(coroutine_t* cr) {
  swapJmpBufs(&cr->invoking_env, &cr->cr_env);
}

void yieldFromCoroutine(coroutine_t* cr) {
  swapJmpBufs(&cr->cr_env, &cr->invoking_env);
}

Next we have macros that will allow us to write our coroutine functions. CR_START does the first call to setjmp from within the target coroutine and returns immediately. CR_YIELD yields control back to the main thread. CR_END just sets the finished flag and yields forever. The __is_base_cr boolean is there to allow calling coroutine functions from within other coroutines, and it prevents the sub-coroutine from overwriting the jmp_buf and also allows it to return normally at the end of execution.

// CR_START does the initial jmp_buf setup only if this is the base coroutine (we check using the init_done member)
#define CR_START(cr) bool __is_base_cr = !cr->init_done; do{if(!cr->init_done && (0 == setjmp(cr->cr_env))) { return; }}while(0)
#define CR_YIELD(cr) yieldFromCoroutine(cr)
#define CR_END(cr) while(__is_base_cr){cr->finished = true; CR_YIELD(cr);}

There’s only one more function to the coroutine library but it’s the most dangerous. initCoroutine does the initial setup of a coroutine_t so that calling switchToCoroutine will properly jump to the desired function operating on its own stack. Basically what we’re doing is:

  • Calling in to the function we want to run as the coroutine (func) and letting it hit its CR_START macro, which will set up a jmp_buf and return.
  • Examine that jmp_buf to figure out how much stack was used to get to the CR_START macro
  • Copy that portion of stack over to the user-provided stack buffer for the new coroutine
  • Redirect the stack pointer entry of the jmp_buf so that when the context is restored with longjmp, it will point at the user-provided stack buffer.

Note that the position of the stack pointer within the jmp_buf is specific to the C libraries you’re using. Also this code has STM32 specific IRQ toggling built in.

// Initialize a coroutine
// After this function is run on a coroutine_t structure, feeding that structure to switchToCoroutine will invoke the
// function provided to the "func" arg on the stack buffer provided by the "stack" arg.
void initCoroutine(coroutine_t* cr_to_load, coroutine_func_t func, void* func_arg, uint8_t* stack, size_t stack_len) {
  // THIS IS NOT PORTABLE BETWEEN ARCHITECTURES
  register void* SP asm ("sp");
  void* local_sp = NULL;
  void* remote_sp = NULL;
  size_t stack_used = 0;
  uint32_t i = 0;

  cr_to_load->init_done = false;
  cr_to_load->finished = false;
  cr_to_load->arg = func_arg;

  bool interrupts_masked = __get_PRIMASK();

  // Call the coroutine function directly. It will yield immediately on the first run and load the cr_env jump_buf
  // BEGIN CRITICAL SECTION - WE ARE ABUSING THE STACK
  // If an IRQ were to strike here, we would hardfault down the road when restoring context
  __disable_irq();
  func(cr_to_load);
  cr_to_load->init_done = true;

  // Manipulate the jmp_buf
  static const uint32_t jmp_buf_sp_i = 8;

  // First figure out how much stack the function used before its initial call to setjmp
  local_sp = (void*)cr_to_load->cr_env[jmp_buf_sp_i];
  stack_used = (size_t)((uint32_t)SP - (uint32_t)local_sp);

  // Offsets within the jmp_buf for stack pointer and program counter derived from observation in debugger
  // Set the stack pointer to the user-provided stack buffer, expecting it to grow down
  remote_sp = (void*)(stack + stack_len - stack_used);
  cr_to_load->cr_env[jmp_buf_sp_i] = (int)remote_sp;

  // Copy the stack over to the new context
  for(i = 0; i < stack_used; i++) {
    ((uint8_t*)remote_sp)[i] = ((uint8_t*)local_sp)[i];
  }

  // END CRITICAL SECTION
  if(!interrupts_masked) {
    __enable_irq();
  }
  debug_print("Transposed %d bytes of stack\r\n", stack_used);

  return;
}

That’s it for the library itself, let’s write some tests. Let’s start with a really basic function that just interprets the shared argument as an int and prints it out. So the coroutine function is:

void myCoroutine(coroutine_t* cr_arg) {
  CR_START(cr_arg);
  while(1) {
    debug_print("CO: %u\r\n", *(uint32_t*)cr_arg->arg);
    CR_YIELD(cr_arg);
  }
  CR_END(cr_arg);
}

And the code to run it (from within main) is:

  uint8_t cr_stack[1024]; // yo dawg, I heard you like stacks, so...

  // First example: Super basic demo, just transition back and forth to a coroutine a few times.
  debug_print("\n\rTest 1: Switch back and forth\r\n");
  coroutine_t cr_test;
  uint32_t cr_arg = 0;
  initCoroutine(&cr_test, myCoroutine, (void*)&cr_arg, cr_stack, sizeof(cr_stack));

  for(uint32_t i = 0; i < 4; i++) {
    // Feed arbitrary example data to the coroutine.
    debug_print("MAIN: Switching to coroutine\r\n");
    cr_arg = i*i;
    switchToCoroutine(&cr_test);
  }

Note that debug_print behaves like printf on my setup and just pushes data out a serial port where I can see it. Running this test gives me:

Test 1: Switch back and forth
Transposed 24 bytes of stack
MAIN: Switching to coroutine
CO: 0
MAIN: Switching to coroutine
CO: 1
MAIN: Switching to coroutine
CO: 4
MAIN: Switching to coroutine
CO: 9

Nice. We are swapping back and forth between two different stacks and sensibly exchanging data. Let’s try something a little more involved: an iterator pattern. Let’s make an iterator which is initialized with a list of numbers, and when you call next it will tell you whether the next number is prime or not. I’m keeping this demo all in C, but you can see that I’m basically encapsulating the coroutine_t data in an object which is almost a class (and would be really easy to implement as such).

bool isPrime(uint32_t num) {
  // Simple brute force prime check
  for(uint32_t i = 2; i*i < num; i++) {
    if((num % i) == 0) {
      return false;
    }
  }
  return true;
}

typedef struct {
  bool is_prime_out;
  uint32_t* num_in;
  size_t n_num;
  coroutine_t cr;
} isprime_iter_member_t;

void isPrime_coroutine(coroutine_t* cr_arg) {
  CR_START(cr_arg);
  isprime_iter_member_t* arg = (isprime_iter_member_t*)cr_arg->arg;
  uint32_t i = 0;
  while(true) {
    uint32_t n = arg->num_in[i];
    arg->is_prime_out = isPrime(n);
    debug_print("CO: Is %u prime?\r\n", n);
    if(++i == arg->n_num) {
      break;
    } else {
      CR_YIELD(cr_arg);
    }
  }
  CR_END(cr_arg);
}

void isPrime_init(isprime_iter_member_t* to_fill, uint32_t* num_to_check, size_t n_num_to_check, uint8_t* stack, size_t stack_len) {
  to_fill->num_in = num_to_check;
  to_fill->n_num = n_num_to_check;
  initCoroutine(&to_fill->cr, isPrime_coroutine, (void*)to_fill, stack, stack_len);
}

bool isPrime_next(isprime_iter_member_t* arg) {
  switchToCoroutine(&arg->cr);
  return arg->is_prime_out;
}

bool isPrime_empty(isprime_iter_member_t* arg) {
  return arg->cr.finished;
}

And the invocation in main:

  // Second example: Set up a coroutine to iterate over a sequence and determine whether elements are prime
  uint32_t nums_to_check[] = {500, 521, 2025, 2027};
  isprime_iter_member_t isprime_iter_member;
  isPrime_init(&isprime_iter_member, nums_to_check, ARRAY_LEN(nums_to_check), cr_stack, sizeof(cr_stack));
  while(!isPrime_empty(&isprime_iter_member)) {
    bool is_next_prime = isPrime_next(&isprime_iter_member);
    debug_print("MAIN: %s\r\n", is_next_prime?"YES!":"NO!");
  }

Which yields:

Test 2: Deferred processing iterator
Transposed 32 bytes of stack
CO: Is 500 prime?
MAIN: NO!
CO: Is 521 prime?
MAIN: YES!
CO: Is 2025 prime?
MAIN: NO!
CO: Is 2027 prime?
MAIN: YES!

Nice. The main routine and coroutine are exchanging data well and the coroutine is providing a friendly way to defer processing on a sequence, a common iterator use case. The last thing I want to demonstrate is nesting. If we’re writing functions adhering to the coroutine_func_t prototype, we want to be able to call each other and yield from more than one layer deep in the call stack. So as a trivial test, let’s build a coroutine which just runs the prime number checker twice:

void doPrimeCoroutineTwice(coroutine_t* cr_arg) {
  CR_START(cr_arg);
  debug_print("FIRST SUBCOROUTINE CALL\r\n");
  isPrime_coroutine(cr_arg);
  // Extra yield necessary because isPrime_coroutine expects CR_END to yield final value
  CR_YIELD(cr_arg);
  debug_print("SECOND SUBCOROUTINE CALL\r\n");
  isPrime_coroutine(cr_arg);
  CR_END(cr_arg);
}

void setupPrimeCoroutineTwice(isprime_iter_member_t* to_fill, uint32_t* num_to_check, size_t n_num_to_check, uint8_t* stack, size_t stack_len) {
  to_fill->num_in = num_to_check;
  to_fill->n_num = n_num_to_check;
  initCoroutine(&to_fill->cr, doPrimeCoroutineTwice, (void*)to_fill, stack, stack_len);
}

And the corresponding code in main:

  // Third example: Demonstrate yielding from within a nested coroutine
  setupPrimeCoroutineTwice(&isprime_iter_member, nums_to_check, ARRAY_LEN(nums_to_check), cr_stack, sizeof(cr_stack));
  while(!isPrime_empty(&isprime_iter_member)) {
    bool is_next_prime = isPrime_next(&isprime_iter_member);
    debug_print("MAIN: %s\r\n", is_next_prime?"YES!":"NO!");
  }

Which yields:

Test 3: Sub-coroutine
Transposed 24 bytes of stack
FIRST SUBCOROUTINE CALL
CO: Is 500 prime?
MAIN: NO!
CO: Is 521 prime?
MAIN: YES!
CO: Is 2025 prime?
MAIN: NO!
CO: Is 2027 prime?
MAIN: YES!
SECOND SUBCOROUTINE CALL
CO: Is 500 prime?
MAIN: NO!
CO: Is 521 prime?
MAIN: YES!
CO: Is 2025 prime?
MAIN: NO!
CO: Is 2027 prime?
MAIN: YES!

Great! We have a functional, if unpolished, stackful coroutine library for at least some flavor of STM32 setup in about 50 lines of actual code. Full file pasted below:

/***************************
 * It is the opinion of the author that nobody should use this code for anything.
 * The code is provided with no warranties of any kind and if you try to run it
 * you will probably end up mildly annoyed.
 *
 * That said, if you find it useful, I'd like to hear about it!
 * James Whong 2020/07/13
 * james@moosh.im
 */

#include <setjmp.h> // for setjmp, longjmp
#include <stddef.h>
#include <inttypes.h>
#include <stdbool.h>

#include "cmsis_gcc.h"
#include "debug.h" // Project specific for debug_print

#define ARRAY_LEN(x) (sizeof(x)/sizeof(x[0]))

typedef struct {
  bool init_done; // Has this coroutine been initialized?
  bool finished;  // Has the base function of the coroutine hit CR_END?
  void* arg;      // Used for data exchange between coroutine and main function
  jmp_buf invoking_env; // main context
  jmp_buf cr_env;       // coroutine context
} coroutine_t;

// A coroutine function must start with CR_START in the body and end with CR_END.
// It can call CR_YIELD as many times as it likes. It can also feed its couroutine_t*
// arg to subroutines and yield from within subroutines.
typedef void(*coroutine_func_t)(coroutine_t*);

void swapJmpBufs(jmp_buf* from, jmp_buf* to) {
  if(0 == setjmp(*from)) {
    longjmp(*to, 1);
  }
}

void switchToCoroutine(coroutine_t* cr) {
  swapJmpBufs(&cr->invoking_env, &cr->cr_env);
}

void yieldFromCoroutine(coroutine_t* cr) {
  swapJmpBufs(&cr->cr_env, &cr->invoking_env);
}

// CR_START does the initial jmp_buf setup only if this is the base coroutine (we check using the init_done member)
#define CR_START(cr) bool __is_base_cr = !cr->init_done; do{if(!cr->init_done && (0 == setjmp(cr->cr_env))) { return; }}while(0)
#define CR_YIELD(cr) yieldFromCoroutine(cr)
#define CR_END(cr) while(__is_base_cr){cr->finished = true; CR_YIELD(cr);}

// Initialize a coroutine
// After this function is run on a coroutine_t structure, feeding that structure to switchToCoroutine will invoke the
// function provided to the "func" arg on the stack buffer provided by the "stack" arg.
void initCoroutine(coroutine_t* cr_to_load, coroutine_func_t func, void* func_arg, uint8_t* stack, size_t stack_len) {
  // THIS IS NOT PORTABLE BETWEEN ARCHITECTURES
  register void* SP asm ("sp");
  void* local_sp = NULL;
  void* remote_sp = NULL;
  size_t stack_used = 0;
  uint32_t i = 0;

  cr_to_load->init_done = false;
  cr_to_load->finished = false;
  cr_to_load->arg = func_arg;

  bool interrupts_masked = __get_PRIMASK();

  // Call the coroutine function directly. It will yield immediately on the first run and load the cr_env jump_buf
  // BEGIN CRITICAL SECTION - WE ARE ABUSING THE STACK
  // If an IRQ were to strike here, we would hardfault down the road when restoring context
  __disable_irq();
  func(cr_to_load);
  cr_to_load->init_done = true;

  // Manipulate the jmp_buf
  static const uint32_t jmp_buf_sp_i = 8;

  // First figure out how much stack the function used before its initial call to setjmp
  local_sp = (void*)cr_to_load->cr_env[jmp_buf_sp_i];
  stack_used = (size_t)((uint32_t)SP - (uint32_t)local_sp);

  // Offsets within the jmp_buf for stack pointer and program counter derived from observation in debugger
  // Set the stack pointer to the user-provided stack buffer, expecting it to grow down
  remote_sp = (void*)(stack + stack_len - stack_used);
  cr_to_load->cr_env[jmp_buf_sp_i] = (int)remote_sp;

  // Copy the stack over to the new context
  for(i = 0; i < stack_used; i++) {
    ((uint8_t*)remote_sp)[i] = ((uint8_t*)local_sp)[i];
  }

  // END CRITICAL SECTION
  if(!interrupts_masked) {
    __enable_irq();
  }
  debug_print("Transposed %d bytes of stack\r\n", stack_used);

  return;
}

////////////////////////////////////////////////////////////////////
// BELOW HERE ARE ALL TEST FUNCTIONS FOR THE COROUTINE LIBRARY ABOVE
////////////////////////////////////////////////////////////////////

void myCoroutine(coroutine_t* cr_arg) {
  CR_START(cr_arg);
  while(1) {
    debug_print("CO: %u\r\n", *(uint32_t*)cr_arg->arg);
    CR_YIELD(cr_arg);
  }
  CR_END(cr_arg);
}

bool isPrime(uint32_t num) {
  // Simple brute force prime check
  for(uint32_t i = 2; i*i < num; i++) {
    if((num % i) == 0) {
      return false;
    }
  }
  return true;
}

typedef struct {
  bool is_prime_out;
  uint32_t* num_in;
  size_t n_num;
  coroutine_t cr;
} isprime_iter_member_t;

void isPrime_coroutine(coroutine_t* cr_arg) {
  CR_START(cr_arg);
  isprime_iter_member_t* arg = (isprime_iter_member_t*)cr_arg->arg;
  uint32_t i = 0;
  while(true) {
    uint32_t n = arg->num_in[i];
    arg->is_prime_out = isPrime(n);
    debug_print("CO: Is %u prime?\r\n", n);
    if(++i == arg->n_num) {
      break;
    } else {
      CR_YIELD(cr_arg);
    }
  }
  CR_END(cr_arg);
}

void isPrime_init(isprime_iter_member_t* to_fill, uint32_t* num_to_check, size_t n_num_to_check, uint8_t* stack, size_t stack_len) {
  to_fill->num_in = num_to_check;
  to_fill->n_num = n_num_to_check;
  initCoroutine(&to_fill->cr, isPrime_coroutine, (void*)to_fill, stack, stack_len);
}

bool isPrime_next(isprime_iter_member_t* arg) {
  switchToCoroutine(&arg->cr);
  return arg->is_prime_out;
}

bool isPrime_empty(isprime_iter_member_t* arg) {
  return arg->cr.finished;
}

void doPrimeCoroutineTwice(coroutine_t* cr_arg) {
  CR_START(cr_arg);
  debug_print("FIRST SUBCOROUTINE CALL\r\n");
  isPrime_coroutine(cr_arg);
  // Extra yield necessary because isPrime_coroutine expects CR_END to yield final value
  CR_YIELD(cr_arg);
  debug_print("SECOND SUBCOROUTINE CALL\r\n");
  isPrime_coroutine(cr_arg);
  CR_END(cr_arg);
}

void setupPrimeCoroutineTwice(isprime_iter_member_t* to_fill, uint32_t* num_to_check, size_t n_num_to_check, uint8_t* stack, size_t stack_len) {
  to_fill->num_in = num_to_check;
  to_fill->n_num = n_num_to_check;
  initCoroutine(&to_fill->cr, doPrimeCoroutineTwice, (void*)to_fill, stack, stack_len);
}

void runCoroutineDemo() {
  debug_print("\n\n\n\rHello World\r\n");

  uint8_t cr_stack[1024];

  // First example: Super basic demo, just transition back and forth to a coroutine a few times.
  debug_print("\n\rTest 1: Switch back and forth\r\n");
  coroutine_t cr_test;
  uint32_t cr_arg = 0;
  initCoroutine(&cr_test, myCoroutine, (void*)&cr_arg, cr_stack, sizeof(cr_stack));

  for(uint32_t i = 0; i < 4; i++) {
    // Feed arbitrary example data to the coroutine.
    debug_print("MAIN: Switching to coroutine\r\n");
    cr_arg = i*i;
    switchToCoroutine(&cr_test);
  }

  debug_print("\n\rTest 2: Deferred processing iterator\r\n");

  // Second example: Set up a coroutine to iterate over a sequence and determine whether elements are prime
  uint32_t nums_to_check[] = {500, 521, 2025, 2027};
  isprime_iter_member_t isprime_iter_member;
  isPrime_init(&isprime_iter_member, nums_to_check, ARRAY_LEN(nums_to_check), cr_stack, sizeof(cr_stack));
  while(!isPrime_empty(&isprime_iter_member)) {
    bool is_next_prime = isPrime_next(&isprime_iter_member);
    debug_print("MAIN: %s\r\n", is_next_prime?"YES!":"NO!");
  }

  debug_print("\n\rTest 3: Sub-coroutine\r\n");

  // Third example: Demonstrate yielding from within a nested coroutine
  setupPrimeCoroutineTwice(&isprime_iter_member, nums_to_check, ARRAY_LEN(nums_to_check), cr_stack, sizeof(cr_stack));
  while(!isPrime_empty(&isprime_iter_member)) {
    bool is_next_prime = isPrime_next(&isprime_iter_member);
    debug_print("MAIN: %s\r\n", is_next_prime?"YES!":"NO!");
  }

  debug_print("Done!");
  for(;;) { vTaskDelay(10); }
}

And the output from the full test:

Hello World

Test 1: Switch back and forth
Transposed 24 bytes of stack
MAIN: Switching to coroutine
CO: 0
MAIN: Switching to coroutine
CO: 1
MAIN: Switching to coroutine
CO: 4
MAIN: Switching to coroutine
CO: 9

Test 2: Deferred processing iterator
Transposed 32 bytes of stack
CO: Is 500 prime?
MAIN: NO!
CO: Is 521 prime?
MAIN: YES!
CO: Is 2025 prime?
MAIN: NO!
CO: Is 2027 prime?
MAIN: YES!

Test 3: Sub-coroutine
Transposed 24 bytes of stack
FIRST SUBCOROUTINE CALL
CO: Is 500 prime?
MAIN: NO!
CO: Is 521 prime?
MAIN: YES!
CO: Is 2025 prime?
MAIN: NO!
CO: Is 2027 prime?
MAIN: YES!
SECOND SUBCOROUTINE CALL
CO: Is 500 prime?
MAIN: NO!
CO: Is 521 prime?
MAIN: YES!
CO: Is 2025 prime?
MAIN: NO!
CO: Is 2027 prime?
MAIN: YES!
Done!

This could all be made more elegant by writing some custom assembly, but that’s not really the point of this post. Maybe next time, since the setjmp and longjmp routines are only a few assembly statements long and I can use them as a basis. Hope you found this interesting or useful! Thanks for reading!

Tags: ,

No comments yet.

Leave a Reply