Implementing state-machines: switch() vs. method pointers

Posted: Saturday, 2015-10-17 15:51 | Tags: Programming, C++

Finite-state machines are useful for many purposes: Say we have a long running computation that goes back and forth between different phases and is executed chunk-by-chunk in a main loop that does other things in between (e.g. updating some user interface). Yes, you scream "threads", but there are some situations where those are just not necessary or wanted for whatever reason.

Another common use is building a lexer/parser that reads some kind of input and interprets the incoming characters differently depending on "in what state we are" e.g. whether we just read a '' or not might influence how we interpret the current character.

Most state-machine code I've seen so far is built with giant switch()-blocks and recently I've come to wonder whether this is always the way to go.

First, lets implement a simple state machine using such a switch-statement so you know what I am talking about.

#include <iostream>

struct SwitchStateMachine {

  SwitchStateMachine() : current_state(0) { }

  void run_state() {
    switch(current_state) {
      case 0:
        std::cout << "s0\n";
        current_state = 1;
        break;

      case 1:
        std::cout << "s1\n";
        current_state = 2;
        break;

      case 2:
        std::cout << "s2\n";
        current_state = 3;
        break;

      case 3:
        std::cout << "s3\n";
        current_state = 4;
        break;
    }
  }
  int current_state;
};

int main() {
  SwitchStateMachine ss;
  ss.run_state();
  ss.run_state();
  ss.run_state();
  ss.run_state();
  return 0;
}

Whenever run_state() is called, it will look up the current state (stored as integer value in current_state) and using a switch-statement on this value, decide what action to perform. After each such action the current_state variable is updated to program the state for the next call.

None of this is new, you probably have seen and/or implemented it a bazillion times and know its uses.

Now for the fun part, lets compile this with a reasonable amount of optimization turned on:

g++ -O2 ./state_machine_impl_switch.cpp; objdump -d ./a.out

The main() method holds little surprise, and visibly reflects the 4 calls to run_state():

00000000004006a0 <main>:
  4006a0:   48 83 ec 18             sub    $0x18,%rsp
  4006a4:   48 89 e7                mov    %rsp,%rdi
  4006a7:   c7 04 24 00 00 00 00    movl   $0x0,(%rsp)
  4006ae:   e8 5d 01 00 00          callq  400810 <_ZN18SwitchStateMachine9run_stateEv>
  4006b3:   48 89 e7                mov    %rsp,%rdi
  4006b6:   e8 55 01 00 00          callq  400810 <_ZN18SwitchStateMachine9run_stateEv>
  4006bb:   48 89 e7                mov    %rsp,%rdi
  4006be:   e8 4d 01 00 00          callq  400810 <_ZN18SwitchStateMachine9run_stateEv>
  4006c3:   48 89 e7                mov    %rsp,%rdi
  4006c6:   e8 45 01 00 00          callq  400810 <_ZN18SwitchStateMachine9run_stateEv>
  4006cb:   31 c0                   xor    %eax,%eax
  4006cd:   48 83 c4 18             add    $0x18,%rsp
  4006d1:   c3                      retq
  4006d2:   0f 1f 40 00             nopl   0x0(%rax)
  4006d6:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  4006dd:   00 00 00

If you ever wondered what a compiler does with a switch() statement, look at this: run_state uses a number of jumps to decide what piece of code to execute:

0000000000400810 <_ZN18SwitchStateMachine9run_stateEv>:
  400810:   53                      push   %rbx
  400811:   8b 07                   mov    (%rdi),%eax
  400813:   48 89 fb                mov    %rdi,%rbx
  400816:   83 f8 01                cmp    $0x1,%eax
  400819:   74 75                   je     400890 <_ZN18SwitchStateMachine9run_stateEv+0x80>
  40081b:   7e 4b                   jle    400868 <_ZN18SwitchStateMachine9run_stateEv+0x58>
  40081d:   83 f8 02                cmp    $0x2,%eax
  400820:   74 26                   je     400848 <_ZN18SwitchStateMachine9run_stateEv+0x38>
  400822:   83 f8 03                cmp    $0x3,%eax
  400825:   75 1a                   jne    400841 <_ZN18SwitchStateMachine9run_stateEv+0x31>
  400827:   ba 03 00 00 00          mov    $0x3,%edx
  40082c:   be 40 09 40 00          mov    $0x400940,%esi
  400831:   bf 40 0d 60 00          mov    $0x600d40,%edi
  400836:   e8 55 fe ff ff          callq  400690 <_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l@plt>
  40083b:   c7 03 03 00 00 00       movl   $0x3,(%rbx)
  400841:   5b                      pop    %rbx
  400842:   c3                      retq
  400843:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  400848:   ba 03 00 00 00          mov    $0x3,%edx
  40084d:   be 3c 09 40 00          mov    $0x40093c,%esi
  400852:   bf 40 0d 60 00          mov    $0x600d40,%edi
  400857:   e8 34 fe ff ff          callq  400690 <_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l@plt>
  40085c:   c7 03 03 00 00 00       movl   $0x3,(%rbx)
  400862:   5b                      pop    %rbx
  400863:   c3                      retq
  400864:   0f 1f 40 00             nopl   0x0(%rax)
  400868:   85 c0                   test   %eax,%eax
  40086a:   75 d5                   jne    400841 <_ZN18SwitchStateMachine9run_stateEv+0x31>
  40086c:   ba 03 00 00 00          mov    $0x3,%edx
  400871:   be 34 09 40 00          mov    $0x400934,%esi
  400876:   bf 40 0d 60 00          mov    $0x600d40,%edi
  40087b:   e8 10 fe ff ff          callq  400690 <_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l@plt>
  400880:   c7 03 01 00 00 00       movl   $0x1,(%rbx)
  400886:   5b                      pop    %rbx
  400887:   c3                      retq
  400888:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  40088f:   00
  400890:   ba 03 00 00 00          mov    $0x3,%edx
  400895:   be 38 09 40 00          mov    $0x400938,%esi
  40089a:   bf 40 0d 60 00          mov    $0x600d40,%edi
  40089f:   e8 ec fd ff ff          callq  400690 <_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l@plt>
  4008a4:   c7 03 02 00 00 00       movl   $0x2,(%rbx)
  4008aa:   5b                      pop    %rbx
  4008ab:   c3                      retq
  4008ac:   0f 1f 40 00             nopl   0x0(%rax)

This roughly translates to this (pseudo code):

run_state {
  compare $state with 0x1
  if equal: goto state 1
  elif lower or equal: goto state 0

  compare $state with 0x2
  if equal: goto state 2

  compare $state with 0x3
  if equal: goto state 3

  state 0:
    compare state with 0
    if not equal: return
    // do state 0

  state 1:
    // do state 1

  state 2:
    // do state 2

  state 3:
    // do state 3
}

Basically what it does is a number of comparisons to figure out where to actually jump to. In this case they pretty much work like a concatenation of if-statements, resulting in a number of comparisons that is linear to the number of states.

However depending on the number of cases, and the distribution of the tested values, I have also seen g++ build decision trees out of compare/jump statements such that only a logarithmic number of statements w.r.t. number of cases need be executed. The logic would go something like this:

switch(v) {
  if v < 8 {
    if v < 4 {
      if v < 2 {
        if v == 0 { ... }
        else { ... } // v == 1
      }
      else {
        if v == 2 { ... }
        else { ... } // v == 3
      }
    }
    else {
      // ...
    }
  }
  else {
   // ...
  }
}

Still, if we have a lot of cases/states, this seems like an awful lot of comparisons to be done just to figure out what we want to do (which, in a sense, we already knew at compile time!).

Here is a different approach: Instead of encoding the piece of code we want to execute in an integer that has to be translated to a code address by many jumps, why don't we just store the address of the code we want to execute directly?

Luckily C and C++ both offer function pointers :) In this case I will use pointer-to-member-functions. That is, we restrict ourselves to only ever call methods of a certain class and agree that we will have to provide the implicit this pointer for the call.

#include <iostream>

struct FPtrStateMachine {

  typedef void (FPtrStateMachine::*State)();

  FPtrStateMachine() : current_state(&FPtrStateMachine::state0) {
  }

  void run_state() {
    (this->*current_state)();
  }

  void state0() {
    std::cout << "s0\n";
    current_state = &FPtrStateMachine::state1;
  }

  void state1() {
    std::cout << "s1\n";
    current_state = &FPtrStateMachine::state2;
  }

  void state2() {
    std::cout << "s2\n";
    current_state = &FPtrStateMachine::state3;
  }

  void state3() {
    std::cout << "s3\n";
    current_state = &FPtrStateMachine::state3;
  }

  State current_state;
};

int main() {
  FPtrStateMachine ss;

  ss.run_state();
  ss.run_state();
  ss.run_state();
  ss.run_state();

  return 0;
}

Even if you are still somewhat weary of the syntax function pointers (and the probably somewhat hardar-to-like syntax for pointers to member functions), you are probably agreeing that this code overall does look quite readable: Every state has its own member function, and instead of integers or enum values, we provide the address of one of those member functions to encode what we want to do next.

Now lets compile and disassemble:

g++ -O2 ./state_machine_impl_fptr.cpp; objdump -d ./a.out

First thing we notice, main() has grown a little longer, that is because the compiler decided that run_state() is simple enough to be inlined:

00000000004006a0 <main>:
  4006a0:     48 83 ec 18             sub    $0x18,%rsp
  4006a4:     48 89 e7                mov    %rsp,%rdi
  4006a7:     48 c7 04 24 b0 08 40    movq   $0x4008b0,(%rsp)
  4006ae:     00
  4006af:     48 c7 44 24 08 00 00    movq   $0x0,0x8(%rsp)
  4006b6:     00 00
  4006b8:     e8 f3 01 00 00          callq  4008b0 <_ZN16FPtrStateMachine6state0Ev>
  4006bd:     48 8b 04 24             mov    (%rsp),%rax
  4006c1:     48 8b 7c 24 08          mov    0x8(%rsp),%rdi
  4006c6:     a8 01                   test   $0x1,%al
  4006c8:     74 09                   je     4006d3 <main+0x33>
  4006ca:     48 8b 14 3c             mov    (%rsp,%rdi,1),%rdx
  4006ce:     48 8b 44 02 ff          mov    -0x1(%rdx,%rax,1),%rax
  4006d3:     48 01 e7                add    %rsp,%rdi
  4006d6:     ff d0                   callq  *%rax
  4006d8:     48 8b 04 24             mov    (%rsp),%rax
  4006dc:     48 8b 7c 24 08          mov    0x8(%rsp),%rdi
  4006e1:     a8 01                   test   $0x1,%al
  4006e3:     74 09                   je     4006ee <main+0x4e>
  4006e5:     48 8b 14 3c             mov    (%rsp,%rdi,1),%rdx
  4006e9:     48 8b 44 02 ff          mov    -0x1(%rdx,%rax,1),%rax
  4006ee:     48 01 e7                add    %rsp,%rdi
  4006f1:     ff d0                   callq  *%rax
  4006f3:     48 8b 04 24             mov    (%rsp),%rax
  4006f7:     48 8b 7c 24 08          mov    0x8(%rsp),%rdi
  4006fc:     a8 01                   test   $0x1,%al
  4006fe:     74 09                   je     400709 <main+0x69>
  400700:     48 8b 14 3c             mov    (%rsp,%rdi,1),%rdx
  400704:     48 8b 44 02 ff          mov    -0x1(%rdx,%rax,1),%rax
  400709:     48 01 e7                add    %rsp,%rdi
  40070c:     ff d0                   callq  *%rax
  40070e:     31 c0                   xor    %eax,%eax
  400710:     48 83 c4 18             add    $0x18,%rsp
  400714:     c3                      retq
  400715:     90                      nop
  400716:     66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  40071d:     00 00 00

This is what the individual states look like:

0000000000400850 <_ZN16FPtrStateMachine6state3Ev>:
  400850:     53                      push   %rbx
  400851:     ba 03 00 00 00          mov    $0x3,%edx
  400856:     48 89 fb                mov    %rdi,%rbx
  400859:     be 94 09 40 00          mov    $0x400994,%esi
  40085e:     bf 20 0e 60 00          mov    $0x600e20,%edi
  400863:     e8 28 fe ff ff          callq  400690 <_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l@plt>
  400868:     48 c7 03 50 08 40 00    movq   $0x400850,(%rbx)
  40086f:     48 c7 43 08 00 00 00    movq   $0x0,0x8(%rbx)
  400876:     00
  400877:     5b                      pop    %rbx
  400878:     c3                      retq
  400879:     0f 1f 80 00 00 00 00    nopl   0x0(%rax)

0000000000400880 <_ZN16FPtrStateMachine6state2Ev>:
  400880:     53                      push   %rbx
  400881:     ba 03 00 00 00          mov    $0x3,%edx
  400886:     48 89 fb                mov    %rdi,%rbx
  400889:     be 98 09 40 00          mov    $0x400998,%esi
  40088e:     bf 20 0e 60 00          mov    $0x600e20,%edi
  400893:     e8 f8 fd ff ff          callq  400690 <_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l@plt>
  400898:     48 c7 03 50 08 40 00    movq   $0x400850,(%rbx)
  40089f:     48 c7 43 08 00 00 00    movq   $0x0,0x8(%rbx)
  4008a6:     00
  4008a7:     5b                      pop    %rbx
  4008a8:     c3                      retq
  4008a9:     0f 1f 80 00 00 00 00    nopl   0x0(%rax)

00000000004008b0 <_ZN16FPtrStateMachine6state0Ev>:
  4008b0:     53                      push   %rbx
  4008b1:     ba 03 00 00 00          mov    $0x3,%edx
  4008b6:     48 89 fb                mov    %rdi,%rbx
  4008b9:     be 9c 09 40 00          mov    $0x40099c,%esi
  4008be:     bf 20 0e 60 00          mov    $0x600e20,%edi
  4008c3:     e8 c8 fd ff ff          callq  400690 <_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l@plt>
  4008c8:     48 c7 03 e0 08 40 00    movq   $0x4008e0,(%rbx)
  4008cf:     48 c7 43 08 00 00 00    movq   $0x0,0x8(%rbx)
  4008d6:     00
  4008d7:     5b                      pop    %rbx
  4008d8:     c3                      retq
  4008d9:     0f 1f 80 00 00 00 00    nopl   0x0(%rax)

00000000004008e0 <_ZN16FPtrStateMachine6state1Ev>:
  4008e0:     53                      push   %rbx
  4008e1:     ba 03 00 00 00          mov    $0x3,%edx
  4008e6:     48 89 fb                mov    %rdi,%rbx
  4008e9:     be a0 09 40 00          mov    $0x4009a0,%esi
  4008ee:     bf 20 0e 60 00          mov    $0x600e20,%edi
  4008f3:     e8 98 fd ff ff          callq  400690 <_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l@plt>
  4008f8:     48 c7 03 80 08 40 00    movq   $0x400880,(%rbx)
  4008ff:     48 c7 43 08 00 00 00    movq   $0x0,0x8(%rbx)
  400906:     00
  400907:     5b                      pop    %rbx
  400908:     c3                      retq
  400909:     0f 1f 80 00 00 00 00    nopl   0x0(%rax)

Due to the inlining, we are still at one 'call(q)' instruction per state call, so nothing really changed there. This call is now indirect (which, I assume, could make it more expensive), however on the other hand we save some compare/jump instructions. Especially I would like to stress that the cost for the callq will not change if we add more states, but the cost of a switch() statement would increase at least logarithmically.

At this point I would have loved to present some numbers on performance to proove to you that the pointer approach is really better. It turns out however, that g++ with -O2 is pretty smart already and when I write a loop with many calls to run_state it will analyze it and find out that what I'm actually doing is printing s0, s1, s2 and then printing a lot of s3's. It thus seems to remove almost all of the state-machine logic and actually writes the print instructions and a simple loop into the binary.

Thats pretty awesome actually, but in any real application that wouldn't work: If you could have done everything in one, stateless piece of code, you wouldnt have written a state machine in the first place!

I might go and write a small lexer or similar with both approaches and compare their performance against each other.