Compiler IV: the runtime environment

This is the last installment in the compiler series. I will be reviewing how NGEDIT implements the stack-based machine that executes the bytecodes generated by the compiler, and actually breaths life into the code written in the script language.

The first design decision is how to store the values that the runtime environment manipulates. For one, we have the variables that may hold values – and object members as well, which are just another kind of variable. As the other important group, we have the values that are stored in the stack.

We know that variables may hold a value of any type, which in the case of NGS means:

  • char: single byte signed quantities as in regular C/C++
  • int: native machine integer, that is, 32-bit, signed quantities
  • unsigned: same as the previous one, but unsigned
  • float: regular 32-bit floating point number
  • string: byte-per-character strings
  • object: reference to an NGS object
  • function: reference to an NGS function
  • built-in function: reference to a builtin NGEDIT provided function
  • nil: both a type and a value of its own, the “non-value”

There were some design decisions in coming up with this list. Even if the main editor interface is provided via objects, I decided not to provide a special type for built-in objects. Instead, I chose to use regular script objects with member functions initialized to built-in functions, and some “opaque” data to access the internal application data. This makes things simpler and fits nicely in the overall architecture.

I also decided not to provide a boolean type. Bytecodes that generate a logical boolean, such as comparisons, push either integer 1 or integer 0. Instructions that test a value (conditional jumps), take nil and any numeric zero as false, and anything else as true. As an important case, an empty string evaluates as true (you have to check its length if you want to check whether a string is non-empty). When adding values involving strings, which is implemented as string concatenation, nil values are promoted to the empty string, so string manipulation is quite comfortable.

Another decision that I took is to initially support only byte-based encoded strings. That means Unicode UTF-16 (almost the same as UCS-2, or what Windows simply calls “Unicode”) is not supported as a native NGS value. Even if the editor is fully-enabled to handling all sorts of encodings, I didn’t want to overcomplicate the internal script system. Please take into account that this does not mean the script cannot manipulate text in other encodings – this is done via NGEDIT’s built-in functions, it only means NGS’s native variables cannot hold this type of text. I’ll probably add a second string type to support unicode strings in the future.

How does one implement these values? Via a good’ole union inside a C++ class, which holds separately the “type” of the value:

enum EAtomType
{
  SA_NIL,
  SA_CHAR,
  SA_INT,
  SA_UNSIGNED,
  SA_FLOAT,
  SA_BUILTIN_FN,  // Built-in global functions
                // (::MesageBox(), ...)
  SA_FUNCTION,    // Identifies a script function
  SA_STRING,
  SA_OBJECT,
};

class TAtom
{
  public:
    TAtom () { m_Type = SA_NIL; }
    ...
    void SetFloat( float fVal )
    {
      m_Type = SA_FLOAT;
      m_Data.f = fVal;
    }
    ...
    EAtomType   GetType     () const
      { return m_Type; }
    char        GetVal_char () const
      { ASSERT(m_Type == SA_CHAR); return m_Data.c; }
    ...

  private:
    EAtomType m_Type;
    union
    {
      char     c;
      int      i;
      unsigned u;
      float    f;
      char     *psz;
      SObject  *pObj;
    } m_Data;
};

You get the idea. Each value then occupies exactly 8 bytes. 3 of them are wasted, but that’s the price of alignment.

The operand stack, then, is simply an array of TAtom values together with a pointer to the TOS element. Variables are just TAtom’s. The members of an object or the elements in an array are just TAtom’s.

By the way, pardon my French in borrowing the ‘atom’ and ‘nil’ term from lisp terminology. ‘Atom’ was quite clear to me, and I preferred ‘nil’ over ‘null’, given that it represents a non-value rather than a null pointer.

As I have already commented, I will probably evolve NGS towards Javascript so some of this terminology will probably change.

So, now that we know how to structure the stack and the array pool, we just need to start interpreting the bytecodes. From the rest of NGEDIT, a script function is invoked with a call similar to this:

  TScrExec exec;
  TScript *ps = FindScript(...);
  TAtom arg, retval;

  arg.SetInt(key.Code);
  exec.AddArg( &arg );

  arg.SetInt(key.ModifKeyState.Val);
  exec.AddArg( &arg );

  exec.InvokeFunction(
    ps,
    ps->FindFunction("IProcessChar"), NULL, &retval
  );

  return CastInt(retval) ? true : false;

TScrExec is the class that executes a script function by interpreting its bytecode. Right now, scripts are non-concurrent, and as such the function runs until it returns or it generates a “trap”. I will be adding a concurrent script engine which will help in using the scripts more efficiently and, importantly, in being able to debug them by executing step-by-step.

Leaving apart the interfacing code to pass in the arguments and get the result back, the core function is InvokeFunction. It is the function that actually performs the work. How does interpreting the bytecodes work? Quite simply, we need to keep a pointer into the bytecode stream pointing to the next instruction to execute – we can perform a large switch statement on the bytecode, which branches into the code to execute each instruction. Each instruction is so simple that they are quite straightforward to execute. Push, pop, dup, etc… Quite a lot of error-checking goes in here, as we have to take care that the stack doesn’t underflow, invalid codes are not present in the bytecode stream, operations are not performed on values of the wrong type, etc:

TRet TScrExec::InvokeFunction(
  TScript *ps, unsigned uFn,
  const TAtom *pThisVal, TAtom *pRetVal
)
{
  // Push 'this' and args on the stack
  ...

  // Prepare the run context
  ...
  m_pRC->uPC = 0; // Next instruction pointer
  ...

  // Execute instructions
  while (!IsFinished())
  {
    ExecInstr();
  }

  ...
}

void TScrExec::ExecInstr()
{
  // Trap on invalid address
  if (!IsValidPC())
  {
    Trap( TRAP_INVALID_ADDRESS );
    return;
  }
  ...

  // Read the opcode
  byte opcode = pCode->GetByte( m_pRC->uPC );

  // Immediate handling & decoding
  unsigned immediate = 0;
  unsigned imm_bytes = GetByteCodeImmBytes ( opcode );
  unsigned instr_bytes = imm_bytes + 1;

  if (imm_bytes)
  {
    ...

    // Read immediate
    if (imm_bytes == 1)
      immediate = pCode->GetByte( m_pRC->uPC + 1 );
    else if (imm_bytes == 2)
      immediate = pCode->GetWord( m_pRC->uPC + 1 );
  }

  // Advance PC to next instruction
  m_pRC->uPC += instr_bytes;

  // Instruction execution
  switch  (opcode)
  {
    case BC_NULL:
      Trap( TRAP_USER );
      break;

    case BC_PUSH_CONST:
    {
      TAtom atom;
      ps->GetConstant(&atom, immediate);
      m_Stack.Add(atom);
      break;
    }

    ... Really long switch statement ...

  }
}

There is one thing we haven’t considered here: functions in the script will be calling other functions. Values to those functions are passed on the stack, so it’s just a matter of pushing the arguments before calling the function. The return value is returned on the stack as well, so that part’s simple. But there is the need of storing the return address in order to know where to jump when we hit on the RET bytecode in the called function.

Apart from the need for the return address, we need to reserve space for local variables. Sure, we could use the regular stack for them, but that would make code generation a bit more complex. What we do is we have a separate “run-context” stack. A run-context holds the return address, the local variables, and some other useful info. When we find a CALL instruction, a new run-context is generated and pushed on what I call the “call-stack”. The variable m_pRC shown above always points to the top of the run-context stack.

There is one last thing we haven’t talked about: the calling convention. Even if we have already talked about the calling function pushing the args on the stack, and the called function popping them and leaving the result on the stack, there are some details that have to be taken care of. For once, we need to decide in what order the arguments are pushed: is the leftmost one pushed first or last? Where does the implicit “this” pointer get pushed?

Given that NGS handles values, and functions are called via pushing references to them on the stack (maybe even taken from members of objects), there is in general no easy mechanism to check at compile time whether the number of arguments passed matches the number expected by the called function. We could check it at runtime and fail if the number is not right. I preferred to implement a calling mechanism that pushes the actual number of arguments passed on the stack as well, after pushing all the regular arguments. The called function, then, can use that number to know how many arguments have been actually passed and pop them in local variables. If more arguments have been passed, the extra ones are ignored. If less than the expected number have been passed, their corresponding local variables are set to nil. The ENTER instruction, which is the first instruction in every function, does exactly this.

This is the last part of our review of NGEDIT’s scripting architecture. I hope it has been worthy as an overview of how a full scripting systems is developed. I will cover some other compiling issues in the future, but I will also be covering other general editor areas such as text memory management, etc.

Feel free to comment or ask whatever you are interested on.

Leave a Reply