Compiler Implementation (Part 1)

Introduction

I remember reading on an Andrew Appel book1:

The learning experience of coming to the right abstraction by several iterations of think–implement–redesign is one that should not be missed.

And oh boy have I done several iterations…

The problem with most compiler and language design resources is that they force a relatively absurd learning path, that is:

\(Lexing \rightarrow Parsing \rightarrow Semantic Analysis \rightarrow IR \rightarrow Codegen \)

This might look like a non-issue at a first glance and in truth, there is nothing inherently wrong with this top-down approach. But it forces you into thinking in a specific way that is only good for reading and not actually doing. Let me put it this way, it’s okay to read about the structure of a compiler in the order of the standard compilation stages, but would you really want to implement one in the same way? Before creating the parser shouldn’t you have a concrete idea of what the AST will look like to begin with? Is your AST design even viable or fast enough for your usecase? What if you don’t care about a specific compilation stage?

I would argue that a compiler pipeline can be so ridiculously enormous with so many moving parts it’s downright impossible to design it upfront. In this series I want to present a simpler, in my opinion, approach. A less top-down but more middle/bottom-down-ish approach.

We are going to essentially start of with the second to last IR (last being llvm) of a (yet) unknown list of previous intermediate representations. The design goals of this IR will be that it must be self-contained, extensible and will assume no existence of previous IRs though I want to keep my options open. This will allow us to handle all the dirty work of explaining the LLVM API and code generation before moving to the language frontend which is arguably the easy part.

My goal is to describe the problems I have faced and I am currently facing on my hobbyist compiler project, writting things down helps me maintain the things I learn and maybe you can also benefit from reading from my mistakes. My language of choice at least, in the beginning, is going to be C++ because of its support of LLVM. I have a love-hate relationship with this language, and I don’t consider myself proficient at it by any means, so any feedback and possible corrections to mistakes I may have made are very much appreciated.

For now we are going to split the blog-series into 4 parts:

  1. Defining the AST and Visitors. (you are here)
  2. Visitors for emitting LLVM IR.
  3. More AST’s and Visitor extensions.
  4. Grammars and Parsing (LL,LR,possibly Earley).

All of this being said, lets create the Abstract Syntax Tree!

Abstract Syntax Tree

Technically we do not need yet another AST to use the LLVM API responsible for generating IR. However, removing it from the discussion would hide a structural component of our language’s design: the point where LLVM’s external interface connects to our compiler’s internal structure. Even though there is nothing stopping us from emitting IR directly (e.g.;from parsing or just code) most language implementations introduce an AST to separate front-end concerns (syntax,semantics,type checking) from back-end concerns (optimization,codegen). This separation allows us to “ignore” certain stages of compilation because right now, for this blog post, we only care about creating a small backend.

There are many documented ways to represent abstract syntax trees in C++. For this overview, the focus is on the two simplest and most common approaches (though we will end up using only the first):

  1. ASTs via subtype polymorphism.
  2. ASTs via compile time polymorphism.

Effectively all of the approaches are just ways of expressing “This node is one of A,B,C,D”. For example an Expression node in an AST can be a Literal, something like a plain number (e.g.; 23) or a BinaryExpression (e.g.; 23 + x) that contains two other nodes and an operator.

To me choosing one of the above was the biggest hurdle and was the inspiration for this blog post. With the goal of making you avoid the same mistakes I have made, I will try to convice you to choose the first approach.

The latter approach, compile time polymorphism usually means representing the AST as a tagged union (std::variant or equivalent). Each possible node becomes an alternative in the variant, and operations on the tree are expressed through std::visit or a layer of function overloading. While this model provides static exhaustiveness checking and often leads to very explicit code generation (you always know what you are transforming and what you are transforming into) it also imposes constraints that becomes noticeable as soon as the AST evolves.

First of all, C++ does not offer a compact,ergonomic form of pattern matching and even though, std::variant emulates them, there are no first-class Sum Types to cleanly represent a tree. The boilerplate grows extremely quickly as the number of node types increases and making the slightest change of behavior requires nontrivial amount of refactoring. This is the main reason you see people recommending the ML-family of languages, or maybe Rust or Haskell for compiler design. Sum Types and pattern matching are extremely usefull for expressing the types of data structures and transformations that compilers desperately need, yet C++ does not provide an equivalent mechanism.

A second issue is how tightly the variant approach couples front-end design to complete knowledge of the AST. The entire std::variant definition must be known upfront, and every consumer of the AST must be recompiled when new node types are added. This is manageable for compilers whose syntax is largely fixed, but less so for exploratory or evolving language designs. Changing the shape of the tree forces a global update: every std::visit and every “match arm” must be revisited.

In contrast, subtype polymorphism shifts the cost profile. I did not mention it before but the second approach gives perfomance benefits because variant dispatch is resolved entirely at compile time, whereas in this model dispatch occurs at runtime through virtual calls. That said, the cost is rarely significant for our purposes. We care about extending our AST cleanly to learn about LLVM, not maximizing perfomance. Here we take advantage of C++’s object-oriented facilities by defining an abstract base class–effectively a type that declares virtual functions intended to be overriden by concrete node types. Every subclass inherits the common interface and any shared members, one of which is a virtual method responsible for performing a compiler operation such as type checking, constant folding or, the one we care about IR generation.

The Type Node

Let’s be more specific. We declare an abstract class Type.

class Type {
 public:
  Type() = default;
  virtual ~Type() = default; 
  virtual void accept(TypeVisitor&) const = 0;
};

Some note-worthy points:

  • Type is a base class that is going to be used like a pointer to a derived class object. This means that we need to declare the destructor as virtual because otherwise it’s undefined behavior!
  • void accept(TypeVisitor&) const is the compiler operation we talked about above. We will implement it later but obviously, it’s a readonly operation.

So now we have a concept of what a Type is. Let’s create the concrete types of our language!

A PrimitiveType is a Type that holds a PrimTy:

class PrimitiveType : public Type {
 public:
  PrimitiveType(PrimTy);
  ~PrimitiveType() override = default;
  PrimTy getBase() const;
  void accept(TypeVisitor& v) const override;
private:
  PrimTy base_;
};

where PrimTy is:

enum class PrimTy {
 i64,
 u64,
 Bool,
 Void,
};

A Struct is a Type that holds an amount of Type‘s determined at runtime and a name:

class Struct : public Type {
 public:
  Struct(std::string,std::vector<std::unique_ptr<Type>>);
  ~Struct() override = default;
  void accept(TypeVisitor& v) const override;
  const std::vector<std::unique_ptr<Type>>& getInner() const;
  const std::string& getName() const;
 private:
  std::vector<std::unique_ptr<Type>> inner_;
  std::string name_;
};

An Array is a Type that holds a Type and a length:

class Array : public Type {
 public:
  Array(std::unique_ptr<Type>,std::size_t);
  ~Array() override = default;
  void accept(TypeVisitor& v) const override;
  std::size_t getLen() const;
  const Type& getInner() const;
 private:
  std::size_t len_;
  std::unique_ptr<Type> inner_;
};

A Vector is also a Type that holds a Type and a length:

class Vector : public Type {
 public:
  Vector(std::unique_ptr<Type>,std::size_t);
  ~Vector() override = default;
  void accept(TypeVisitor& v) const override;
  std::size_t getLen() const;
  const Type& getInner() const;
 private:
  std::size_t len_;
  std::unique_ptr<Type> inner_;
};

A Pointer is a Type that holds a Type:

class Pointer : public Type {
 public:
  Pointer(std::unique_ptr<Type>);
  ~Pointer() override = default;
  void accept(TypeVisitor& v) const override;
  const Type& getInner() const;
 private:
  std::unique_ptr<Type> inner_;
};

The Expr Node

We just defined one layer of the AST. Continuing the bottom-up approach we have the expression node:

An Expr node is just a node that holds a Type:

class Expr {
 public:
  Expr() = default;
  Expr(std::unique_ptr<type::Type> ty);
  virtual ~Expr() = default;
  virtual void accept(ExprVisitor&) const = 0;
  const type::Type& getType() const;
 private:
  std::unique_ptr<type::Type> type_;
};

An ExprId is an Expr that holds a name:

class ExprId : public Expr {
 public:
  ExprId(std::string name,std::unique_ptr<type::Type> ty);
  ~ExprId() override = default;
  void accept(ExprVisitor& v) const override;
  const std::string& getName() const;
 private:
  std::string name_;
};

A BinaryExpr is an Expr that holds two Expr‘s and a BinaryOp:

class BinaryExpr : public Expr {
 public:
   BinaryExpr(op::BinaryOp, std::unique_ptr<Expr>, std::unique_ptr<Expr>,
              std::unique_ptr<type::Type> ty);
  ~BinaryExpr() override = default;
  void accept(ExprVisitor& v) const override;
  op::BinaryOp getOp() const;
  Expr& getLhs() const;
  Expr& getRhs() const;
 private:
  op::BinaryOp op_;
  std::unique_ptr<Expr> lhs_;
  std::unique_ptr<Expr> rhs_;
};

where BinaryOp is:

enum class BinaryOp {
  Add,
  Sub,
  Mul,
  Div,
  Eq,
  Ne,
  Lt,
  Le,
  Gt,
  Ge,
  And,
  Or,
};

An ExprLiteral is an Expr that holds an ExprLiteralVar:

class ExprLiteral : public Expr {
 public:
  ExprLiteral(ExprLiteralVar value,std::unique_ptr<type::Type>);
  ~ExprLiteral() override = default;
  void accept(ExprVisitor& v) const override ;
  const ExprLiteralVar& getValue() const;
 private:
  ExprLiteralVar inner_;
 protected:
};

where ExprLiteralVar is:

using ExprLiteralVar = std::variant<uint64_t,int64_t,bool>;

The Stmt Node

We could theoretically get away with only expressions, but I like the idea of compartmentalizing the control-flow structures from the typed structures (although functions also have types in llvm but I digress).

A Stmt is an empty node that can be traversed:

class Stmt {
 public:
  virtual ~Stmt() = default;
  Stmt() = default;
  virtual void accept(StmtVisitor&) = 0;
};

A StmtAssign is a Stmt that holds an Expr and a name:

class StmtAssign : public Stmt {
 public:
  StmtAssign(std::string,std::unique_ptr<expr::Expr>);
  ~StmtAssign() override = default;
  void accept(StmtVisitor&) override;
  const std::string& getName() const;
  const expr::Expr& getExpr() const;
 private:
  std::string name_;
  std::unique_ptr<expr::Expr> expr_;
};

A StmtLabel is a Stmt that holds an amount of Stmt‘s and a name:

class StmtLabel : public Stmt {
 public:
  StmtLabel(std::string,std::vector<std::unique_ptr<Stmt>>);
  ~StmtLabel() override = default;
  void accept(StmtVisitor&) override;
  const std::string& getName() const;
  const std::vector<std::unique_ptr<Stmt>>& getInner() const;
 private:
  std::string name_;
  std::vector<std::unique_ptr<Stmt>> inner_;
};

A StmtBr is a Stmt that holds labelName. This is essentially an unconditional branch. We could refence a StmtLabel but just holding a string is good enough:

class StmtBr : public Stmt {
 public:
  StmtBr(std::string);
  ~StmtBr() override = default;
  void accept(StmtVisitor&) override;
  const std::string& getLabelName() const;
 private:
  std::string labelName_;
};

Same idea for StmtCondBr, the conditional branch:

class StmtCondBr : public Stmt {
 public:
  StmtCondBr(std::string,std::unique_ptr<expr::Expr>);
  ~StmtCondBr() override = default;
  void accept(StmtVisitor&) override;
  const expr::Expr& getCondition() const;
  const std::string& getLabelName() const;
 private: 
  std::unique_ptr<expr::Expr> cond_;
  std::string labelName_;
};

A StmtId is just a Stmt that holds an Expr:

class StmtId : public Stmt {
 public: 
  StmtId(std::unique_ptr<expr::Expr>);
  ~StmtId() override = default;
  void accept(StmtVisitor&) override;
  const expr::Expr& getExpr() const;
 private:
  std::unique_ptr<expr::Expr> expr_;
};

The Function Node

The recursive types end here, finally! Unless we decide to add more toplevel language constructs and abstract them in a simillar way. Right now though for only define functions:

class Function {
 public:
  Function(std::unique_ptr<type::Type>, std::string, std::vector<std::unique_ptr<Parameter>>,
            std::vector<std::unique_ptr<stmt::Stmt>>);
  ~Function() = default;
  void accept(FunctionVisitor&) const;
  const std::vector<std::unique_ptr<stmt::Stmt>>& getBody() const;
  const std::vector<std::unique_ptr<Parameter>>& getParams() const;
  const std::string& getName() const;
  const type::Type& getType() const;
 private:
  std::unique_ptr<type::Type> returnType_;
  std::string name_;
  std::vector<std::unique_ptr<stmt::Stmt>> body_;
  std::vector<std::unique_ptr<Parameter>> params_;
};

class Parameter {
 public:
  Parameter(std::unique_ptr<type::Type>,std::string);
  ~Parameter() = default;
  void accept(FunctionVisitor&) const;
  const type::Type& getType() const;
  const std::string& getName() const;
 private:
  std::unique_ptr<type::Type> type_;
  std::string name_;
};

The Module Node

And obviously the module that holds all of the top-level constructs we may or may not add. This is mainly for completeness:

class Module {
 public:
  Module(std::string,std::vector<std::unique_ptr<fun::Function>>);
  ~Module() = default;
 private:
  std::string name_;
  std::vector<std::unique_ptr<fun::Function>> functions_;
};

Visitor Operations

Notice how easy it is to define an AST node this way? You can pick and choose whichever node class or subclass you want to implement in your own AST, how to compartmentalize them in .hpp files, etc. All you have to do is make sure you implement the accept(Visitor&) method where Visitor is the corresponding visitor class.

Speaking of which the visitor is how we can apply all of our operations/transformations to our AST. This includes displaying, optimizing or generating code and is an amazing way for seperating our concerns (the existense of nodes vs operating on these existing nodes). For this part lets implement a simple naive visitor Printer.

For starters lets define the Type‘s visitor:

struct TypeVisitor {
  TypeVisitor() = default;
  virtual ~TypeVisitor() = default;
  virtual void visit(const PrimitiveType&) = 0;
  virtual void visit(const Struct&) = 0;
  virtual void visit(const Array&) = 0;
  virtual void visit(const Vector&) = 0;
  virtual void visit(const Pointer&) = 0;
};

struct Printer : TypeVisitor {
  void visit(const PrimitiveType&) override;
  void visit(const Struct&) override;
  void visit(const Array&) override;
  void visit(const Vector&) override;
  void visit(const Pointer&) override;
};

Notice that the visitor itself is not an operation, the implementation of a visitor is.

Furthermore its implementation looks like this:

void Printer::visit(const PrimitiveType& n) {
  switch (n.getBase()) {
  case PrimTy::i64: std::print("i64"); break;
  case PrimTy::u64: std::print("u64"); break;
  case PrimTy::Bool: std::print("bool"); break;
  case PrimTy::Void: std::print("void"); break;
  };
}

void Printer::visit(const Struct &n) {
  std::print("{}",n.getName());
}

void Printer::visit(const Array &n) {
  std::print("[");
  n.getInner().accept(*this);
  std::print(";");
  std::print("{0}]",n.getLen());
}

void Printer::visit(const Vector &n) {
  std::print("{{");
  n.getInner().accept(*this);
  std::print(";");
  std::print("{0}}}",n.getLen());
}

void Printer::visit(const Pointer& n) {
  n.getInner().accept(*this);
  std::print(" ptr");
}

The accept method is for now the same for every possible node:

void Array::accept(TypeVisitor &v) const {
  v.visit(*this);
}

void Vector::accept(TypeVisitor &v) const {
  v.visit(*this);
}

void Struct::accept(TypeVisitor &v) const {
  v.visit(*this);
}

void Pointer::accept(TypeVisitor &v) const {
  v.visit(*this);
}

void PrimitiveType::accept(TypeVisitor &v) const {
  v.visit(*this);
}

You can probably find a way to reduce code-reuse using templates but I am fine with it for now.

For completeness lets define all of the visitors.

Expr visitor:

struct ExprVisitor {
 public:
  virtual void visit(const BinaryExpr&) = 0;
  virtual void visit(const ExprId&) = 0;
  virtual void visit(const ExprLiteral&) = 0;
};

struct Printer : ExprVisitor {
  public:
  void visit(const BinaryExpr&) override;
  void visit(const ExprId&) override;
  void visit(const ExprLiteral&) override;
};

with implementation:

void Printer::visit(const BinaryExpr& binexpr) {
  std::print("(");
  std::print("{0}",op::to_string(binexpr.getOp()));
  std::print(" ");
  binexpr.getLhs().accept(*this);
  std::print(" ");
  binexpr.getRhs().accept(*this);
  std::print(")");
}

void Printer::visit(const ExprId& idexpr) {
  std::print("{}",idexpr.getName());
};

void Printer::visit(const ExprLiteral &literal) {
  std::visit([](auto&& v){
    std::print("{}",v);
  }, literal.getValue());
};

Stmt visitor:

struct StmtVisitor {
  virtual void visit(const StmtAssign& v) = 0;
  virtual void visit(const StmtLabel& v) = 0;
  virtual void visit(const StmtBr& v) = 0;
  virtual void visit(const StmtCondBr& v) = 0;
  virtual void visit(const StmtId& v) = 0;
};

struct Printer : StmtVisitor {
  void visit(const StmtAssign& v) override;
  void visit(const StmtLabel& v) override;
  void visit(const StmtBr& v) override;
  void visit(const StmtCondBr& v) override;
  void visit(const StmtId& v) override;
};

with implementation:

void Printer::visit(const StmtAssign &v) {
  std::print("(=");
  std::print("{0} ",v.getName());
  expr::Printer exprPrinter;
  const expr::Expr& e = v.getExpr();
  e.accept(exprPrinter);
  std::println(")");
}

void Printer::visit(const StmtLabel &v) {
  std::println("(deflabel \"{0}\"",v.getName());
  auto& stmts = v.getInner();

  for (auto &stmt : stmts) {
    stmt->accept(*this);
  }

  std::println(")");
}

void Printer::visit(const StmtBr &v) {
  std::println("(branch {0})",v.getLabelName());
}

void Printer::visit(const StmtCondBr &v) {
  std::println("(branch");
  expr::Printer exprPrinter;
  auto& e = v.getCondition();
  std::print(":if ");
  e.accept(exprPrinter);
  std::print(":then {0}", v.getLabelName());
}

void Printer::visit(const StmtId &v) {
  expr::Printer exprPrinter;
  v.getExpr().accept(exprPrinter);
}

Function visitor:

struct FunctionVisitor {
  virtual void visit(const Function&) = 0;
  virtual void visit(const Parameter&) = 0;
};

struct Printer : FunctionVisitor {
  void visit(const Function&) override;
  void visit(const Parameter&) override;
};

with implementation:

void Printer::visit(const Function &f) {
  auto& name = f.getName();
  std::print("(defun {0} :type ", name);
  auto& returnType = f.getType();
  type::Printer v;
  returnType.accept(v);
  auto& params = f.getParams();
  for (auto &param : params) {
    param->accept(*this);
  }
  auto& body = f.getBody();
  stmt::Printer stmtPrinter;
  for (auto &stmt : body) {
    stmt->accept(stmtPrinter);
  }
  std::println(")");
}

void Printer::visit(const Parameter &p) {
  auto& type = p.getType();
  auto &name = p.getName();
  type::Printer v;
  type.accept(v);
  std::print("({0})", name);
}

Conclusion

With the tree and visitors defined, code generation becomes a local problem: each node describes exactly the information LLVM needs, and each visitor method becomes a translation step from our language to LLVM’s IR. The remaining work is therefore not architectural but mechanical: implementing an IR-producing visitor that walks the tree, emits types, creates functions, and lowers expressions and statements into LLVM constructs.

In other words, the groundwork is done (for the most part). The next stage is to replace the printing logic with actual LLVM IR builders and begin constructing a backend that can compile real programs. Then we can talk about more versatile, designs for our ASTs.

Footnotes:

1

Modern Compiler Implementation in C

Leave a Reply

Your email address will not be published. Required fields are marked *