Regex II: The Shunting Yard Algorithm

02 July 2015

Introduction

I want to start by apologising for a lack of posts for the last couple of weeks. I was scheduled to give a talk at the local SQL Saturday conference here in Vancouver, and that ended up taking up most of my time, and unfortunately blogging was the item that had to suffer. I'm back now though, and intend to get back to my weekly blogging habits with a few extras thrown in to make up for lost time.

Today I'm going to talk a little about the shunting-yard algorithm and how to apply it to the problem of parsing the language of a Regex.

Infix vs Postfix Notation

To understand how the shunting-yard alogirthm fits into the design of a Regex engine and a compiler, one first needs to understand about a couple of different ways to express equations. The two notations we're going to talk about today are the infix and postfix notation.

Infix notation is the one you're most likely to be familiar with. When one generally thinks of a mathematical equation, they will think of it in terms of an infix notation.

For example: (5 + 3) * 7 is written in infix notation and can be read as "five plus three times seven".

While it's clearly apparent that most people would very comfortably understand and be able to parse an equation in infix notation, it's actually rather more difficult for a computer to do so.

A notation that is much easier for a computer to undestand is referred to as postfix notation. Where the operands are written before the operations that are performed on them.

The same example from above, written in postfix notation: 5 3 + 7 * can be read as "take five and three, add them to get eight then take seven, finally multiply seven and eight".

Operating on Postfix Notation

The reason that it's much easier to work on postfix notation is that one can very simply use a stack to quickly compute the operation. Let's try with the example we've been using throughout the blog post:

5 3 + 7 *

We have two operations available to use while calculating the result here:

  • Push an operand onto the stack
  • Pop the top two operands off the stack, apply the operator, push the result onto the stack

Let's now apply these to the above calulation:

  1. The first character of the calculation is 5, which, being an operand, is pushed onto the stack. Stack: 5

  2. The second character of the calculation is 3, which, being an operand, is pushed onto the stack. Stack: 5, 3

  3. The third character of the calculation is +, which, being an operator, is applied to the first two values to be popped off the stack. 5 added with 3 is 8 which is pushed onto the stack. Stack: 8

  4. The fourth character is 7, which is pushed onto the stack. Stack: 8, 7

  5. The fifth character of the calculation is *, which, being an operator, is applied to the first two values to be popped off the stack. 8 is multiplied with 7, which is pushed onto the stack. Stack: 56

  6. There are no more characters left in the calculation, so we're at the end of the calculation. There should be only one value left on the stack if this is a valid calculation, which there is, and therefore 56 is the result.

A Note on the Style of Regex Language

To make it easier to parse (this is a demonstration afterall), I'm using the . character to represent the concatenation operation, rather than the normal absence of a operator.

Converting Between Infix and Postfix

Looking at the language of a Regex, the language can be considered to be in infix notation. To make it easier for us to parse the Regex into a DFA for processing, we can first turn the infix notation into a postfix notation.

The basic premise of the shunting-yard algorithm is quite simple, but it can be tricky to get right at first. I think the best way to explain the algorithm as I've implemented it is to lead with an example, one that I found a little tricky to get right the first time, and one that highlights the concepts at play quite well.

To start, let's define the basic components of the shunting-yard algorithm:

  • The equation we're converting
  • A sequence of operators that we're currently working with
  • An output in postfix notation

Each of these basic components can be represented as either a string or a sequence of characters, in the code I've decided to represent these as strings as they are easier to use in the Intellij debugger, though you could choose whatever fit your representation best.

Order of Precedence and Operator Testing

First we have to define the order of precedence of our operators. Just like the operators in a general mathematical equation, the Regex operators also have a particular order of precedence. Consider how in general mathematical equations multiplication binds tighter than addition which binds tighter than subtraction, in our Regex:

  • Iteration (*) binds most tightly
  • Concatenation (.) binds less tightly than iteration
  • Union (|) binds less tightly than iteration

To represent this in our code, we specify a precedence map:

Map('*' -> 0, '.' -> 1, '|' -> 2)

Using that precedence map, we then define a function to test if a character is an operator and if one operator binds more tightly than another:

def isOperator(x: Char) = operatorPrecedence.contains(x)

def bindsMoreTightly(a: Char, b: Char) = operatorPrecedence(a) >operatorPrecedence(b)

The Algorithm

Now that we've specified our help methods and precedence map, let's talk about the main algorithm.

Because we're doing this with Scala, which is a functional-esque language, we're going to define the function recursively. This means that the first thing we need to do is establish a base case.

Because we're consuming characters from our infix notation, the base-case we'll use here is that the input is out of characters and is the empty-string ("").

We'll come back to this base case in a moment, but first, let's consider the non-base case.

When we still have characters on our input, let's grab one. This character can be an operator or an operand. Using our isOperator function defined above, we now have a decision to make based on the state of the character.

If the character is not an operator, then our decision is easy, we just append the character to the end of our output.

If the character is an operator, then we need to make a couple more decisions.

Firstly, if our operator stack is non-empty, and the character binds less tightly than the operator on the top of the operator stack, then we pop all of the operators from the stack that bind more tightly and append them to the end of our output.

Conversely, if our operator stack is empty or the chracter binds more tightly than the operator on the top of the operator stack, then we do nothing.

Finally, we append our looser binding operator to the beginning of the operator stack.

Going back to the base case, once we have reached the end of the input, we take all of the remaining operators from the operator stack and append them to the end of the output.

Example

Let's now look at the example from above: a.b**|c

Step Input Operators Output Comment
1 a.b**|c a The character 'a' is a operand, so append it to the output
2 .b**|c . a There are no other operators in the list, so push '.' onto the list
3 b**|c . ab The character 'b' is a operand, so append it to the output
4 **|c *. ab The character '*' has higher precedence than '.' so we push it onto the operator stack
5 *|c **. ab The character '*' has equal precedence as '*' so we push it onto the operator stack
6 |c | ab**. The operator '|' has lower precedence, so we pop all the higher precendence operators from the operator stack and push '|' onto the stack
7 c | ab**.c The character 'c' is a operand, so append it to the output
8 | ab**.c| Now we start popping the remaining operators off the end of the stack
9 ab**.c|

The Code

Finally, let's look at the code to perform this process:

class ShuntingYard(val operatorPrecedence: Map[Char, Int]) {

  def toPostfix(s: String): String = toPostfix(s, "", "")

  def isOperator(x: Char) = operatorPrecedence.contains(x)

  def hasLowerPrecedence(a: Char, b: Char) = operatorPrecedence(a) < operatorPrecedence(b)

  private def toPostfix(input: String, operators: String, output: String): String = {
    if (input == "") {
      if (operators.nonEmpty) {
        toPostfix(input, operators.tail, output :+ operators.head)
      } else {
        output
      }
    } else {
      if (isOperator(input.head)) {
        if (operators.nonEmpty && hasLowerPrecedence(input.head, operators.head)) {
          val operatorsWithHigherPrecedence = operators.filter(o => !hasLowerPrecedence(input.head, o))
          val operatorsWithLowerPrecedence = operators.filter(o => hasLowerPrecedence(input.head, o))
          toPostfix(input.tail, input.head +: operatorsWithHigherPrecedence, output ++ operatorsWithLowerPrecedence)
        } else {
          toPostfix(input.tail, input.head +: operators, output)
        }
      } else {
        toPostfix(input.tail, operators, output :+ input.head)
      }
    }
  }
}

The code for this can also be found on GitHub.

In Conclusion

That's all for today, we're getting close to an actual Regex implementation. Coming up next will be taking the postfix notation and using that to construct an actual NFA, from which we can build our DFA - the implementation of our Regex.

Stay tuned.

Tags: Compilers | Automata | Scala | Regex
Tweet