Regex III: Getting From Postfix Notation to NFA

07 July 2015

Introduction

Last time I showed how we could take an infix notation Regex pattern and convert it to a postfix notation Regex pattern, so that it would be easier for us to parse and build our implementation. This time, I'm going to show how to take the postfix notation Regex pattern and convert that into an NFA. In future posts we'll take that NFA and convert it to a DFA, which will allow us to test a string against our Regex implementation.

The Algorithm

We discussed the algorithm for parsing a postfix notation Regex in the previous post, so this time let's start with the code:

private def toNFA(s: Seq[Char], stack: Seq[NFA[Char]]): NFA[Char] =
    s match {
      case Nil => stack(0)
      case '|' :: t => toNFA(t, Regex.union(stack.tail.head, stack.head) +: stack.takeRight(stack.length - 2))
      case '*' :: t => toNFA(t, Regex.iteration(stack.head) +: stack.takeRight(stack.length - 1))
      case '.' :: t => toNFA(t, Regex.concatenation(stack.tail.head, stack.head) +: stack.takeRight(stack.length - 2))
      case c :: t => toNFA(t, Regex.single(c) +: stack)
    }

Pattern Matching

This algorithm makes heavy use of a concept called "Pattern Matching". In short, you can think of pattern matching like the switch/case statement you might see in a language like C# on steroids.

In C#, a switch/case statement is a flow control structure that allows you to change the behavior of your program by matching a primitive variable to a set of possible values.

For example (from MSDN):

int caseSwitch = 1;
switch (caseSwitch)
{
    case 1:
        Console.WriteLine("Case 1");
        break;
    case 2:
        Console.WriteLine("Case 2");
        break;
    default:
        Console.WriteLine("Default case");
        break;
}

Pattern matching, by contrast, allows you to compare a variable to a "pattern".

Lots of people have written a lot of excellent content on pattern matching, and I'm not going to rehash it here, except to say that its very useful in cases where you can only get at data structures from the outside (using something equivalent to a POCO in C#), which seems to be quite common in functional programming.

If you want to learn more about pattern matching, and I would very much recommend that you do so, check out this article which is a conversation with Martin Odersky, the designer of Scala.

The Algorithm

In the toNFA algorithm above, the input variable is a sequence of characters (ie. a string) and each of our case statements matches this to a "pattern", that is, a potential form that the list may be in.

  • The first pattern is the empty list, in which case we want to return the top value of our NFA stack.

  • The second pattern is a list that begins with the union character. case '|' :: t says "match if the list starts with a | and has values afterwards (the tail t). In this case, we want to recursively call toNFA, passing in the tail of the list, and popping the top two values off the top of our NFA stack, unioning them, and pushing the result onto the NFA stack.

  • The third pattern is a list that begins with the iteration character. In this case, we want to call toNFA recursively, passing in the tail of the list, and popping the top value off of our NFA stack, applying the iterator NFA, and pushing the result onto the NFA stack.

  • The fourth pattern is a list that begins with the concatenation character. In this case, we want to call toNFA recursively, passing in the tail of the list, and popping the top two values off of our NFA stack, concatenating them, and pushing the result onto the NFA stack.

  • Finally, the fifth pattern handles a list starting with a non-operator character, for which we just push the single NFA onto the stack.

Some Cool Regexes as NFAs

Now that we finally have something that can take our rudimentary Regex syntax and turn it into a NFA, we can start to visualize what some cool Regexes might look like.

Let's start with something simple like a.b|c.d:

val regex = new Regex("a.b|c.d")
val nfa = regex.toNFA()

new GraphPrinter(nfa).outputGraphViz("output/a_dot_b_pipe_c_dot_d.dot")
"dot -T png output/a_dot_b_pipe_c_dot_d.dot -o output/a_dot_b_pipe_c_dot_d.png" !

a.b|c.d

How about a*.b*?:

val regex = new Regex("a*.b*")
val nfa = regex.toNFA()

new GraphPrinter(nfa).outputGraphViz("output/a_star_dot_b_star.dot")
"dot -T png output/a_star_and_b_star.dot -o output/a_star_dot_b_star.png" !

a*.b*

Or a*|b*?:

val regex = new Regex("a*|b*")
val nfa = regex.toNFA()

new GraphPrinter(nfa).outputGraphViz("output/a_star_pipe_b_star.dot")
"dot -T png output/a_star_pipe_b_star.dot -o output/a_star_pipe_b_star.png" !

a*|b*

These are the Regex patterns I can think of right now that generate nice pretty NFA diagrams for a blog post. There are a lot of other interesting ones without such symmetrical and pretty diagrams.

In Conclusion

Well that was fun! We're now ready for the last (and most difficult) step, which is to convert our NFAs to a DFA so that we can test a string against our Regex. That will be the topic of the next blog post or two.

Tags: Compilers | Automata | Scala | Regex
Tweet