Regex IV: From NFA to DFA

20 July 2015

Introduction

Let me start by saying that today, we're finally going to get a working, albeit rudimentary, Regex engine that we can test against a string. Woohoo!

fireworks

Now that we've all settled down, let's talk about what it took to get to this stage.

In the previous post we looked at calculating the epsilon closure of an NFA state, and today we're going to look at how we can put together the epsilon closures to convert our NFA into an equivalent DFA.

Backing Up a Little Bit

Before we go there though, as it's been quite a while since we've talked about all of the components of a Regex, let's quickly make sure that we have our terms down correctly.

  • DFA: is a deterministic finite automata - a state machine for which each transition for a state is uniquely determined by it's input symbol as each state can only have a single move per input symbol.
  • NFA: is a non-deterministic finite automata - a state machine that can have states with multiple transitions for the same input symbol, and that can make "epsilon" moves, for no input at all.

To compare, let's look at the NFA and DFA for the simple Regex a* which matches the character 'a' zero or more times:

The NFA looks like so:

a*

Whereas the DFA looks like:

a*

You'll notice that if you treat the epsilon (ε) move in the NFA as a possible move without consuming any input and the accepting states (those that have a double circle node) as states that, if you land on and have no more input, then the string is accepted as matching the pattern, you'll see that both of these state-machines represent the same pattern. The DFA, however, is much simpler and deterministic (at each state you have two options per input symbol - fail or move to another state) and this is what makes it much easier to use these for a Regex implementation.

The NFA to DFA Algorithm

This is actually rather tricky, and took me a few goes to get right. Having spent most of my life writing object-oriented imperative code, when the going gets tough I still find it a bit easier to write algorithms in an imperative manner, and then convert them to a functional algorithm once I understand better what I'm supposed to be doing.

Once I got it though, the algorithm made much more sense, but it's still somewhat complex.

I'm going to begin by just showing each method, and then talk through what it does at each stage.

Setting an Accepting State

private def setDFAAccepting(closure: Seq[Int], dfa: DFA[T], dfaState: Int): DFA[T] =
  if (closure.contains(states.length - 1)) dfa.setAccepting(dfa.acceptingStates :+ dfaState)
  else dfa

This is the simplest part of the equation. It checks to see if the current epsilon closure we're dealing with contains the accepting state of the NFA. If it does, then the current state that we're working with in the DFA must also be an accepting state, so we add this state to the DFA accepting states.

Getting Transitions for an Epsilon Closure

private def getEpsilonClosureTransitions(closure: Seq[Int]) =
  alpha.take(alpha.length - 1)
    .map(character => character -> closure.flatMap(cs => states(cs)(getAlphaIndex(character)).filter(_ != NFA.DeadState)))
    .toMap
    .filter(m => m._2.nonEmpty)

This is a little more tricky. In this function we're trying to get a map of the transitions for each character in the epsilon-closure. For example, in the below closure for an Regex a*.b*:

a*.b*

The transitions for the highlighted epsilon closure are:

Character States
a 3
b 6

Building a DFA

def toDFA(trans: Map[T, Seq[Int]] = getEpsilonClosureTransitions(getEpsilonClosure(1)),
          dfa: DFA[T] = setDFAAccepting(getEpsilonClosure(1), new DFA[T](alpha).addState(), 1),
          thisClosure: Seq[Int] = getEpsilonClosure(1),
          dfaState: Int = 1): DFA[T] = {

  if(trans.isEmpty) {
    dfa
  } else {
    // We flat map because we could have more than one transition within our closure for the same character
    val nextClosure = trans.head._2.flatMap(i => getEpsilonClosure(i))
    if (nextClosure == thisClosure) { // we're looping back on ourselves
      toDFA(trans.tail, dfa.addTransition(dfaState, trans.head._1, dfaState), thisClosure, dfaState)
    } else {
      val dfaWithAccepting = setDFAAccepting(nextClosure, dfa, dfa.length() + 1)
      // we're moving to a new closure
      val dfaWithTrans = dfaWithAccepting.addState().addTransition(dfaState, trans.head._1, dfa.length + 1)
      val epsilonClosureTransitions = getEpsilonClosureTransitions(nextClosure)
      val nextClosureDFA = toDFA(epsilonClosureTransitions, dfaWithTrans, nextClosure, dfa.length + 1)
      // and making sure to add the transition for this closure
      toDFA(trans.tail, nextClosureDFA, thisClosure, dfaState)
    }
  }
}

Now, taking the previous two functions, we can build our DFA from our NFA.

We start with:

  • A list of non-epsilon character transitions for the epsilon-closure of state 1
  • A new DFA with a single state which also may be an accepting state
  • The current closure, which is the epsilon-closure of state 1
  • The current DFA state, which is obviously 1

This is a recursive function, and our base-case is that we have no more transitions to work on. In this case, we simply return the DFA we're working on, and we're done.

The next closure that we're working on is the epsilon-closure of the state of the first transition in our list of transitions. If this closure is the current closure that we're working on (that is, we're looping back on ourselves as happens in the iteration case) then we simply add a transition to the current DFA state and move to the next transition.

Otherwise, we're moving to a new closure and things get a little more interesting.

  1. We add a new state to the DFA, as we have a new epsilon-closure to work with, and each new epsilon-closure corresponds to a state in the DFA
  2. We also check here whether our new closure is an accepting state and set it as such in the DFA if it is
  3. We add a transition to this new state, and call ourselves recursively for this new closure
  4. Finally we call ourselves recursively for the next transition for the current closure

Now For Some Cool Examples

Let's start with my name, with any number of included vowels:

val regex = new Regex("j.a*.c.o*.b")
val nfa = regex.toNFA()
val dfa = nfa.toDFA()

The NFA:

a*.b*

The DFA:

a*.b*

Some Tests:

dfa.testString("jacob") // true
dfa.testString("jcb") // true
dfa.testString("jaaaaaacoooooooooob") // true
dfa.testString("jacb") // true
dfa.testString("jcooooob") // true
dfa.testString("jacib") // false
dfa.testString("jaaaaacoeb") // false

How about two alternate spellings of my name with any number of vowels?:

val regex = new Regex("j.a*.c.o*.b|j.a*.k.e*.b")
val nfa = regex.toNFA()
val dfa = nfa.toDFA()

The NFA:

a*.b*

The DFA:

a*.b*

Some Tests:

dfa.testString("jacob") // true
dfa.testString("jakeb") // true
dfa.testString("jackeb") // false
dfa.testString("jaceb") // false
dfa.testString("jaaaaacoooob") // true
dfa.testString("jaaaakeeeb") // true
dfa.testString("jkeb") // true
dfa.testString("jkb") // true
dfa.testString("jcb") // true

Conclusion

Well this has been a lot of fun, but it took quite a bit of effort to get here.

Some useful resources I found while trying to figure this out are:

Stay tuned, next time we'll look at implementing some shortcuts for building complex Regexes.

Tags: Compilers | Automata | Scala | Regex
Tweet