Regex I: Four Components of a Regex

12 June 2015

Introduction

Now that we've learned and built the basics of an NFA and DFA, it's time to get our hands dirty and start down the path of building a simple Regex engine.

There are four base types of Regex graphs that we can produce.

Note that you've probably seen a greater range of Regex syntax, we'll cover more of that in a future blog post.

You'll most likely recognise the Regex operations listed below as being names of set operations. That's a good way of thinking about them, and how I tend to get my head around it.

Single

The single graph, represented in Regex form as /a/ describes a string that matches only one character. In this case, the Regex /a/ matches only the string "a". You can consider this Regex as matching the set of a single character Union

To create this regex, we use the following code:

object Regex {
    def single(subject: Char) = new NFA('a' to 'z', ').addState().addTransition(1, subject, 2)
}

and this generates the following NFA graph:

Single

Concatenation

The concatenation graph, represented in Regex form as /ab/ describes a string in which each character occurs after the other. In this case, the Regex /ab/ matches only the string "ab". You can consider this Regex as matching the set of the single string Union

To create this regex, we use the following code:

object Regex {
    def concatenation(from: NFA[Char], to: NFA[Char]) = from.combine(to)
}

and this generates the following NFA graph:

Concatenation

Union

The union graph, represented in Regex form as /a|b/ describes a string in which either one character (or Regex) or another will match. In this case, the Regex /a|b/ matches either the string "a" or the string "b". You can consider this Regex as matching the set of the two possible strings Union

To create this regex, we use the following code:

object Regex {
    def union(a: NFA[Char], b: NFA[Char]) =
        new NFA('a' to 'z', ')
            .addState()
            .addTransition(1, ', 2)
            .combine(a)
            .addState()
            .combine(b)
            .addTransition(1, ', a.length + 2)
            .addState()
            .addTransition(a.length + 1, ', b.length + a.length + 2)
            .addTransition(b.length + a.length + 1, ', b.length + a.length + 2)
}

and this generates the following NFA graph:

Union

Iteration

The iteration graph, represented in Regex form at /a*/ describes a string in which a character (or Regex) is repeated zero or many times. In this case, the Regex /a*/ matches the strings "", "a", "aa", "aaa", "aaaa", and so on. You can consider this Regex as matching the set of strings Iteration Equation

To create this regex, we use the following code:

object Regex {
    def iteration(subject: NFA[Char]) = 
        new NFA('a' to 'z', ')
            .addState()
            .addTransition(1, ', 2)
            .combine(subject)
            .addTransition(subject.length + 1, ', subject.length + 2)
            .addTransition(subject.length + 1, ', 2)
            .addState()
            .addTransition(subject.length + 2, ', 1)
}

and this generates the following NFA graph:

Iteration

Refactored NFA Code

All of this work made me realise that there's a few ways to clean up the NFA code I'd written so far, and keep only the operations that were required.

Firstly, I made the public interface of the NFA class much simpler by moving the printing functionality to another class. While printing is important for writing this blog and some other ideas I have going forward, it's not integral to the actual execution of the NFA, so it doesn't really belong in the NFA class. I also simplified the public interface to only methods that were required, these methods are:

  • addState
  • addTransition (simplified to only have one signature where the to and from states are known)
  • combine
  • length

The new NFA code is shown below:

class NFA[T](val alphabet: Seq[T], val states: Seq[Seq[Seq[Int]]]) {

  def this(alphabet: Seq[T], epsilon: T) = {
    this(alphabet :+ epsilon, Seq.fill(2)(Seq.fill(alphabet.length + 1)(Seq(NFA.DeadState))))
  }

  def addState(): NFA[T] = new NFA(alphabet, states ++ Seq.fill(1)(Seq.fill(states(0).length)(Seq(NFA.DeadState))))

  def addTransition(from: Int, key: T, to: Int): NFA[T] = {
    val newTransitions = states(from)(getalphaIndex(key)).filter(_ != NFA.DeadState) :+ to
    val newStates = states.updated(from, states(from).updated(getalphaIndex(key), newTransitions))
    new NFA(alphabet, newStates)
  }

  def combine(to: NFA[T]): NFA[T] = {
    new NFA(alphabet, states.dropRight(1) ++ to.states.tail.map(_ .map(_ .map(z => if(z == NFA.DeadState) NFA.DeadState else z + states.length - 2))))
  }

  def length() = states.length - 1

  private def getalphaIndex(item: T) = alphabet.zipWithIndex.filter(a => a._1 == item)(0)._2
}

object NFA {
  val DeadState = 0
}

Much cleaner and simpler, even if I do say so myself.

In Conclusion

Things are starting to get interesting, and the next key part of the Regex engine to build will be to take a Regex string and parse it into these particular states. That will be the focus of at least the next couple of blog posts.

Tags: Compilers | Automata | Scala | Regex
Tweet