art with code

2009-02-22

Generic regexps

You might like this: A small polymorphic regexp library.

To explain, four bullet points of theory:
  • Formal languages are sets of words.
  • Words are strings of characters from an alphabet.
  • To define a formal language, you usually use a formal grammar.
  • Regular expressions are formal grammars of regular languages.
So when you are matching a string against a regexp, you are actually testing whether the string is a member of the regular language defined by the regexp.

The beauty of that is that the strings and the alphabet can be anything you can come up with, they don't have to mean the usual "array of 8-bit integers." It could just as well be a list of UTF-8 characters, an array of floats or a tree of lists.[1]

All you need is a fold over the data structure and comparison operators for your alphabet (equality for character comparisons, < and > for range matching.) And, as you likely don't want to write out the regexp AST by hand, a parser for expressions in your alphabet might be helpful.

The library linked above contains parsers for regular expressions of chars, ints and floats, and a simple polymorphic NFA-interpreting regexp engine for running the regexps. It's totally unoptimized, so don't expect magnificent feats of performance from it.

In terms of missing features, the engine doesn't have backreferences as it's an NFA. I also haven't implemented negative ranges (e.g. [^a-z] does not parse.) And it doesn't do greedy matching (e.g. "fo+" matches only the first two characters of "fooooo".)

Examples:

(* boring string regexp *)
pat_match "(fo.|bar)+baz$" "afooforbarbarfozbaz"
- : Some (1, 19)

(* int array regexp, _ is the wildcard, ; separates the numbers *)
int_match "[1..10]; _; _; [11..15; 17; 19..25]$" [|17; 4; 0; -1; 17|]
- : Some (1, 5)

(* float array regexp *)
float_match "^2e3;[1.5..2.2]+;17.0" [|2000.0; 1.7; 2.1; 17.0; 8.0|]
- : Some (0, 4)


Here's what happens under the covers: The regular expression is parsed into an 'a regex, which is then turned into an 'a nfa. To execute the NFA, the matcher calls execute_nfa nfa getter, where the getter returns an option value for a given index. That's... probably not all too clear, so here's the implementation of int_match:


(* Creates a getter for the array.
The getter is a function that returns
None for out-of-bounds indices and
Some value for valid ones.
*)
let array_getter arr =
let len = Array.length arr in
fun i -> if i < 0 || i >= len then None else Some arr.(i)

let int_match pat arr =
let re = int_regex pat in (* make the regexp AST *)
let nfa = nfa_of_regex re in (* turn it into an NFA *)
let getter = array_getter arr in (* make the getter *)
execute_nfa nfa getter (* and run the NFA on the getter *)



[1] If you really want to meta it up, your string could be a regular expression and your alphabet could be regular expressions, so you'd write regular expressions of regular expressions to see if a regular expression of regular expressions matches your regular expression.

No comments:

Blog Archive