art with code

2009-02-22

Generic regexps

You might like this: A small polymorphic regexp library.

To explain, four bullet points of theory:
  • Formal languages are sets of words.
  • Words are strings of characters from an alphabet.
  • To define a formal language, you usually use a formal grammar.
  • Regular expressions are formal grammars of regular languages.
So when you are matching a string against a regexp, you are actually testing whether the string is a member of the regular language defined by the regexp.

The beauty of that is that the strings and the alphabet can be anything you can come up with, they don't have to mean the usual "array of 8-bit integers." It could just as well be a list of UTF-8 characters, an array of floats or a tree of lists.[1]

All you need is a fold over the data structure and comparison operators for your alphabet (equality for character comparisons, < and > for range matching.) And, as you likely don't want to write out the regexp AST by hand, a parser for expressions in your alphabet might be helpful.

The library linked above contains parsers for regular expressions of chars, ints and floats, and a simple polymorphic NFA-interpreting regexp engine for running the regexps. It's totally unoptimized, so don't expect magnificent feats of performance from it.

In terms of missing features, the engine doesn't have backreferences as it's an NFA. I also haven't implemented negative ranges (e.g. [^a-z] does not parse.) And it doesn't do greedy matching (e.g. "fo+" matches only the first two characters of "fooooo".)

Examples:

(* boring string regexp *)
pat_match "(fo.|bar)+baz$" "afooforbarbarfozbaz"
- : Some (1, 19)

(* int array regexp, _ is the wildcard, ; separates the numbers *)
int_match "[1..10]; _; _; [11..15; 17; 19..25]$" [|17; 4; 0; -1; 17|]
- : Some (1, 5)

(* float array regexp *)
float_match "^2e3;[1.5..2.2]+;17.0" [|2000.0; 1.7; 2.1; 17.0; 8.0|]
- : Some (0, 4)


Here's what happens under the covers: The regular expression is parsed into an 'a regex, which is then turned into an 'a nfa. To execute the NFA, the matcher calls execute_nfa nfa getter, where the getter returns an option value for a given index. That's... probably not all too clear, so here's the implementation of int_match:


(* Creates a getter for the array.
The getter is a function that returns
None for out-of-bounds indices and
Some value for valid ones.
*)
let array_getter arr =
let len = Array.length arr in
fun i -> if i < 0 || i >= len then None else Some arr.(i)

let int_match pat arr =
let re = int_regex pat in (* make the regexp AST *)
let nfa = nfa_of_regex re in (* turn it into an NFA *)
let getter = array_getter arr in (* make the getter *)
execute_nfa nfa getter (* and run the NFA on the getter *)



[1] If you really want to meta it up, your string could be a regular expression and your alphabet could be regular expressions, so you'd write regular expressions of regular expressions to see if a regular expression of regular expressions matches your regular expression.

Status update

Reading through Clutter sources for fun. Drawing a comic script for fun (now on page 46, 3x5cm pages. On reaching page 100, celebrating with pizza.) Procrastinating doing logic curse homework. Installed Linux on an USB stick as homework.

Installing Linux on the same stick as your installer is on is not such a great idea. Syslinux (or having a FAT partition at the start of the stick or something) makes Grub not work, so I had to delete the Syslinux partition (and change the Grub root drive number and edit menu.lst and update-grub.) Plus the USB stick write speed was slow, so it took more than three hours to do a Debian desktop net install.

qemu-kvm is nifty.

2009-02-17

Fedora 10 + Compiz steal your desktop edges?

Using Fedora 10? GL desktop enabled? Can't just throw the mouse to the top/bottom edge of the screen and click on gnome panel items? The culprit is the Compiz wall plugin. You need to use gconf-editor and go to apps/compiz/plugins/wall/allscreens/options, then erase the values for flip_down_edge and flip_up_edge.

2009-02-10

The Kindle 2 looks pretty cool


The Amazon Kindle 2, available for pre-order.
But it's, like, a book? Huh? It does seem to have a simple web browser though. The GUI reminds me of MacOS 7, what with them rounded B&W corners. I do like the design, very sleek.

Started writing documentation for prelude.ml, and it will take forever (err, say a month or two?)

I don't really know what to do with the IO layer. What I'd like to have is files, pipes and sockets under the same banner, pipes and sockets using a folding style of IO, files having folding and an mmap-like approach. Maybe something like Enum in Batteries Included.

And some structured IO marshaller on top so that you could e.g. send a list of float * string -tuples over a TCP socket. Or open a file over HTTP and write it to a local file, like with Ruby's open-uri.

Or maybe I just leave it as is and do something more productive instead.

2009-02-02

Caught on Last.fm, also some OSS config

9mm Parabellum Bullet - Japanese rock, Star One - prog rock, Кино - Soviet rock, Mumiy Troll - Russian rock

If you wish to use OSS4 on Fedora 10 Rhythmbox with Intel HDA integrated sound:
  1. download and install the rpm, fiddle with ossxmix to mute all line-ins and to maybe fix the jack mappings
  2. yum erase pulseaudio
  3. Use gconf-editor to change system/gstreamer/0.10/default *audiosink etc. to osssink (they default to autoaudiosink, which necessitates audioresample and 30% cpu use. You don't want that.)
  4. vmixctl rate 44100 /dev/dsp to make the OSS vmix take 44100 Hz audio (which most of your audio is)
  5. Oh hey, now Rhythmbox uses only 2% CPU instead of 6% it used with Pulseaudio. And it doesn't skip either. And you can play several sounds at once! Welcome to a Windows computer circa 1995.

Alternative for steps 3-5:
  • Use gconf-editor to change the *audiosink to speexresample ! audio/x-raw-float,rate=48000 ! audioconvert ! oss4sink (speexresample uses 6% cpu, same as Pulseaudio (as it uses speex as well))


(Why are Pulseaudio, artsd and esd a bad idea? Because they try to solve a driver-level problem by adding a new driver on top, instead of fixing the driver.

It's like you had a network driver that allowed only one process to use it at a time, and people scrambled to provide multiplexers that add a userspace daemon that let several programs access the network concurrently. And they don't bother to make them fast or low-latency. Or compatible.

So you have PulseIP, eIPd and artIPd, and every app must decide which one to support, or take a leap of faith and trust that the network driver does non-exclusive network access. And there are two driver APIs, the Open IP System and the Advanced Linux IP Architecture, both incompatible, you need to choose which one to use.

Retarded? Yes. How about just fixing the driver? Which, as it happens, is what OSS4 does.

There are some benefits to user-space sound daemons you say? Network-transparent audio? Application-specific volume control? While nice, they're not vital. What really matters is that basic audio works right.

And you should do the fancy stuff iptables-style, i.e. with a transparent sound shaper between the driver and the applications. Push the solutions down the stack, that way more people can benefit from them.)

Blog Archive

About Me

My photo

Built art installations, web sites, graphics libraries, web browsers, mobile apps, desktop apps, media player themes, many nutty prototypes, much bad code, much bad art.

Have freelanced for Verizon, Google, Mozilla, Warner Bros, Sony Pictures, Yahoo!, Microsoft, Valve Software, TDK Electronics.

Ex-Chrome Developer Relations.