The Origins of Regular Expressions

The Origins of Regular Expressions

Where did regular expressions come from and why the funny name? Rather interestingly, they grew from original research work on artificial intelligence, dating back nearly 60 years and preceding the era of computing.

The Early History

In the 1940s, Warren McCulloch and Walter Pitts modeled neuron-like finite state machines to mimic the human nervous system in an effort to help build a Turing machine. After being introduced to this research by John von Neumann, mathematician Stephen Kleene later described these models in a notation that he called regular sets. The actual term regular expression made its initial debut in Kleene's 1951 paper, from the University of Wisconsin at Madison, titled "Representation of events in nerve nets and finite automata."

Regexes supplemented Kleene's Princeton University doctoral work, dating from the 1930s, on recursive algorithms,^[1] a fundamental contribution which helped make electronic computing directly possible in the first place. Kleene's work also complemented related work by the more famous British scientist Alan Turing, who like Kleene was a doctoral student (1936-1938) of Professor Alonzo Church, the enigmatic head of Princeton's mathematics department. Church himself had extended the earlier recursive work of Vienna's Gödel, who'd briefly lectured Kleene at Princeton in 1934 before returning to Austria. Gödel later escaped from Hitler's Germany and came back to Princeton in 1940, via Russia and Japan, after World War II broke out. After Turing completed his own doctorate, and some hurried studies on ciphers, he returned to England in 1938 to successfully crack the Nazi's ENIGMA code via the use of repetitive symbolic manipulations. In these various ways, an early form of regular expressions grew from the mathematical culture dish of Princeton.

^[1] There's a good joke on recursion: You can only learn how it works if you understand it already. (You'll be pleased to hear that's all we're going to say about recursion.)

After the war ended, research continued towards regular expressions proper and the creation of their backbone components. The most famous of these is the asterisk wildcard, still technically known as the Kleene Star, which was heavily adopted by many different computing applications. From these beginnings, Kleene's regular expressions gradually made their way into a wide range of programming languages, helping develop many other different technologies along the way.

qed, ed, and vi

Jumping ahead a number of decades, MIT's Ken Thompson incorporated Kleene's regular sets notation into Butler Lampson and Peter Deutsch's original Berkeley qed editor program. This was the distant ancestor of Thompson's ed and Bill Joy's personal interpretation of vi. With the help of Dennis Ritchie, regexes were also popularized via the Unix grep program, and from this historical point, regexes inveigled their way into sed, lex, awk, nawk, gawk, and a host of other programs, including the venerable CHANGE command line editor within SQL*Plus.

Enter Perl

When Perl bubbled spontaneously from its primeval soup in 1987, consisting mainly of amino acids stripped from sed and awk, the rest was pure biological determinism, swimming along in the moonlight with the general tide of Unix. If you add Larry Wall's linguistic origins to this tidal churn, the strong relationship between Perl and regexes was an almost inevitable development (or a successful pre-adaptation, as Darwinist Stephen J. Gould might have put it). And the relationship with regexes has remained central to Perl ever since, as Figure C-1 illustrates.

Perl vs. grep

Regular expressions are found within many Unix tools to pattern-match groups of characters within input data. Such data usually comes in through files, though it can be any kind of data held within a scalar zvariable. Although many such tools exist — for example, grep and awk^[2] — for finding patterns of characters, it's within Perl that pattern matching has been most strongly developed. Perl contains the greatest range of operators and metacharacters for finding and substituting patterns into something else. Compared to grep, regexes in Perl have three major advantages:

^[2] grep itself stands for Global Regular Expression Print. awk is named after the surnames of its creators, Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. (Alfred Aho also invented egrep. The roots of regexes go deep.)

They pulsate strongly within the beating heart of the Perl core engine. You can therefore program far more complex regular expressions in Perl than you could ever imagine doing with grep, and you can then immediately wrap them within programming constructs.
Regexes in Perl can deal with binary data, without turning your xterm screen into a mass of Klingon ideographs. This can all too easily happen when you do an ordinary grep on a binary file.
Because Perl is available on virtually every operating system, regexes written in Perl can be equally widespread. This is especially comforting for sysadmins operating across an entire range of OS quadrants. And because Perl works seamlessly on Win32, you therefore get regexes there too.

Figure C-1. Languages within languages