The Whinery: Emacs Lisp Regular Expressions

In introducing the concept of strings The GNU Emacs Lisp Reference Manual comments that they're a good place for storing regular expressions. This is in fact, complete nonsense. It's probably true that elisp has nothing better for this purpose, but it's clear that it's strings are a, uh, bad match.

Consider first of all that in keeping with the general concept of regular expressions, emacs regexps are different from everyone else's. A notable, shall we say, oddity, is the fact that "|", "(" and ")" are all just ordinary characters. If you want alternation you need to escape the vertical bar with a backslash: "\|", ditto with grouping via parenthesis: "\(" and "\)". I can only speculate about this design decision, but I would guess that it was expected that one of the major things you were going to want to hack with elisp regexps was elisp itself, and you were going to be matching parenthesis a hell of a lot. In any case, elisp regexps are pretty heavy on backslashes.

But wait, there's more! You enter strings in elisp, conventionally enough, by sticking them within double-quotes, and also, as it not unusual, you can use backslashes in these strings to escape characters for different purposes. A not irrelevent example: "\t" is how you enter the tab character (without literally typing in a TAB, which after all, you can do in emacs, usually by doing something like "C-q TAB").

But we're not up to the funny part yet, now we're getting near the punchline: if you want to enter a backslash into one of these strings, you need to escape it with another backslash.

A real world example for you. In the documentation strings for elisp functions, if you want to refer to another function, you bracket the name like so: \\[function-name]. (This gets turned into a hyper-link inside the emacs help system.) For good and sufficient reasons, I wanted to write some elisp to extract function names from docstrings. The regexp to do this works out to:

Pretty whacky stuff, eh?

(For the life of me, I can't remember why 6 leading backslashes is the magic number. Trying to think it through I keep coming up with 8, but there are 6 in my code and it seems to work, so let's hope that that's right.)

But wait there's *still* more. Remember the "\t" I mentioned? Suppose you want to capture some white space with a regexp. The pattern you're after is something like:

  \([ \t]*\)
And when you enter this in a string, it becomes:
  "\\([ \t]*\\)"
Look at that closely. See anything unusual? \( becomes \\( and \) becomes \\)... but what about \t? Why does it just stay \t?

Don't ask me. I know that leaving it as "\t" works, and I would guess that what happens is something like this: the \t interpolates into a literal TAB inside the string, so the regexp engine gets fed a pattern with a literal TAB in it, and thankfully that works. But on the other hand, if you stuck a "\\t" in the string, and that got converted into a "\t", what's wrong with that? Why wouldn't the regexp engine understand a "\t" as a TAB? Would *you* write a regexp engine that couldn't deal with it?

Anyway, there it stands. There's a simple rule of thumb with sticking a regexp in a string: double-up all the (already plentiful) backwhacks, except when you don't.

appendix: possible fix?

It would seem like it shouldn't be that hard to write a function that does this additional backwhacking for you (emacs-regexp-whack-off-string).

In principle you could implement any number of variant regexps with some (relatively) simple translations.

But this wouldn't fix the real problem: the default doesn't work in anything like an intuitive way. I flailed around with this problem occasionally for years before finally deciding to figure out what the hell was going wrong with my regexps.

This is an issue that comes up repeatedly in the emacs world: everything can be customized, so in theory all problems can be fixed, but often a "fix" is no good if you need to be an expert to know about the fix.

For more whining, see the rest of The Whinery

Joseph Brenner, 28 Feb 2004