Tuesday, August 11, 2009

Regular Expressions and Perl

Today's News:


30 songs: $30? $4,000? $675,000?

Jury awards RIAA $675,000 from a grad student for DLing and sharing 30 songs. RIAA demanded $4k before trial. Grad student. Now he'll have to scrape up enough to pay a bankruptcy lawyer. RIAA won't get money, but it's got a hell of a legal precedent.

For the record, it took 42 seconds for me to find the file-sharing service my kids use.

Perl


First, Perl is a very kludgy language. The Perl programmer does a lot of work for the benefit of the interpreter. That is backwards. Scalar values' names must start with "$". Array names start with "@". If you have an array, @arr and want its first element you refer to $arr[0].

I assigned a task to myself: learn the three "P"s of the LAMP stack. That's Linux, the Apache server, the MySQL database and one of Perl, PHP or Python. I got far enough in Perl that I did not use it for CGI code. Too kludgy.

Consider this. Arguments are passed in the global variable @_. Suppose your second argument, $_[1], is a reference to a list. The reference is a scalar. The list is not. So you say:$ref = $_[1]; my @list = @$ref;. You now have a list variable, @list.

If that's not bad enough, consider this simpler substitute: my @list = @$_[1];. It looks identical, but it doesn't work. Ugh. I quit.

Regular Expressions

Perl did one thing well: regular expressions. It puts regular expression constants into the core syntax as most languages have string constants. You have "string constant" or /regular expression constant/.

A Perl statement has this format:

[optional statement part] [# optional comment part]

A regular expression to recognize this is /[~#]*#.*/. In English, that's any single character other than "#", repeated zero or more times, followed by "#" followed by zero or more characters of any kind. Add parentheses: /([~#]*)(#.*)/ and you can refer to the parts in your code.

For many text processing chores, regular expressions are so valuable that you forgive their unreadability. The programmer can partially solve this problem with extensive comments. Since Perl, most languages, including Python, feature regex constants delimited by forward slashes. Oddly, Ruby, designed to be a better Perl, does not. It features a regular expression class (JavaScript does this, also) which is somewhat less convenient.

Unfortunately, that sample regular expression won't handle itself as input. It would have /([~ as the statement part and #]*)(#.*)/ as the comment part.

The correct statement definition is:

Zero or more characters, including the "#" character if it is part of a string literal or a regex literal, optionally followed by a "#" character and zero or more comment characters.

There's a neat Perl program on my board that reads Perl source and emits html output. As the comments are output unprocessed, you can use HTML in the comments: lists, tables, images, ...

However, the processing that splits the Perl source into statement and comment does not use Perl's regex. It reads like a C program: got a quote character? Read forward until the matching close quote. Got a forward slash? Set the "in regex" flag. In regex, got a "["? Set the in character class flag. Got a "\"? The next char is escaped. And so it goes.

Challenge:

If you love your Perl and your Tim Toady, meet this challenge: write my "pl2html" program using Perl's regex. If you can get that done, I'll substantially revise this post.

2 comments: