Chapter 7
Perl Regular Expressions

Perl has been featured prominently in this book, and with good reason. It is popular, extremely rich with regular expressions, freely and readily obtainable, easily approachable by the beginner, and available for a wide variety of platforms, including Amiga, DOS, MacOS, NT, OS/2, Windows, VMS, and virtually all flavors of Unix.

Some of Perl's programming constructs superficially resemble those of C or other traditional programming languages, but the resemblance stops there. The way you wield Perl to solve a problem -- The Perl Way -- is very different from traditional languages. The overall layout of a Perl program often uses the traditional structured and object-oriented concepts, but data processing relies very heavily on regular expressions. In fact, I believe it is safe to say that regular expressions play a key role in virtually all Perl programs. This includes everything from huge 100,000-line systems, right down to simple one-liners, like

 % perl -pi -e 's{([-+]?\d+(\.\d*)?)F\b}{sprintf "%.0fC", ($1-32) * 5/9}eg' *.txt

which goes through *.txt files and replaces Fahrenheit values with Celsius ones (reminiscent of the first example from Chapter 2).

In This Chapter

In this chapter we'll look at everything regex about Perl -- the details of its regex flavor and the operators that put them to use. This chapter presents the regex-relevant details from the ground up, but I assume that you have at least a basic familiarity with Perl. (If you've read Chapter 2, you're probably already familiar enough to at least start using this chapter.) I'll often use, in passing, concepts that have not yet been examined in detail, and won't dwell much on non-regex aspects of the language. It might be a good idea to keep the Perl manpage handy, or perhaps O'Reilly's Perl 5 Desktop Reference (see Appendix A).

Perhaps more important than your current knowledge of Perl is your desire to understand more. This chapter is not light reading by any measure. Because it's not my aim to teach Perl from scratch, I am afforded a luxury that general books about Perl do not have: I don't have to omit important details in favor of weaving one coherent story that progresses unbroken through the whole chapter. What remains coherent throughout is the drive for a total understanding. Some of the issues are complex, and the details thick; don't be worried if you can't take it all in at once. I recommend first reading through to get the overall picture, and returning in the future to use as a reference as needed.

Ideally, it would be nice if I could cleanly separate the discussion of the regex flavor from the discussion on how to apply them, but with Perl the two are inextricably intertwined. To help guide your way, here's a quick rundown of how this chapter is organized:

The Perl Way

Table 7-1 summarizes Perl's extremely rich regular-expression flavor. If you're new to Perl after having worked with another regex-endowed tool, many items will be unfamiliar. Unfamiliar, yes, but tantalizing! Perl's regular-expression flavor is perhaps the most powerful among today's popular tools. One key to more than a superficial understanding of Perl regular expressions is knowing that it uses a Traditional NFA regex engine. This means that all the NFA techniques described in the previous chapters can be brought to bear in Perl.

Yet, hacker does not live by metacharacters alone. Regular expressions are worthless without a means to apply them, and Perl does not let you down here either. In this respect, Perl certainly lives up to its motto ``There's more than one way to do it.''


Table 7-1: Overview of Perl's Regular-Expression Language
. (dot) any byte except newline (=>) (any byte at all with the /s modifier1 =>)
| alternation (...) normal grouping and capturing
greedy quantifiers (=>) (?:...) pure grouping only* (=>)
* + ? {n} {min,} {min,max} (?=...) positive lookahead* (=>)
non-greedy quantifiers* (=>) (?!...) negative lookahead* (=>)
*? +? ?? {n}? {min,}? {min,max}? anchors
(?#...) comment* (=>) \b* \B word/non-word anchors (=>)
#... (with /x mod, =>) comment2 until newline or end of regex

^ $ start/end of string (or start and end of logical line) (=>)

inlined modifiers* (=>) \A \Z start/end of string* (=>)
(?mods) mods from among i, x, m, and s \G end of previous match* (=>)
\1, \2, etc.  text previously matched by associated set of capturing parentheses (=>)
[...] [^...] Normal and inverted character classes (=>)
(the items below are also valid within a character class)
character shorthands (=>) class shorthands (=>)
\b3'  \t \n \r \f \a \e \num \xnum \cchar \w \W \s \S \d \D
\l \u \L \U \Q* \E  on-the-fly text modification (=>)
Not available before Version 5.000
Not reliably available before Version 5.002
\b is an anchor outside of a character class, a shorthand inside

Regular Expressions as a Language Component

An attractive feature of Perl is that regular expression support is so deftly built-in as part of the language. Rather than providing stand-alone functions for applying regular expressions, Perl provides regular-expression operators that are meshed well with the rich set of other operators and constructs that make up the Perl language. Table 7-2 briefly notes the language components that have some direct connection with regular expressions.

Perhaps you never considered ``... =~ m/.../'' to be an operator, but just as addition's + is an operator which takes two operands and returns a sum, the match is an operator that takes two operands, a regex operand and a target-string operand, and returns a value. As discussed in Chapter 5's ``Functions vs. Integrated Features vs. Objects'' (=>159), the main difference between a function and an operator is that operators can treat their operands in magical ways that a function normally can't.4 And believe me, there's lots of magic surrounding Perl's regex operators. But remember what I said in Chapter 1: There's nothing magic about magic if you understand what's happening. This chapter is your guide.

The waters are muddied in Perl, where functions and procedures can also respond to their context, and can even modify their arguments. I try to draw a sharp distinction with the regex features because I wish to highlight characteristics which can easily lead to misunderstandings.

There are some not-so-subtle differences between a regular expression and a regular-expression operand. You provide a raw regex operand in your script, Perl cooks it a bit, then gives the result to the regex search engine. The preprocessing (cooking) is similar to, but not exactly the same as, what's done for doublequoted strings. For a superficial understanding, these distinctions are not important -- which is why I'll explain and highlight the differences at every opportunity!

Don't let the shortness of Table 7-2's ``regex-related operators'' section fool you. Each of those three operators really packs a punch with a variety of options and special-case uses.


Table 7-2: Overview of Perl's Regex-Related Items
Regex-Related Operators modifier (=>) modifies how...
* m/regex/mods (=>) /x5 /o regex is interpreted
* s/regex/subst/mods (=>) /s* /m* /i engine considers target text
split(...) (=>) /g /e other
*operates on $_ unless related via =~ or !~ After-Match Variables (=>)
Related Variables $1, $2, etc. captured text
$_ default search target $+ highest filled $1, $2, ...
$* obsolete multi-line mode (=>) $` $& $' text before, of, and after match
(best to avoid -- see ``Perl Efficiency Issues'' =>)
-- Related Functions --
pos (=>)  study (=>)    quotemeta lc lcfirst uc ucfirst (=>)
Not available before Version 5.000


Perl's Greatest Strength

The richness of variety and options among the operators and functions is perhaps Perl's greatest feature. They can change their behavior depending on the context in which they're used, often doing just what the author naturally intends in each differing situation. In fact, the Second Edition of O'Reilly's Programming Perl goes so far as to boldly state ``In general, Perl operators do exactly what you want....'' The regex match operator m/regex/, for example, offers a wide variety of different functionality depending upon where, how, and with which modifiers it is used. The flexibility is amazing.

Perl's Greatest Weakness

This concentrated richness in expressive power is also one of Perl's least-attractive features. There are innumerable special cases, conditions, and contexts that seem to change out from under you without warning when you make a subtle change in your code -- you've just hit another special case you weren't aware of.6 The Programming Perl quote continues ``...unless you want consistency.'' Certainly, when it comes to computer science, there is a certain art-like appreciation to boring, consistent, dependable interfaces. Perl's power can be a devastating weapon in the hands of a skilled user, but it seems with Perl, you become skilled by repeatedly shooting yourself in the foot.

That they're innumerable doesn't stop this chapter from trying!

In the Spring 1996 issue of The Perl Journal,7 Larry Wall wrote:

See http://tpj.com/ or staff@tpj.com

One of the ideas I keep stressing in the design of Perl is that
things that ARE different should LOOK different.

This is a good point, but with the regular expression operators, differences unfortunately aren't always readily apparent. Even skilled hackers can get hung up on the myriad of options and special cases. If you consider yourself an expert, don't tell me you've never wasted way too much time trying to understand why

if (m/.../g) {
 :
wasn't working, because I probably wouldn't believe you. Everyone does it eventually. (If you don't consider yourself an expert and don't understand what's wrong with the example, have no fear: this book is here to change that!)


In the same article, Larry also wrote:

In trying to make programming predictable, computer scientists have
mostly succeeded in making it boring.

This is also true, but it struck me as rather funny since I'd written ``there is a certain art-like appreciation to boring, consistent, dependable interfaces'' for this section's introduction only a week before reading Larry's article! My idea of ``art'' usually involves more engineering than paint, so what do I know? In any case, I highly recommend Larry's entertaining and thought provoking article for some extremely insightful comments on Perl, languages, and yes, art.

A Chapter, a Chicken, and The Perl Way

Perl regular expressions provide so many interrelated concepts to talk about that it's the classic chicken and egg situation. Since the Chapter 2 introduction to Perl's approach to regular expressions was light, I'd like to present a meaty example to get the ball rolling before diving into the gory intricacies of each regex operator and metacharacter. I hope it will become familiar ground, because it discusses many concepts addressed throughout this chapter. It illustrates The Perl Way to approach a problem, points out pitfalls you might fall into, and may even hatch a few chickens from thin air.

An Introductory Example: Parsing CSV Text

Let's say that you have some $text from a CSV (Comma Separated Values) file, as might be output by dBASE, Excel, and so on. That is, a file with lines such as:
"earth",1,,"moon",9.374

This line represents five fields. It's reasonable to want this information as an array, say @field, such that $field[0] was `earth', $field[1] was `1', $field[2] was undefined, and so forth. This means not only splitting the data into fields, but removing the quotes from quoted fields. Your first instinct might be to use split, along the lines of:

@fields = split(/,/, $text);

This finds places in $text where , matches, and fills @fields with the snippets that those matches delimit (as opposed to the snippets that the matches match).

Unfortunately, while split is quite useful, it is not the proper hammer for this nail. It's inadequate because, for example, matching only the comma as the delimiter leaves those doublequotes we want to remove. Using "?,"? helps to solve this, but there are still other problems. For example, a quoted field is certainly allowed to have commas within it, and we don't want to treat those as field delimiters, but there's no way to tell split to leave them alone.


Perl's full toolbox offers many solutions; here's one I've come up with:

@fields = (); # initialize @fields to be empty
while ($text =~ m/"([^"\\]*(\\.[^"\\]*)*)",?|([^,]+),?|,/g) {
   push(@fields, defined($1) ? $1 : $3); # add the just-matched field
}
push(@fields, undef) if $text =~ m/,$/; # account for an empty last field
# Can now access the data via @fields ...

Even experienced Perl hackers might need more than a second glance to fully grasp this snippet, so let's go over it a bit.

Regex operator context

The regular expression here is the somewhat imposing:

  "([^"\\]*(\\.[^"\\]*)*)",?|([^,]+),?|,

Frankly, looking at the expression is not meaningful until you also consider how it is used. In this case, it is applied via the match operator, using the /g modifier, as the conditional of a while. This is discussed in detail in the meat of the chapter, but the crux of it is that the match operator behaves differently depending how and where it is used. In this case, the body of the while loop is executed once each time the regex matches in $text. Within that body are available any $&, $1, $2, etc., set by each respective match.

Details on the regular expression itself

That regular expression isn't really as imposing as it looks. From a top-level view, it is just three alternatives. Let's look at what these expressions mean from a local, standalone point of view. The three alternatives are:

 "([^"\\]*(\\.[^"\\]*)*)",?
This is our friend from Chapter 5 to match a doublequoted string, here with ,? appended. The marked parentheses add no meaning to the regular expression itself, so they are apparently used only to capture text to $1. This alternative obviously deals with doublequoted fields of the CSV data.

 ([^,]+),?
This matches a non-empty sequence of non-commas, optionally followed by a comma. Like the first alternative, the parentheses are used only to capture matched text -- this time everything up to a comma (or the end of text). This alternative deals with non-quoted fields of the CSV data.

 ,  
Not much to say here -- it simply matches a comma.


These are easy enough to understand as individual components, except perhaps the significance of the ,?, which I'll get to in a bit. However, they do not tell us much individually -- we need to look at how they are combined and how they work with the rest of the program.

How the expression is actually applied

Using the combination of while and m/.../g to apply the regex repeatedly, we want the expression to match once for each field of the CSV line. First, let's consider how it matches just the first time it's applied to any given line, as if the /g modifier were not used.

The three alternatives represent the three types of fields: quoted, unquoted, and empty. You'll notice that there's nothing in the second alternative to stop it from matching a quoted field. There's no need to disallow it from matching what the first alternative can match since Perl's Traditional NFA's non-greedy alternation guarantees the first alternative will match whatever it should, never leaving a quoted field for the second alternative to match against our wishes.

You'll also notice that by whichever alternative the first field is matched, the expression always matches through to the field-separating comma. This has the benefit of leaving the current position of the /g modifier right at the start of the next field. Thus, when the while plus m/.../g combination iterates and the regular expression is applied repeatedly, we are always sure to begin the match at the start of a field. This ``keeping in synch'' concept can be quite important in many situations where the /g modifier is used. In fact, it is exactly this reason why I made sure the first two alternatives ended with ,?. (The question mark is needed because the final field on the line, of course, will not have a trailing comma.) We'll definitely be seeing this keeping-in-synch concept again.

Now that we can match each field, how do we fill @fields with the data from each match? Let's look at $1 and such. If the field is quoted, the first alternative matches, capturing the text within the quotes to $1. However, if the field is unquoted, the first alternative fails and the second one matches, leaving $1 undefined and capturing the field's text to $3. Finally, on an empty field, the third alternative is the one that matches, leaving both $1 and $3 undefined. This all crystallizes to:

push(@fields, defined($1) ? $1 : $3);

The marked section says ``Use $1 if it is defined, or $3 otherwise.'' If neither is defined, the result is $3's undef, which is just what we want from an empty field. Thus, in all cases, the value that gets added to @fields is exactly what we desire, and the combination of the while loop, the /g modifier, and keeping-in-synch allows us to process all the fields.

Well, almost all the fields. If the last field is empty (as demonstrated by a line-ending comma), our program accounts for it not with the main regex, but after the fact with a separate line to add an undef to the list. In these cases, you might think that we can just let the main regular expression match the nothingness at the end of the line. This would work for lines that do end with an empty field, but would tack on a phantom empty field to lines that don't, since all lines have nothingness at the end, so to speak.

Regular Expressions and The Perl Way

Despite getting just a bit bogged down with the discussion of the particular regular expression used, I believe that this has served as an excellent example for introducing regular expressions and The Perl Way:

Perl Unleashed

Larry Wall released Perl to the world in December 1987, and it has been continuously updated since. Version 1 used a regular-expression engine based upon Larry's own news browser rn, which in turn was based upon the regex engine in James Gosling's version of Emacs (the first Emacs for Unix). Not a particularly powerful regex flavor, it was replaced in Version 2 by an enhanced version of Henry Spencer's widely used regular-expression package. With a powerful regular expression flavor at its disposal, the Perl language and regular expression features soon became intertwined.

Version 5, Perl5 for short, was officially released in October 1994 and represented a major upgrade. Much of the language was redesigned, and many regular expression features were added or modified. One problem Larry Wall faced when creating the new features was the desire for backward compatibility. There was little room for Perl's regular expression language to grow, so he had to fit in the new notations where he could. The results are not always pretty, with many new constructs looking ungainly to novice and expert alike. Ugly yes, but as we will see, extremely powerful.

Perl4 vs. Perl5

Because so many things were updated in version 5.000, many continued to use the long-stable version 4.036 (Perl4 for short). Despite Perl5's maturity, at the time of this writing Perl4 still remains, perhaps unfortunately,9 in use. This poses a problem for someone (at the time of this writing, me) writing about Perl. You can write code that will work with either version, but doing so causes you to lose out on many of the great features new with Perl5. For example, a modern version of Perl allows the main part of the CSV example to be written as:
push(@fields, $+) while $text =~ m{
    "([^\"\\]*(?:\\.[^\"\\]*)*)",?  # Standard quoted string (with possible comma)
  | ([^,]+),?                       # or  up to next comma (with possible comma)
  | ,                               # or  just a comma.
}gx;

Tom Christiansen suggested that I use ``dead flea-bitten camel carcasses'' instead of ``Perl4'', to highlight that anything before version 5 is dead and has been abandoned by most. I was tempted.

Those comments are actually part of the regular expression... much more readable, don't you think? As we'll see, various other features of Perl5 make this solution more appealing -- we'll re-examine this example again in ``Putting It All Together'' (=>290) once we've gone over enough details to make sense of it all.

I'd like to concentrate on Perl5, but ignoring Perl4 ignores a lingering reality. I'll mention important Perl4 regex-related differences using markings that look like this: . This indicates that the first Perl4 note, related to whatever comment was just made, can be found on page 305. Since Perl4 is so old, I won't feel the need to recap everything in Perl4's manpage -- for the most part, the Perl4 notes are short and to the point, targeting those unfortunate enough to have to maintain code for both versions, such as a site's Perlmaster. Those just starting out with Perl should definitely not be concerned with an old version like Perl4.

Perl5 vs. Perl5

To make matters more complex, during the heady days surrounding Perl5's early releases, discussions on what is now USENET's comp.lang.perl.misc resulted in a number of important changes to the language and to its regular expressions. For example, one day I was responding to a post with an extraordinarily long regular expression, so I ``pretty-printed'' it, breaking it across lines and otherwise adding whitespace to make it more readable. Larry Wall saw the post, thought that one really should be able to express regexes that way, and right there and then added Perl5's /x modifier, which causes most whitespace to be ignored in the associated regular expression.

Around the same time, Larry added the (?#...) construct to allow comments to be embedded within a regex. A few months later, though, after discussions in the newsgroup, a raw `#' was also made to start a comment if the /x modifier were used. This appeared in version 5.002.10 There were other bug fixes and changes as well -- if you are using an early release, you might run into incompatibilities as you follow along in this book. I recommend version 5.002 or later.

10  Actually, it appeared in some earlier versions, but did not work reliably.

As this second printing goes to press, version 5.004 is about to hit beta, with a final release targeted for Spring 1997. Regex-related changes in the works include enhanced locale support, modified pos support, and a variety of updates and optimizations. (For example, Table 7-10 will likely become almost empty.) Once 5.004 is officially released, this book's home page (see Appendix A) will note the changes.

Regex-Related Perlisms

There are a variety of concepts that are part of Perl in general, but are of particular interest to our study of regular expressions. The next few sections discuss:

Expression Context

The notion of context is important throughout the Perl language, and in particular, to the match operator. In short, an expression might find itself in one of two contexts: A list context11 is one where a list of values is expected, while a scalar context is where a single value is expected.

11  ``List context'' used to be called ``array context''. The change recognizes that the underlying data is a list that applies equally well to an @array, a %hash, and (an, explicit, list).

Consider the two assignments:

$s = expression one;
@a = expression two;

Because $s is a simple scalar variable (holds a single value, not a list), it expects a simple scalar value, so the first expression, whatever it may be, finds itself in a scalar context. Similarly, because @a is an array variable and expects a list of values, the second expression finds itself in a list context. Even though the two expressions might be exactly the same, they might (depending on the case) return completely different values, and cause completely different side effects while they're at it.

Sometimes, the type of an expression doesn't exactly match the type of value expected of it, so Perl does one of two things to make the square peg fit into a round hole: 1) it allows the expression to respond to its context, returning a type appropriate to the expectation, or 2) it contorts the value to make it fit.

Allowing an expression to respond to its context

A simple example is a file I/O operator, such as <MYDATA>. In a list context, it returns a list of all (remaining) lines from the file. In a scalar context, it simply returns the next line.

Many Perl constructs respond to their context, and the regex operators are no different. The match operator m/.../, for example, sometimes returns a simple true/false value, and sometimes a list of certain match results. All the details are found later in this chapter.

Contorting an expression

If a list context still ends up getting a scalar value, a list containing the single value is made on the fly. Thus, @a = 42 is the same as @a = (42). On the other hand, there's no general rule for converting a list to a scalar. If a literal list is given, such as with
$var = ($this, &is, 0xA, 'list');
the comma-operator returns the last element, 'list', for $var. If an array is given, as with $var = @array, the length of the array is returned.

Some words used to describe how other languages deal with this issue are cast, promote, coerce, and convert, but I feel they are a bit too consistent (boring?) to describe Perl's attitude in this respect.

Dynamic Scope and Regex Match Effects

Global and private variables

On a broad scale, Perl offers two types of variables: global and private.12 Not available before Perl5, private variables are declared using my(...). Global variables are not declared, but just pop into existence when you use them. Global variables are visible from anywhere within the program, while private variables are visible, lexically, only to the end of their enclosing block. That is, the only Perl code that can access the private variable is the code that falls between the my declaration and the end of the block of code that encloses the my.

12  Perl allows the names of global variables to be partitioned into groups called packages, but the variables are still global.

Dynamically scoped values

Dynamic scoping is an interesting concept that many programming languages don't provide. We'll see the relevance to regular expressions soon, but in a nutshell, you can have Perl save a copy of a global variable that you intend to modify, and restore the original copy automatically at the time when the enclosing block ends. Saving a copy is called creating a new dynamic scope. There are a number of reasons you might want to do this, including:

The first use listed is less important these days because Perl now offers truly local variables via the my directive. Using my creates a new variable completely distinct from and utterly unrelated to any other variable anywhere else in the program. Only the code that lies between the my to the end of the enclosing block can have direct access to the variable.

The extremely ill-named function local creates a new dynamic scope. Let me say up front that the call to local does not create a new variable. Given a global variable, local does three things:

  1. saves an internal copy of the variable's value; and

  2. copies a new value into the variable (either undef, or a value assigned to the local); and

  3. slates the variable to have the original value restored when execution runs off the end of the block enclosing the local.

This means that ``local'' refers only to how long any changes to the variable will last. The global variable whose value you've copied is still visible from anywhere within the program -- if you make a subroutine call after creating a new dynamic scope, that subroutine (wherever it might be located in the script) will see any changes you've made. This is just like any normal global variable. The difference here is that when execution of the enclosing block finally ends, the previous value is automatically restored.

An automatic save and restore of a global variable's value -- that's pretty much all there is to local. For all the misunderstanding that has accompanied local, it's no more complex than the snippet on the right of Table 7-3 illustrates.



Table 7-3: The meaning of local
Normal Perl Equivalent Meaning
{    {
  local($SomeVar); #save copy      my $TempCopy = $SomeVar;
  $SomeVar = 'My Value';      $SomeVar = undef;
  :      $SomeVar = 'My Value';
  :      :
  :      $SomeVar = $TempCopy;
} #automatically restore $SomeVar    }


(As a matter of convenience, you can assign a value to local($SomeVar), which is exactly the same as assigning to $SomeVar in place of the undef assignment. Also, the parentheses can be omitted to force a scalar context.)

References to $SomeVar while within the block, or within a subroutine called from the block, or within a signal handler invoked while within the block -- any reference from the time the local is called to the time the block is exited -- references `My·Value'. If the code in the block (or anyone else, for that matter) modifies $SomeVar, everyone (including the code in the block) sees the modification, but it is lost when the block exits and the original copy is automatically restored.

As a practical example, consider having to call a function in a poorly written library that generates a lot of Use of uninitialized value warnings. You use Perl's -w option, as all good Perl programmers should, but the library author apparently didn't. You are exceedingly annoyed by the warnings, but if you can't change the library, what can you do short of stop using -w altogether? Well, you could set a local value of $^W, the in-code debugging flag (the variable name ^W can be either the two characters, caret and `W', or an actual control-W character):

{
    local $^W = 0; # ensure debugging is off.
    &unruly_function(...);
}
# exiting the block restores the original value of $^W

The call to local saves an internal copy of the previous value of the global variable $^W, whatever it might have been. Then that same $^W receives the new value of zero that we immediately scribble in. When unruly_function is executing, Perl checks $^W and sees the zero we wrote, so doesn't issue warnings. When the function returns, our value of zero is still in effect.

So far, everything appears to work just as if you didn't use local. However, when the block is exited right after the subroutine returns, the saved value of $^W is restored. Your change of the value was local, in time, to the lifetime of the block. You'd get the same effect by making and restoring a copy yourself, as in Table 7-3, but local conveniently takes care of it for you.

For completeness, let's consider what happens if I use my instead of local.13 Using my creates a new variable with an initially undefined value. It is visible only within the lexical block it is declared in (that is, visible only by the code written between the my and the end of the enclosing block). It does not change, modify, or in any other way refer to or affect other variables, including any global variable of the same name that might exist. The newly created variable is not visible elsewhere in the program, including from within unruly_function. In our example snippet, the new $^W is immediately set to zero but is never again used or referenced, so it's pretty much a waste of effort. (While executing unruly_function and deciding whether to issue warnings, Perl checks the unrelated global variable $^W.)

13  Perl doesn't allow the use of my with this special variable name, so the comparison is only academic.

A better analogy: clear transparencies

A useful analogy for local is that it provides a clear transparency over a variable on which you scribble your own changes. You (and anyone else that happens to look, such as subroutines and interrupt handlers) will see the new values. They shadow the previous value until the point in time that the block is finally exited. At that point, the transparency is automatically removed, in effect, removing any changes that might have been made since the local.

This analogy is actually much closer to reality than the original ``an internal copy is made'' description. Using local doesn't have Perl actually make a copy, but instead puts your new value earlier in the list of those checked whenever a variable's value is accessed (that is, it shadows the original). Exiting a block removes any shadowing values added since the block started. Values are added manually, with local, but some variables have their values automatically dynamically scoped. Before getting into that important regex-related concern, I'd like to present an extended example illustrating manual dynamic scoping.

An extended dynamic-scope example

As an extended, real life, Perl Way example of dynamic scoping, refer to the listing on page 215. The main function is ProcessFile. When given a filename, it opens it and processes commands line by line. In this simple example, there are only three types of commands, processed at [6], [7], and [8]. Of interest here are the global variables $filename, $command, $., and %HaveRead, as well as the global filehandle FILE. When ProcessFile is called, all but %HaveRead have their values dynamically scoped by the local at [3].

Dynamic Scope Example


# Process ``this'' command
sub DoThis [1]
{
   print "$filename line $.: processing $command";
     :
}

# Process ``that'' command
sub DoThat [2]
{
   print "$filename line $.: processing $command";
     :
}

# Given a filename, open file and process commands
sub ProcessFile
{
   local($filename) = @_;       [3]
   local(*FILE, $command, $.);

   open(FILE, $filename) || die qq/can't open "$filename": $!\n/;

   $HaveRead{$filename} = 1; [4]

   while ($command = <FILE>)
   {
       if ($command =~ m/^#include "(.*)"$/) { [5]
          if (defined $HaveRead{$1}) {
              warn qq/$filename $.: ignoring repeat include of "$1"\n/;
          } else {
              ProcessFile($1); [6]
          }
       } elsif ($command =~ m/^do-this/) {
          DoThis;  [7]
       } elsif ($command =~ m/^do-that/) {
          DoThat;  [8]
       } else {
          warn "$filename $.: unknown command: $command";
      }
   }
   close(FILE);
} [9]

When a do-this command is found (at [7]), the DoThis function is called to process it. You can see at [1] that the function refers to the global variables $filename, $., and $command. The DoThis function doesn't know (nor care), but the values of these variables that it sees were written in ProcessFile.

The #include command's processing begins with the filename being plucked from the line at [5]. After making sure the file hasn't been processed already, we call ProcessFile recursively, at [6]. With the new call, the global variables $filename, $command, and $., as well as the filehandle FILE, are again overlaid with a transparency that is soon updated to reflect the status and commands of the second file. When commands of the new file are processed within ProcessFile and the two subroutines, $filename and friends are visible, just as before.

Nothing at this point appears to be different from straight global variables.

The benefits of dynamic scoping are apparent when the second file has been processed and the related call of ProcessFile exits. When execution falls off the block at [9], the related local transparencies laid down at [3] are removed, restoring the original file's values of $filename and such. This includes the filehandle FILE now referring to the first file, and no longer to the second.

Finally, let's look at %HaveRead, used to keep track of files we've seen ([4] and [5]). It is specifically not dynamically scoped because we really do need it to be global across the entire time the script runs. Otherwise, included files would be forgotten each time ProcessFile exits.

Regex side-effects and dynamic-scoping

What does all this about dynamic scope have to do with regular expressions? A lot. Several variables are automatically set as a side effect of a successful match. Discussed in detail in the next section, they are variables like $& (refers to the text matched) and $1 (refers to the text matched by the first parenthesized subexpression). These variables have their value automatically dynamically scoped upon entry to every block.

To see the benefit of this, realize that each call to a subroutine involves starting a new block. For these variables, that means a new dynamic scope is created. Because the values before the block are restored when the block exits (that is, when the subroutine returns), the subroutine can't change the values that the caller sees.

As an example, consider:

if (m/(...)/)
{
    &do_some_other_stuff();
    print "the matched text was $1.\n";
}

Because the value of $1 is dynamically scoped automatically upon entering each block, this code snippet neither cares, nor needs to care, whether the function do_some_other_stuff changes the value of $1 or not. Any changes to $1 by the function are contained within the block that the function defines, or perhaps within a sub-block of the function. Therefore, they can't affect the value this snippet sees with the print after the function returns.

The automatic dynamic scoping can be helpful even when not so apparent:

if ($result =~ m/ERROR=(.*)/) {
   warn "Hey, tell $Config{perladmin} about $1!\n";
}

(The standard library module Config defines an associative array %Config, of which the member $Config{perladmin} holds the email address of the local Perlmaster.) This code could be very surprising if $1 were not automatically dynamically scoped. You see, %Config is actually a tied variable, which means that any reference to it involves a behind-the-scenes subroutine call. Config's subroutine to fetch the appropriate value when $Config{...} is used invokes a regex match. It lies between your match and your use of $1, so it not being dynamically scoped would trash the $1 you were about to use. As it is, any changes in the $Config{...} subroutine are safely hidden by dynamic scoping.

Dynamic scoping vs. lexical scoping

Dynamic scoping provides many rewards if used effectively, but haphazard dynamic scoping with local can create a maintenance nightmare. As I mentioned, the my(...) declaration creates a private variable with lexical scope. A private variable's lexical scope is the opposite of a global variable's global scope, but it has little to do with dynamic scoping (except that you can't local the value of a my variable). Remember, local is an action, while my is an action and a declaration.

Special Variables Modified by a Match

A successful match or substitution sets a variety of global, automatically-dynamic-scoped, read-only variables.14 These values never change when a match attempt is unsuccessful, and are always set when a match is successful. As the case may be, they may be set to the empty string (a string with no characters in it), or undefined (a ``no value here'' value very similar to, yet testably distinct from, an empty string). In all cases, however, they are indeed set.
14  As described on page 209, this  notation means that Perl4 note #1 is found on page 305.

$&
A copy of the text successfully matched by the regex. This variable (along with $` and $' below) is best avoided. (See ``Unsociable $& and Friends'' on page 273.) $& is never undefined after a successful match.

$`
A copy of the target text in front of (to the left of) the match's start. When used in conjunction with the /g modifier, you sometimes wish $` to be the text from start of the match attempt, not the whole string. Unfortunately, it doesn't work that way. If you need to mimic such behavior, you can try using \G([\x00-\xff]*?) at the front of the regex and then refer to $1. $` is never undefined after a successful match.

$'
A copy of the target text after (to the right of) the successfully matched text. After a successful match, the string "$`$&$'" is always a copy of the original target text.15 $' is never undefined after a successful match.

15  Actually, if the original target is undefined, but the match successful (unlikely, but possible), "$`$&$'" would be an empty string, not undefined. This is the only situation where the two differ.

$1, $2, $3, etc.
The text matched by the 1st, 2nd, 3rd, etc., set of capturing parentheses. (Note that $0 is not included here -- it is a copy of the script name and not related to regexes). These are guaranteed to be undefined if they refer to a set of parentheses that doesn't exist in the regex, or to a set that wasn't actually involved in the match.

These variables are available after a match, including in the replacement operand of s/.../.../. But it makes no sense to use them within the regex itself. (That's what \1 and friends are for.) See ``Using $1 Within a Regex?'' on page 219.

The difference between (\w+) and (\w)+ can be seen in how these variables are set. Both regexes match exactly the same text, but they differ in what is matched within the parentheses. Matching against the string tubby, the first results in $1 having tubby, while the latter in it having y: the plus is outside the parentheses, so each iteration causes them to start capturing anew.

Also, realize the difference between (x)? and (x?). With the former, the parentheses and what they enclose are optional, so $1 would be either x or undefined. But with (x?), the parentheses enclose a match -- what is optional are the contents. If the overall regex matches, the contents matches something, although that something might be the nothingness x? allows. Thus, with (x?) the possible values of $1 are x and an empty string.

Perl4 and Perl5 treat unusual cases involving parentheses and iteration via star and friends slightly differently. It shouldn't matter to most of you, but I should at least mention it. Basically, the difference has to do with what $2 will receive when something like (main(OPT)?)+ matches only main and not OPT during the last successful iteration of the plus. With Perl5, because (OPT) did not match during the last successful match of its enclosing subexpression, $2 becomes (rightly, I think) undefined. In such a case, Perl4 leaves $2 as what it had been set to the last time (OPT) actually matched. Thus, with Perl4, $2 is OPT if it had matched any time during the overall match.

$+
A copy of the highest numbered $1, $2, etc. explicitly set during the match. If there are no capturing parentheses in the regex (or none used during the match), it becomes undefined. However, Perl does not issue a warning when an undefined $+ is used.

When a regex is applied repeatedly with the /g modifier, each iteration sets these variables afresh. This is why, for instance, you can use $1 within the replacement operand of s/.../.../g and have it represent a new slice of text each time. (Unlike the regex operand, the replacement operand is re-evaluated for each iteration; =>255.)

Using $1 within a regex?

The Perl manpage makes a concerted effort to point out that \1 is not available as a backreference outside of a regex. (Use the variable $1 instead.) \1 is much more than a simple notational convenience -- the variable $1 refers to a string of static text matched during some previously completed successful match. On the other hand, \1 is a true regex metacharacter to match text similar to that matched within the first parenthesized subexpression at the time that the regex-directed NFA reaches the \1. What \1 matches might change over the course of an attempt as the NFA tracks and backtracks in search of a match.

A related question is whether $1 is available within a regex operand. The answer is ``Yes, but not the way you might think.'' A $1 appearing in a regex operand is treated exactly like any other variable: its value is interpolated (the subject of the next section) before the match or substitution operation even begins. Thus, as far as the regex is concerned, the value of $1 has nothing to do with the current match, but remains left over from some previous match somewhere else.

In particular, with something like s/.../.../g, the regex operand is evaluated, compiled once (also discussed in the next section), and then used by all iterations via /g. This is exactly opposite of the replacement operand, which is re-evaluated after each match. Thus, a $1 within the replacement operand makes complete sense, but in a regex operand it makes virtually none.

``Doublequotish Processing'' and Variable Interpolation

Strings as operators

Most people think of strings as constants, and, in practice a string like
$month = "January";
is just that. $month gets the same value each time the statement is executed because "January" never changes. However, Perl can interpolate variables within doublequoted strings (that is, have the variable's value inserted in place of its name). For example, in
$message = "Report for $month:";
the value that $message gets depends on the value of $month, and in fact potentially changes each time the program does this assignment. The doublequoted string "Report for $month:" is exactly the same as:
'Report for ' . $month . ':'

(In a general expression, a lone period is Perl's string-concatenation operator; concatenation is implicit within a doublequoted string.)


The doublequotes are really operators that enclose operands. A string such as

"the month is $MonthName[&GetMonthNum]!"
is the same as the expression
'the month is ' . $MonthName[&GetMonthNum] . '!'
including the call to GetMonthNum each time the string is evaluated. Yes, you really can call functions from within doublequoted strings -- because doublequotes are operators! To create a true constant, Perl provides singlequoted strings, used here in the code snippet equivalences.

One of Perl's unique features is that a doublequoted string doesn't have to actually be delimited by doublequotes. The qq/.../ notation provides the same functionality as "...", so qq/Report for $month:/ is a doublequoted string. You can also choose your own delimiters. The following example uses qq{...} to delimit the doublequoted string:

warn qq{"$ARGV" line $.: $ErrorMessage\n};

Singlequoted strings use q/.../, rather than qq/.../. Regular expressions use m/.../ and s/.../.../ for match and substitution, respectively. The ability to pick your own delimiters for regular expressions, however, is not unique to Perl: ed and its descendants have supported it for over 25 years.

Regular expressions as strings, strings as regular expressions

All this is relevant to regular expressions because the regex operators treat their regex operands pretty much (but not exactly) like doublequoted strings, including support for variable interpolation:
$field = "From";
 :
if ($headerline =~ m/^$field:/) {
 :
}

The marked section is taken as a variable reference and is replaced by the variable's value, resulting in ^From: being the regular expression actually used. ^$field: is the regex operand; ^From: is the actual regex after cooking. This seems similar to what happens with doublequoted strings, but there are a few differences (details follow shortly), so it is often said that it receives ``doublequotish'' processing.

One might view the mathematical expression ($F - 32) * 5/9 as
  Divide( Multiply( Subtract($F, 32), 5), 9 )
to show how the evaluation logically progresses. It might be instructive to see $headerline =~ m/^$field:/ presented in a similar way:
  RegexMatch( $headerline, DoubleQuotishProcessing(``^$field:'') )

Thus, you might consider ^$field: to be a match operand only indirectly, as it must pass through doublequotish processing first. Let's look at a more involved example:

$single = qq{'([^'\\\\]*(?:\\\\.[^'\\\\]*)*)'}; # to match a singlequoted string
$double = qq{"([^"\\\\]*(?:\\\\.[^"\\\\]*)*)"}; # to match a doublequoted string
$string = "(?:$single|$double)";  # to match either kind of string
 :
while (<CONFIG>) {
    if (m/^name=$string$/o) {
        $config{name} = $+;
    } else {
 :

This method of building up the variable $string and then using it in the regular expression is much more readable than writing the whole regex directly:

if (m/^name=(?:'[^'\\]*(?:\\.[^'\\]*)*'|"[^"\\]*(?:\\.[^"\\]*)*")$/o) {

Several important issues surround this method of building regular expressions within strings, such as all those extra backslashes, the /o modifier, and the use of (?:...) non-capturing parentheses. See ``Matching an Email Address'' (=>294) for a heady example. For the moment, let's concentrate on just how a regex operand finds its way to the regex search engine. To get to the bottom of this, let's follow along as Perl parses the program snippet:

$header =~ m/^\Q$dir\E # base directory
             \/        # separating slash
             (.*)      # grab the rest of the filename
            /xgm;

For the example, we'll assume that $dir contains `~/.bin'.

The \Q...\E that wraps the reference to $dir is a feature of doublequotish string processing that is particularly convenient for regex operands. It puts a backslash in front of most symbol characters. When the result is used as a regex, it matches the literal text the \Q...\E encloses, even if that literal text contains what would otherwise be considered regex metacharacters (with our example of `~/.bin', the first three characters are escaped, although only the dot requires it).

Also pertinent to this example are the free whitespace and comments within the regex. Starting with Perl version 5.002, a regex operand subject to the /x modifier can use whitespace liberally, and may have raw comments. Raw comments start with # and continue to the next newline (or to the end of the regex). This is but one way that doublequotish processing of regex operands differs from real doublequoted strings: the latter has no such /x modifier.

Figure 7-1 on page 223 illustrates the path from unparsed script, to regex operand, to real regex, and finally to the use in a search. Not all the phases are necessarily done at the same time. The lexical analysis (examining the script and deciding what's a statement, what's a string, what's a regex operand, and so on) is done just once when the script is first loaded (or in the case of eval with a string operand, each time the eval is re-evaluated). That's the first phase of Figure 7-1. The other phases can be done at different times, and perhaps even multiple times. Let's look at the details.

Phase A -- identifying the match operand

The first phase simply tries to recognize the lexical extent of the regex operand. Perl finds m, the match operator, so it knows to scan a regex operand. At point 1 in the figure, it recognizes the slash as the operand delimiter, then searches for the closing delimiter, finding it at point 4. For this first phase, strings and such get the same treatment, so there's no special regex-related processing here. One transformation that takes place in this phase is that the backslash of an escaped closing delimiter is removed, as at point 3



Figure 7-1: Perl parsing, from program text to regular expression engine

Phase B -- doublequotish processing

The second phase of regex-operand parsing treats the isolated operand much like a doublequoted string. Variables are interpolated, and \Q...\E and such are processed. (The full list of these constructs is given in Table 7-8 on page 245.) With our example, the value of $dir is interpolated under the influence of \Q, so `\~\/\.bin' is actually inserted into the operand.

Although similar, differences between regex operands and doublequoted strings become apparent in this phase. Phase B realizes it is working with a regex operand, so processes a few things differently. For example, \b and \3 in a doublequoted string always represent a backspace and an octal escape, respectively. But in a regex, they could also be the word-boundary metacharacter or a backreference, depending on where in the regex they're located -- Phase B therefore leaves them undisturbed for the regex engine to later interpret as it sees fit. Another difference involves what is and isn't considered a variable reference. Something like $| will always be a variable reference within a string, but it is left for the regex engine to interpret as the metacharacters $ and |. Similarly, a string always interprets $var[2-7] as a reference to element -5 of the array @var (which means the fifth element from the end), but as a regex operand, it is interpreted as a reference to $var followed by a character class. You can use the ${...} notation for variable interpolation to force an array reference if you wish:

  ${var[2-7]}.

Due to variable interpolation, the result of this phase can depend on the value of variables (which can change as the program runs). In such a case, Phase B doesn't take place until the match code is reached during runtime. ``Perl Efficiency Issues'' (=>265) expands on this important point.

As discussed in the match and substitution operator sections later in this chapter, using a singlequote as the regex-operand delimiter invokes singlequotish processing. In such a case, this Phase B is skipped.

Phase C -- /x processing

Phase C concerns only regex operands applied with /x. Whitespace (except within a character class) and comments are removed. Because this happens after Phase B's variable interpolation, whitespace and comments brought in from a variable end up being removed as well. This is certainly convenient, but there's one trap to watch out for. The # comments continue to the next newline or the end of the operand -- this is different from ``until the end of an interpolated string.'' Consider if we'd included a comment at the end of page 221's $single:
$single = qq{'(...regex here...)'  # for singlequoted strings};

This value makes its way to $string and then to the regex operand. After Phase B, the operand is (with the intended comment bold, but the real comment underlined):  ^name=(?:'(...)'·#·for·singlequoted·strings|"(...)")$

Surprise! The comment intended only for $single ends up wiping out what follows because we forgot to end the comment with a newline.


If, instead, we use:

$single = qq{'(...regex here...)'  # for singlequoted strings\n};
everything is fine because the \n is interpreted by the doublequoted string, providing an actual newline character to the regex when it's used. If you use a singlequoted q{...} instead, the regex receives the raw \n, which matches a newline, but is not a newline. Therefore, it doesn't end the comment and is removed with the rest of the comment.16

16  Figure 7-1 is a model describing the complex multi-level parsing of Perl, but those digging around in Perl internals will find that the processing actually done by Perl is slightly different. For example, what I call Phase C is not a separate step, but is actually part of both Phases B and D.

I feel it is more clear to present it as a separate step, so have done so. (I spent a considerable amount of time coming up with the model that Figure 7-1 illustrates -- I hope you'll find it helpful.) However, in the unlikely event of variable references within a comment, my model and reality could conceivably differ.

Consider m/regex #comment $var/x. In my model, $var is interpolated into the operand at Phase B, the results of which are removed with the rest of the comment in Phase C. In reality, the comment and variable reference are removed in Phase B before the variable is interpolated. The end results appear the same... almost always. If the variable in question contains a newline, that newline does not end the comment as my model would indicate. Also, any side effects of the variable interpolation, such as function calls and the like, are not done, since in reality the variable is not interpolated.

These situations are farfetched and rare, so they will almost certainly never matter to you, but I felt I should at least mention it.

Phase D -- regex compilation

The result of Phase C is the true regular expression that the engine uses for the match. The engine doesn't apply it directly, but instead compiles it into an internal form. I call this Phase D. If there was no variable interpolation in Phase B, the same compiled form can be reused each time the match operator is executed during the course of the program, and so each phase needs to be performed only once -- a real time savings. On the other hand, variable interpolation means that the regex can change each time the operator is reached, so Phases B through D must be performed again, each time. The important effects of this are the focus of ``Regex Compilation, the /o Modifier, and Efficiency'' (=>268). Also see the related discussion in Chapter 5's ``Compile Caching'' (=>158).

Perl's Regex Flavor

Now that we've gotten some important tangential issues out of the way, let's look at Perl's regex flavor itself. Perl uses a Traditional NFA engine with a base set of metacharacters superficially resembling egrep's, but the resemblance to egrep stops there. The biggest differences are due to Perl's NFA, and to the many extra metacharacters that provide both convenient shorthands and additional raw power. The Perl manpage only summarizes Perl's regex flavor, so I will try to provide much more depth.

Quantifiers -- Greedy and Lazy

Perl provides normal greedy quantifiers, with Perl5 adding the non-greedy counterparts as mentioned in Chapter 4. Greedy quantifiers are also said to be ``maximal matching,'' while non-greedy are also called ``lazy,'' ``ungreedy,'' and ``minimal matching,'' among others.17 Table 7-4 summarizes:

17  Larry Wall's preferred perlance uses minimal matching and maximal matching.

Table 7-4: Perl's Quantifiers (Greedy and Lazy)
Traditional Lazy18
Number of matches Greedy (Non-greedy)
(maximal matching) (minimal matching)
Any number (zero, one, or more) * *?
One or more + +?
Optional (zero or one) ? ??
Specified limits (at least min; no more than max)

{min,max} {min,max}?
Lower limit (at least min)

{min,} {min,}?
Exactly num19 {num} {num}?
18  Not available before Perl5.
19  There is nothing optional to be matched by an ``exactly num'' construct, so both versions are identical except for efficiency -- they co-exist only as a matter of notational convenience and consistency.


The non-greedy versions are examples of the ``ungainly looking'' additions to the regex flavor that appeared with Perl5. Traditionally, something like *? makes no sense in a regex. In fact, in Perl4 it is a syntax error. So, Larry was free to give it new meaning. There was some thought that the non-greedy versions might be **, ++, and such, which has certain appeal, but the problematic notation for {min,max} led Larry to choose the appended question mark. This also leaves ** and the like for future expansion.

Non-greedy efficiency

Many of the effects that non-greedy quantifiers have on efficiency were discussed in Chapter 5's ``A Detailed Look at the Performance Effects of Parentheses and Backtracking,'' starting on page 151, and ``Simple Repetition'' on page 155. Other effects stem primarily from the regex-directed nature of Perl's NFA engine.

It's not common that you have a free choice between using greedy and non-greedy quantifiers, since they have such different meanings, but if you do, the choice is highly dependent upon the situation. Thinking about the backtracking that either must do with the data you expect should lead you to a choice. For an example, I recommend my article in the Autumn 1996 (Volume 1, Issue 3) issue of The Perl Journal in which I take one simple problem and investigate a variety of solutions, including those with greedy and non-greedy quantifiers.

A non-greedy construct vs. a negated character class

I find that people often use a non-greedy construct as an easily typed replacement for a negated character-class, such as using <(.+?)> instead of <([^>]+)>. Sometimes this type of replacement works, although it is less efficient -- the implied loop of the star or plus must keep stopping to see whether the rest of the regex can match. Particularly in this example, it involves temporarily leaving the parentheses which, as Chapter 5 points out, has its own performance penalty (=>150). Even though the non-greedy constructs are easier to type and perhaps easier to read, make no mistake: what they match can be very different.

First of all, of course, is that if the /s modifier (=>234) is not used, the dot of .+? doesn't match a newline, while the negated class in [^>]+ does.

A bigger problem that doesn't always manifest itself clearly can be shown with a simple text markup system that uses <...> to indicate emphasis, as with:

Fred was very, <very> angry. <Angry!> I tell you.
Let's say you want to specially process marked items that end with an exclamation point (perhaps to add extra emphasis). One regex to match those occurrences and capture the marked text is <([^>]*!)>. Using <(.*?!)>, even with the /s modifier, is very different. The former matches `...Angry!...' while the latter matches `...very> angry. <Angry!...'.

The point to remember is that the negated class in [^>]*> never matches `>', while the non-greedy construct in .*?> does if that is what it takes to achieve a match. If nothing after the lazy construct could force the regex engine to backtrack, it's not an issue. However, as the exclamation point of this example illustrates, the desire for a match allows the lazy construct past a point you really don't want it to (and that a negated character can't) exceed.

The non-greedy constructs are without a doubt the most powerful Perl5 additions to the regex flavor, but you must use them with care. A non-greedy .*? is almost never a reasonable substitute for [^...]* -- one might be proper for a particular situation, but due to their vastly different meaning, the other is likely incorrect.

Grouping

As I've noted often, parentheses traditionally have had two uses: for grouping and for capturing matched text to $1, $2, and friends. Within the same regular expression, the metacharacters \1, \2, and so on, are used instead. As pointed out on page 219, the difference is much more than just notational. Except within a character class where a backreference makes no sense, \1 through \9 are always backreferences. Additional backreferences (\10, \11, ...) become available as the number of capturing parentheses warrants (=>243).

Uniquely, Perl provides two kinds of parentheses: the traditional (...) for both grouping and capturing, and new with Perl5, the admittedly unsightly (?:...) for grouping only. With (?:...), the ``opening parenthesis'' is really the three-character sequence `(?:', while the ``closing parenthesis'' is the usual `)'.

Like the notation for the non-greedy quantifiers, the sequence `(?' was previously a syntax error. Starting with version 5, it is used for a number of regex language extensions, of which (?:...) is but one. We'll meet the others soon.

Capturing vs. non-capturing parentheses

The benefits in a grouping-only, non-capturing construct include:

This last item is probably the one that provides the most benefit from a user's point of view: Recall in the CSV example how because the first alternative used two sets of parentheses, those in the second alternative captured to $3 (=>204; =>207). Any time the first alternative changes the number of parentheses it uses, all subsequent references must be changed accordingly. It's a real maintenance nightmare that non-capturing parentheses can greatly alleviate.20

20  Even better still would be named subexpressions, such as Python provides through its symbolic group names, but Perl doesn't offer this... yet.

For related reasons, the number of capturing parentheses becomes important when m/.../ is used in a list context, or at any time with split. We will see specific examples in each of the relevant sections later (=>252264).

Lookahead

Perl5 also added the special (?=...) and (?!...) lookahead constructs. Like normal non-capturing parentheses, the positive lookahead (?=subexpression) is true if the subexpression matches. The twist is that the subexpression does not actually ``consume'' any of the target string -- lookahead matches a position in the string similar to the way a word boundary does. This means that it doesn't add to what $& or any enclosing capturing parentheses contain. It's a way of peeking ahead without taking responsibility for it.

Similarly, the negative lookahead (?!subexpression) is true when the subexpression does not match. Superficially, negative lookahead is the logical counterpart to a negated character class, but there are two major differences:


A few examples of lookahead:

Bill(?=·The·Cat|·Clinton)
Matches Bill, but only if followed by `·The·Cat' or `·Clinton'.

\d+(?!\.)
Matches a number if not followed by a period.

\d+(?=[^.])
Matches a number if followed by something other than a period. Make sure you understand the difference between this and the previous item -- consider a number at the very end of a string. Actually, I'll just put that question to you now. Which of these two will match the string `OH·44272', and where? ¤ Think carefully, then turn the page to check your answer.

Positive lookahead vs. negative lookahead

¤ Answer to the question on page 228

Both \d+(?!\.) and \d+(?=[^.]) can match `OH·44272'. The first matches OH·44272, while the second OH·44272.

Remember, greediness always defers in favor of an overall match. Since \d+(?=[^.]) requires a non-period after the matched number, it will give up part of the number to become the non-period if need be.

We don't know what these might be used for, but they should probably be written as \d+(?![\d.]) and \d+(?=[^.\d]).

^(?![A-Z]*$)[a-zA-Z]*$
Matches if the item contains only letters, but not all uppercase.

^(?=.*?this)(?=.*?that)
A rather ingenious (if not somewhat silly) way of checking whether this and that can match on the line. (A more logical solution that is mostly comparable is the dual-regex /this/ && /that/.)21

21  There are quite a variety of ways to implement ``this and that'', and as many ways to compare them. Major issues of such a comparison include understanding exactly what is matched, match efficiency, and understandability. I have an article in the Autumn 1996 (Volume 1, Issue 3) issue of The Perl Journal that examines this in great detail, providing, I hope, additional insight about what the regex engine's actions mean in the real world. By the way, the ingenious solution shown here comes from a post by Randal Schwartz, and it fared rather well in the benchmarks.

Other illustrative examples can be found at ``Keeping the Match in Synch with Expectations'' (=>237). For your entertainment, here's a particularly heady example copied from ``Adding Commas to a Number'' (=>292):

s<
   (\d{1,3})       # before a comma: one to three digits
   (?=             # followed by, but not part of what's matched...
      (?:\d\d\d)+  #    some number of triplets...
      (?!\d)       #    ...not followed by another digit
   )               #    (in other words, which ends the number)><$1,>gx;

Lookahead parentheses do not capture text, so they do not count as a set of parentheses for numbering purposes. However, they may contain raw parentheses to capture the phantomly matched text. Although I don't recommend this often, it can perhaps be useful. For example, (.*?)(?=<(strong|em)\s*>), matches everything up to but not including a <strong> or <em> HTML tag on the line. The consumed text is put into $1 (and $& as well, of course), while the `strong' or `em' that allows the match to stop is placed into $2. If you don't need to know which tag allows the match to stop successfully, the ...(strong|em)... is better written as ...(?:strong|em)... to eliminate the needless capturing. As another example, I use an appended (?=(.*)) on page =>277 as part of mimicking $'. (Due to the $&-penalty, we want to avoid $' if at all possible; =>273.)

Using capturing parentheses within a negative lookahead construct makes absolutely no sense, since the negative lookahead construct matches only when its enclosed subexpression does not.

The perlre manpage rightly cautions that lookahead is very different from lookbehind. Lookahead ensures that the condition (matching or not matching the given subexpression) is true starting at the current location and looking, as normal, toward the right. Lookbehind, were it supported, would somehow look back toward the left.

For example, (?!000)\d\d\d means ``so long as they're not 000, match three digits,'' which is a reasonable thing to want to do. However, it is important to realize that it specifically does not mean ``match three digits that are not preceded by 000.'' This would be lookbehind, which is not supported in Perl or any other regex flavor that I know of. Well, actually, I suppose that a leading anchor (either type: string or word) can be considered a limited form of lookbehind.

You often use lookahead at the end of an expression to disallow it from matching when followed (or not followed) by certain things. Although its use at the beginning of an expression might well indicate a mistaken attempt at lookbehind, leading lookahead can sometimes be used to make a general expression more specific. The 000 is one example, and the use of (?!0+\.0+\.0+\.0+\b) to disallow null IP addresses in Chapter 4 (=>125) is another. And as we saw in Chapter 5's ``A Global View of Backtracking'' (=>149), lookahead at the beginning of an expression can be an effective way to speed up a match.

Otherwise, be careful with leading negative lookahead. The expression \w+ happily matches the first word in the string, but prepending (?!cat) is not enough to ensure ``the first word not beginning with cat.'' (?!cat)\w+ can't match at the start of cattle, but can still match cattle. To get the desired effect, you need additional precautions, such as \b(?!cat)\w+.

Comments within a regular expression

The (?#...) construct is taken as a comment and ignored. Its content is not entirely free-form, as any copy of the regex-operand delimiter must still be escaped.22

22  (?#...) comments are removed very early in the parsing, effectively between Phase A and Phase B of page 223's Figure 7-1. A bit of trivia for you: As far as I can tell, the closing parenthesis of a (?#...) comment is the only item in Perl's regular expression language that cannot be escaped. The first closing parentheses after the `(?#' ends the comment, period.

(?#...) appeared in version 5.000, but as of 5.002, the /x modifier enables the unadorned # comment that runs to the next newline (or the end of the regex), pretty much just like in regular code. (We saw an example of this earlier in the commafication snippet on page 229.) The /x modifier also causes most whitespace to be ignored, so you can write the

$text =~ m/"([^"\\]*(\\.[^"\\]*)*)",?|([^,]+),?|,/g
from the CSV example as the more-easily understandable:
$text =~ m/                # A field is one of three types of things:
                           #   1) DOUBLEQUOTED STRING
  "([^"\\]*(\\.[^"\\]*)*)" #      - Standard doublequoted string (nab to $1).
  ,?                       #      - Eat any trailing comma.
        |                  # -OR-
                           #   2) NORMAL FIELD
  ([^,]+)                  #      - Capture (to $3) up to next comma.
  ,?                       #      - (and including comma if there is one)
        |                  # -OR-
                           #   3) EMPTY FIELD
  ,                        #      just match the comma.
/gx;

The underlying regular expression is exactly the same. As with (?#...) comments, the regex's closing delimiter still counts, so these comments are not entirely free-form either. As with most other regex metacharacters, the # and whitespace metacharacters that become active with /x are not available within a character class, so there can be no comments or ignored whitespace within classes. Like other regex metacharacters, you can escape # and whitespace within an /x


m{
  ^            # Start of line.
  (?:          # Followed by one of:
      From     #    `From'
     |Subject  #    `Subject'
     |Date     #    `Date'
  )            #
  :            # All followed by a colon...
  \ *          # .. and any number of spaces. (note escaped space)
  (.*)         # Capture rest of line (except newline) to $1.
}x;

Other (?...) constructs

The special (?modifiers) notation uses the `(?' notation mentioned in this section, but for a different purpose. Traditionally, case-insensitive matching is invoked with the /i modifier. You can accomplish the same thing by putting (?i) anywhere in the regex (usually at the beginning). You can specify /i (case insensitive), /m (multi-line mode), /s (single-line mode), and /x (free formatting) using this mechanism. They can be combined; (?si) is the same as using both /i and /s.

String Anchors

Anchors are indispensable for creating bullet-proof expressions, and Perl provides several flavors.

Logical lines vs. raw text

Perl provides the traditional string anchors, caret and dollar,23 but their meaning is a bit more complex than simply ``start- and end-of-string.'' With the typically simple use of a regular expression within a while (<>) loop (and the input record separator $/ at its default), you know that the text being checked is exactly one logical line, so the distinction between ``start of logical line'' and ``start of the string'' is irrelevant.

23  A dollar sign can also indicate variable interpolation, but whether it represents that or the end-of-line metacharacter is rarely ambiguous. Details are in ``Doublequotish Processing and Variable Interpolation'' (=>222).

However, if a string contains embedded newlines, it's reasonable to consider the one string to be a collection of multiple logical lines. There are many ways to create strings with embedded newlines (such as "this\nway") -- when applying a regex to such a string, however it might have acquired its data, should ^Subject: find `Subject:' at the start of any of the logical lines, or only at the start of the entire multi-line string?

Perl allows you to do either. Actually, there are four distinct modes, summarized in Table 7-5


Table 7-5: Overview of Newline-Related Match Modes
mode ^ and $ anchors consider target text as dot
default mode a single string, without regard to newlines doesn't match newline
single-line mode a single string, without regard to newlines matches all characters
multi-line mode multiple logical lines separated by newlines (unchanged from default)
clean multi-line multiple logical lines separated by newlines matches all characters


When there are no embedded newlines in the target string, all modes are equal.24 You'll noticed that Table 7-5 doesn't mention...


In fact, meticulous study of Table 7-5 reveals that these modes are concerned only with how the three metacharacters, caret, dollar, and dot, consider newlines.

The default behavior of caret, dollar, and dot

Perl's default is that caret can match only at the beginning of the string. Dollar can match at the end of the string, or just before a newline at the end of the string. This last provision might seem a bit strange, but it makes sense in the context of how Perl is normally used: input is read line by line, with the trailing newline kept as part of the data. Since dot doesn't match a newline in the default mode, this rule allows .*$ to consume everything up to, but not including, the newline. My term for this default mode is the default mode. You can quote me.

/m and the ill-named ``multi-line mode''

Use the /m modifier to invoke a multi-line mode match. This allows caret to match at the beginning of any logical line (at the start of the string, as well as after any embedded newline), and dollar to match at the end of any logical line (that is, before any newline, as well as at the end of the string). The use of /m does not affect what dot does or doesn't match, so in the typical case where /m is used alone, dot retains its default behavior of not matching a newline. (As we'll soon see, /m can be combined with /s to create a clean multi-line mode where dot matches anything at all.)


Let me say again:

The /m modifier influences only how ^ and $ treat newlines.

The /m modifier affects only regular-expression matching, and in particular, only the caret and dollar metacharacters. It has nothing whatsoever to do with anything else. Perhaps ``multi-line mode'' is better named ``line-anchors-notice-embedded-newlines mode.'' The /m and multi-line mode are debatably the most misunderstood simple features of Perl, so please allow me to get on the soapbox for a moment to make it clear that the /m modifier has nothing to do with...

The /m modifier was added in Perl5 -- Perl4 used the now-obsolete special variable $* to indicate multi-line mode for all matches. When $* is true, caret and dollar behave as if /m were specified. This is less powerful than explicitly indicating that you want multi-line mode on a per-use basis, so modern programs should not use $*. However, modern programmers should worry if some old or unruly library does, so the complement of /m is the /s modifier.

Single-line mode

The /s modifier forces caret and dollar to not consider newlines as special, even if $* somehow gets turned on. It also affects dot: with /s, dot matches any character. The rationale is that if you go to the trouble to use the /s modifier to indicate that you are not interested in logical lines, a newline should not get special treatment with dot, either.

Clean multi-line mode

Using the /m and /s modifiers together creates what I call a ``clean'' multi-line mode. It's the same as normal multi-line mode except dot can match any character (the influence added by /s). I feel that removing the special case of dot not matching newline makes for cleaner, more simple behavior, hence the name.

Explicit start- and end-of-string

To allow greater flexibility, Perl5 also provides \A and \Z to match the beginning and end of the string. They are never concerned with embedded newlines. They are exactly the same as the default and /s versions of caret and dollar, but they can be used even with the /m modifier, or when $* is true.

Newline is always special in one respect

There is one situation where newline always gets preferential treatment. Regardless of the mode, both $ and \Z are always allowed to match before a text-ending newline. In Perl4, a regex cannot require the absolute end of the string. In Perl5, you can use ...(?!\n)$ as needed. On the other hand, if you want to force a trailing newline, simply use ...\n$ in any version of Perl.

Eliminating warnings about $*

Perl scripts should generally not use $*, but sometimes the same code needs to support Perl4 as well. Perl5 issues a warning if it sees $* when warnings are turned on (as they generally should be). The warnings can be quite annoying, but rather than turning them off for the entire script, I recommend:
{ local($^W) = 0;  eval '$* = 1' }

This turns off warnings while $* is modified, yet when done leaves warnings on if they were on -- I explained this technique in detail in ``Dynamic Scope'' (=>213).

/m vs. (?m), /s vs. /m

As mentioned briefly on page 231, you can specify certain modifiers using the (?mod) notation within the regex itself, such as using (?m) as a substitute for using /m. There are no fancy rules regarding how the /m modifier might conflict with (?m), or where (?m) can appear in the regex. Simply using either /m or (?m) (anywhere at all in the regex) enables multi-line mode for the entire match.

Although you may be tempted to want something like (?m)...(?s)...(?m)... to change the mode mid-stream, the line mode is an all-or-nothing characteristic for the entire match. It makes no difference how the mode is specified.

Combining both /s and /m has /m taking precedence with respect to caret and dollar. Still, the use or non-use of /m has no bearing on whether a dot matches a newline or not -- only the explicit use of /s changes dot's default behavior. Thus, combining both modes creates the clean multi-line mode.

All these modes and permutations might seem confusing, but Table 7-6 should keep things straight. Basically, they can be summarized with ``/m means multi-line, /s means dot matches newline.''


Table 7-6: Summary of Anchor and Dot Modes
Dot Matches
Mode Specified With ^ $ \A, \Z Newline
default neither /s nor /m, $* false string string string  no
single-line /s ($* irrelevant) string string string  yes
multi-line /m ($* irrelevant) line line string  no default
clean multi-line both /m and /s ($* irrelevant) line line string  yes
obsolete multi-line neither /s nor /m; $* true line line string  no default
string -- cannot anchor to an embedded newline.
line -- can anchor to an embedded newline.


All other constructs are unaffected. \n always matches a newline. A character class can be used to match or exclude the newline characters at any time. An inverted character class such as [^x] always matches a newline (unless \n is included, of course). Keep this in mind if you want to change something like .* to the seemingly more restrictive [^...]*.

Multi-Match Anchor

Perl5 adds the \G anchor, which is related to \A, but is geared for use with /g. It matches at the point where the previous match left off. For the first attempt of a /g match, or when /g is not used, it is the same as \A.

An example with \G

Let's look at a lengthy example that might seem a bit contrived, but which illustrates some excellent points. Let's say that your data is a series of five-digit US postal codes (ZIP codes) that are run together, and that you need to retrieve all that begin with, say, 44. Here is a sample line of data, with the target codes in bold:
03824531449411615213441829503544272752010217443235
As a starting point, consider that we can use @zips = m/\d\d\d\d\d/g; to create a list with one ZIP code per element (assuming, of course, that the data is in the default search variable $_). The regular expression matches one ZIP code each time /g iteratively applies it. A point whose importance will soon become apparent: the regex never fails until the entire list has been parsed -- there are absolutely no bump-and-retries by the transmission. (I'm assuming we'll have only proper data, an assumption that is sometimes valid in the real world -- but usually not.)

So, it should be apparent that changing \d\d\d\d\d to 44\d\d\d in an attempt to find only ZIP codes starting with 44 is silly -- once a match attempt fails, the transmission bumps along one character, thus putting the match for the 44 out of synch with the start of each ZIP code. Using 44\d\d\d incorrectly finds ...5314494116... as the first match.

You could, of course, put a caret or \A at the head of the regex, but they allow a target ZIP code to match only if it's the first in the string. We need to keep the regex engine in synch manually by writing our regex to pass over undesired ZIP codes as needed. The key here is that it must pass over full ZIP codes, not single characters as with the automatic bump-along.

Keeping the match in synch with expectations

I can think of several ways to have the regex pass over undesired ZIP codes. Any of the following inserted at the head of the regex achieves the desired effect:

(?:[^4]\d\d\d\d|\d[^4]\d\d\d)*...
This brute-force method actively skips ZIP codes that start with something other than 44. (Well, it's probably better to use [1235-9] instead of [^4], but as I said earlier, I am assuming properly formatted data.) By the way, we can't use (?:[^4][^4]\d\d\d)*, as it does not pass over undesired ZIP codes like 43210.

(?:(?!44)\d\d\d\d\d)*...
This similar method actively skips ZIP codes that do not start with 44. This English description sounds virtually identical to the one above, but when rendered into a regular expression looks quite different. Compare the two descriptions and related expressions. In this case, a desired ZIP code (beginning with 44) causes (?!44) to fail, thus causing the skipping to stop.

(?:\d\d\d\d\d)*?...
This method skips ZIP codes only when needed (that is, when a later subexpression describing what we do want fails). Because of the minimal-matching, (?:\d\d\d\d\d) is not even attempted until whatever follows has failed (and is repeatedly attempted until whatever follows finally does match, thus effectively skipping only what is absolutely needed).

Combining this last method with (44\d\d\d) gives us

@zips = m/(?:\d\d\d\d\d)*?(44\d\d\d)/g;
and picks out the desired `44xxx' codes, actively skipping undesired ones that intervene. (When used in a list context, m/.../g returns a list of the text matched by subexpressions within capturing parentheses from each match; =>253.)

This regex can work with /g because we know each match always leaves the ``current match position'' at the start of the next ZIP code, thereby priming the next match (via /g) to start at the beginning of a ZIP code as the regex expects. You might remember that we used this same keeping-in-synch technique with the CSV example (=>206).

Hopefully, these techniques are enlightening, but we still haven't seen the \G we're supposed to be looking at. Continuing the analysis of the problem, we find a use for \G quickly.

Maintaining synch after a non-match as well

Have we really ensured that the regex will always be applied only at the start of a ZIP code? No! We have manually skipped intervening undesired ZIP codes, but once there are no more desired ones, the regex will finally fail. As always, the bump-along-and-retry happens, thereby starting the match from a position within a ZIP code! This is a common concern that we've seen before, in Chapter 5's Tcl program to remove C comments (=>173).

Let's look at our sample data again:

03824531449411615213441829503544272|7|5|2|010217443235
Here, the matched codes are bold (the third of which is undesired), the codes we actively skipped are underlined, and characters skipped via bump-along-and-retry are marked. After the match of 44272, no more target codes are able to be matched, so the subsequent attempt fails. Does the whole match attempt end? Of course not. The transmission bumps along to apply the regex at the next character, putting us out of synch with the real ZIP codes. After the fourth such bump-along, the regex skips 10217 as it matches the ``ZIP code'' 44323.

Our regex works smoothly so long it's applied at the start of a ZIP code, but the transmission's bump-along defeats it. This is where \G comes in:

@zips = m/\G(?:\d\d\d\d\d)*?(44\d\d\d)/g;

\G matches the point where the previous /g match ended (or the start of the string during the first attempt). Because we crafted the regex to explicitly end on a ZIP code boundary, we're assured that any subsequent match beginning with \G will start on that same ZIP code boundary. If the subsequent match fails, we're done for good because a bump-along match is impossible -- \G requires that we start from where we had last left off. In other words, this use of \G effectively disables the bump along.

In fact, the transmission is optimized to actually disable the bump-along in common situations. If a match must start with \G, a bump-along can never yield a match, so it is done away with altogether. The optimizer can be tricked, so you need to be careful. For example, this optimization isn't activated with something like \Gthis|\Gthat, even though it is effectively the same as \G(?:this|that) (which it does optimize).

\G in perspective

\G is not used often, but when needed, it is indispensable. As enlightening as I hope this example has been, I can actually see a way to solve it without \G. In the interest of study, I'd like to mention that after successfully matching a 44xxx ZIP code, we can use either of the first two ``skip undesired ZIP codes'' subexpressions to bypass any trailing undesired ZIP codes as well (the third bypasses only when forced, so would not be appropriate here):
@zips = m/(?:\d\d\d\d\d)*?(44\d\d\d)(?:(?!44)\d\d\d\d\d)*/g;

After the last desired ZIP code has been matched, the added subexpression consumes the rest of the string, if any, and the m/.../g is finished.

These methods work, but frankly, it is often prudent to take some of the work out of the regular expression using other regular expressions or other language features. The following two examples are easier to understand and maintain:

@zips = grep {defined} m/(44\d\d\d)|\d\d\d\d\d/g;
@zips = grep {m/^44/} m/\d\d\d\d\d/g;

In Perl4, you can't do it all in one regex because it doesn't have many of the constructs we've used, so you need to use a different approach regardless.

Priming a /g match

Even if you don't use \G, the way Perl remembers the ``end of the previous match'' is a concern. In Perl4, it is associated with a particular regex operator, but in Perl5 it is associated with the matched data (the target string) itself. In fact, this position can be accessed using the pos(...) function. This means that one regex can actually pick up where a different one left off, in effect allowing multiple regular expressions to do a tag-team match. As a simple example, consider using
@nums = $data =~ m/\d+/g;
to pick apart a string, returning in @nums a list of all numbers in the data. Now, let's suppose that if the special value <xx> appears on the line, you want only numbers after it. An easy way to do it is:
$data =~ m/<xx>/g;  # prime the /g start. pos($data) now points to just after the <xx>.
@nums = $data =~ m/\d+/g;

The match of <xx> is in a scalar context, so /g doesn't perform multiple matches (=>253). Rather, it sets the pos of $data, the ``end of the last match'' position where the next /g-governed match of the same data will start. I call this technique priming the /g. Once done, m/\d+/g picks up the match at the primed point. If <xx> can't match in the first place, the subsequent m/\d+/g starts at the beginning of the string, as usual.

Two important points allow this example to work. First, the first match is in a scalar context. In a list context, the <xx> is applied repeatedly until it fails, and failure of a /g-governed match resets pos to the start of the string. Secondly, the first match must use /g. Matches that don't use /g never access pos.

And here is something interesting: Because you can assign to pos, you can prime the /g start manually:

pos($data) = $i if $i = index($data,"<xx>"), $i > 0;
@nums = $data =~ m/\d+/g;

If index finds <xx> in the string, it sets the start of the next /g-governed match of $data to begin there. This is slightly different from the previous example -- here we prime it to start at the <xx>, not after it as we did before. It turns out, though, that it doesn't matter in this case.

This example is simple, but you can imagine how these techniques could be quite useful if used carefully in limited cases. You can also imagine them creating a maintenance nightmare if used carelessly.

Word Anchors

Perl lacks separate start-of-word and end-of-word anchors that are commonly found in other tools (Table 3-1 =>63; Table 6-1 =>182). Instead,26 Perl has word-boundary and non-wordboundary anchors \b and \B. (Note: in character classes, and in doublequoted strings for that matter, \b is a shorthand for a backspace.) A word boundary is any position where the character on one side matches \w and the character on the other matches \W (with the ends of the string being considered \W for the purposes of this definition). Note that unlike most tools, Perl includes the underscore in \w. It can also include additional characters if a locale is defined in the user environment (=>65242).

26  The two styles are not mutually exclusive. GNU Emacs, for example, provides both.

An important warning about \b

There's one particular danger surrounding \b that I sometimes run into.27 As part of a Web search interface, given an $item to find, I was using m/\b\Q$item\E\b/ to do the search. I wrapped the item in \b...\b because I knew I wanted to find only whole-word matches of whatever $item was. Given an item such as `3.75', the search regex would become \b3\.75\b, finding things like `price is 3.75 plus tax' as expected.

27  Most recently, actually, just before feeling the need to write this section!

However, if the $item were `$3.75', the regex would become \b\$3\.75\b, which requires a word boundary before the dollar sign. (The only way a word boundary can fall before a \W character like a dollar sign is if a word ends there.) I'd prepended \b with the thought that it would force the match to start where $item started its own word. But now that the start of $item can't start its own word, (`$' isn't matched by \w, so cannot possibly start a word) the \b is an impediment. The regex doesn't even match `... is $3.75 plus ...'.

We don't even want to begin the match if the character before can match \w, but this is lookbehind, something Perl doesn't have. One way to address this issue is to add \b only when the $item starts or ends with something matching \w:


$regex = "\Q$item\E";                       # make $item ``safe''
$regex = '\b' . $regex if $regex =~ m/^\w/; # if can start word, ensure it does
$regex = $regex . '\b' if $regex =~ m/\w$/; # if can end word, ensure it does

This ensures that \b doesn't go where it will cause problems, but it still doesn't address the situations where it does (that is, where the $item begins or ends with a \W character). For example, an $item of -998 still matches in `800-998-9938'. If we don't mind matching more text than we really want (something that's not an option when embedded within a larger subexpression, and often not an option when applied using the /g modifier), we can use the simple, but effective, (?:\W|^)\Q$item\E(?!\w).

Dealing with a lack of separate start and end anchors

At first glance, Perl's lack of separate start- and end-of-word anchors might seem to be a major shortcoming, but it's not so bad since \b's use in the regex almost always disambiguates it. I've never seen a situation where a specific start-only or end-only anchor was actually needed, but if you ever do bump into one, you can use \b(?=\w) and \b(?!\w) to mimic them. For example,
s/\b(?!\w).*\b(?=\w)//
removes everything between the first and last word in the string. As a comparison, in modern versions of sed that support the separate \< and \> word boundaries, the same command would be s/\>.*\<//.

Convenient Shorthands and Other Notations

We've already seen many of Perl's convenient shorthands for common constructs. Table 7-7 shows the full28 list.

28  The phantom \v (vertical tab) has been omitted. The manpage and other documentation listed it for years, but it was never actually added to the language! I doubt it will be missed now that it has finally been removed from the documentation.


Table 7-7: Regex Shorthands and Special-Character Encodings
Byte Notations Machine-dependent
\num character specified in octal Control Characters
\xnum character specified in hexadecimal \a alarm (bell)
\cchar control character \f formfeed
Shorthand for Common Classes \e escape
\d digit [0-9] \n newline
\s whitespace, usually [·\f\n\r\t] \r carriage return
\w word character, usually [a-zA-Z0-9_] \t tab
\D, \S, \W -- complement of \d, \s, \w \b backspace
(only within a class)


The \n and other shorthands probably seem familiar -- these machine dependent (=>72) notations for common control characters are also available in doublequoted strings. I feel it is important to maintain the mental distinction that these regular expression metacharacters themselves are not available within strings, but that strings just happen to have their own metacharacters which parallel the regex ones in this area (=>41).


If you compare m/(?:\r\n)+$/ with

$regex = "(\r\n)+";
m/$regex$/;
you'll find that they produce exactly the same results. But make no mistake, they are not the same. Naïvely use something like "\b[+\055/*]\d+\b" in the same situation and you'll find a number of surprises. When you assign to a string, it's a string -- your intention to use it as a regex is irrelevant to how the string processes it. The two \b in this example are intended to be word boundaries, but to a doublequoted string they're shorthands for a backspace. The regex will never see \b, but instead raw backspaces (which are unspecial to a regex and simply match backspaces). Surprise!

On a different front, both a regex and a doublequoted string convert \055 to a dash (055 is the ASCII code for a dash), but if the regex does the conversion, it doesn't see the result as a metacharacter. The string doesn't either, but the resulting [+-/*] that the regex eventually receives has a dash that will be interpreted as part of a class range. Surprise!

Finally, \d is not a known metasequence to a doublequoted string, so it simply removes the backslash. The regex sees d. Surprise!

I want to emphasize all this because you must know what are and aren't metacharacters (and whose metacharacters they are, and when, and in what order they're processed) when building a string that you later intend to use as a regex. It can certainly be confusing at first. An extended example is presented in ``Matching an Email Address'' (=>294).

Perl and the POSIX locale

As ``Backdoor Support for Locales'' (=>65) briefly noted, Perl's support for POSIX locales is very limited. It knows nothing about collating elements and the like, so ASCII-based character-class ranges don't include any of the locale's non-ASCII characters. (String comparisons aren't locale-based either, so neither is sorting.)

If compiled with appropriate libraries, though, Perl uses the ``is this a letter?'' routines (isalpha, isupper, and so on), as well as the mappings between uppercase and lowercase. This affects /i and the items mentioned in Table 7-8 (=>245). It also allows \w, \W, \s, \S, (but not \d), and word boundaries to respond to the locale. (Perl's regex flavor, however, explicitly does not support [:digit:] and the other POSIX bracket expression character classes listed on page 80.)

Related standard modules

Among Perl's standard modules are the POSIX module and Jarkko Hietaniemi's I18N::Collate module. (I18n is the common abbreviation for internationalization -- why is an exercise for your free time.) Although they don't provide regular-expression support, you might find them useful if locales are a concern. The POSIX module is huge, but the documentation is relatively brief -- for additional documentation, try the corresponding C library manpages, or perhaps Donald Lewine's POSIX Programmer's Guide (published by O'Reilly & Associates).

Byte notations

Perl provides methods to easily insert bytes using their raw values. Two- or three-digit octal values may be given like \33 and \177, and one- or two-digit hexadecimal values like \xA and \xFF.

Perl strings also allow one-digit octal escapes, but Perl regexes generally don't because something like \1 is usually taken as a backreference. In fact, multiple-digit backreferences are possible if there are enough capturing parentheses. Thus, \12 is a backreference if the expression has at least 12 sets of capturing parentheses, an octal escape (for decimal 10) otherwise. Upon reading a draft of this chapter, Wayne Berke offered a suggestion that I wholeheartedly agree with: never use a two-digit octal escape such as \12, but rather the full three-digit \012. Why? Perl will never interpret \012 as a backreference, while \12 is in danger of suddenly becoming a backreference if the number of capturing parentheses warrants.

There are two special cases: A backreference within a character class makes no sense so single-digit octal escapes are just peachy within character classes (which is why I wrote generally don't in the previous paragraph). Secondly, \0 is an octal escape everywhere, since it makes no sense as a backreference.

Bytes vs. characters, newline vs. linefeed

As Chapter 1's ``Regex Nomenclature'' (=>26) explained, exactly which characters these bytes represent is context-dependent. Usually the context is ASCII, but it can change with the whim of the user or the data. A related concern is that the value imparted by \n and friends is not defined by Perl, but is system-dependent (=>72).

Character Classes

Perl's class sublanguage is unique among regex flavors because it fully supports backslash escapes. For instance, [\-\],] is a single class to match a dash, right bracket, and comma. (It might take a bit for [\-\],] to sink in -- parse it carefully and it should make sense.) Many other regex flavors do not support backslashes within classes, which is too bad because being able to escape class metacharacters is not only logical, but beneficial. Furthermore, it's great to be allowed to escape items even when not strictly necessary, as it can enhance readability.29

29  I generally use GNU Emacs when writing Perl code, and use cperl-mode to provide automatic indenting and smart colorization. I often escape quotes within a regex because otherwise they confuse cperl-mode, which doesn't understand that quotes within a regex don't start strings.

As I've cautioned throughout this book, metacharacters recognized within a character class are distinct from those recognized outside. Perl is no exception to this rule, although many metacharacters have the same meaning in both situations. In fact, the dual-personality of \b aside, everything in Table 7-7 is also supported within character classes, and I find this extremely convenient.

Many normal regex metacharacters, however, are either unspecial or utterly different within character classes. Things like star, plus, parentheses, dot, alternation, anchors and the like are all meaningless within a character class. We've seen that \b, \3, and ^ have special meanings within a class, but they are unrelated to their meaning outside of a class. Both - and ] are unspecial outside of a class, but special inside (usually).

Character classes and non-ASCII data

Octal and hexadecimal escapes can be quite convenient in a character class, particularly with ranges. For example, the traditional class to match a ``viewable'' ASCII character (that is, not whitespace or a control character) has been [!-~]. (An exclamation point is the first such character in ASCII, a tilde the last.) It might be less cryptic to spell it out exactly: [\x21-\x7e]. Someone who already knew what you were attempting would understand either method, but someone happening upon [!-~] for the first time would be confused, to say the least. Using [\x21-\x7e] at least offers a clue that it is a character-encoding range.

Lacking real POSIX locale support, octal and hexadecimal escapes are quite useful for working with non-ASCII text. For instance, when working with the Latin-1 (ISO-8859-1) encoding popular on the Web, you need to consider that u might also appear as ù, ú, û, or ü (using the character encodings \xf9 through \xfc). Thus, to match any of these u's, you can use [u\xf9-\xfc]. The uppercase versions are encoded from \xd9 through \xdc, so a case-insensitive match is [uU\xf9-\xfc\xd9-\xdc] (Using the /i modifier applies only to the ASCII u, so it is just as easy to include U directly and save ourselves the woe of the /i penalty; =>278.)

Sorting is a practical use of this technique. Normally, to sort @items, you simply use sort @items, but this sorts based on raw byte values and puts u (with ASCII value \x75) far away from û and the like. If we make a copy of each item to use as a sort key (conveniently associating them with an associated array), we can modify the key so that it will work directly with sort and yield the results we want. We can then map back to the unmodified key while keeping the sorted order.


Here's a simple implementation:

foreach $item (@Items) {
  $key = lc $item;                    # Copy $item, forcing ASCII to lowercase.
  $key =~ s/[\xd9-\xdc\xf9-\xfc]/u/g; # All types of accented u become plain u
  ... same treatment for other accented letters...
  $pair{$item} = $key;                # Remember the item->sortkey relation
}

# Sort the items based upon their key.
@SortedItems = sort { $pair{$a} cmp $pair{$b} } @Items;

(lc is a convenient Perl5 feature, but the example is easily rewritten in Perl4.) In reality, this is only the beginning of a solution since each language has its own particular sorting requirements, but it's a step in the right direction. Another step is to use the I18N::Collate module mentioned

Modification with \Q and Friends: True Lies

Look in any documentation on Perl regular expressions, including the bottom of Table 7-1, and you are likely to find \L, \E, \u, and the other items listed here in Table 7-8. Yet, it might surprise you that they are not really regular-expression metacharacters. The regex engine understands that `*' means ``any number'' and that `[' begins a character class, but it knows nothing about `\E'. So why have I included them here?


Table 7-8: String and Regex-Operand Case-Modification Constructs
In-string Construct

Meaning Built-in Function
\L, \U lower, raise case of text until \E30 lc(...), uc(...)

\l, \u lower, raise case of next character31 lcfirst(...), ucfirst(...)

\Q add an escape before all non-alphabetic until \E quotemeta(...)
Special Combinations
\u\L Raise case of first character; lower rest until \E or end of text
\l\U Lower case of first character; raise rest until \E or end of text
30  In all these \E cases, ``until the end of the string or regex'' applies if no \E is given.
31  In Perl5, ignored within \L...\E and \U...\E unless right after the \L or \U.


For most practical purposes, they appear to be normal regex metacharacters. When used in a regex operand, the doublequotish processing of Figure 7-1's Phase B handles them, so they normally never reach the regex engine (=>222). But because of cases where it does matter, I call these Table 7-8 items second-class metacharacters.

Second-class metacharacters

As far as regex-operands are concerned, variable interpolation and the metacharacters in Table 7-8 are special because they:

This is because they are recognized only during Phase B of Figure 7-1, not later after the interpolation has taken place. If you don't understand this, you would be confused when the following didn't work:

$ForceCase = $WantUpper ? '\U' : '\L';
if (m/$ForceCase$RestOfRegex/) {
 :

Because the \U or \L of $ForceCase are in the interpolated text (which is not processed further until Phase C), nothing recognizes them as special. Well, the regex engine recognizes the backslash, as always, so \U is treated as the general case of an unknown escape: the escape is simply ignored. If $RestOfRegex contains Path and $WantUpper is true, the search would be for the literal text UPath, not PATH as had been desired.

Another effect of the ``one-level'' rule is that something like m/([a-z])...\U\1 doesn't work. Ostensibly, the goal is to match a lowercase letter, eventually followed by an uppercase version of the same letter. But \U works only on the text in the regular expression itself, and \1 represents text matched by (some other part of the) expression, and that's not known until the match attempt is carried out. (I suppose a match attempt can be considered a ``Phase E'' of Figure 7-1.)

The Match Operator

The basic match is the core of Perl regular-expression use. Fortunately, the match operator is quite flexible; unfortunately, that makes mastering it more complex. In looking at what m/.../ offers, I take a divide-and-conquer approach, looking at:

The Perl regular-expression match is an operator that takes two operands (a target string operand and a regex operand) and returns a value, although exactly what kind of value depends on context. Also, there are optional modifiers which change how the match is done. (Actually, I suppose these can be considered operands as well.)

Match-Operand Delimiters

It's common to use a slash as the match-operand delimiter, but you can choose any symbol you like. (Specifically, any non-alphanumeric, non-whitespace character may be used.) This is perhaps one of the most bizarre aspects of Perl syntax, although arguably one of the most useful for creating readable programs.

For example, to apply the expression ^/(?:[^/]+/)+Perl$ using the match operator with standard delimiters, m/^\/(?:[^\/]+\/)+Perl$/ is required. As presented in Phase A of Figure 7-1, a closing delimiter that appears within the regular expression must be escaped to hide the character's delimiter status. (These escapes are bold in the example.) Characters escaped for this reason are passed through to the regex as if no escape were present. Rather than suffer from backslashitis, it's more readable to use a different delimiter -- two examples of the same regex are m!^/(?:[^/]+/)+Perl$! and m,^/(?:[^/]+/)+Perl$,. Other common delimiters are m|...|, m#...#, and m%...%.

There are several special-case delimiters:

The substitution operator s/.../.../ has other special delimiters that we'll talk about later (=>255); the above are the special cases for the match operator.

As yet another special case, if the delimiter is either the commonly used slash, or the special match-only-once question mark, the m itself becomes optional. It's common to use /.../ for matches.

Finally, the generic pattern target =~ expression (with no delimiters and no m) is supported. The expression is evaluated as a generic Perl expression, taken as a string, and finally fed to the regex engine. This allows you to use something like $text =~ &GetRegex() instead of the longer:

my $temp_regex = &GetRegex();
... $text =~ m/$temp_regex/ ...

Similarly, using $text =~ "...string..." could be useful if you wanted real doublequote processing, rather than the doublequoteish processing discussed earlier. But frankly, I would leave this to an Obfuscated Perl Contest.

The default regex

If no regex is given, such as with m// (or with m/$regex/ where the variable $regex is empty or undefined), Perl reuses the regular expression most recently used successfully within the enclosing dynamic scope. In this case, any match modifiers (discussed in the next section) are completely ignored. This includes even the /g and /i modifiers. The modifiers used with the default expression remain in effect.

The default regex is never recompiled (even if the original had been built via variable interpolation without the /o modifier). This can be used to your advantage for creating efficient tests. There's an example in ``The /o Modifier'' (=>270).

Match Modifiers

The match operator supports a number of modifiers (options), which influence:

You can group several modifier letters together and place them in any order after the closing delimiter,32 whatever it might be. For example, m/<code>/i applies the regular expression <code> with the /i modifier, resulting in a case-insensitive match. Do keep in mind that the slash is not part of the modifier -- you could write this example as m|<code>|i or perhaps m{<code>}i or even m<<code>>i.

32  Because match-operator modifiers can appear in any order, a large portion of a programmer's time is spent adjusting the order to achieve maximal cuteness. For example, learn/by/osmosis is valid code (assuming you have a function called learn). The osmosis are the modifiers -- repetition of match-operator modifiers (but not the substitution-operator's /e) is allowed, but meaningless.

As discussed earlier (=>231), the modifiers /x, /i, /m, and /s can also appear within a regular expression itself using the (?...) construct. Allowing these options to be indicated in the regular expression directly is extremely convenient when one operator is applying different expressions at different times (usually due to variable interpolation). When you use /i, for example, every application of an expression via the match operator in question is case-insensitive. By allowing each regular expression to choose its own options, you get more general-purpose code.

A great example is a search engine on a Web page that offers a full Perl regular-expression lookup. Most search engines offer very simple search specifications that leaves power users frustrated, so offering a full regex option is appealing. In such a case, the user could use (?i) and the like, as needed, without the CGI having to provide special options to activate these modifiers.

m/.../g with a regex that can match nothingness

Normally, subsequent match attempts via the /g modifier start where the previous match ended, but what if there is a way for the regex to match the null string? As a simple example, consider the admittedly silly m/^/g. It matches at the start of the string, but doesn't actually consume any characters, so the first match ends at the beginning of the string. If the next attempt starts there as well, it will match there as well. Repeat forever and you start to see a problem.

Perl version 5.000 is broken in this respect, and indeed repeats until you run out of electrons. Perl4 and later versions of Perl5 work, although differently. They both begin the match where the previous one left off unless the previous match was of no text, in which case a special bump-along happens and the match is re-applied one character further down. Thus, each match after the first is guaranteed to progress down the string at least one character, and the infinite loop is avoided.

Except when used with the substitution operator, Perl5 takes the additional step of disallowing any match that ends at the same position as the previous match -- in such a case the automatic one-character heave-ho is done (if not already at the end of the string). This difference from Perl4 can be important -- Table 7-9 shows a few simple examples. (This table is not light reading by any means -- it might take a while to absorb.) Things are quite different with the substitution operator, but that's saved for ``The Substitution Operator'' (=>255).


Table 7-9: Examples of m/.../g with a Can-Match-Nothing Regex
regex: \d* count \d* count x|\d* count \d*|x33 count
Perl4 123| 2 123|x| 3 |a123|wx|y|z456| 8 |a123|w|x|y|z456| 8
Perl534 123 1 123x| 2 |a123wxy|z456 5 |a123w|x|y|z456 6
(Each match shown via either an underline, or as  |  for a zero-width match)
33  As an aside, note that since \d* can never fail, the |x is never used, and so is meaningless.
34  From version 5.001 on.


Specifying the Match Target Operand

Fortunately, the target string operand is simpler to describe than the regex operand. The normal way to indicate ``This is the string to search'' is using =~, as with $line =~ m/.../. Remember that =~ is not an assignment operator, nor is it a comparison operator. It is merely an odd way of providing the match operator with one of its operands. (The notation was adapted from awk.)

Since the whole ``expr =~ m/.../'' is an expression itself, you can use it wherever an expression is allowed. Some examples (each separated by a wavy line):

$text =~ m/.../;   # just do it, presumably, for the side effects.
 . . . . . . . . . . . .
if ($text =~ m/.../) {
  ## do code if match successful
 :
 . . . . . . . . . . . .
$result = ( $text   =~ m/.../ ); # set $result to result of match against $text
$result =   $text   =~ m/.../  ; # same thing; =~ has higher precedence than = 
 . . . . . . . . . . . .
$result =   $text;                # copy $text to $result...
$result             =~ m/.../  ; # ...and perform match on $result
( $result =   $text ) =~ m/.../  ; # Same thing in one expression

If the target operand is the variable $_, you can omit the ``$_ =~'' altogether. In other words, the default target operand is $_.

Something like $line =~ m/regex/ means ``Apply regex to the text in $line, ignoring the return value but doing the side effects.'' If you forget the `~', the resulting $line = m/regex/ becomes ``Apply regex to the text in $_, returning a true or false value that is then assigned to $line.'' In other words, the following are the same:

$line =        m/regex/
$line = ($_ =~ m/regex/)

You can also use !~ instead of =~ to logically negate the return value. (Return values and side effects are discussed soon.) $var !~ m/.../ is effectively the same as not ($var =~ m/.../). All the normal side effects, such as the setting of $1 and the like, still happen. It is merely a convenience in an ``If this doesn't match'' situation. Although you can use !~ in a list context, it doesn't make much sense.

Other Side Effects of the Match Operator

Often, more important than the actual return value of the match are the resulting side effects. In fact, it is quite common to use the match operator without using its return value just to obtain the side effects. (The default context is scalar.) I've already discussed most of the side effects ($&, $1, $+, and so on; =>217), so here I'll look at the remaining side effects a match attempt can have.

Two involve ``invisible status.'' First, if the match is specified using m?...?, a successful match dooms future matches to fail, at least until the next reset (=>247). Of course, if you use m?...? explicitly, this particular side effect is probably the main effect desired. Second, the regex in question becomes the default regex until the dynamic scope ends, or another regex matches (=>248).

Finally, for matches with the /g modifier, the pos of the target string is updated to reflect the index into the string of the match's end. (A failed attempt always resets pos.) The next /g-governed match attempt of the same string starts at that position, unless:

As discussed earlier (=>249), in order to avoid an infinite loop, a successful match that doesn't actually match characters causes the next match to begin one character further into the string. During the begun-further attempt, pos properly reflects the end of the match because it's at the start of the subsequent attempt that the anti-loop movement is done. During such a next match, \G still refers to the true end of the previous match, so cannot be successful. (This is the only situation where \G doesn't mean ``the start of the attempt.'')

Match Operator Return Value

Far more than a simple true/false, the match operator can return a wide variety of information. (It can also return a simple true/false value if you like.) The exact information and the way it's returned depends on two main factors: context and the /g modifier.

Scalar context, without the /g modifier

Scalar context without the /g modifier is the ``normal situation.'' If a match is found, a Boolean true is returned:
if ($target =~ m/.../) {
    # processing if match found
 :
} else {
    # processing if no match found
 :
}

On failure, it returns an empty string (which is considered a Boolean false).

List context, without the /g modifier

A list context without /g is the normal way to pluck information from a string. The return value is a list with an element for each set of capturing parentheses in the regex. A simple example is processing a date of the form 69/8/31, using:
($year, $month, $day) = $date =~ m{^ (\d+) / (\d+) / (\d+) $}x;

The three matched numbers are then available in the three variables (and $1 and such as well). There is one element in the return-value list for each set of capturing parentheses, or an empty list upon failure. Of course, it is possible for a set or sets to have not been part of a match, as is certainly guaranteed with one in m/(this)|(that)/. List elements for such sets exist, but are undefined. If there are no sets of capturing parentheses to begin with, a successful list-context non-/g match returns the list (1).

Expanding a bit on the date example, using a match expression as the conditional of an if (...) can be useful. Because of the assignment to ($year, ...), the match operator finds itself in a list context and returns the values for the variables. But since that whole assignment expression is used in the scalar context of the if's conditional, it is then contorted into the count of items in the list. Conveniently, this is interpreted as a Boolean false if there were no matches, true if there were.

if ( ($year, $month, $day) = $date =~ m{^ (\d+) / (\d+) / (\d+) $}x ) {
    # Process for when we have a match: $year and such have new values
} else {
    # Process for when no match: $year and such have been newly cleared to undefined
}

List context, with the /g modifier

Available since version 4.036, this useful construct returns a list of all text matched within capturing parentheses (or if there are no capturing parentheses, the text matched by the whole expression), not only for one match as with the non-/g list-context, but for all matches in the string. For example, consider having the entire text of a Unix mailbox alias file in a single string, where logical lines look like:
alias  jeff      jfriedl@ora.com
alias  perlbug   perl5-porters@perl.org
alias  prez      president@whitehouse
You can use something like m/^alias\s+(\S+)\s+(.+)/ to pluck the alias and full address from a single logical line. It returns a list of two elements, such as ('jeff', 'jfriedl@ora.com') for the first line. Now consider working with all the lines in one string. You can do all the matches all at once by using /g (and /m, to allow caret to match at the beginning of each logical line), returning a list such as:
( 'jeff', 'jfriedl@ora.com', 'perlbug',
  'perl5-porters@perl.org', 'prez', 'president@whitehouse' )
If it happens that the items returned fit the key/value pair pattern as in this example, you can actually assign it directly to an associative array. After running
%alias = $text =~ m/^alias\s+(\S+)\s+(.+)/mg;
you could access the full address of `jeff' with $alias{'jeff'}.

Scalar context, with the /g modifier

A scalar-context m/.../g is a special construct quite different from the other three situations. Like a normal m/.../, it does only one match, but like a list-context m/.../g, it pays attention to where previous matches occurred. Each time a scalar-context m/.../g is reached, such as in a loop, it finds the ``next'' match. Once it fails, the next check starts again from the beginning of the string.

This is quite convenient as the conditional of a while loop. Consider:

while ($ConfigData =~ m/^(\w+)=(.*)/mg) {
    my($key, $value) = ($1, $2);
      :
}

All matches are eventually found, but the body of the while loop is executed between the matches (well, after each match). Once an attempt fails, the result is false and the while loop finishes. Also, upon failure, the /g state (given by pos) is reset. Finally, be careful not to modify the target data within the loop unless you really know what you're doing: it resets pos

Outside Influences on the Match Operator

We've just spent a fair amount of time covering the various options, special cases, and side effects of the match operator. Since many of the influences are not visually connected with the application of a match operator (that is, they are used or noticed elsewhere in the program), I'd like to summarize the hidden effects:

Keeping your mind in context (and context in mind)

Before leaving the match operator, I'll put a question to you. Particularly when changing among the while, if, and foreach control constructs, you really need to keep your wits about you. What do you expect the following to print?
while ("Larry Curly Moe" =~ m/\w+/g) {
   print "WHILE stooge is $&.\n";
}
print "\n";

if ("Larry Curly Moe" =~ m/\w+/g) {
   print "IF stooge is $&.\n";
}
print "\n";

foreach ("Larry Curly Moe" =~ m/\w+/g) {
   print "FOREACH stooge is $&.\n";
}

It's a bit tricky. ¤ Turn the page to check your answer.

while vs. foreach vs. if

¤ Answer to the question on page 254.

The results differ depending on your version of Perl:

Perl4 Perl5
WHILE stooge is Larry.
WHILE stooge is Curly.
WHILE stooge is Moe.

IF stooge is Larry.

FOREACH stooge is .
FOREACH stooge is .
FOREACH stooge is .
WHILE stooge is Larry.
WHILE stooge is Curly.
WHILE stooge is Moe.

IF stooge is Larry.

FOREACH stooge is Moe.
FOREACH stooge is Moe.
FOREACH stooge is Moe.


Note that if the print within the foreach loop had referred to $_ rather than $&, its results would have been the same as the while's. In this foreach case, however, the result returned by the m/.../g, ('Larry', 'Curly', 'Moe'), goes unused. Rather, the side effect $& is used, which almost certainly indicates a programming mistake, as the side effects of a list-context m/.../g are not often useful.

The Substitution Operator

Perl's substitution operator s/regex/replacement/ extends the idea of matching text to match-and-replace. The regex operand is the same as with the match operator, but the replacement operand used to replace matched text adds a new, useful twist. Most concerns of the substitution operator are shared with the match operator and are covered in that section (starting on page 246). New concerns include:

The Replacement Operand

With the normal s/.../.../, the replacement operand immediately follows the regex operand, using a total of three instances of the delimiter rather than the two of m/.../. If the regex uses balanced delimiters (such as <...>), the replacement operand then has its own independent pair of delimiters (yielding four delimiters). In such cases, the two sets may be separated by whitespace, and if so, by comments as well.  Balanced delimiters are commonly used with /x or /e:
$test =~ s{
  ...some big regex here, with lots of comments and such...
} {
  ...a perl code snippet to be evaluated to produce the replacement text...
}ex

Perl normally provides true doublequoted processing of the replacement operand, although there are a few special-case delimiters. The processing happens after the match (with /g, after each match), so $1 and the like are available to refer to the proper match slice.

Special delimiters of the replacement operand are:

Remember that this replacement-operand processing is all quite distinct from regex-operand processing, which usually gets doublequotish processing and has its own set of special-case delimiters.

The /e Modifier

Only the substitution operator allows the use of /e modifier. When used, the replacement operand is evaluated as if with eval {...} (including the load-time syntax check), the result of which is substituted for the matched text. The replacement operand does not undergo any processing before the eval (except to determine its lexical extent, as outlined in Figure 7-1's Phase A), not even singlequotish processing. The actual evaluation, however, is redone upon each match.

As an example, you can encode special characters of a World Wide Web URL using % followed by their two-digit hexadecimal representation. To encode all non-alphanumerics this way, you can use

$url =~ s/([^a-zA-Z0-9])/sprintf('%%%02x', ord($1))/ge;
and to decode it back, you can use:
$url =~ s/%([0-9a-f][0-9a-f])/pack("C",hex($1))/ige;

In short, pack("C", value) converts from a numeric value to the character with that value, while sprintf('%%%02x', ord(character)) does the opposite; see your favorite Perl documentation for more information. (Also, see the footnote on page 66 for more on this example.)

The moving target of interpretation

Particularly with the /e modifier, you should understand exactly who interprets what -- and when. It's not too confusing, but it does take some effort to keep things straight. For example, even with something as simple as s/.../`echo $$`/e, the question arises whether it's Perl or the shell that interprets the $$. To Perl and many shells, $$ is the process ID (of Perl or the shell, as the case may be). You must consider several levels of interpretation. First, the replacement-operand has no pre-eval processing in Perl5, but in Perl4 has singlequotish processing. When the result is evaluated, the backquotes provide doublequoted-string processing. (This is when Perl interpolates $$ -- it may be escaped to prevent this interpolation.) Finally, the result is sent to the shell, which then runs the echo command. (If the $$ had been escaped, it would be passed to the shell unescaped, resulting in the shell's interpolation of $$.)

To add to the fray, if using the /g modifier as well, should `echo $$` be evaluated just once (with the result being used for all replacements), or should it be done after each match? When $1 and such appear in the replacement operand, it obviously must be evaluated on a per-match basis so that the $1 properly reflects its after-match status. Other situations are less clear. With this echo example, Perl version 5.000 does only one evaluation, while other versions before and after evaluate on a per-match basis.

/eieio

Perhaps useful only in an Obfuscated Perl Contest, it is interesting to note that the replacement operand will be evaluated multiple times if /e is specified more than once. (It is the only modifier for which repetition matters.) This is what Larry Wall calls an accidental feature, and was ``discovered'' in early 1991. During the ensuing comp.lang.perl discussion, Randal Schwartz offered one of his patented JAPH signatures:35
$Old_MacDonald = q#print #; $had_a_farm = (q-q:Just another Perl hacker,:-);
s/^/q[Sing it, boys and girls...],$Old_MacDonald.$had_a_farm/eieio;

35  My thanks to Hans Mulder for providing the historical background for this section, and for Randal for being Just Another Perl Hacker with a sense of humor.

The eval due to the first /e sees

q[Sing it, boys and girls...],$Old_MacDonald.$had_a_farm
whose execution results in print q:Just another Perl hacker,: which then prints Randal's ``Just another Perl hacker'' signature when evaluated due to the second /e.

Actually, this kind of construct is sometimes useful. Consider wanting to interpolate variables into a string manually (such as if the string is read from a configuration file). A simple approach uses $data =~ s/(\$[a-zA-Z_]\w*)/$1/eeg;. Applying this to `option=$var', the regex matches option=$var. The first eval simply sees the snippet $1 as provided in the replacement-operand, which in this case expands to $var. Due to the second /e, this result is evaluated again, resulting in whatever value the variable $var has at the time. This then replaces the matched `$var', in effect interpolating the variable.

I actually use something like this with my personal Web pages -- most of them are written in a pseudo Perl/HTML code that gets run through a CGI when pulled by a remote client. It allows me to calculate things on the fly, such as to remind readers how few shopping days are left until my birthday.36

36  If you, too, would like to see how many days are left until my birthday, just load
  http://omrongw.wg.omron.co.jp/cgi-bin/j-e/jfriedl.html
or perhaps one of its mirrors (see Appendix A).

Context and Return Value

Recall that the match operator returns different values based upon the particular combination of context and /g. The substitution operator, however, has none of these complexities -- it returns the same type of information regardless of either concern.

The return value is either the number of substitutions performed or, if none were done, an empty string. When interpreted as a Boolean (such as for the conditional of an if), the return value conveniently interprets as true if any substitutions were done, false if not.

Using /g with a Regex That Can Match Nothingness

The earlier section on the match operator presented a detailed look at the special concerns when using a regex that can match nothingness. Different versions of Perl act differently, which muddies the water considerably. Fortunately, all versions' substitution works the same in this respect. Everything presented earlier in Table 7-9 (=>250) applies to how s/.../.../g matches as well, but the entries marked ``Perl5'' are for the match operator only. The entries marked ``Perl4'' apply to Perl4's match operator, and all versions' substitution operator.

The Split Operator

The multifaceted split operator (often called a function in casual conversation) is commonly used as the converse of a list-context m/.../g (=>253). The latter returns text matched by the regex, while a split with the same regex returns text separated by matches. The normal match $text =~ m/:/g applied against a $text of `IO.SYS:225558:95-10-03:-a-sh:optional', returns the four-element list
  (':', ':', ':', ':')
which doesn't seem useful. On the other hand, split(/:/, $text) returns the five-element list:
  ('IO.SYS', '225558', '95-10-03', '-a-sh', 'optional')
Both examples reflect that : matches four times. With split, those four matches partition a copy of the target into five chunks which are returned as a list of five strings.

In its most simple form with simple data like this, split is as easy to understand as it is useful. However, when the use of split or the data are complicated, understanding is less clear-cut. First, I'll quickly cover some of the basics.

Basic Split

split is an operator that looks like a function, and takes up to three operands:
split(match operand, target string, chunk-limit operand)

(The parentheses are optional with Perl5.) Default values, discussed below, are provided for operands left off the end.

Basic match operand

The match operand has several special-case situations, but normally it is a simple regular expression match such as /:/ or m/\s*<P>\s*/i. Conventionally, /.../ rather than m/.../ is used, although it doesn't really matter. The /g modifier is not needed (and is ignored) because split itself provides the iteration for matching in multiple places.

There is a default match operand if one is not provided, but it is one of the complex special cases discussed later.

Target string operand

The target string is inspected, and is never modified by split. The content of $_ is the default if no target string is provided.

Basic chunk-limit operand

In its primary role, the chunk-limit operand specifies a limit to the number of chunks that split partitions the string into. For example, with our sample data, split(/:/, $text, 3) returns:
( 'IO.SYS', '225558', '95-10-03:-a-sh:optional' )

This shows that split stopped after /:/ matched twice, resulting in the requested three-chunk partition. It could have matched additional times, but that's irrelevant here because of the chunk-limit. The limit is an upper bound, so no more than that many elements will ever be returned, but note that it doesn't guarantee that many elements -- no extra are produced to ``fill the count'' if the data can't be partitioned enough to begin with. split(/:/, $text, 1234) still returns only a five-element list. Still, there is an important difference between split(/:/, $text) and split(/:/, $text, 1234) which does not manifest itself with this example -- keep this in mind for when the details are discussed later.

Remember that the chunk-limit operand is not a match-limit operand. Had it been for the example above, the three matches would have partitioned to

('IO.SYS', '225558', '95-10-03', '-a-sh:optional')
which is not what actually happens.

One comment on efficiency: Let's say you intended to fetch only the first few fields, such as with

($filename, $size, $date) = split(/:/, $text)
you need four chunks -- filename, size, date, and ``everything else.'' You don't even want the ``everything else'' except that if it weren't a separate chunk, the $date chunk would contain it. So, you'd want to use a limit of 4 so that Perl doesn't waste time finding further partitions. Indeed, you can use a chunk-limit of 4, but if you don't provide one, Perl provides an appropriate default so that you get the performance enhancement without changing the results you actually see. 

Advanced Split

Because split is an operator and not a function, it can interpret its operands in magical ways not restricted by normal function-calling conventions. That's why, for example, split can recognize whether the first operand is a match operator, as opposed to some general expression that gets evaluated independently before the ``function'' is called.


Although useful, split is not straightforward to master. Some important points to consider are:

Split can return empty elements

The basic premise of split is that it returns the text between matches. If the match operator matches twice in a row, the nothingness between the matches is returned. Applying m/:/ to the sample string
:IO.SYS:225558:::95-10-03:-a-sh:
finds seven matches (each marked). With split, a match always37 results in a split between two items, including even that first match at the start of the string separating '' (an empty string, i.e., nothingness) from 'IO.SY...'. Similarly, the fourth match separates two empty strings. All in all, the seven matches partition the target into the eight strings:
('', 'IO.SYS', '225558', '', '', '95-10-03', '-a-sh', '')

37  There is one special case where this is not true. It is detailed a bit later during the discussion about advanced split's match operand.

However, this is not what split(/:/, $text) returns. Surprised?

Trailing empty elements are not (normally) returned

When the chunk-limit operand is not specified, as it often isn't, Perl strips trailing empty items from the list before it is returned. (Why? Your guess is as good as mine, but the feature is indeed documented.) Only empty items at the end of the list are stripped; others remain. You can, however, stop Perl from removing the trailing empty items, and it involves a special use of the chunk-limit operand.

The chunk-limit operand's second job

In addition to possibly limiting the number of chunks, any non-zero chunk-limit operand also eliminates the stripping of trailing empty items. (A chunk-limit given as zero is exactly the same as if no chunk limit is given at all.)

If you don't want to limit the number of chunks returned, but instead only want to leave trailing empty items intact, simply choose a very large limit. Also, a negative chunk-limit is taken as an arbitrarily large limit: split(/:/, $text, -1) returns all elements, including any trailing empty ones.

At the other extreme, if you want to remove all empty items, you could put grep {length} before the split. The grep lets pass only list elements with non-zero lengths (in other words, elements that aren't empty).

Advanced Split's Match Operand

Most of the complexity of the split operator lies in the many personalities of the match operand. There are four distinct styles of the match operand:

Split's match-operator match operand

Like all the split examples we've seen so far, the most common use of split uses a match operator for the match operand. However, there are a number of important differences between a match operand and a real match operator:

Special match operand: a string with a single space

A match operand that is a string (not a regex) consisting of exactly one space is a special case. It's almost the same as /\s+/ except that leading whitespace is skipped. (This is meant to simulate the default input-record-separator splitting that awk does with its input, although it can certainly be quite useful for general use.)

For instance, the call split('·', "···this···is·a·····test") returns the four-element list ('this', 'is', 'a', 'test'). As a contrast to '·', consider using m/\s+/ directly. This bypasses the leading-whitespace removal and returns ('', 'this', 'is', 'a', 'test')

Finally, both of these are quite different from using m/·/, which matches each individual space and returns:

('','','','this','','','is','a','','','','','test')

Any general scalar expression as the match operand

Any other general Perl expression as the match operand is evaluated independently, taken as a string, and interpreted as a regular expression. For example, split(/\s+/, ...) is the same as split('\s+', ...) except the former's regex is compiled only once, the latter's each time the split is executed.  

The default match operand

With Perl5, the default match operand when none is specified (which is different from // or ''), is identical to using '·'. Thus, a raw split without any operands is the same as split('·', $_, 0).  

Scalar-Context Split

Perl4 supported a scalar-context split, which returned the number of chunks instead of the list itself, causing the variable @_ to receive the list of chunks as a side effect. Although Perl5 currently supports it, this feature has been deprecated and will likely disappear in the future. Its use generates a warning when warnings are enabled, as they generally should be.

Split's Match Operand with Capturing Parentheses

Using capturing parentheses in the match-operand regex changes the whole face of split. In such a case, the returned array has additional, independent elements interjected for the item(s) captured by the parentheses. This means that text normally elided entirely by split is now included in the returned list.

In Perl4, this was more of a pain than a useful feature because it meant you could never (easily) use a regex that happened to require parentheses merely for grouping. If grouping is the only intent, the littering of extra elements in the return list is definitely not a feature. Now that you can select the style of parentheses -- capturing or not -- it's a great feature. For example, as part of HTML processing, split(/(<[^>]*>)/) turns

...·and·<B>very·<FONT·color=red>very</FONT>·much</B>·effort...
into
( '...·and ', '<B>', 'very·', '<FONT·color=red>',
  'very', '</FONT>', '·much', '</B>', '·effort...' )
which might be easier to deal with. This example so far works with Perl4 as well, but if you want to extend the regex to be cognizant of, say, simple "[^"]*" doublequoted strings (probably necessary for real HTML work), you run into problems as the full regex becomes something along the lines of:38

38  You might recognize this as being a partially unrolled version of <("[^"]*"|[^>"])*>, with a normal of [^>"] and a special of "[^"]*". You could, of course, also unroll special independently.

  (<[^>"]*("[^"]*"[^>"]*)*>)


The added set of capturing parentheses means an added element returned for each match during the split, yielding two items per match plus the normal items due to the split. Applying this to
  Please <A HREF="test">press me</A> today
returns (with descriptive comments added):

( 'Please·',                             before first match
  '<A·HREF="test">', '"test"', from first match
  'press me',                         between matches
  '</A>', '',     from second match
  '·today'                        after last match
 )

The extra elements clutter the list. Using (?:...) for the added parentheses, however, returns the regex to split usefulness, with the results being:

( 'Please·',                before first match
  '<A HREF="test">',from first match
  'press me',between matches
  '</A>',from second match
  '·today'  after last match
)

Perl Efficiency Issues

For the most part, efficiency with Perl regular expressions is achieved in the same way as with any tool that uses a Traditional NFA: use the techniques discussed in Chapter 5 -- the internal optimizations, the unrolling methods, the ``Think'' section -- they all apply to Perl.

There are, of course, Perl-specific efficiency issues, such as the use of non-capturing parentheses unless you specifically need capturing ones. There are some much larger issues as well, and even the issue of capturing vs. non-capturing is larger than the micro-optimization explained in Chapter 5 (=>152). In this section, we'll look at this (=>276), as well as the following topics:

``There's More Than One Way to Do It''

There are often many ways to go about solving any particular problem, so there's no substitute for really knowing all that Perl has to offer when balancing efficiency and readability. Let's look at the simple problem of padding an IP address like 18.181.0.24 such that each of the four parts becomes exactly three digits: 018.181.000.024. One simple and readable solution is:
$ip = sprintf "%03d.%03d.%03d.%03d", split(/\./, $ip);

This is a fine solution, but there are certainly other ways to do the job. In the same style as the The Perl Journal article I mentioned in the footnote on page 229, let's examine various ways of achieving the same goal. This example's goal is simple and not very ``interesting'' in and of itself, yet it represents a common text-handling task. Its simplicity will let us concentrate on the differing approaches to using Perl. Here are a few other solutions:

  1. $ip =~ s/(\d+)/sprintf("%03d", $1)/eg;

  2. $ip =~ s/\b(\d{1,2}\b)/sprintf("%03d", $1)/eg;

  3. $ip = sprintf("%03d.%03d.%03d.%03d", $ip =~ m/(\d+)/g);

  4. $ip =~ s/\b(\d\d?\b)/'0' x (3-length($1)) . $1/eg;

  5. $ip = sprintf("%03d.%03d.%03d.%03d",
    $ip =~ m/^(\d+)\.(\d+)\.(\d+)\.(\d+)$/);

  6. $ip =~ s/\b(\d(\d?)\b)/$2 eq '' ? "00$1" : "0$1"/eg;

  7. $ip =~ s/\b(\d\b)/00$1/g;
    $ip =~ s/\b(\d\d\b)/0$1/g;

Like the original solution, each produces the same results when given a correct IP address, but fail in different ways if given something else. If there is any chance that the data will be malformed, more care than any of these solutions provide is needed. That aside, the practical differences lie in efficiency and readability. As for readability, about the only thing that's easy to see about most of these is that they are cryptic at best.

So, what about efficiency? I benchmarked these solutions on my system with Perl version 5.003, and have listed them in order from least to most efficient. The original solution belongs somewhere between positions four and five, the best taking only 80 percent of its time, the worst about 160 percent.39 But if efficiency is really important, faster methods are still available:

substr($ip,  0, 0) = '0' if substr($ip,  1, 1) eq '.';
substr($ip,  0, 0) = '0' if substr($ip,  2, 1) eq '.';
substr($ip,  4, 0) = '0' if substr($ip,  5, 1) eq '.';
substr($ip,  4, 0) = '0' if substr($ip,  6, 1) eq '.';
substr($ip,  8, 0) = '0' if substr($ip,  9, 1) eq '.';
substr($ip,  8, 0) = '0' if substr($ip, 10, 1) eq '.';
substr($ip, 12, 0) = '0' while length($ip) < 15;

39  With Perl4, for reasons I don't exactly know, the original solution is actually the fastest of those listed. I wasn't able, however, to benchmark the solutions using /e due to a related Perl4 memory leak that rendered the results meaningless.

This takes only half the time as the original, but at a fairly expensive toll in understandability. Which solution you choose, if any, is up to you. There are probably other ways still. Remember, ``There's more than one way to do it.''

Regex Compilation, the /o Modifier, and Efficiency

In ``Doublequotish Processing and Variable Interpolation,'' we saw how Perl parses a script in phases. All the processing represented by Figure 7-1 (=>223) can be rather substantial. As an optimization, Perl realizes that if there is no variable interpolation, it would be useless to bother processing all the phases each time the regex is used -- if there is no interpolation, the regex can't possibly change from use to use. In such cases, the internal form is saved the first time the regex is compiled, then used directly for all subsequent matches via the same operator. This saves a lot of reprocessing work.

On the other hand, if the regex changes each time, it certainly makes sense for Perl to reprocess it for us. Of course, it takes longer to redo the processing, but it's a very convenient feature that adds remarkable flexibility to the language, allowing a regex to vary with each use. Still, as useful as it may be, the extra work is sometimes needless. Consider a situation where a variable that doesn't change from use to use is used to provide a regex:

$today = (qw<Sun Mon Tue Wed Thu Fri Sat>)[(localtime)[6]];
#  $today now holds the day ("Mon", "Tue", etc., as appropriate)
$regex = "^$today:";
while (<LOGFILE>) {
    if (m/$regex/) {
 :

The variable $regex is set just once, before the loop. The match operator that uses it, however, is inside a loop, so it is applied over and over again, once per line of <LOGFILE>. We can look at this script and know for sure that the regex doesn't change during the course of the loop, but Perl doesn't know that. It knows that the regex operand involves interpolation, so it must re-evaluate that operand each time it is encountered.

This doesn't mean the regex must be fully recompiled each time. As an intermediate optimization, Perl uses the compiled form still available from the previous use (of the same match operand) if the re-evaluation produces the same final regex. This saves a full recompilation, but in cases where the regex never does change, the processing of Figure 7-1's Phase B and C, and the check to see if the result is the same as before, are all wasted effort.

This is where the /o modifier comes in. It instructs Perl to process and compile a regex operand the first time it is used, as normal, but to then blindly use the same internal form for all subsequent tests by the same operator. The /o ``locks in'' a regex the first time a match operator is used. Subsequent uses apply the same regex even if variables making up the operand were to change. Perl won't even bother looking. Normally, you use /o as a measure of efficiency when you don't intend to change the regex, but you must realize that even if the variables do change, by design or by accident, Perl won't reprocess or recompile if /o is used.

Now, let's consider the following situation:

while (...)
{
   :
  $regex = &GetReply('Item to find');

  foreach $item (@items) {
      if ($item =~ m/$regex/o) { # /o used for efficiency, but has a gotcha!
        :
      }
  }
   :
}

The first time through the inner foreach loop (of the first time through the outer while loop), the regular expression is processed and compiled, the result of which is then used for the actual match attempt. Because the /o modifier is used, the match operator in question uses the same compiled form for all its subsequent attempts. Later, in the second iteration of the outer loop, a new $regex is read from the user with the intention of using it for a new search. It won't work -- the /o modifier means to compile a regex operator's regex just once, and since it had already been done, the original regex continues to be used -- the new value of $regex is completely ignored.

The easiest way to solve this problem is to remove the /o modifier. This allows the program to work, but it is not necessarily the best solution. Even though the intermediate optimization stops the full recompile (except when the regex really has changed, the first time through each inner loop), the pre-compile processing and the check to see if it's the same as the previous regex must still be done each and every time. The resulting inefficiency is a major drawback that we'd like to avoid if at all possible.

Using the default regex to avoid work

If you can somehow ensure a successful sample match to install a regex as the default, you can reuse that match with the empty-regex construct m// (=>248):
while (...)
{
    :
    $regex = &GetReply('Item to find');

    # install regex (must be successful to install as default)
    if ($sample_text !~ m/$regex/) {
        die "internal error: sample text didn't match!";
    }

    foreach $item (@items) {
       if ($item =~ m//)  # use default regex
       {
         :
       }
    }
    :
}

Unfortunately, it's usually quite difficult to find something appropriate for the sample string if you don't know the regex beforehand. Remember, a successful match is required to install the regex as the default. Additionally (in modern versions of Perl), that match must not be within a dynamic scope that has already exited.

The /o modifier and eval

There is a solution that overcomes all of these problems. Although complex, the efficiency gains often justify the means. Consider:

while (...)
{
    :
    $regex = &GetReply('Item to find');
    eval 'foreach $item (@items) {
                 if ($item =~ m/$regex/o) {
 :
                 }
         }';

    # if $@ is defined, the eval had an error.
    if ($@) { 
             ...report error from eval had there been one ...
    }
    :
}

Notice that the entire foreach loop is within a singlequoted string, which itself is the argument to eval. Each time the string is evaluated, it is taken as a new Perl snippet, so Perl parses it from scratch, then executes it. It is executed ``in place,'' so it has access to all the program's variables just as if it had been part of the regular code. The whole idea of using eval in this way is to delay the parsing until we know what each regex is to be.

What is interesting for us is that, because the snippet is parsed afresh with each eval, any regex operands are parsed from scratch, starting with Phase A of Figure 7-1 (=>223). As a consequence, the regular expression will be compiled when first encountered in the snippet (during the first time through the foreach loop), but not recompiled further due to the /o modifier. Once the eval has finished, that incarnation of the snippet is gone forever. The next time through the outer while loop, the string handed to eval is the same, but because the eval interprets it afresh, the snippet is considered new all over again. Thus, the regular expression is again new, and so it is compiled (with the new value of $regex) the first time it is encountered within the new eval.

Of course, it takes extra effort for eval to compile the snippet with each iteration of the outer loop. Does the /o savings justify the extra time? If the array of @items is short, probably not. If long, probably so. A few benchmarks (addressed later) can often help you decide.

This example takes advantage of the fact that, when we build a program snippet in a string to feed to eval, Perl doesn't consider it to be Perl code until eval is actually executed. You can, however, have eval work with a normally compiled block of code, instead.

Evaluating a string vs. evaluating a block

eval is special in that its argument can be a general expression (such as the singlequoted string just used) or a {...} block of code. When using the block method, such as with
eval {foreach $item (@items) {
             if ($item =~ m/$regex/o) {
 :
             }
     }};
the snippet is checked and compiled only once, at program load time. The intent is ostensibly to be more efficient than the one earlier (in not recompiling with each use), but here it defeats the whole point of using eval in the first place. We rely on the snippet being recompiled with each use, so we must use the non-block version.

Included among the variety of reasons to use eval are the recompile-effects of the non-block style and the ability to trap errors. Run-time errors can be trapped with either the string or the block style, while only the string style can trap compile-time errors as well. (We saw an example of this with $* on page 235.) Trapping run-time errors (such as to test if a feature is supported in your version of Perl), and to trap warn, die, exit, and the like, are about the only reasons I can think of to use the eval {...} block version.

A third reason to use eval is to execute code that you build on the fly. The following snippet shows a common trick:

sub Build_MatchMany_Function
{
   my @R = @_;        # Arguments are regexes
   my $program = '';  # We'll build up a snippet in this variable
   foreach $regex (@R) {
       $program .= "return 1 if m/$regex/;"; # Create a check for each regex
   }
   my $sub = eval "sub { $program; return 0 }"; # create anonymous function
   die $@ if $@;
   $sub; # return function to user
}

Before explaining the details, let me show an example of how it's used. Given an array of regular expressions, @regexes2check, you might use


# Create a function to check a bunch of regexes
$CheckFunc = Build_MatchMany_Function(@regexes2check);

while (<>) {
    # Call the function to check the current $_
    if (&$CheckFunc) {
        ...have a line which matches one of the regexes...
    }
}

Given a list of regular expressions (or, more specifically, a list of strings intended to be taken as regular expressions), Build_MatchMany_Function builds and returns a function that, when called, indicates whether any of the regexes match the contents of $_.

The reason to use something like this is efficiency. If you knew what the regexes were when writing the script, all this would be unnecessary. Not knowing, you might be able to get away with

$regex = join('|', @regexes2check); # Build monster regex

while (<>) {
    if (m/$regex/o) {
        ...have a line which matches one of the regexes...
    }
}
but the alternation makes it inefficient. (Also, it breaks if any but the first regex have backreferences.) You could also just loop through the regexes, applying them as
while (<>) {
    foreach $regex (@regexes2check) {
        if (m/$regex/) {
            ...have a line which matches one of the regexes...
            last;
        }
    }
}

This, too, is inefficient because each regex must be reprocessed and recompiled each time. Extremely inefficient. So, spending time to build an efficient match approach in the beginning can, in the long run, save a lot.

If the strings passed to Build_MatchMany_Function are this, that, and other, the snippet that it builds and evaluates is effectively:

sub {
  return 1 if m/this/;
  return 1 if m/that/;
  return 1 if m/other/;
  return 0
}

Each time this anonymous function is called, it checks the $_ at the time for the three regexes, returning true the moment one is found.

It's a nice idea, but there are problems with how it's commonly implemented (including the one on the previous page). Pass Build_MatchMany_Function a string which contains a $ or @ that can be interpreted, within the eval, by variable interpolation, and you'll get a big surprise. A partial solution is to use a singlequote delimiter:

$program .= "return 1 if m'$regex';"; # Create a check for each regex

But there's a bigger problem. What if one of the regexes contains a singlequote (or one of whatever the regex delimiter is)? Wanting the regex don't adds
  return 1 if m'don't';
to the snippet, which results in a syntax error when evaluated. You can use \xff or some other unlikely character as the delimiter, but why take a chance? Here's my solution to take care of these problems:

sub Build_MatchMany_Function
{
    my @R = @_;
    my $expr = join '||', map { "m/\$R[$_]/o" } (0..$#R);
    my $sub = eval "sub { $expr }"; # create anonymous function
    die $@ if $@;
    $sub; # return function to user
}

I'll leave the analysis as an exercise. However, one question: What happens if this function uses local instead of my for the @R array? ¤ Turn the page to check your answer.

local vs. my

¤ Answer to the question on page 273.

Before answering the question, first a short summary of binding: Whenever Perl compiles a snippet, whether during program load or eval, references to variables in the snippet are ``linked'' (bound) to the code. The values of the variables are not accessed -- during program load time, the variables don't have values yet. Values are accessed when the code is actually executed.

my @R creates a new variable, private, distinct, and unrelated to all other variables in the program. When the snippet to create the anonymous subroutine (the a-sub) is evaluated, its code is linked to our private @R. ``@R'' refer to this private variable. They don't access @R yet, since that happens only when the a-sub is executed.

When Build_MatchMany_Function exits, the private @R would normally disappear, but since the a-sub still links to it, @R and its strings are kept around, although they become inaccessible to all but the a-sub (references to ``@R'' elsewhere in the code refer to the unrelated global variable of the same name). When the a-sub is later executed, that private @R is referenced and our strings are used as desired.

local @R, on the other hand, simply saves a copy of the global variable @R before we overwrite it with the @_ array. It's likely that there was no previous value, but if some other part of the program happens to use it (the global variable @R, that is) for whatever reason, we've saved a copy so we don't interfere, just in case. When the snippet is evaluated, the ``@R'' links to the same global variable @R. (This linking is unrelated to our use of local -- since there is no private my version of @R, references to ``@R'' are to the global variable, and any use or non-use of local to copy data is irrelevant.)

When Build_MatchMany_Function exits, the @R copy is restored. The global variable @R is the same variable as it was during the eval, and is still the same variable that the a-sub links to, but the content of the variable is now different. (We'd copied new values into @R, but never referenced them! A completely wasted effort.) The a-sub expects the global variable @R to hold strings to be used as regular expressions, but we've already lost the values we'd copied into it. When the subroutine is first used, it sees what @R happens to have at the time -- whatever it is, it's not our regexes, so the whole approach breaks down.

Mmm. If before we leave Build_MatchMany_Function, we actually use the a-sub (and make sure that the sample text cannot match any of the regexes), the /o would lock in our regexes while @R still holds them, getting around the problem. (If the sample text matches a regex, only the regexes to that point are locked in). This might actually be an appealing solution -- we need @R only until all the regexes are locked in. After that, the strings used to create them are unneeded, so keeping them around in the separate (but undeleteable until the a-sub is deleted) my @R wastes memory.

Unsociable $& and Friends

$`, $&, and $' refer to the text leading the match, the text matched, and the text that trails the match, respectively (=>217). Even if the target string is later changed, these variables must still refer to the original text, as advertised. Case in point: the target string is changed immediately during a substitution, but we still need to have $& refer to the original (and now-replaced) text. Furthermore, even if we change the target string ourselves, $1, $&, and friends must all continue to refer to the original text (at least until the next successful match, or until the block ends). So, how does Perl conjure up the original text despite possible changes?

It makes a copy. All the variables described above actually refer to this internal-use-only copy, rather than to the original string. Having a copy means, obviously, that there are two copies of the string in memory at once. If the target string is huge, so is the duplication. But then, since you need the copy to support these variables, there is really no other choice, right?

Internal optimizations

Not exactly. If you don't intend to use these special variables, the copy is obviously unnecessary, and omitting it can yield a huge savings. The problem is that Perl doesn't realize that you have no intention to use the variables. Still, Perl sometimes realizes that it doesn't need to make the copy. If you can get your code to trigger such cases, it will run more efficiently. Eliminating the copy not only saves time, but as a surprise mystery bonus, the substitution operator is often more efficient when no copy is done. (This is taken up later in this section.)

Warning: The situations and tricks I describe exploit internal workings of Perl. It's nice if they make your programs faster, but they're not part of the Perl specification40 and may be changed in future releases. (I am writing as of version 5.003.) If these optimizations suddenly disappear, the only effect will be on efficiency -- programs will still produce the same results, so you don't need to worry that much.

40  Well, they wouldn't be a part if there were a Perl specification.

Three situations trigger the copy for a successful match or substitution :
 ·  the use of $`, $&, or $' anywhere in the entire script
 ·  the use of capturing parentheses in the regex
 ·  the use of the /i modifier with a non-/g match operator

Also, you might need additional internal copies to support:
 ·  the use of the /i modifier (with any match or substitute)
 ·  the use of many, but not all, substitution operators

I'll go over the first three here, and the other two in the following section.

$`, $&, or $' require the copy

Perl must make a copy to support any use of $`, $&, and $'. In practice, these variables are not used after most matches, so it would be nice if the copy were done only for those matches that needed it. But because of the dynamically scoped nature of these variables, their use may well be some distance away from the actual match. Theoretically, it may be possible for Perl to do exhaustive analysis to determine that all uses of these variables can't possibly refer to a particular match (and thus omit the copy for that match), but in practice Perl does not do this. Therefore, it must normally do the copy for every successful match of all regexes during the entire run of the program.

However, it does notice whether there are no references whatsoever to $`, $&, or $' in the entire program (including all libraries referenced by the script!). Since the variables never appear in the program, Perl can be quite sure that a copy merely to support them can safely be omitted. Thus, if you can be sure that your code and any libraries it might reference never use $`, $&, or $', you are not penalized by the copy except when explicitly required by the two other cases.

Capturing parentheses require the copy

If you use capturing parentheses in the regex, Perl assumes that you intend to use the captured text, so it does the copy after the match. (This means that the eventual use or non-use of $1 has no bearing whatsoever -- if capturing parentheses are used, the copy is made even if its results are never used.) In Perl4 there were no grouping-only parentheses, so even if you didn't intend to capture text, you did anyway as a side effect and were penalized accordingly. Now, with (?:...), you should never find that you are capturing text that you don't intend to use.41 But when you do intend to use $1, $2, etc., Perl will have made the copy for you.

41  Well, capturing parentheses are also used for backreferences, so it's possible that capturing parentheses might be used when $1 and the like are not. This seems uncommon in practice.

m/.../i requires the copy

A non-/g match operator with /i causes a copy. Why? Frankly, I don't know. From looking at the code, the copy seems entirely superfluous to me, but I'm certainly no expert in Perl internals. Anyway, there's another, more important efficiency hit to be concerned about with /i. I'll pick up this subject again in a moment, but first I'd like to show some benchmarks that illustrate the effects of the $&-support copy.

An example benchmarked

I ran a simple benchmark that checked m/c/ against each of the 50,000 or so lines of C that make up the Perl source distribution. The check merely noted whether there was a `c' on a line -- the benchmark didn't actually do anything with the information since the goal was to determine the effect of the behind-the-scenes copying. I ran the test two different ways: once where I made sure not to trigger any of the conditions mentioned above, and once where I made sure to do so. The only difference, therefore, was in the extra copy overhead.

The run with the extra copying consistently took over 35 percent longer than the one without. This represents an ``average worst case,'' so to speak. The more real work a program does, the less of an effect, percentage-wise, the copying has. The benchmark didn't do any real work, so the effect is highlighted.

On the other hand, in true worst-case scenarios, the extra copy might truly be an overwhelming portion of the work. I ran the same test on the same data, but this time as one huge line incorporating the more than megabyte of data rather than the 50,000 or so reasonably sized lines. Thus, the relative performance of a single match can be checked. The match without the copy returned almost immediately, since it was sure to find a `c' somewhere near the start of the string. Once it did, it was done. The test with the copy is the same except, well, it had to make a copy of the megabyte-plus-sized string first. Relatively speaking, it took over 700 times longer! Knowing the ramifications, therefore, of certain constructs allows you to tweak your code for better efficiently.

Conclusions and recommendations about $& and friends

It would be nice if Perl knew the programmer's intentions and made the copy only as necessary. But remember, the copies are not ``bad'' -- Perl's handling of these bookkeeping drudgeries behind the scenes is why we use it and not, say, C or assembly language. Indeed, Perl was first developed in part to free users from the mechanics of bit-fiddling so they could concentrate on creating solutions to problems.

A solution in Perl can be approached in many ways, and I've said numerous times that if you write in Perl as you write in another language (such as C), your Perl will be lacking and almost certainly inefficient. For the most part, crafting programs The Perl Way should go a long way toward putting you on the right track, but still, as with any discipline, special care can produce better results. So yes, while the copies aren't ``wrong,'' we still want to avoid unnecessary copying whenever possible. Towards that end, there are steps we can take.

Foremost, of course, is to never use $`, $&, or $' anywhere in your code. This also means to never use English.pm nor any library modules that use it, or in any other way references these variables. Table 7-10 shows a list of standard libraries (in Perl version 5.003) which reference one of the naughty variables, or uses another library that does. You'll notice that most are tainted only because they use Carp.pm. If you look into that file, you'll find only one naughty variable:

$eval =~ s/[\\\']/\\$&/g;

Changing this to

$eval =~ s/([\\\'])/\\$1/g;
makes most of the standard libraries sociable in this respect. Why hasn't this been done to the standard distribution? I have no idea. Hopefully, it will be changed in a future release.


Table 7-10: Standard Libraries That Are Naughty (That Reference $& and Friends)
  AutoLoader   Fcntl   Pod::Text
  AutoSplit   File::Basename   POSIX
  Benchmark   File::Copy   Safe
  Carp   File::Find   SDBM_File
  DB_File   File::Path   SelectSaver
  diagnostics   FileCache   SelfLoader
  DirHandle   FileHandle   Shell
  dotsh.pl   GDBM_File   Socket
  dumpvar.pl   Getopt::Long   Sys::Hostname
  DynaLoader   IPC::Open2   Syslog
  English   IPC::Open3   Term::Cap
  ExtUtils::Install   lib   Test::Harness
  ExtUtils::Liblist   Math::BigFloat   Text::ParseWords
  ExtUtils::MakeMaker   MM_VMS   Text::Wrap
  ExtUtils::Manifest   newgetopt.pl   Tie::Hash
  ExtUtils::Mkbootstrap   ODBM_File   Tie::Scalar
  ExtUtils::Mksymlists   open2.pl   Tie::SubstrHash
  ExtUtils::MM_Unix   open3.pl   Time::Local
  ExtUtils::testlib   perl5db.pl   vars
Naughty due to the use of:  C: Carp B: File::Basename E: English L: Getopt::Long


If you can be sure these variables never appear, you'll know you will do the copy only when you explicitly request it with capturing parentheses, or via the rogue m/.../i. Some expressions in current code might need to be rewritten. On a case-by-case basis, $` can often be mimicked by (.*?) at the head of the regex, $& by (...) around the regex, and $' by (?=(.*)) at the end of the regex.

If your needs allow, there are other, non-regex methods that might be attempted in place of some regexes. You can use index(...) to find a fixed string, for example. In the benchmarks I described earlier, it was almost 20 percent faster than m/.../, even without the copy overhead.

How to check whether your code is tainted by $&

Especially with the use of libraries, it's not always easy to notice whether your program ever references $&, $`, or $'. I went to the trouble of modifying my version of Perl to issue a warning the first time one was encountered. (If you'd like to do the same, search for the three appearances of sawampersand in Perl's gv.c, and add an appropriate call to warn.)

An easier approach is to test for the performance penalty, although it doesn't tell you where the offending variable is. Here's a subroutine that I've come up with:

sub CheckNaughtiness
{
  local($_) = 'x' x 10000; # some non-small amount of data

  # calculate the overhead of a do-nothing loop
  local($start) = (times)[0];
  for ($i = 0; $i < 5000; $i++)             {       }
  local($overhead) = (times)[0] - $start;

  # now calculate the time for the same number
  $start = (times)[0];
  for ($i = 0; $i < 5000; $i++)             { m/^/; }
  local($delta) = (times)[0] - $start;

  # a differential of 10 is just a heuristic
  printf "It seems your code is %s (overhead=%.2f, delta=%.2f)\n",
    ($delta > $overhead*10) ? "naughty":"clean", $overhead, $delta;
}

This is not a function you would keep in production code, but one you might insert temporarily and call once at the beginning of the program (perhaps immediately following it with an exit, then removing it altogether once you have your answer). Once you know you program is $&-clean, there is still a chance that a rogue eval could introduce it during the run of the program, so it's also a good idea to test at the end of execution, just to be sure.

The Efficiency Penalty of the /i Modifier

If you ask Perl to match in a case-insensitive manner, common sense tells you that you are asking for more work to be done. You might be surprised, though, to find out just how much extra work that really is.

Before a match or substitution operator applies an /i-governed regex, Perl first makes a temporary copy of the entire target string. This copy is in addition to any copy in support of $& and friends. The latter is done only after a successful match, while the one to support a case-insensitive match is done before the attempt. After the copy is made, the engine then makes a second pass over the entire string, converting any uppercase characters to lowercase. The result might happen to be the same as the original, but in any case, all letters are lowercase.

This goes hand in hand with a bit of extra work done during the compilation of the regex to an internal form. At that time, uppercase letters in the regex are converted to lowercase as well.

The result of these two steps is a string and a regex that then matches normally -- nothing special or extra needs to be done within the actual matching portion of the regex engine. It all appears to be a very tidy arrangement, but this has got to be one of the most gratuitous inefficiencies in all of Perl.

Methods to implement case-insensitive matching

There are (at least) two schools of thought on how to implement case-insensitive matching. We've just seen one, which I call string oriented (in addition to ``gratuitously inefficient''). The other, which I consider to be far superior, is what I would call regex oriented. It works with the original mixed-case target string, having the engine itself make allowances for case-insensitive matching as the need arises.

Many subexpressions (and full regular expressions, for that matter) do not require special handling: the CSV program at the start of the chapter (=>205), the regex to add commas to numbers (=>229), and even the huge, 4,724-byte regex we construct in ``Matching an Email Address'' (=>294) are all free of the need for special case-insensitive handling. A case-insensitive match with such expressions should not have any efficiency penalty at all.

Even a character class with letters shouldn't entail an efficiency penalty. At compile time, the appropriate other-case version of any letter can easily be included. (A character class's efficiency is not related to the number of characters in the class; =>115.) So, the only real extra work would be when letters are included in literal text, and with backreferences. Although they must be dealt with, they can certainly be handled more efficiently than making a copy of the entire target string.

By the way, I forgot to mention that when the /g modifier is used, the copy is done with each match. At least the copy is only from the start of the match to the end of the string -- with a m/.../ig on a long string, the copies are successively shorter as the matches near the end.

A few /i benchmarks

I did a few benchmarks, similar to the ones on page 276. As before, the test data is a 52,011-line, 1,192,395-byte file made up of Perl's main C source.

As an unfair and cruel first test, I loaded the entire file into a single string, and benchmarked 1 while m/./g and 1 while m/./gi. Dot certainly doesn't care one way or the other about capitalization, so it's not reasonable to penalize this match for case-insensitive handling. On my machine, the first snippet benchmarked at a shade under 12 seconds. Simply adding the /i modifier (which, you'll note, is meaningless in this case) slowed the program by four orders of magnitude, to over a day and a half!42 I calculate that the needless copying caused Perl to shuffle around more than 647,585 megabytes inside my CPU. This is particularly unfortunate, since it's so trivial for the compilation part of the engine to tell the matching part that case-insensitiveness is irrelevant for ., the regex at hand.

42  I didn't actually run the benchmark that long. Based on other test cases, I calculated that it would take about 36.4 hours. Feel free to try it yourself, though.

This unrealistic benchmark is definitely a worst-case scenario. Searching a huge string for something that matches less often than . is more realistic, so I benchmarked m/\bwhile\b/gi and m/\b[wW][hH][iI][lL][eE]\b/g on the same string. Here, I try to mimic the regex-oriented approach myself. It's incredibly naïve for a regex-oriented implementation to actually turn literal text into character classes,43 so we can consider the /i-equivalent to be a worst-case situation in this respect. In fact, manually turning while into [wW][hH][iI][lL][eE] also kills Perl's fixed string check (=>155), and renders study (=>287) useless for the regex. With all this against it, we should expect it to be very slow indeed. But it's still over 50 times faster than the /i version!

43  Although this is exactly what the original implementation of grep did!

Perhaps this test is still unfair -- the /i-induced copy made at the start of the match, and after each of the 412 matches of \bwhile\b in my test data, is large. (Remember, the single string is over a megabyte long.) Let's try testing m/^int/i and m/^[iI][nN][tT]/ on each of the 50,000 lines of the test file. In this case, /i has each line copied before the match attempt, but since they're so short, the copy is not so crushing a penalty as before: the /i version is now just 77 percent slower. Actually, this includes the extra copies inexplicably made for each of the 148 matches -- remember, a non-/g m/.../i induces the $&-support copy.

Final words about the /i penalty

As the later benchmarks illustrate, the /i-related penalty is not as heinous as the first benchmark leads you to believe. Still, it's a concern you should be very aware of, and I hope that future versions of Perl eliminate the most outlandish of these inefficiencies.

Foremost: don't use /i unless you really have to. Blindly adding it to a regex that doesn't require it invites many wasted CPU cycles. In particular, when working with long strings, it can be a huge benefit to rewrite a regex to mimic the regex-oriented approach to case insensitivity, as I did with the last two benchmarks.

Substitution Efficiency Concerns

As I mentioned, I am not an expert in Perl's internal workings. When it comes to how Perl's substitute operator actually moves strings and substrings around internally during the course of a substitute, well, I'm pretty much completely lost. The code and logic are not for the faint of heart.

Still, I have managed to understand a bit about how it works,44 and have developed a few rules of thumb that I'd like to share with you. Let me warn you, up front, that there's no simple one-sentence summary to all this. Perl often takes internal optimizations in bits and pieces where they can be found, and numerous special cases and opportunities surrounding the substitution operator provide fertile ground for optimizations. It turns out that the $&-support copy disables all of the substitution-related optimizations, so that's all the more reason to banish $& and friends from your code.

44  I modified my copy of version 5.003 to spit out volumes of pretty color-coded messages as various things happen internally. This way, I have been able to understand the overall picture without having to understand the fine details.

Let me start by stepping back to look at the substitution operator efficiency's worst-case scenario.

The normal ``slow-mode'' of the substitution operator

In the worst case, the substitution operator simply builds the new copy of the target string, off to the side, then swaps it for the original. As an example, the temperature conversion one-liner from the first page of this chapter
s[(\d+(\.\d*)?)F\b]{sprintf "%.0fC", ($1-32) * 5/9}eg
matches at the marked locations of:
Water boils at 212F, freezes at 32F.
When the first match is found, Perl creates an empty temporary string and then copies everything before the match, `Water·boils·at·', to it. The substitution text is then computed (`100C' in this case) and added to the end of the temporary string. (By the way, it's at this point that the $&-support copy would be made were it required.)

At the next match (because /g is used), the text between the two matches is added to the temporary string, followed by the newly computed substitution text, `0C'. Finally, after it can't find any more matches, the remainder of the string (just the final period in this case) is copied to close out the temporary string. This leaves us with:
  Water boils at 100C, freezes at 0C.
The original target string, $_, is then discarded and replaced by the temporary string. (I'd think the original target could be used to support $1, $&, and friends, but it does not -- a separate copy is made for that, if required.)

At first, this method of building up the result might seem reasonable because for the general case, it is reasonable. But imagine something simple like s/\s+$// to remove trailing whitespace. Do you really need to copy the whole (potentially huge) string, just to lop off its end? In theory, you don't. In practice, Perl doesn't either. Well, at least not always.

$& and friends disable all substitute-operator optimizations

Perl is smart enough to optimize s/\s+$// to simply adjusting the length of the target string. This means no extra copies -- very fast. For reasons that escape me, however, this optimization (and all the substitution optimizations I mention in a moment) are disabled when the $&-support copy is done. Why? I don't know, but the practical effect is yet another way that $& is detrimental to your code's efficiency.

The $&-support copy is also done when there are capturing parentheses in the regex, although in that case, you'll likely be enjoying the fruits of that copy (since $1 and the like are supported by it). Capturing parentheses also disable the substitution optimizations, but at least only for the regexes they're used in and not for all regexes as a single stray $& does.

Replacement text larger than the match not optimized

s/\s+$// is among many examples that fit a pattern: when the replacement text is shorter or the same length as the text being replaced, it can be inserted right into the string. There's no need to make a full copy. Figure 7-2 shows part of the example from page 45, applying the substitution s/<FIRST>/Tom/ to the string `Dear·<FIRST>,[NL]'. The new text is copied directly over what is being replaced, and the text that follows the match is moved down to fill the gap.



Figure 7-2: Applying s/<FIRST>/Tom/ to `Dear·<FIRST>,[NL]'

In the case of s/\s+$//, there's no replacement text to be filled in and no match-following text to be moved down -- once the match is found, the size of the string is adjusted to lop off the match, and that's that. Very zippy. The same kinds of optimizations also apply to matches at the beginning of the string.

When the replacement text is exactly the same size as the matched text, you'd think that as a further optimization, the ``move to fill the gap'' could be omitted, since there is no gap when the sizes are an exact match. For some reason, the needless ``move'' is still done. At least the algorithm, as it stands, never copies more than half the string (since it is smart enough to decide whether it should copy the part before the match, or the part after).

A substitution with /g is a bit better still. It doesn't do the gap-filling move until it knows just how much to move (by delaying the move until it knows where the next match is). Also, it seems in this case that the worthless move to fill a non-existent gap is kindly omitted.

Only fixed-string replacement text substitutions are optimized

These optimizations kick in only when Perl knows the size of the replacement string before the overall match attempt begins. This means that any replacement string with $1 and friends disables the optimizations.45 Other variable interpolation, however, does not. With the previous example, the original substitution was:
$given = 'Tom';
$letter =~ s/<FIRST>/$given/g;

45  It is redundant to say that the use of $1 in the replacement string disables the optimizations. Recall that the use of capturing parentheses in the regex causes the $&-support copy, and that copy also disables the substitution optimizations. It's silly to use $1 without capturing parentheses, as you're guaranteed its value will be undefined (=>217).

The replacement operand has variables interpolated before any matching begins, so the size of the result is known.

Any substitution using the /e modifier, of course, doesn't know the size of the substitution text until after a match, and the substitution operand is evaluated, so there are no substitution optimizations with /e either.

Final comments on all these internal optimizations

When it comes down to it, with all these attempts at optimizations, the infamous YMMV (your mileage may vary) applies. There are a bazillion more details and special cases than I've presented here, and it's certain that some internal workings will change in future versions. If it is important enough to want to optimize, it is, perhaps, important enough to benchmark.

Benchmarking

If you really care about efficiency, it may be best to try benchmarking. Perl5 provides the Benchmark module, but it is tainted by the $& penalty. This is quite unfortunate, for the penalty might itself silently render the benchmark results invalid. I prefer to keep things simple by just wrapping the code to be tested in something like:
$start = (times)[0];
  :
$delta = (times)[0] - $start;
printf "took %.1f seconds\n", $delta;

An important consideration about benchmarking is that due to clock granularity (1/60 or 1/100 of a second on many systems), it's best to have code that runs for at least a few seconds. If the code executes too quickly, do it over and over again in a loop. Also, try to remove unrelated processing from the timed portion. For example, rather than

$start = (times)[0]; # Go! Start the clock.
  $count = 0;
  while (<>) {
     $count++ while m/\b(?:char\b|return\b|void\b)/g;
  }
  print "found $count items.\n";
$delta = (times)[0] - $start; # Done. Stop the clock.
printf "the benchmark took %.1f seconds.\n", $delta;
it is better to do:
$count = 0;          # (no need to have this timed, so bring above the clock start)
@lines = <>;         # Do all file I/O here, so the slow disk is not an issue when timed
$start = (times)[0]; # Okay, I/O now done, so now start the clock.
  foreach (@lines) {
     $count++ while m/\b(?:char\b|return\b|void\b)/g;
  }
$delta = (times)[0] - $start;  # Done. Stop the clock.
print "found $count items.\n"; # (no need to have this timed)
printf "the benchmark took %.1f seconds.\n", $delta;

The biggest change is that the file I/O has been moved out from the timed portion. Of course, if you don't have enough free memory and start swapping to disk, the advantage is gone, so make sure that doesn't happen. You can simulate more data by using a smaller amount of real data and processing it several times:

for ($i = 0; $i < 10; $i++) {
   foreach (@lines) {
        $count++ while m/\b(?:char\b|return\b|void\b)/g;
   }
}

It might take some time to get used to benchmarking in a reasonable way, but the results can be quite enlightening and downright rewarding.

Regex Debugging Information

Perl carries out a phenomenal number of optimizations to try to arrive at a regex match result quickly; some of the less esoteric ones are listed in Chapter 5's ``Internal Optimizations'' (=>154). If your perl has been compiled with debugging information (by using -DDEBUGGING during its build), the -D debugging command-line option is available. The use of -Dr (-D512 with Perl4) tells you a bit about how Perl compiles your regular expressions and gives you a blow-by-blow account of each application.

Much of what -Dr provides is beyond the scope of this book, but you can readily understand some of its information. Let's look at a simple example (I'm using Perl version 5.003):

 [1] jfriedl@tubby> perl -cwDr -e '/^Subject: (.*)/'
 [2] rarest char j at 3
 [3] first 14 next 83 offset 4
 [4]  1:BRANCH(47)
 [5]  5:BOL(9)
 [6]  9:EXACTLY(23) <Subject: >
 [7] 23:OPEN1(29)
    :
 [8] 47:END(0)
 [9] start `Subject: ' anchored minlen 9 

At [1], I invoke perl at my shell prompt, using the command-line arguments -c (which means check script, don't actually execute it), -w (issue warnings about things Perl thinks are dubious -- always used as a matter of principle), -Dr (regex debugging), and -e (the next argument is the Perl snippet itself). This combination is convenient for checking regexes right from the command line. The regex here is ^Subject:·(.*) which we've seen several times in this book.

Lines [4] through [8] represents Perl's compiled form of the regex. For the most part, we won't be concerned much about it here. However, in even a casual look, line [6] sticks out as understandable.

Literal text cognizance

Many of Perl's optimizations are contingent upon it deducing from the regex some fixed literal text which must appear in any possible match. (I'll call this ``literal text cognizance.'') In the example, that text is `Subject:·', but many expressions either have no required literal text, or have it beyond Perl's ability to deduce. (This is one area where Emacs' optimization far outshines Perl's; =>197.) Some examples where Perl can deduce nothing include -?([0-9]+(\.[0-9]*)?|\.[0-9]+), ^\s*, ^(-?\d+)(\d{3}), and even int|void|while.

Examining int|void|while, you see that `i' is required in any match. Some NFA engines can deduce exactly that (any DFA knows it implicitly), but Perl's engine is unfortunately not one of them. In the debugging output, int, void, and while appear on lines similar to [6] above, but those are local (subexpression) requirements. For literal text cognizance, Perl needs global regex-wide confidence, and as a general rule, it can't deduce fixed text from anything that's part of alternation.

Many expressions, such as <CODE>(.*?)</CODE>, have more than one clump of literal text. In these cases, Perl (somewhat magically) selects one or two of the clumps and makes them available to the rest of the optimization subroutines. The selected clump(s) are shown on a line similar to [9].

Main optimizations reported by -Dr

Line [9] can report a number of different things. Some items you might see include:

start `clump'
Indicates that a match must begin with clump, one of those found by literal text cognizance. This allows Perl to do optimizations such as the fixed string check and first character discrimination discussed in Chapter 5.

must have "clump" back num
Indicates a required clump of text like the start item above, but the clump is not at the start of the regex. If num is not -1, Perl knows any match must begin that many characters earlier. For example, with [Tt]ubby..., the report is `must have "ubby" back 1', meaning that if a fixed string check reveals ubby starting at such and such a location, the whole regex should be applied starting at the previous position.

Conversely, with something like .*tubby, it's not helpful to know exactly where tubby might be in a string, since a match including it could start at any previous position, so num is -1.

stclass `:kind'
Indicates that Perl realizes the match must start with some particular kind of character. With \s+, kind is SPACE, while with \d+ it is DIGIT. With a character class like the [Tt]ubby example, kind is reported as ANYOF.

plus
Indicates that the stclass or a single-character start item is governed by +, so not only can the first character discrimination find the start of potential matches, but it can also quickly zip past a leading \s+ and the like before letting the full (but slower) regex engine attempt the complete match.

anchored
Indicates that the regex begins with a caret anchor. This allows the ``String/ Line Anchor'' optimization (=>158).

implicit
Indicates that Perl has added an implicit caret to the start of the regex because the regex begins with .* (=>158).

Other optimizations arising from literal text cognizance

One other optimization arising from literal text cognizance relates to study (which is examined momentarily). Perl somewhat arbitrarily selects one character it considers ``rare'' from the selected clump(s). Before a match, if a string has been study'd, Perl knows immediately if that character exists anywhere in the string. If it doesn't exist, no match is possible and the regex engine does not need to get involved at all. It's a quick way to prune some impossible matches. The character selected is reported at [2].

For the trivia-minded, Perl's idea of the rarest character is \000, followed by \001, \013, \177, and \200. Some of the rarest printable characters are ~, Q, Z, ?, and @. The least-rare characters are e, space, and t. (The manpage says that this was derived by examining a combination of C programs and English text.)

The Study Function

As opposed to optimizing the regex itself, study(...) optimizes access to certain information about a string. A regex, or multiple regexes, can then benefit from the cached knowledge when applied to the string. What it does is simple, but understanding when it's a benefit or not can be quite difficult. It has no effect whatsoever46on any values or results of a program -- the only effects are that Perl uses more memory, and that overall execution time might increase, stay the same, or (here's the goal) decrease.

46  Or, at least it shouldn't in theory. However, as of Perl version 5.003, there is a bug in which the use of study can cause successful matches to fail. This is discussed further at the end of this section.

When you study a string, Perl takes some time and memory to build a list of places in the string each character is found. On most systems, the memory required is four times the size of the string (but is reused with subsequent calls of study). study's benefit can be realized with each subsequent regex match against the string, but only until the string is modified. Any modification of the string renders the study list invalid, as does studying a different string.

The regex engine itself never looks at the study list; only the transmission references it. The transmission looks at the start and must have debugging information mentioned on page 286 to pick what it considers a rare character (discussed It picks a rare (yet required) character because it's not likely to be found in the string, and a quick check of the study list that turns up nothing means a match can be discounted immediately without having to rescan the entire string.

If the rare character is found in the string, and if that character must occur at a known position in any possible match (such as with ..this, but not .?this), the transmission can use the study list to start matching from near the location. This saves time by bypassing perhaps large portions of the string.47

47  There's a bug in the current implementation which disables this optimization when the regex begins with literal text. This is unfortunate because such expressions have generally been thought to benefit most from study.

When not to use study

When study can help

study is best used when you have a large string you intend to match many times before the string is modified. A good example is a filter I used in preparing this book. I write in a home-grown markup that the filter converts to SGML (which is then converted to troff, which is then converted to PostScript). Within the filter, an entire chapter eventually ends up within one huge string (this chapter is about 650 kilobytes). Before exiting, I apply a bevy of checks to guard against mistaken markup leaking through. These checks don't modify the string, and they often look for fixed strings, so they're what study thrives on.

Study in the real world

It seems that study has been hexed from the start. First, the programming populace never seemed to understand it well. Then, a bug in Perl versions 5.000 and 5.001 rendered study completely useless. In recent versions, that's been fixed, but now there's a study bug that can cause successful matches in $_ to fail (even matches that have nothing to do with the string that was study'd). I discovered this bug while investigating why my markup filter wasn't working, quite coincidentally, just as I was writing this section on study. It was a bit eerie, to say the least.

You can get around this bug with an explicit undef, or other modification, of the study'd string (when you're done with it, of course). The automatic assignment to $_ in while (<>) is not sufficient.

When study can work, it often doesn't live up to its full potential, either due to simple bugs or an implementation that hasn't matured as fast as the rest of Perl. At this juncture, I recommend against the use of study unless you have a very specific situation you know benefits. If you do use it, and the target string is in $_, be sure to undefine it when you are done.

Putting It All Together

This chapter has gone into great detail about Perl's regular expression flavor and operators. At this point, you may be asking yourself what it all means -- it may take a fair amount of use to ``internalize'' the information.

Let's look again at the initial CSV problem. Here's my Perl5 solution, which, as you'll note, is fairly different from the original on page 205:

@fields = ();
push(@fields, $+) while $text =~ m{
    "([^"\\]*(?:\\.[^"\\]*)*)",?    # standard quoted string, with possible comma
  | ([^,]+),?                       # anything else, with possible comma
  | ,                               # lone comma
}gx;

# add a final empty field if there's a trailing comma
push(@fields, undef) if substr($text,-1,1) eq ',';

Like the first version, it uses a scalar-context m/.../g with a while loop to iterate over the string. We want to stay in synch, so we make sure that at least one of the alternatives matches at any location a match could be started. We allow three types of fields, which is reflected in the three alternatives of the main match.

Because Perl5 allows you to choose exactly which parentheses are capturing and which aren't, we can ensure that after any match, $+ holds the desired text of the field. For empty fields where the third alternative matches and no capturing parentheses are used, $+ is guaranteed to be undefined, which is exactly what we want. (Remember undef is different from an empty string -- returning these different values for empty and "" fields retains the most information.)

The final push covers cases in which the string ends with a comma, signifying a trailing empty field. You'll note that I don't use m/,$/ as I did earlier. I did so earlier because I was using it as an example to show regular expressions, but there's really no need to use a regex when a simpler, faster method exists.

Along with the CSV question, many other common tasks come up time and again in the Perl newsgroups, so I'd like to finish out this chapter by looking at a few of them.

Stripping Leading and Trailing Whitespace

By far the best all-around solution is the simple and obvious:
s/^\s+//;
s/\s+$//;

For some reason, it seems to be The Thing to try to find a way to do it all in one shot, so I'll offer a few methods. I don't recommend them, but it's educational to understand why they work, and why they're not desirable.

s/\s*(.*?)\s*$/$1/
Commonly given as a great example of the non-greediness that is new in Perl5, but it's not a great example because it's so much slower (by about three times in my tests) than most of the other solutions. The reason is that with each character, before allowing dot to match, the *? must try to see whether what follows can match. That's a lot of backtracking, particularly since it's the kind that goes in and out of the parentheses (=>151).

s/^\s*(.*\S)?\s*$/$1/
Much more straightforward. The leading ^\s* takes care of leading whitespace before the parentheses start capturing. Then the .* matches to the end of the line, with the \S causing backtracking past trailing whitespace to the final non-whitespace. If there's nothing but whitespace in the first place, the (.*\S)? fails (which is fine), and the final \s* zips to the end.

$_ = $1 if m/^\s*(.*\S)?/
More or less the same as the previous method, this time using a match and assignment instead of a substitution. With my tests, it's about 10 percent faster.

s/^\s*|\s*$//g
A commonly thought-up solution that, while not incorrect, has top-level alternation that removes many of the optimizations that might otherwise be possible. The /g modifier allows each alternative to match, but it seems a waste to use /g when we know we intend at most two matches, and each with a different subexpression. Fairly slow.

Their speed often depends on the data being checked. For example, in rare cases when the strings are very, very long, with relatively little whitespace at either end, s/^\s+//; s/\s+$// can take twice the time of $_ = $1 if m/^\s*(.*\S)?/. Still, in my programs, I use s/^\s+//; s/\s+$// because it's almost always fastest, and certainly the easiest to understand.

Adding Commas to a Number

People often ask how to print numbers with commas, as in 12,345,678. The FAQ currently gives
1 while s/^(-?\d+)(\d{3})/$1,$2/;
which repeatedly scans to the last (non-comma'd) digit with \d+, backtracks three digits so \d{3} can match, and finally inserts a comma via the replacement text `$1,$2'. Because it works primarily ``from the right'' instead of the normal left, it is useless to apply with /g. Thus, multiple passes to add multiple commas is achieved using a while loop.

You can enhance this solution by using a common optimization from Chapter 5 (=>156), replacing \d{3} with \d\d\d. Why bother making the regex engine count the occurrences when you can just as easily say exactly what you want? This one change saved a whopping three percent in my tests. (A penny saved...)

Another enhancement is to remove the start-of-string anchor. This allows you to comma-ify a number (or numbers) within a larger string. As a byproduct, you can then safely remove the -?, since it exists only to tie the first digit to the anchor. This change could be dangerous if you don't know the target data, since 3.14159265 becomes 3.14,159,265. In any case, if you know the number is the string by itself, the anchored version is better.

A completely different, but almost-the-same approach I've come up with uses a single /g-governed substitution:

s<
   (\d{1,3})       # before a comma: one to three digits
   (?=             # followed by, but not part of what's matched...
      (?:\d\d\d)+  #    some number of triplets...
      (?!\d)       #    ...not followed by another digit
   )               #    (in other words, which ends the number)><$1,>gx;
Because of the comments and formatting, it might look more complex than the FAQ solution, but in reality it's not so bad, and is a full third faster. However, because it's not anchored to the start of the string, it faces the same problem with 3.14159265. To take care of that, and to bring it in line with the FAQ solution for all strings, change the (\d{1,3}) to \G((?:^-)?\d{1,3}). The \G anchors the overall match to the start of the string, and anchors each subsequent /g-induced match to the previous one. The (?:^-)? allows a leading minus sign at the start of the string, just as the FAQ solution does. With these changes, it slows down a tad, but my tests show it's still over 30 percent faster than the FAQ solution.

Removing C Comments

It's challenging to see how crisply you can strip C comments from text. In Chapter 5, we spent a fair amount of time coming up with the general comment-matching /\*[^*]*\*+([^/*][^*]*\*+)*/, and a Tcl program to remove comments. Let's express it in Perl.

Chapter 5 dealt with generic NFA engines, so our comment-matching regex works fine in Perl. For extra efficiency, I'd use non-capturing parentheses, but that's about the only direct change I'd make. It's not unreasonable to use the FAQ's simpler /\*.*?\*/ -- Chapter 5's solution leads the engine to a match more efficiently, but /\*.*?\*/ is fine for applications that aren't time critical. It's certainly easier to understand at first glance, so I'll use it to simplify the first draft of our comment-stripping regex.


Here it is:

s{
    # First, we'll list things we want to match, but not throw away
    (
        " (?:\\.|[^"\\])* "  # doublequoted string.
      |                      # -or-
        ' (?:\\.|[^'\\])* '  # singlequoted constant
    )
 |  # OR...
    # ...we'll match a comment. Since it's not in the $1 parentheses above,
    # the comments will disappear when we use $1 as the replacement text.
    /\*  .*?  \*/            # Traditional C comments.
    |                        # -or-
    //[^\n]*                 # C++ //-style comments
}{$1}gsx;

After applying the changes discussed during the Tcl treatment, and combining the two comment regexes into one top-level alternative (which is easy since we're writing the regex directly and not building up from separate $COMMENT and $COMMENT1 components), our Perl version becomes:

s{
    # First, we'll list things we want to match, but not throw away
    (
       [^"'/]+                                 # other stuff
      |                                        # -or-
       (?:"[^"\\]*(?:\\.[^"\\]*)*" [^"'/]*)+   # doublequoted string.
      |                                        # -or-
       (?:'[^'\\]*(?:\\.[^'\\]*)*' [^"'/]*)+   # singlequoted constant
    )
 |  # OR...
    # ...we'll match a comment. Since it's not in the $1 parentheses above,
    # the comments will disappear when we use $1 as the replacement text.

    / (?:                             # (all comments start with a slash)
        \*[^*]*\*+(?:[^/*][^*]*\*+)*/ # Traditional C comments.
        |                             # -or-
        /[^\n]*                       # C++ //-style comments
      )
}{$1}gsx;
With the same tests as Chapter 5's Tcl version, times hover around 1.45 seconds (compare to Tcl's 2.3 seconds, the first Perl's version at around 12 seconds, and Tcl's first version at around 36 seconds).

To make a full program out of this, just insert it into:

undef $/;              # Slurp-whole-file mode
$_ = join('', <>); # The join(...) can handle multiple files.
 ... insert the substitute command from above ...
print;

Yup, that's the whole program.

Matching an Email Address

I'd like to finish with a lengthy example that brings to bear many of the regex techniques seen in these last few chapters, as well as some extremely valuable lessons about building up a complex regular expression using variables. Verifying correct syntax of an Internet email address is a common need, but unfortunately, because of the standard's complexity,48 it is quite difficult to do simply. In fact, it is impossible with a regular expression because address comments may be nested. (Yes, email addresses can have comments: comments are anything between parentheses.) If you're willing to compromise, such as allowing only one level of nesting in comments (suitable for any address I've ever seen), you can take a stab at it. Let's try.

48  Internet RFC 822. Available at: ftp://ftp.rfc-editor.org/in-notes/rfc822.txt

Still, it's not for the faint at heart. In fact, the regex we'll come up with is 4,724 bytes long! At first thought, you might think something as simple as \w+\@[.\w]+ could work, but it is much more complex. Something like

Jeffy <"That Tall Guy"@ora.com (this address no longer active)>
is perfectly valid as far as the specification is concerned.49 So, what constitutes a lexically valid address? Table 7-11 on page 295 lists a lexical specification for an Internet email address in a hybrid BNF/regex notation that should be mostly self-explanatory. In addition, comments (item 22) and whitespace (spaces and tabs) are allowed between most items. Our task, which we choose to accept, is to convert it to a regex as best we can. It will require every ounce of technique we can muster, but it is possible.50

49  It is certainly not valid in the sense that mail sent there will bounce for want of an active username, but that's an entirely different issue.
50  The program we develop in this section is available on my home page -- see Appendix A.

Table 7-11: Somewhat Formal Description of an Internet Email Address
Item Description
1 mailbox addr-spec | phrase route-addr
2 addr-spec local-part @ domain
3 phrase ( word )+
4 route-addr < ( route )? addr-spec >
5 local-part word ( . word )*
6 domain sub-domain ( . sub-domain )*
7 word atom | quoted-string
8 route @ domain ( , @ domain )* :
9 sub-domain domain-ref | domain-literal
10 atom ( any char except specials, space and ctls )+
11 quoted-string " ( qtext | quoted-pair )* "
12 domain-ref atom
13 domain-literal [ ( dtext | quoted-pair )* ]
14 char any ASCII character (000-177 octal)
15 ctl any ASCII control (000-037 octal)
16 space ASCII space (040 octal)
17 CR ASCII carriage return (015 octal)
18 specials any of the characters: ()<>@,;:\".[]
19 qtext any char except ", \ and CR
20 dtext any char except [, ], \ and CR
21 quoted-pair \ char
22 comment ( ( ctext | quoted-pair | comment )* )
23 ctext any char except (, ), \ and CR

Levels of interpretation

When building a regex using variables, you must take extra care to understand the quoting, interpolating, and escaping that goes on. With ^\w+\@[.\w]+$ as an example, you might naïvely render that as
$username = "\w+";
$hostname = "\w+(\.\w+)+";
$email    = "^$username\@$hostname$";
   :
... m/$email/o ...
but it's not so easy. While evaluating the doublequoted strings in assigning to the variables, the backslashes are interpolated and discarded: the final $email sees `^w+@w+(.w+)+$' with Perl4, and can't even compile with Perl5 because of the trailing dollar sign. Either the escapes need to be escaped so they'll be preserved through to the regex, or a singlequoted string must be used. Singlequoted strings are not applicable in all situations, such as in the third line where we really do need the variable interpolation provided by a doublequoted string:
$username = '\w+';
$hostname = '\w+(\.\w+)+';
$email    = "^$username\@$hostname\$";

Let's start building the real regex by looking at item 16 in Table 7-11. The simple $space = "·" isn't good because if we use the /x modifier when we apply the regex (something we plan to do), spaces outside of character classes, such as this one, will disappear. We can also represent a space in the regex with \040 (octal 40 is the ASCII code for the space character), so we might be tempted to assign "\040" to $space. This would be a silent mistake because, when the doublequoted string is evaluated, \040 is turned into a space. This is what the regex will see, so we're right back where we started. We want the regex to see \040 and turn it into a space itself, so again, we must use "\\040" or '\040'.

Getting a match for a literal backslash into the regex is particularly hairy because it's also the regex escape metacharacter. The regex requires \\ to match a single literal backslash. To assign it to, say, $esc, we'd like to use '\\', but because \\ is special even within singlequoted strings,51 we need $esc = '\\\\' just to have the final regex match a single backslash. This backslashitis is why I make $esc once and then use it wherever I need a literal backslash in the regex. We'll use it a few times as we construct our address regex. Here are the preparatory variables I'll use this way:

# Some things for avoiding backslashitis later on.
$esc        = '\\\\';               $Period      = '\.';
$space      = '\040';               $tab         = '\t';
$OpenBR     = '\[';                 $CloseBR     = '\]';
$OpenParen  = '\(';                 $CloseParen  = '\)';
$NonASCII   = '\x80-\xff';          $ctrl        = '\000-\037';
$CRlist     = '\n\015';  # note: this should really be only \015.

51  Within Perl singlequoted strings, \\ and the escaped closing delimiter (usually \') are special. Other escapes are passed through untouched, which is why \040 results in \040.

The $CRlist requires special mention. The specification indicates only the ASCII carriage return (octal 015). From a practical point of view, this regex is likely to be applied to text that has already been converted to the system-native newline format where \n represents the carriage return. This may or may not be the same as an ASCII carriage return. (It usually is, for example, on MacOS, but not with Unix; =>72.) So I (perhaps arbitrarily) decided to consider both.

Filling in the basic types

Working mostly from Table 7-11, bottom up, here are a few character classes we'll be using, representing items 19, 20, 23, and a start on 10:
# Items 19, 20, 21
$qtext = qq/[^$esc$NonASCII$CRlist"]/;               # for within "..."
$dtext = qq/[^$esc$NonASCII$CRlist$OpenBR$CloseBR]/; # for within [...]
$quoted_pair = qq< $esc [^$NonASCII] >; # an escaped character

#
Item 10: atom $atom_char = qq/[^($space)<>\@,;:".$esc$OpenBR$CloseBR$ctrl$NonASCII]/; $atom = qq< $atom_char+ # some number of atom characters... (?!$atom_char) # ..not followed by something that could be part of an atom >;
That last item, $atom, might need some explanation. By itself, $atom need be only $atom_char+, but look ahead to Table 7-11's item 3, phrase. The combination yields ($atom_char+)+, a lovely example of one of those neverending-match patterns (=>144). Building a regex in a variable is prone to this kind of hidden danger because you can't normally see everything at once. This visualization problem is why I used $NonASCII = '\x80-\xff' above. I could have used "\x80-\xff", but I wanted to be able to print the partial regex at any time during testing. In the latter case, the regex holds the raw bytes -- fine for the regex engine, but not for our display if we print the regex while debugging.

Getting back to ($atom_char+)+, to help delimit the inner-loop single atom, I can't use \b because Perl's idea of a word is completely different from an email address atom. For example, `--genki--' is a valid atom that doesn't match \b$atom_char+\b. Thus, to ensure that backtracking doesn't try to claim an atom that ends in the middle of what it should match, I use (?!...) to make sure that $atom_char can't match just after the atom's end. (This is a situation where I'd really like the possessive quantifiers that I pined for in the footnote on page 111.)

Even though these are doublequoted strings and not regular expressions, I use free spacing and comments (except within the character classes) because these strings will eventually be used with an /x-governed regex. But I do take particular care to ensure that each comment ends with a newline, as I don't want to run into the overzealous comment problem (=>223).

Address comments

Comments with this specification are difficult to match because they allow nested parentheses, something impossible to do with a single regular expression. You can write a regular expression to handle up to a certain number of nested constructs, but not to an arbitrary level. For this example, I've chosen to implement $comment to allow for one level of internal nesting:
# Items 22 and 23, comment.
# Impossible to do properly with a regex, I make do by allowing at most one level of nesting.
$ctext   = qq< [^$esc$NonASCII$CRlist()] >;
$Cnested = qq< $OpenParen (?: $ctext | $quoted_pair )* $CloseParen >;
$comment = qq< $OpenParen
                     (?: $ctext | $quoted_pair | $Cnested )*
               $CloseParen >;

$sep = qq< (?: [$space$tab] | $comment )+ >; #
required separator $X = qq< (?: [$space$tab] | $comment )* >; # optional separator


You'll not find comment, item 22, elsewhere in the table. What the table doesn't show is that the specification allows comments, spaces, and tabs to appear freely between most tokens. Thus, we create $X for optional spaces and comments, $sep for required ones.

The straightforward bulk of the task

Most items in Table 7-11 are relatively straightforward to implement. One trick is to make sure to use $X where required, but for efficiency's sake, no more often than necessary. The method I use is to provide $X only between elements within a single subexpression. Most of the remaining items are shown


# Item 11: doublequoted string, with escaped items allowed
$quoted_str = qq<
        " (?:                      # opening quote...
              $qtext #   Anything except backslash and quote
              | #    or
              $quoted_pair #   Escaped something (something != CR)
                             )* " # closing quote
>;

#
Item 7: word is an atom or quoted string $word = qq< (?: $atom | $quoted_str ) >;
#
Item 12: domain-ref is just an atom $domain_ref = $atom;
#
Item 13 domain-literal is like a quoted string, but [...] instead of "..." $domain_lit = qq< $OpenBR # [ (?: $dtext | $quoted_pair )* # stuff $CloseBR # ] >;
#
Item 9: sub-domain is a domain-ref or domain-literal $sub_domain = qq< (?: $domain_ref | $domain_lit ) >;

# Item 6: domain is a list of subdomains separated by dots.
$domain = qq< $sub_domain # initial subdomain
              (?: #
                 $X $Period # if led by a period...
                 $X $sub_domain #   ...further okay
              )*
>;

#
Item 8: a route. A bunch of "@ $domain" separated by commas, followed by a colon $route = qq< \@ $X $domain (?: $X , $X \@ $X $domain )* # further okay, if led by comma : # closing colon >;

# Item 5: local-part is a bunch of $word separated by periods
$local_part = qq< $word # initial word
        (?: $X $Period $X $word )* # further okay, if led by a period
>;

#
Item 2: addr-spec is local@domain $addr_spec = qq< $local_part $X \@ $X $domain >;
#
Item 4: route-addr is <route? addr-spec> $route_addr = qq[ < $X # leading < (?: $route $X )? # optional route $addr_spec # address spec $X > # trailing > ];

Item 3 -- phrase

phrase poses some difficulty. According to Table 7-11, it is one or more word, but we can't use (?:$word)+ because we need to allow $sep between items. We can't use (?:$word|$sep)+, as that doesn't require a $word, but merely allows one. So, we might be tempted to try $word(?:$word|$sep)*, and this is where we really need to keep our wits about us. Recall how we constructed $sep. The non-comment part is effectively [$space$tab]+, and wrapping this in the new (...)* smacks of a neverending match (=>166). The $atom within $word would also be suspect except for the (?!...) we took care to tack on to checkpoint the match. We could try the same with $sep, but I've a better idea.

Four things are allowed in a phrase: quoted strings, atoms, spaces, and comments. Atoms are just sequences of $atom_char -- if these sequences are broken by spaces, it means only that there are multiple atoms in the sequence. We don't need to identify individual atoms, but only the extent of the entire sequence, so we can just use something like:

$word (?: [$atom_char$space$tab] | $quoted_string | $comment )+

We can't actually use that character class because $atom_char is already a class itself, so we need to construct a new one from scratch, mimicking $atom_char, but removing the space and tab (removing from the list of a negated class includes them in what the class can match):

# Item 3: phrase
$phrase_ctrl = '\000-\010\012-\037'; # like ctrl, but without tab

#
Like atom-char, but without listing space, and uses phrase_ctrl. # Since the class is negated, this matches the same as atom-char plus space and tab $phrase_char = qq/[^()<>\@,;:".$esc$OpenBR$CloseBR$NonASCII$phrase_ctrl]/;
$phrase = qq< $word #
one word, optionally followed by.... (?: $phrase_char | # atom and space parts, or... $comment | # comments, or... $quoted_str # quoted strings )* >;
Unlike all the other constructs so far, this one matches trailing whitespace and comments. That's not bad, but for efficiency's sake, we remember we don't need to insert $X after any use of $phrase.

Wrapping up with mailbox

Finally, we need to address item 1, the simple:

#
Item #1: mailbox is an addr_spec or a phrase/route_addr $mailbox = qq< $X # optional leading comment (?: $addr_spec # address | # or $phrase $route_addr # name and address ) $X # optional trailing comment >;

Whew, done!


Well, we can now use this like:

die "invalid address [$addr]\n" if $addr !~ m/^$mailbox$/xo;

(With a regex like this, I strongly suggest not forgetting the /o modifier.)52

52  From the ``don't try this at home, kids'' department: During initial testing, I was stumped to find that the optimized version (presented momentarily) was consistently slower than the normal version. I was really dumbfounded until I realized that I'd forgotten /o! This caused the entire huge regex operand to be reprocessed for each match (=>268). The optimized expression turned out to be considerably longer, so the extra processing time completely overshadowed any regex efficiency benefits. Using /o not only revealed that the optimized version was faster, but caused the whole test to finish an order of magnitude quicker.

It might be interesting to look at the final regex, the contents of $mailbox. After removing comments and spaces and breaking it into lines for printing, here are the first few out of 60 or so lines:

(?:[\040\t]|\((?:[^\\\x80-\xff\n\015()]|\\[^\x80-\xff]|\((?:[^\\\x80-\xff\n\015(
)]|\\[^\x80-\xff])*\))*\))*(?:(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^
(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"(?:[^\\\x80-\xff\n\015"]|\\[^\x80-\xff
])*")(?:(?:[\040\t]|\((?:[^\\\x80-\xff\n\015()]|\\[^\x80-\xff]|\((?:[^\\\x80-\xf
f\n\015()]|\\[^\x80-\xff])*\))*\))*\.(?:[\040\t]|\((?:[^\\\x80-\xff\n\015()]|\\[
^\x80-\xff]|\((?:[^\\\x80-\xff\n\015()]|\\[^\x80-\xff])*\))*\))*(?:[^(\040)<>@,;
:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"(?:[

Wow. At first you might think that such a gigantic regex could not possibly be efficient, but the size of a regex has little to do with its efficiency. More at stake is how much backtracking it has to do. Are there places with lots of alternation? Neverending-match patterns and the like? Like an efficient set of ushers at the local 20-screen theater complex, a huge regex can still guide the engine to a fast match, or to a fast failure, as the case may be.

Recognizing shortcomings

To actually use a regex like this, you need to know its limitations. For example, it recognizes only Internet email addresses, not local addresses. While logged in to my machine, jfriedl by itself is a perfectly valid email address, but is not an Internet email address. (This is not a problem with the regex, but with its use.) Also, an address might be lexically valid but might not actually point anywhere, as with the earlier That Tall Guy example. A step toward eliminating some of these is to require a domain to end in a two- or three-character subdomain (such as .com or .jp). This could be as simple as appending $esc . $atom_char {2,3} to $domain, or more strictly with something like:
$esc . (?: com | edu | gov | ... | ca | de | jp | u[sk] ... )

When it comes down to it, there is absolutely no way to ensure a particular address actually reaches someone. Period. Sending a test message is a good indicator if someone happens to reply. Including a Return-Receipt-To header in the message is also useful, as it has the remote system generate a short response to the effect that your message has arrived to the target mailbox.

Optimizations -- unrolling loops

As we built our regex, I hope that you recognized ample room for optimizations. Remembering the lessons from Chapter 5, our friend the quoted string is easily unrolled to
$quoted_str = qq< "                                # opening quote
                    $qtext * #   leading normal
                    (?: $quoted_pair $qtext * )* #   ( special normal* )*
                  " # closing quote
>;
while $phrase might become:
$phrase = qq< $word                     # leading word
              $phrase_char * # "normal" atoms and/or spaces
           (?:
              (?: $comment | $quoted_str ) # "special" comment or quoted string
              $phrase_char * #  more "normal"
           )*
>

Items such as $Cnested, $comment, $phrase, $domain_lit, and $X can be optimized similarly, but be careful -- some can be tricky. For example, consider $sep from the section on comments. It requires at least one match, but using the normal unrolling-the-loop technique creates a regex that doesn't require a match.

Talking in terms of the general unrolling-the-loop pattern (=>164), if you wish to require special, you can change the outer (...)* to (...)+, but that's not what $sep needs. It needs to require something, but that something can be either special or normal.

It's easy to create an unrolled expression that requires one or the other in particular, but to require either we need to take a dual-pronged approach:

$sep = qq< (?:
              [$space$tab]+                     # for when space is first
              (?: $comment [$space$tab]* )*
            |
              (?: $comment [$space$tab]* )+     # for when comment is first
            )
>;

This contains two modified versions of the normal*(specialnormal*)* pattern, where the class to match spaces is normal, and $comment is special. The first requires spaces, then allows comments and spaces. The second requires a comment, then allows spaces. For this last alternative, you might be tempted to consider $comment to be normal and come up with:
  $comment (?: [$space$tab]+ $comment )*

This might look reasonable at first, but that plus is the quintessence of a neverending match. Removing the plus fixes this, but the resulting regex loops on each space -- not a pillar of efficiency.

As it turns out, though, none of this is needed, since $sep isn't used in the final regex; it appeared only in the early attempt of $phrase. I kept it alive this long because this is a common variation on the unrolling-the-loop pattern, and the discussion of its touchy optimization needs is valuable.

Optimizations -- flowing spaces

Another kind of optimization centers on the use of $X. Examine how the $route part of our regex matches `@·gateway·:'. You'll find times where an optional part fails, but only after one or more internal $X match. Recall our definitions for $domain and $route:
# Item 6: domain is a list of subdomains separated by dots.
$domain = qq< $sub_domain # initial subdomain
              (?: #
                 $X $Period # if led by a period...
                 $X $sub_domain #   ...further okay
              )*
>;

#
Item 8: a route. A bunch of "@ $domain" separated by commas, followed by a colon $route = qq< \@ $X $domain (?: $X , $X \@ $X $domain )* # further okay, if led by comma : # closing colon >;
After the $route matches the initial `@·gateway·:' and the first $sub_domain of $domain matches `@·gateway·:', the regex checks for a period and another $sub_domain (allowing $X at each juncture). In making the first attempt at this $X $Period $X $sub_domain subexpression, the initial $X matches the space `@·gateway·:', but the subexpression fails trying to match a period. This causes backtracking out of the enclosing parentheses, and the instance of $domain finishes.

Back in $route, after the first $domain is finished, it then tries to match another if separated by a colon. Inside the $X , $X \@... subexpression, the initial $X matches the same space that had been matched (and unmatched) earlier. It, too, fails just after.

It seems wasteful to spend the time matching $X when the subexpression ends up failing. Since $X can match almost anywhere, it's more efficient to have it match only when we know the associated subexpression can no longer fail.


Consider the following, whose only changes are the placement of $X:

$domain = qq<
     $sub_domain $X
     (?:
        $Period $X $sub_domain $X
     )*
>;

$route = qq<
    \@ $X $domain
    (?: , $X \@ $X $domain )*
    : $X
>;
Here, we've changed the guideline from ``use $X only between elements within a subexpression'' to ``ensure a subexpression consumes any trailing $X.'' This kind of change has a ripple effect on where $X appears in many of the expressions.

After applying all these changes, the resulting expression is almost 50 percent longer (these lengths are after comments and free spacing are removed), but executed 9-19 percent faster with my benchmarks (the 9 percent being for tests that primarily failed, 19 percent for tests that primarily matched). Again, using /o is very important. The final version of this regex is in Appendix B.

Building an expression through variables -- summary

This has been a long example, but it illustrates some important points. Building a complex regex using variables is a valuable technique, but must be used with skill and caution. Some points to keep in mind include:

Final Comments

I'm sure it's obvious that I'm quite enamored with Perl's regular expressions, and as I noted at the start of the chapter, it's with good reason. Larry Wall, Perl's creator, apparently let himself be ruled by common sense and the Mother of Invention. Yes, the implementation has its warts, but I still allow myself to enjoy the delicious richness of Perl's regex language.

However, I'm not a blind fanatic -- Perl does not offer features that I wish it did. The most glaring omission is offered by other implementations, such as by Tcl, Python, and GNU Emacs: the index into the string where the match (and $1, $2, etc.) begins and ends. You can get a copy of text matched by a set of parentheses using the aforementioned variables, but in general, it's impossible to know exactly where in the string that text was taken from. A simple example that shows the painfulness of this feature's omission is in writing a regex tutor. You'd like to show the original string and say ``The first set of parentheses matched right here, the second set matched here, and so on,'' but this is currently impossible with Perl.

Another feature I've found an occasional need for is an array ($1, $2, $3, ...) similar to Emacs' match-data (=>196). I can construct something similar myself using:

$parens[0] = $&;
$parens[1] = $1;
$parens[2] = $2;
$parens[3] = $3;
$parens[4] = $4; 
  :
but it would be nicer if this functionality were built-in.

Then there are those possessive quantifiers that I mentioned in the footnote on page 111. They could make many expressions much more efficient.

There are a lot of esoteric features I can (and do) dream of. One that I once went so far as to implement locally was a special notation whereby the regex would reference an associative array during the match, using \1 and such as an index. It made it possible to extend something like (['"]).*?\1 to include <...>, ... and the like.

Another feature I'd love to see is named subexpressions, similar to Python's symbolic group names feature. These would be capturing parentheses that (somehow) associated a variable with them, filling the variable upon a successful match. You could then pick apart a phone number like (inventing some fictitious (?<var>...) notation on the fly):

  (?<$area>\d\d\d)-(?<$exchange>\d\d\d)-(?<$num>\d\d\d\d)

Well, I'd better stop before I get carried away. The sum of it all is that I definitely do not think Perl is the ideal regex-wielding language.

But it is very close.


Notes for Perl4

For the most part, regex use flows seamlessly from Perl4 to Perl5. Perhaps the largest backward-compatibility issue is that @ now interpolates within a regex (and a doublequoted string, for that matter). Still, you should be aware of a number of subtle (and not-so-subtle) differences when working with Perl4:

Perl4 Note #1

Page 217   The special variables $&, $1, and so on are not read-only in Perl4 as they are in Perl5. Although it would be useful, modifying them does not magically modify the original string they were copied from. For the most part, they're just normal variables that are dynamically scoped and that get new values with each successful match.

Perl4 Note #2

Page 217   Actually, with Perl4, $` sometimes does refer to the text from the start of the match (as opposed to the start of the string). A bug that's been fixed in newer versions caused $` to be reset each time the regex was compiled. If the regex operand involved variable interpolation, and was part of a scalar-context m/.../g such as the iterator of a while loop, this recompilation (which causes $` to be reset) is done during each iteration.

Perl4 Note #3

Page 218   In Perl4, $+ magically becomes a copy of $& when there are no parentheses in the regex.

Perl4 Note #4

Page 220   Perl4 interpolates $MonthName[...] as an array reference only if @MonthName is known to exist. Perl5 does it regardless.

Perl4 Note #5

Page 222   The escape of an escaped closing delimiter is not removed in Perl4 as it is in Perl5. It matters when the delimiter is a metacharacter. The (rather farfetched) substitution s*2\*2*4* would not work as expected in Perl5.

Perl4 Note #6

Page 247   Perl4 allows you to use whitespace as a match-operand delimiter. Although using newline, for example, was sometimes convenient, for the most part I'd leave this maintenance nightmare for an Obfuscated Perl contest.

Perl4 Note #7

Page 247   Remember, in Perl4, all escapes are passed through. (See note #.)

Perl4 Note #8

Page 247   Perl4 does not support the four m{...} special-case delimiters for the match operator. It does, however, support them for the substitution operator.

Perl4 Note #9

Page 247   Perl4 supports the special ?-match, but only in the ?...? form. The m?...? form is not special.

Perl4 Note #10

Page 247   With Perl4, a call to any reset in the program resets all ?-delimited matches. Perl5's reset affects only those in the current package.

Perl4 Note #11

Page 248   When the Perl4 match operator is given an empty regex operand, it reuses the most recent successfully applied regular expression without regard to scope. In Perl5, the most recently successful within the current dynamic scope is reused.

An example should make this clear. Consider:

"5" =~ m/5/;     # install 5 as the default regex
{ # start a new scope...
   "4" =~ m/4/;  # install 4 as the default regex
} # ... end the new scope.
"45" =~ m//;     # use default regex to match 4 or 5, depending on which regex is used
print "this is Perl $&\n";

Perl4 prints `this is Perl 4', while Perl5 prints `this is Perl 5'.

Perl4 Note #12

Page 252   In any version, the list form of m/.../g returns the list of texts matched within parentheses. Perl4, however, does not set $1 and friends in this case. Perl5 does both.

Perl4 Note #13

Page 252   List elements for non-matching parentheses of m/.../g are undefined in Perl5, but are simply empty strings in Perl4. Both are considered a Boolean false, but are otherwise quite different.

Perl4 Note #14

Page 253   In Perl5, modifying the target of a scalar-context m/.../g resets the target's pos. In Perl4, the /g position is associated with each regex operand. This means that modifying what you intended to use as the target data has no effect on the /g position (this could be either a feature or a bug depending on how you look at it). In Perl5, however, the /g position is associated with each target string, so it is reset when modified.

Perl4 Note #15

Page 255   Although Perl4's match operator does not allow balanced delimiters such as m[...], Perl4's substitution does. Like Perl5, if the regex operand has balanced delimiters, the replacement operand has its own set. Unlike Perl5, however, whitespace is not allowed between the two (because whitespace is valid as a delimiter in Perl4).

Perl4 Note #16

Page 255   Oddly, in Perl4, s'...'...' provides a singlequotish context to the regular expression operand as you would expect, but not to the replacement operand -- it gets the normal doublequoted-string processing.

Perl4 Note #17

Page 257   In Perl4, the replacement operand is indeed subject to singlequotish interpolation, so instances of \' and \\ have their leading backslash removed before eval ever gets a hold of it. With Perl5, the eval gets everything as is.

Perl4 Note #18

Page 258   Perl5 returns an empty string when no substitutions are done. Perl4 returns the number zero (both of which are false when considered as a Boolean value).

Perl4 Note #19

Page 260   The default chunk-limit that Perl provides in
($filename, $size, $date) = split(...)
does impact the value that @_ gets if the ?...? form of the match operand is used. This is not an issue with Perl5 since it does not support the forced split to @_.

Perl4 Note #20

Page 263   Perl4's split supports a special match operand: if the list-context match operand uses ?...? (but not m?...?), split fills @_ as it does in a scalar-context. Despite current documentation to the contrary, this feature does not exist in Perl5.

Perl4 Note #21

Page 264   In Perl4, if an expression (such as a function call) results in a scalar value equal to a single space, that value is taken as the special-case single space. With Perl5, only a literal single-space string is taken as the special case.

Perl4 Note #22

Page 264   With Perl4, the default operand is m/\s+/, not '·'. The difference affects how leading whitespace is treated.

Perl4 Note #23

Page 275   With Perl4, it seems that the existence of eval anywhere in the program, also triggers the copy for each successful match. Bummer.