Perl has been featured prominently in this book, and with good reason. It is popular, extremely rich with regular expressions, freely and readily obtainable, easily approachable by the beginner, and available for a wide variety of platforms, including Amiga, DOS, MacOS, NT, OS/2, Windows, VMS, and virtually all flavors of Unix.
Some of Perl's programming constructs superficially resemble those of C or other traditional programming languages, but the resemblance stops there. The way you wield Perl to solve a problem -- The Perl Way -- is very different from traditional languages. The overall layout of a Perl program often uses the traditional structured and object-oriented concepts, but data processing relies very heavily on regular expressions. In fact, I believe it is safe to say that regular expressions play a key role in virtually all Perl programs. This includes everything from huge 100,000-line systems, right down to simple one-liners, like
% perl -pi -e 's{([-+]?\d+(\.\d*)?)F\b}{sprintf "%.0fC", ($1-32) * 5/9}eg' *.txt
which goes through *.txt files and replaces Fahrenheit values with
Celsius ones (reminiscent of the first example from Chapter 2).
In This Chapter
In this chapter we'll look at everything regex about Perl -- the details of its regex flavor and the operators that put them to use. This chapter presents the regex-relevant details from the ground up, but I assume that you have at least a basic familiarity with Perl. (If you've read Chapter 2, you're probably already familiar enough to at least start using this chapter.) I'll often use, in passing, concepts that have not yet been examined in detail, and won't dwell much on non-regex aspects of the language. It might be a good idea to keep the Perl manpage handy, or perhaps O'Reilly's Perl 5 Desktop Reference (see Appendix A).
Perhaps more important than your current knowledge of Perl is your desire to understand more. This chapter is not light reading by any measure. Because it's not my aim to teach Perl from scratch, I am afforded a luxury that general books about Perl do not have: I don't have to omit important details in favor of weaving one coherent story that progresses unbroken through the whole chapter. What remains coherent throughout is the drive for a total understanding. Some of the issues are complex, and the details thick; don't be worried if you can't take it all in at once. I recommend first reading through to get the overall picture, and returning in the future to use as a reference as needed.
Ideally, it would be nice if I could cleanly separate the discussion of the regex flavor from the discussion on how to apply them, but with Perl the two are inextricably intertwined. To help guide your way, here's a quick rundown of how this chapter is organized:
Yet, hacker does not live by metacharacters alone. Regular expressions are worthless without a means to apply them, and Perl does not let you down here either. In this respect, Perl certainly lives up to its motto ``There's more than one way to do it.''
. (dot) |
any byte except newline (=>) (any byte at all with the /s modifier =>) |
|||||||||||
| |
alternation | (...) |
normal grouping and capturing | |||||||||
| greedy quantifiers (=>) | (?:...) |
pure grouping only* (=>) | ||||||||||
* + ? {n} {min,} {min,max} |
(?=...) |
positive lookahead* (=>) | ||||||||||
| non-greedy quantifiers* (=>) | (?!...) |
negative lookahead* (=>) | ||||||||||
*? +? ?? {n}? {min,}? {min,max}? |
anchors | |||||||||||
(?#...) |
comment* (=>) | \b* \B |
word/non-word anchors (=>) | |||||||||
#... |
(with /x mod, =>) comment until newline or end of regex
|
^ $ |
start/end of string (or start and end of logical line) (=>) | |||||||||
| inlined modifiers* (=>) | \A \Z |
start/end of string* (=>) | ||||||||||
(?mods) |
mods from among i, x, m, and s |
\G |
end of previous match* (=>) | |||||||||
\1, \2, etc. text previously matched by associated set of capturing parentheses (=>) |
||||||||||||
[...] [^...] Normal and inverted character classes (=>) |
||||||||||||
| (the items below are also valid within a character class) | ||||||||||||
| character shorthands (=>) | class shorthands (=>) | |||||||||||
\b' \t \n \r \f \a \e \num \xnum \cchar |
\w \W \s \S \d \D |
|||||||||||
\l \u \L \U \Q* \E on-the-fly text modification (=>) |
||||||||||||
| ||||||||||||
Perhaps you never considered
``... =~ m/.../''
to be an operator, but just as addition's + is an operator which
takes two operands and returns a sum, the match is an operator that takes
two operands, a regex operand and a target-string operand, and returns a
value. As discussed in Chapter 5's ``Functions vs. Integrated
Features vs. Objects'' (=>159), the main
difference between a function and an operator is that operators can treat
their operands in magical ways that a function normally
can't.
And believe me, there's lots of magic surrounding Perl's regex operators.
But remember what I said in Chapter 1: There's nothing magic about magic
if you understand what's happening. This chapter is your guide.
| 4 | The waters are muddied in Perl, where functions and procedures can also respond to their context, and can even modify their arguments. I try to draw a sharp distinction with the regex features because I wish to highlight characteristics which can easily lead to misunderstandings. |
There are some not-so-subtle differences between a regular expression and a regular-expression operand. You provide a raw regex operand in your script, Perl cooks it a bit, then gives the result to the regex search engine. The preprocessing (cooking) is similar to, but not exactly the same as, what's done for doublequoted strings. For a superficial understanding, these distinctions are not important -- which is why I'll explain and highlight the differences at every opportunity!
Don't let the shortness of Table 7-2's ``regex-related operators'' section fool you. Each of those three operators really packs a punch with a variety of options and special-case uses.
| Regex-Related Operators | modifier (=>) | modifies how... | |||
* m/regex/mods (=>) |
/x /o |
regex is interpreted | |||
* s/regex/subst/mods (=>) |
/s* /m* /i |
engine considers target text | |||
split(...) (=>) |
/g /e |
other | |||
*operates on $_ unless related via =~ or !~ |
After-Match Variables (=>) | ||||
| Related Variables | $1, $2, etc. |
captured text | |||
$_ |
default search target | $+ |
highest filled $1, $2, ... |
||
$* |
obsolete multi-line mode (=>) | $` $& $' |
text before, of, and after match | ||
| (best to avoid -- see ``Perl Efficiency Issues'' =>) | |||||
| -- Related Functions -- | |||||
pos (=>) study (=>) quotemeta lc lcfirst uc ucfirst (=>) |
|||||
| |||||
m/regex/, for example, offers a wide
variety of different functionality depending upon where, how, and with
which modifiers it is used. The flexibility is amazing.
| 6 | That they're innumerable doesn't stop this chapter from trying! |
In the Spring 1996 issue of The Perl Journal, Larry Wall wrote:
7 See http://tpj.com/orstaff@tpj.comOne of the ideas I keep stressing in the design of Perl is that
things that ARE different should LOOK different.
This is a good point, but with the regular expression operators, differences unfortunately aren't always readily apparent. Even skilled hackers can get hung up on the myriad of options and special cases. If you consider yourself an expert, don't tell me you've never wasted way too much time trying to understand why
if (m/.../g) {
:
In the same article, Larry also wrote:
In trying to make programming predictable, computer scientists have
mostly succeeded in making it boring.
This is also true, but it struck me as rather funny since I'd written ``there is a certain art-like appreciation to boring, consistent, dependable interfaces'' for this section's introduction only a week before reading Larry's article! My idea of ``art'' usually involves more engineering than paint, so what do I know? In any case, I highly recommend Larry's entertaining and thought provoking article for some extremely insightful comments on Perl, languages, and yes, art.
$text from a CSV (Comma Separated
Values) file, as might be output by dBASE, Excel, and so on. That is, a
file with lines such as:
"earth",1,,"moon",9.374
This line represents five fields. It's reasonable to want this information
as an array, say @field, such that $field[0] was `earth', $field[1] was `1', $field[2] was undefined, and so forth.
This means not only splitting the data into fields, but removing the quotes
from quoted fields. Your first instinct might be to use split, along
the lines of:
@fields = split(/,/, $text);
This finds places in $text where , matches, and fills @fields with the snippets that those matches delimit (as opposed to
the snippets that the matches match).
Unfortunately, while split is quite useful, it is not the proper
hammer for this nail. It's inadequate because, for example, matching only
the comma as the delimiter leaves those doublequotes we want to remove.
Using "?,"? helps to solve this, but there are still other problems.
For example, a quoted field is certainly allowed to have commas
within it, and we don't want to treat those as field delimiters, but
there's no way to tell split to leave them alone.
Perl's full toolbox offers many solutions; here's one I've come up with:
@fields = (); # initialize@fieldsto be emptywhile ($text =~ m/"([^"\\]*(\\.[^"\\]*)*)",?|([^,]+),?|,/g) { push(@fields, defined($1) ? $1 : $3); #add the just-matched field} push(@fields, undef) if $text =~ m/,$/; #account for an empty last field#Can now access the data via@fields...
Even experienced Perl hackers might need more than a second glance to fully grasp this snippet, so let's go over it a bit.
"([^"\\]*(\\.[^"\\]*)*)",?|([^,]+),?|,
Frankly, looking at the expression is not meaningful until you also
consider how it is used. In this case, it is applied via the match
operator, using the /g modifier, as the conditional of a while.
This is discussed in detail in the meat of the chapter, but the crux of it
is that the match operator behaves differently depending how and where it
is used. In this case, the body of the while loop is executed once
each time the regex matches in $text. Within that body are available
any $&, $1, $2, etc., set by each respective match.
"([^"\\]*(\\.[^"\\]*)*)",?,? appended. The marked parentheses add no meaning to
the regular expression itself, so they are apparently used only to
capture text to $1. This alternative obviously deals with
doublequoted fields of the CSV data.([^,]+),?,
These are easy enough to understand as individual components, except
perhaps the significance of the ,?, which I'll get to in a bit.
However, they do not tell us much individually -- we need to look at
how they are combined and how they work with the rest of the program.
while and m/.../g to apply the regex
repeatedly, we want the expression to match once for each field of the CSV
line. First, let's consider how it matches just the first time it's applied
to any given line, as if the /g modifier were not used.
The three alternatives represent the three types of fields: quoted, unquoted, and empty. You'll notice that there's nothing in the second alternative to stop it from matching a quoted field. There's no need to disallow it from matching what the first alternative can match since Perl's Traditional NFA's non-greedy alternation guarantees the first alternative will match whatever it should, never leaving a quoted field for the second alternative to match against our wishes.
You'll also notice that by whichever alternative the first field is
matched, the expression always matches through to the field-separating
comma. This has the benefit of leaving the current position of the /g modifier right at the start of the next field. Thus, when the while plus m/.../g combination iterates and the regular
expression is applied repeatedly, we are always sure to begin the match at
the start of a field. This ``keeping in synch'' concept can be quite
important in many
situations where the /g modifier is used. In fact, it is exactly this
reason why I made sure the first two alternatives ended with ,?. (The
question mark is needed because the final field on the line, of course,
will not have a trailing comma.) We'll definitely be seeing this
keeping-in-synch concept again.
Now that we can match each field, how do we fill @fields with the
data from each match? Let's look at $1 and such. If the field is
quoted, the first alternative matches, capturing the text within the quotes
to $1. However, if the field is unquoted, the first alternative fails
and the second one matches, leaving $1 undefined and capturing the
field's text to $3. Finally, on an empty field, the third alternative
is the one that matches, leaving both $1 and $3 undefined. This
all crystallizes to:
push(@fields, defined($1) ? $1 : $3);
The marked section says ``Use $1 if it is defined, or $3
otherwise.'' If neither is defined, the result is $3's undef,
which is just what we want from an empty field. Thus, in all cases, the
value that gets added to @fields is exactly what we desire, and the
combination of the while loop, the /g modifier, and
keeping-in-synch allows us to process all the fields.
Well, almost all the fields. If the last field is empty (as demonstrated by
a line-ending comma), our program accounts for it not with the main regex,
but after the fact with a separate line to add an undef to the list.
In these cases, you might think that we can just let the main regular
expression match the nothingness at the end of the line. This would work
for lines that do end with an empty field, but would tack on a phantom
empty field to lines that don't, since all lines have nothingness at
the end, so to speak.
split, but upon
closer examination, that avenue turns out to be a dead end. $3 to access
the parentheses of the second alternative. A change in the first
alternative that adds or subtracts parentheses must be reflected by
changing all related occurrences of $3 -- occurrences
which could be elsewhere in the code some distance away from the
actual regular expression. Text::ParseWords provides the
quotewords routine, so:
use Text::ParseWords;
:
@fields = quotewords(',', 0, $text);
| 8 | Not to dilute my point, but I should point out that at the time of
this writing, there are bugs in the quotewords routine.
Among them, it strips the final field if it is the number zero,
does not recognize trailing empty lines, and does not allow escaped
items (except escaped quotes) within quoted fields. Also, it
invokes an efficiency penalty on the whole script
(=>277). I have contacted the
author, and these concerns will hopefully be fixed in a future
release. |
Of course, being able to solve a problem by hand is a good skill to have, but when efficiency isn't at an absolute premium, the readability and maintainability of using a standard library function is most appealing. Perl's standard library is extensive, so getting to know it puts a huge variety of both high- and low-level functionality at your fingertips.
Version 5, Perl5 for short, was officially released in October 1994 and represented a major upgrade. Much of the language was redesigned, and many regular expression features were added or modified. One problem Larry Wall faced when creating the new features was the desire for backward compatibility. There was little room for Perl's regular expression language to grow, so he had to fit in the new notations where he could. The results are not always pretty, with many new constructs looking ungainly to novice and expert alike. Ugly yes, but as we will see, extremely powerful.
push(@fields, $+) while $text =~ m{
"([^\"\\]*(?:\\.[^\"\\]*)*)",? # Standard quoted string (with possible comma)
| ([^,]+),? # or up to next comma (with possible comma)
| , # or just a comma.
}gx;
| 9 | Tom Christiansen suggested that I use ``dead flea-bitten camel carcasses'' instead of ``Perl4'', to highlight that anything before version 5 is dead and has been abandoned by most. I was tempted. |
Those comments are actually part of the regular expression... much more readable, don't you think? As we'll see, various other features of Perl5 make this solution more appealing -- we'll re-examine this example again in ``Putting It All Together'' (=>290) once we've gone over enough details to make sense of it all.
I'd like to concentrate on Perl5, but ignoring Perl4 ignores a lingering reality. I'll mention important Perl4 regex-related differences using markings that look like this: . This indicates that the first Perl4 note, related to whatever comment was just made, can be found on page 305. Since Perl4 is so old, I won't feel the need to recap everything in Perl4's manpage -- for the most part, the Perl4 notes are short and to the point, targeting those unfortunate enough to have to maintain code for both versions, such as a site's Perlmaster. Those just starting out with Perl should definitely not be concerned with an old version like Perl4.
comp.lang.perl.misc resulted in a number of important changes to the
language and to its regular expressions. For example, one day I was
responding to a post with an extraordinarily long regular expression, so I
``pretty-printed'' it, breaking it across lines and otherwise adding
whitespace to make it more readable. Larry Wall saw the post, thought that
one really should be able to express regexes that way, and right there and
then added Perl5's /x modifier, which causes most whitespace to be
ignored in the associated regular expression.
Around the same time, Larry added the (?#...) construct to allow
comments to be embedded within a regex. A few months later, though, after
discussions in the newsgroup, a raw `#' was also made to start a
comment if the /x modifier were used. This appeared in version
5.002.
There were other bug fixes and changes as well -- if you are using an
early release, you might run into incompatibilities as you follow along in
this book. I recommend version 5.002 or later.
| 10 | Actually, it appeared in some earlier versions, but did not work reliably. |
As this second printing goes to press, version 5.004 is about to hit beta,
with a final release targeted for Spring 1997. Regex-related changes in the
works include enhanced locale support, modified pos support, and a
variety of updates and optimizations. (For example, Table 7-10 will likely become almost empty.) Once
5.004 is officially released, this book's home page (see Appendix A) will
note the changes.
while loop -- the main match operator in the CSV example
would have behaved quite differently had it been used where Perl was
expecting a list.$1 and other match-induced
side effects.
| 11 | ``List context'' used to be called ``array context''. The change
recognizes that the underlying data is a list that applies equally well
to an @array, a %hash, and
(an, explicit, list). |
Consider the two assignments:
$s = expression one; @a = expression two;
Because $s is a simple scalar variable (holds a single value, not a
list), it expects a simple scalar value, so the first expression, whatever
it may be, finds itself in a scalar context. Similarly, because @a is
an array variable and expects a list of values, the second expression finds
itself in a list context. Even though the two expressions might be exactly
the same, they might (depending on the case) return completely different
values, and cause completely different side effects while they're at it.
Sometimes, the type of an expression doesn't exactly match the type of value expected of it, so Perl does one of two things to make the square peg fit into a round hole: 1) it allows the expression to respond to its context, returning a type appropriate to the expectation, or 2) it contorts the value to make it fit.
<MYDATA>. In a list context, it returns a list of all
(remaining) lines from the file. In a scalar context, it simply
returns the next line.
Many Perl constructs respond to their context, and the regex operators are
no different. The match operator m/.../, for example, sometimes
returns a simple true/false value, and sometimes a list of certain match
results. All the details are found later in this chapter.
@a = 42 is
the same as @a = (42). On the other hand, there's no general rule
for converting a list to a scalar. If a literal list is given, such
as with
$var = ($this, &is, 0xA, 'list');
'list', for $var.
If an array is given, as with $var = @array, the length of the
array is returned.Some words used to describe how other languages deal with this issue are cast, promote, coerce, and convert, but I feel they are a bit too consistent (boring?) to describe Perl's attitude in this respect.
my(...). Global variables are not declared, but just pop into
existence when you use them. Global variables are visible from anywhere
within the program, while private variables are visible, lexically, only to
the end of their enclosing block. That is, the only Perl code that can
access the private variable is the code that falls between the my
declaration and the end of the block of code that encloses the my.
| 12 | Perl allows the names of global variables to be partitioned into groups called packages, but the variables are still global. |
The first use listed is less important these days because Perl now offers
truly local variables via the my directive. Using my creates
a new variable completely distinct from and utterly unrelated to any other
variable anywhere else in the program. Only the code that lies between the
my to the end of the enclosing block can have direct access to the variable.
The extremely ill-named function local creates a new dynamic scope.
Let me say up front that
the call to local does not create a new variable.
Given a global variable, local does three things:
undef, or a value
assigned to the local); andlocal.
This means that ``local'' refers only to how long any changes to the variable will last. The global variable whose value you've copied is still visible from anywhere within the program -- if you make a subroutine call after creating a new dynamic scope, that subroutine (wherever it might be located in the script) will see any changes you've made. This is just like any normal global variable. The difference here is that when execution of the enclosing block finally ends, the previous value is automatically restored.
An automatic save and restore of a global variable's value -- that's
pretty much all there is to local. For all the misunderstanding that has
accompanied local, it's no more complex than the snippet on the right
of Table 7-3 illustrates.
| Normal Perl | Equivalent Meaning |
|---|---|
{ |
{ |
local($SomeVar); #save copy |
my $TempCopy = $SomeVar; |
$SomeVar = 'My Value'; |
$SomeVar = undef; |
: |
$SomeVar = 'My Value'; |
: |
: |
: |
$SomeVar = $TempCopy; |
} #automatically restore $SomeVar |
} |
(As a matter of convenience, you can assign a value to local($SomeVar), which is exactly the same as assigning to $SomeVar in place of the undef assignment. Also, the
parentheses can be omitted to force a scalar context.)
References to $SomeVar while within the block, or within a subroutine
called from the block, or within a signal handler invoked while within the
block -- any reference from the time the local is called to
the time the block is exited -- references `My·Value'. If the
code in the block (or anyone else, for that matter) modifies $SomeVar, everyone (including the code in the block) sees the
modification, but it is lost when the block exits and the original copy is
automatically restored.
As a practical example, consider having to call a function in a poorly
written library that generates a lot of Use of uninitialized value warnings. You use
Perl's -w option, as all good Perl programmers should, but the
library author apparently didn't. You are exceedingly annoyed by the
warnings, but if you can't change the library, what can you do short of
stop using -w altogether? Well, you could set a local value
of $^W, the
in-code debugging flag (the variable name ^W can be either the two
characters, caret and `W', or an actual control-W character):
{
local $^W = 0; # ensure debugging is off.
&unruly_function(...);
}
# exiting the block restores the original value of $^W
The call to local saves an internal copy of the previous value of
the global variable $^W, whatever it might have been. Then that same
$^W receives the new value of zero that we immediately scribble in.
When unruly_function is executing, Perl checks
$^W and sees the zero we wrote, so doesn't issue warnings.
When the function returns, our value of zero is still in effect.
So far, everything appears to work just as if you didn't use local.
However, when the block is exited right after the subroutine returns, the
saved value of $^W is restored. Your change of the value was local,
in time, to the lifetime of the block. You'd get the same effect by
making and restoring a copy yourself, as in Table 7-3,
but local conveniently takes care of it
for you.
For completeness, let's consider what happens if I use
my instead of local. Using my creates a new variable with an
initially undefined value. It is visible only within the lexical block it
is declared in (that is, visible only by the code written between the
my and the end of the enclosing block). It does not change, modify,
or in any other way refer to or affect other variables, including any
global variable of the same name that might exist. The newly created
variable is not visible elsewhere in the program, including from within unruly_function. In our example snippet, the new $^W is
immediately set to zero but is never again used or referenced, so it's
pretty much a waste of effort. (While executing unruly_function and
deciding whether to issue warnings, Perl checks the unrelated global
variable $^W.)
| 13 | Perl doesn't allow the use of my with this special variable name,
so the comparison is only academic. |
local is that it provides a clear transparency
over a variable on which you scribble your own changes. You (and anyone
else that happens to look, such as subroutines and interrupt handlers) will
see the new values. They shadow the previous value until the point in time
that the block is finally exited. At that point, the transparency is
automatically removed, in effect, removing any changes that might have been
made since the local.
This analogy is actually much closer to reality than the original ``an
internal copy is made'' description. Using local doesn't have Perl
actually make a copy, but instead puts your new value earlier in the list
of those checked whenever a variable's value is accessed (that is, it
shadows the original). Exiting a block removes any shadowing values added
since the block started. Values are added manually, with local, but
some variables have their values automatically dynamically scoped. Before
getting into that important regex-related concern, I'd like to present an
extended example illustrating manual dynamic scoping.
ProcessFile. When given a filename, it opens it
and processes commands line by line. In this simple example, there are only
three types of commands, processed at [6], [7], and [8]. Of interest here are the global variables $filename, $command, $., and %HaveRead, as well as the global
filehandle FILE. When ProcessFile is called, all but %HaveRead have their values dynamically scoped by the local at [3].
Dynamic Scope Example
# Process ``this'' commandsub DoThisProcess ``that'' command[1]{ print "$filename line $.: processing $command"; : } #sub DoThatGiven a filename, open file and process commands[2]{ print "$filename line $.: processing $command"; : } #sub ProcessFile { local($filename) = @_;[3]local(*FILE, $command, $.); open(FILE, $filename) || die qq/can't open "$filename": $!\n/; $HaveRead{$filename} = 1;[4]while ($command = <FILE>) { if ($command =~ m/^#include "(.*)"$/) {[5]if (defined $HaveRead{$1}) { warn qq/$filename $.: ignoring repeat include of "$1"\n/; } else { ProcessFile($1);[6]} } elsif ($command =~ m/^do-this/) { DoThis;[7]} elsif ($command =~ m/^do-that/) { DoThat;[8]} else { warn "$filename $.: unknown command: $command"; } } close(FILE); }[9]
When a do-this command is found (at ), the [7]DoThis
function is called to process it. You can see at that the
function refers to the global variables [1]$filename, $., and $command. The DoThis function doesn't know (nor care), but the
values of these variables that it sees were written in ProcessFile.
The #include command's processing begins with the filename being
plucked from the line at . After making sure the file hasn't
been processed already, we call [5]ProcessFile recursively, at . With the new call, the global variables [6]$filename, $command, and $., as well as the filehandle FILE, are
again overlaid with a transparency that is soon updated to reflect the
status and commands of the second file. When commands of the new file are
processed within ProcessFile and the two subroutines, $filename and friends are visible, just as before.
Nothing at this point appears to be different from straight global variables.
The benefits of dynamic scoping are apparent when the second file has been
processed and the related call of ProcessFile exits. When execution
falls off the block at , the related [9]local transparencies
laid down at are removed, restoring the original file's values
of [3]$filename and such. This includes the filehandle FILE now
referring to the first file, and no longer to the second.
Finally, let's look at %HaveRead, used to keep track of files
we've seen ( and [4]). It is specifically not
dynamically scoped because we really do need it to be global across the
entire time the script runs. Otherwise, included files would be forgotten
each time [5]ProcessFile exits.
$& (refers to the text matched) and $1
(refers to the text matched by the first parenthesized subexpression).
These variables have their value automatically dynamically scoped upon
entry to every block.To see the benefit of this, realize that each call to a subroutine involves starting a new block. For these variables, that means a new dynamic scope is created. Because the values before the block are restored when the block exits (that is, when the subroutine returns), the subroutine can't change the values that the caller sees.
As an example, consider:
if (m/(...)/)
{
&do_some_other_stuff();
print "the matched text was $1.\n";
}
Because the value of $1 is dynamically scoped automatically upon
entering each block, this code snippet neither cares, nor needs to care,
whether the function do_some_other_stuff changes the value of
$1 or not. Any changes to $1 by the function are contained
within the block that the function defines, or perhaps within a sub-block
of the function. Therefore, they can't affect the value this snippet sees
with the print after the function returns.
The automatic dynamic scoping can be helpful even when not so apparent:
if ($result =~ m/ERROR=(.*)/) {
warn "Hey, tell $Config{perladmin} about $1!\n";
}
(The standard library module Config defines an associative array %Config, of which the member $Config{perladmin} holds
the email address of the local Perlmaster.) This code could be very
surprising if $1 were not automatically dynamically scoped. You see, %Config
is actually a tied variable, which means that any reference to it involves a
behind-the-scenes subroutine call. Config's subroutine to fetch the
appropriate value when $Config{...} is used invokes a regex match.
It lies between your match and your use of $1, so it not being
dynamically scoped would trash the $1 you were about to use. As it
is, any changes in the $Config{...} subroutine are safely hidden by
dynamic scoping.
local can create a maintenance nightmare. As I
mentioned, the my(...) declaration creates a private variable with
lexical scope. A private variable's lexical scope is the opposite of
a global variable's global scope, but it has little to do with dynamic
scoping (except that you can't local the value of a my
variable). Remember, local is an action, while my is an
action and a declaration.
$&$` and $' below) is best
avoided. (See ``Unsociable $& and Friends''
on page 273.)
$& is never undefined after a successful match.$`/g
modifier, you sometimes wish $` to be the text from start
of the match attempt, not the whole string. Unfortunately, it
doesn't work that way. If you need to mimic such behavior,
you can try using \G([\x00-\xff]*?) at the front of the regex
and then refer to $1.
$` is never undefined after a successful match.$'"$`$&$'" is always a copy of the original target
text.
$' is never undefined after a successful match.| 15 | Actually, if the original target is undefined, but the match
successful (unlikely, but possible), "$`$&$'" would be
an empty string, not undefined. This is the only situation
where the two differ. |
$1, $2, $3, etc.$0 is not included
here -- it is a copy of
the script name and not related to regexes). These are guaranteed
to be undefined if they refer to a set of parentheses that doesn't
exist in the regex, or to a set that wasn't actually involved in
the match.
These variables are available after a match, including in the
replacement operand of s/.../.../. But it makes no sense to
use them within the regex itself. (That's what \1 and friends
are for.) See ``Using $1 Within a Regex?'' on page 219.
The difference between (\w+) and (\w)+ can
be seen in how these variables are set. Both regexes match exactly
the same text, but they differ in what is matched within the
parentheses.
Matching against the string tubby, the first results in
$1 having tubby, while the latter in it having
y: the plus is outside the parentheses, so each
iteration causes them to start capturing anew.
Also, realize the difference between (x)? and (x?). With
the former, the parentheses and what they enclose are optional, so
$1 would be either x or undefined. But with
(x?), the parentheses enclose a match -- what is
optional are the contents. If the overall regex matches, the
contents matches something, although that something might be the
nothingness x? allows. Thus, with (x?) the possible
values of $1 are x and an empty string.
Perl4 and Perl5 treat unusual cases involving parentheses and
iteration via star and friends slightly differently. It shouldn't
matter to most of you, but I should at least mention it. Basically,
the difference has to do with what $2 will receive when
something like (main(OPT)?)+ matches only main and not
OPT during the last successful iteration of the plus. With
Perl5, because (OPT) did not match during the last successful
match of its enclosing subexpression, $2 becomes (rightly, I
think) undefined. In such a case, Perl4 leaves $2 as what it
had been set to the last time (OPT) actually matched. Thus,
with Perl4, $2 is OPT if it had matched any time
during the overall match.
$+$1, $2, etc. explicitly
set during the match. If there are no capturing parentheses in the
regex (or none used during the match), it becomes undefined. However, Perl does not issue a warning when an undefined $+
is used.
When a regex is applied repeatedly with the /g modifier, each
iteration sets these variables afresh. This is why, for instance, you can
use $1 within the replacement operand of s/.../.../g and have
it represent a new slice of text each time. (Unlike the regex operand, the
replacement operand is re-evaluated for each
iteration; =>255.)
$1 within a regex?\1 is not available as a backreference outside of a
regex. (Use the variable $1 instead.) \1 is much more than a simple
notational convenience -- the variable $1 refers to a string of
static text matched during some previously completed successful match.
On the other hand, \1 is a true regex metacharacter to match text
similar to that matched within the first parenthesized subexpression
at the time that the regex-directed NFA reaches the \1. What \1 matches might change over the course of an attempt as the NFA
tracks and backtracks in search of a match.
A related question is whether $1 is available within a regex operand.
The answer is ``Yes, but not the way you might think.'' A $1
appearing in a regex operand is treated exactly like any other variable:
its value is interpolated (the subject of the next section) before the
match or substitution operation even begins. Thus, as far as the regex is
concerned, the value of $1 has nothing to do with the current match,
but remains left over from some previous match somewhere else.
In particular, with something like s/.../.../g, the regex operand is
evaluated, compiled once (also discussed in the next section), and then
used by all iterations via /g. This is exactly opposite of the
replacement operand, which is re-evaluated after each match. Thus, a
$1 within the replacement operand makes complete sense, but in a
regex operand it makes virtually none.
$month = "January";
$month gets the same value each time the statement is
executed because "January" never changes. However,
Perl can interpolate variables within doublequoted strings (that is,
have the variable's value inserted in place of its name). For example, in
$message = "Report for $month:";
$message gets depends on the value of $month,
and in fact potentially changes each time the program does this assignment.
The doublequoted string "Report for $month:" is exactly the same
as:
'Report for ' . $month . ':'
(In a general expression, a lone period is Perl's string-concatenation operator; concatenation is implicit within a doublequoted string.)
The doublequotes are really operators that enclose operands. A string such as
"the month is $MonthName[&GetMonthNum]!"
'the month is ' . $MonthName[&GetMonthNum] . '!'
GetMonthNum each time the string is
evaluated. Yes, you really can call functions from within doublequoted
strings -- because doublequotes are operators! To create a true
constant, Perl provides singlequoted strings, used here in the
code snippet equivalences.
One of Perl's unique features is that a doublequoted string doesn't have to
actually be delimited by doublequotes. The qq/.../ notation provides
the same functionality as "...", so qq/Report for $month:/
is a doublequoted string.
You can also choose your own delimiters. The following example uses qq{...} to delimit the doublequoted string:
warn qq{"$ARGV" line $.: $ErrorMessage\n};
Singlequoted strings use q/.../, rather than qq/.../. Regular
expressions use m/.../ and s/.../.../ for match and
substitution, respectively. The ability to pick your own delimiters for
regular expressions, however, is not unique to Perl: ed and its
descendants have supported it for over 25 years.
$field = "From";
:
if ($headerline =~ m/^$field:/) {
:
}
The marked section is taken as a variable reference and is replaced by the
variable's value, resulting in ^From: being the regular expression
actually used. ^$field: is the regex operand; ^From: is the
actual regex after cooking. This seems similar to what happens with
doublequoted strings, but there are a few differences (details follow
shortly), so it is often said that it receives ``doublequotish'' processing.
One might view the mathematical expression ($F - 32) * 5/9 as
Divide( Multiply( Subtract($F, 32), 5), 9 )
to show how the evaluation logically progresses.
It might be instructive to see $headerline =~ m/^$field:/
presented in a similar way:
RegexMatch( $headerline, DoubleQuotishProcessing(``^$field:'') )
Thus, you might consider ^$field: to be a match operand only
indirectly, as it must pass through doublequotish processing first. Let's
look at a more involved example:
$single = qq{'([^'\\\\]*(?:\\\\.[^'\\\\]*)*)'}; # to match a singlequoted string
$double = qq{"([^"\\\\]*(?:\\\\.[^"\\\\]*)*)"}; # to match a doublequoted string
$string = "(?:$single|$double)"; # to match either kind of string
:
while (<CONFIG>) {
if (m/^name=$string$/o) {
$config{name} = $+;
} else {
:
This method of building up the variable $string and then using it in
the regular expression is much more readable than writing the whole
regex directly:
if (m/^name=(?:'[^'\\]*(?:\\.[^'\\]*)*'|"[^"\\]*(?:\\.[^"\\]*)*")$/o) {
Several important issues surround this method of building
regular expressions within strings, such as all those extra backslashes,
the /o modifier, and the use of (?:...) non-capturing
parentheses. See ``Matching an Email Address'' (=>294) for a heady
example. For the moment, let's concentrate on just how a regex operand
finds its way
to the regex search engine. To get to the bottom of this, let's follow
along as Perl parses the program snippet:
$header =~ m/^\Q$dir\E # base directory
\/ # separating slash
(.*) # grab the rest of the filename
/xgm;
For the example, we'll assume that $dir contains `~/.bin'.
The \Q...\E that wraps the reference to $dir is a feature of
doublequotish string processing that is particularly convenient for regex
operands. It puts a backslash in front of most symbol characters. When the
result is used as a regex, it matches the literal text the \Q...\E
encloses, even if that literal text contains what would otherwise be
considered regex metacharacters (with our example of `~/.bin', the
first three characters are escaped, although only the dot requires it).
Also pertinent to this example are the free whitespace and comments within
the regex. Starting with Perl version 5.002, a regex operand subject to the
/x modifier can use whitespace liberally, and may have raw comments.
Raw comments start with # and continue to the next newline (or to
the end of the regex). This is but one way that doublequotish
processing of regex operands differs from real doublequoted strings: the
latter has no such /x modifier.
Figure 7-1
on page 223
illustrates the path from unparsed script, to regex operand, to real regex,
and finally to the use in a search. Not all the phases are necessarily done
at the same time. The lexical analysis (examining the script and deciding
what's a statement, what's a string, what's a regex operand, and so on) is
done just once when the script is first loaded (or in the case of
eval with a string operand, each time the
eval is re-evaluated). That's the first phase of Figure 7-1. The other phases can be done at different
times, and perhaps even multiple times. Let's look at the details.
m, the match operator, so it knows to scan a
regex operand. At point 1 in the figure, it recognizes the slash as the
operand delimiter, then searches for the closing delimiter, finding it at
point 4. For this first phase, strings and such get the same treatment,
so there's no special regex-related processing here. One transformation
that takes place in this phase is that the backslash of an escaped
closing delimiter is removed, as at point 3.

\Q...\E
and such are processed. (The full list of these constructs is given in
Table 7-8 on page 245.)
With our example, the value of $dir is interpolated under the
influence of \Q, so `\~\/\.bin' is actually inserted
into the operand.
Although similar, differences between regex operands and doublequoted
strings become apparent in this phase. Phase B realizes it is working with
a regex operand, so processes a few things differently. For example, \b and \3 in a doublequoted string
always represent a backspace and an octal escape, respectively. But in a
regex, they could also be the word-boundary metacharacter or a
backreference, depending on where in the regex they're
located -- Phase B therefore leaves them undisturbed for the regex
engine to later interpret as it sees fit. Another difference involves what
is and isn't considered a variable reference. Something like
$| will always be a variable reference within a string, but it is left
for the regex engine to interpret as the metacharacters $ and |.
Similarly, a string always interprets $var[2-7]
as a reference to element -5 of the array @var (which means
the fifth element from the end), but as a regex operand, it is interpreted
as a reference to $var followed by a character class. You can
use the ${...} notation for variable interpolation to force an array
reference if you wish:
${var[2-7]}.
Due to variable interpolation, the result of this phase can depend on the value of variables (which can change as the program runs). In such a case, Phase B doesn't take place until the match code is reached during runtime. ``Perl Efficiency Issues'' (=>265) expands on this important point.
As discussed in the match and substitution operator sections later in this chapter, using a singlequote as the regex-operand delimiter invokes singlequotish processing. In such a case, this Phase B is skipped.
/x.
Whitespace (except within a character class) and
comments are removed. Because this happens after Phase B's variable
interpolation, whitespace and comments brought in from a variable end up
being removed as well. This is certainly convenient, but there's one trap
to watch out for. The # comments continue to the next newline or the
end of the operand -- this is different from ``until the end of
an interpolated string.'' Consider if we'd included a comment at the end
of page 221's $single:
$single = qq{'(...regex here...)' # for singlequoted strings};
This value makes its way to $string and then to the regex operand.
After Phase B, the operand is (with the intended comment bold, but the real
comment underlined):
^name=(?:'(...)'·#·for·singlequoted·strings|"(...)")$
Surprise! The comment intended only for $single ends up wiping out what
follows because we forgot to end the comment with a newline.
If, instead, we use:
$single = qq{'(...regex here...)' # for singlequoted strings\n};
\n is interpreted by the doublequoted
string, providing an actual newline character to the regex when it's
used. If you use a singlequoted q{...} instead, the regex
receives the raw \n, which matches a newline, but is not a
newline. Therefore, it doesn't end the comment and is removed with the rest
of the comment.
| 16 | Figure 7-1 is a model describing the complex
multi-level parsing of Perl, but those digging around in Perl internals
will find that the processing actually done by Perl is slightly
different. For example, what I call Phase C is not a separate step, but
is actually part of both Phases B and D.
I feel it is more clear to present it as a separate step, so have done so. (I spent a considerable amount of time coming up with the model that Figure 7-1 illustrates -- I hope you'll find it helpful.) However, in the unlikely event of variable references within a comment, my model and reality could conceivably differ.
Consider These situations are farfetched and rare, so they will almost certainly never matter to you, but I felt I should at least mention it. |
/o Modifier, and Efficiency''
(=>268). Also see the related discussion in
Chapter 5's ``Compile Caching'' (=>158).
| 17 | Larry Wall's preferred perlance uses minimal matching and maximal matching. |
| Traditional | Lazy | |||||
|---|---|---|---|---|---|---|
| Number of matches | Greedy | (Non-greedy) | ||||
| (maximal matching) | (minimal matching) | |||||
| Any number (zero, one, or more) | * |
*? |
||||
| One or more | + |
+? |
||||
| Optional (zero or one) | ? |
?? |
||||
| Specified limits (at least min; no more than max) | {min,max} |
{min,max}? |
||||
| Lower limit (at least min) | {min,} |
{min,}? |
||||
| Exactly num | {num} |
{num}? |
||||
| ||||||
The non-greedy versions are examples of the ``ungainly looking''
additions to the regex flavor that appeared with Perl5. Traditionally,
something like *? makes no sense in a regex. In fact, in Perl4 it
is a syntax error. So, Larry was free to give it new meaning. There was
some thought that the non-greedy versions might be **, ++, and
such, which has certain appeal, but the problematic notation for
{min,max} led Larry to choose the appended question mark.
This also leaves ** and the like for future expansion.
It's not common that you have a free choice between using greedy and non-greedy quantifiers, since they have such different meanings, but if you do, the choice is highly dependent upon the situation. Thinking about the backtracking that either must do with the data you expect should lead you to a choice. For an example, I recommend my article in the Autumn 1996 (Volume 1, Issue 3) issue of The Perl Journal in which I take one simple problem and investigate a variety of solutions, including those with greedy and non-greedy quantifiers.
<(.+?)> instead of <([^>]+)>. Sometimes this type
of replacement works, although it is less efficient -- the implied
loop of the star or plus must keep stopping to see whether the rest of the
regex can match. Particularly in this example, it involves temporarily
leaving the parentheses which, as Chapter 5 points out, has its own
performance penalty (=>150). Even though the
non-greedy constructs are easier to type and perhaps easier to read, make
no mistake: what they match can be very different.
First of all, of course, is that if the /s modifier
(=>234) is not used, the dot of .+? doesn't
match a newline, while the negated class in [^>]+ does.
A bigger problem that doesn't always manifest itself clearly can be shown
with a simple text markup system that uses <...> to indicate
emphasis, as with:
Fred was very, <very> angry. <Angry!> I tell you.
<([^>]*!)>.
Using <(.*?!)>, even with the /s modifier, is very
different. The former matches `...Angry!...' while the
latter matches `...very> angry. <Angry!...'.
The point to remember is that the negated class in [^>]*>
never matches `>', while the non-greedy construct in .*?> does if that is what it takes to achieve a
match. If nothing after the lazy construct could force the regex engine to
backtrack, it's not an issue. However, as the exclamation point of this
example illustrates, the desire for a match allows the lazy construct past
a point you really don't want it to (and that a negated character can't) exceed.
The non-greedy constructs are without a doubt the most powerful Perl5
additions to the regex flavor, but you must use them with care. A
non-greedy .*? is almost never a reasonable substitute for
[^...]* -- one might be proper for a particular situation, but
due to their vastly different meaning, the other is likely incorrect.
$1, $2, and friends.
Within the same regular expression, the
metacharacters \1, \2, and so on, are used instead. As pointed
out on page 219, the difference is much more
than just notational. Except within a character class where a
backreference makes no sense, \1 through \9 are always
backreferences.
Additional backreferences (\10, \11, ...)
become available as the number of capturing parentheses warrants
(=>243).
Uniquely, Perl provides two kinds of parentheses: the traditional (...) for both grouping and capturing, and new with Perl5, the
admittedly unsightly (?:...) for grouping only. With (?:...),
the ``opening parenthesis'' is really the three-character sequence
`(?:', while the ``closing parenthesis'' is the usual `)'.
Like the notation for the non-greedy quantifiers, the sequence `(?' was
previously a syntax error. Starting with version 5, it is used for a number
of regex language extensions, of which (?:...) is but
one. We'll meet the others soon.
(?:return-to|reply-to):· and
re(?:turn-to:·|ply-to:·),
for example, are logically
equivalent, but the latter is faster, with both matches and failures. (If
you're not convinced, work through an attempt to match against
forward-·and·reply-to·fields...) Perl's regex engine does not
currently take advantage of this, but it could in the future.
This last item is probably the one that provides the most benefit from a
user's point of view: Recall in the CSV example
how because the first alternative used two sets of parentheses, those in
the second alternative captured to $3
(=>204; =>207).
Any time the first alternative
changes the number of parentheses it uses, all
subsequent references
must be changed accordingly.
It's a real maintenance nightmare that non-capturing parentheses can greatly
alleviate.
| 20 | Even better still would be named subexpressions, such as Python provides through its symbolic group names, but Perl doesn't offer this... yet. |
For related reasons, the number of capturing parentheses becomes important
when m/.../ is used in a list context, or at any time with
split. We will see specific examples in each of the relevant
sections later
(=>252, 264).
(?=...) and (?!...) lookahead
constructs. Like normal non-capturing parentheses, the positive lookahead
(?=subexpression) is true if the subexpression matches. The
twist is that the subexpression does not actually ``consume'' any of
the target string -- lookahead matches a position in the string
similar to the way a word boundary does. This means that it doesn't add to
what $& or any enclosing capturing parentheses contain. It's a way of
peeking ahead without taking responsibility for it.
Similarly, the negative lookahead (?!subexpression) is true
when the subexpression does not match. Superficially, negative
lookahead is the logical counterpart to a negated character class, but
there are two major differences:
A few examples of lookahead:
Bill(?=·The·Cat|·Clinton)Bill, but only if followed by
`·The·Cat' or `·Clinton'.\d+(?!\.)\d+(?=[^.])OH·44272',
and where?
¤
Think carefully, then turn the page to check your answer.¤ Answer to the question on page 228
Both
Remember, greediness always defers in favor of an overall match. Since
We don't know what these might be used for, but they should probably
be written as |
^(?![A-Z]*$)[a-zA-Z]*$^(?=.*?this)(?=.*?that)this and that can match on the line. (A more logical
solution that is mostly comparable is the dual-regex /this/
&& /that/.)Other illustrative examples can be found at ``Keeping the Match in Synch with Expectations'' (=>237). For your entertainment, here's a particularly heady example copied from ``Adding Commas to a Number'' (=>292):
s<
(\d{1,3}) # before a comma: one to three digits
(?= # followed by, but not part of what's matched...
(?:\d\d\d)+ # some number of triplets...
(?!\d) # ...not followed by another digit
) # (in other words, which ends the number)><$1,>gx;
Lookahead parentheses do not capture text, so they do not count as a set of
parentheses for numbering purposes. However, they may contain raw
parentheses to capture the phantomly matched text. Although I don't
recommend this often, it can perhaps be useful. For example,
(.*?)(?=<(strong|em)\s*>),
matches everything up to but not including a <strong> or <em> HTML tag on the line. The consumed text is put
into $1 (and $& as well, of course), while the `strong' or
`em' that allows the match to stop is placed into $2. If
you don't need to know which tag allows the match to stop successfully, the ...(strong|em)... is better written as ...(?:strong|em)... to eliminate the needless
capturing. As another example, I use an appended (?=(.*)) on
page =>277 as part of mimicking $'. (Due to the $&-penalty, we want to avoid $' if at all
possible; =>273.)
Using capturing parentheses within a negative lookahead construct makes absolutely no sense, since the negative lookahead construct matches only when its enclosed subexpression does not.
The perlre manpage rightly cautions that lookahead is very
different from lookbehind. Lookahead ensures that the condition
(matching or not matching the given subexpression) is true starting at the
current location and looking, as normal, toward the right. Lookbehind, were
it supported, would somehow look back toward the left.
For example, (?!000)\d\d\d means ``so long as they're not
000, match three digits,'' which is a reasonable thing to want to do.
However, it is important to realize that it specifically does not mean
``match three digits that are not preceded by 000.'' This would
be lookbehind, which is not supported in Perl or any other regex flavor
that I know of. Well, actually, I suppose that a leading anchor (either
type: string or word) can be considered a limited form of lookbehind.
You often use lookahead at the end of an expression to disallow it from
matching when followed (or not followed) by certain things. Although its
use at the beginning of an expression might well indicate a mistaken
attempt at lookbehind, leading lookahead can sometimes be used to make a
general expression more specific. The 000 is one example, and
the use of (?!0+\.0+\.0+\.0+\b) to
disallow null IP addresses in Chapter 4 (=>125) is
another. And as we saw in Chapter 5's ``A Global View of
Backtracking'' (=>149), lookahead at the
beginning of an expression can be an effective way to speed up a match.
Otherwise, be careful with leading negative lookahead. The expression \w+ happily matches the first word in the string, but prepending
(?!cat) is not enough to ensure ``the first word not beginning
with cat.'' (?!cat)\w+ can't match at the start of cattle, but can still match cattle. To get the
desired effect, you need additional precautions, such as \b(?!cat)\w+.
(?#...) construct is taken as a comment and ignored. Its
content is not entirely free-form, as any copy of the regex-operand
delimiter must still be escaped.
| 22 | (?#...) comments are removed very early in the parsing, effectively
between Phase A and Phase B of
page 223's Figure 7-1. A bit of trivia for you: As far as I can
tell, the closing parenthesis of a (?#...) comment is the only item
in Perl's regular expression language that cannot be escaped. The first
closing parentheses after the `(?#' ends the comment, period. |
(?#...) appeared in version 5.000, but as of 5.002, the /x
modifier enables the unadorned # comment
that runs to the next newline (or the end of the regex),
pretty much just like in regular code.
(We saw an example of this earlier in the commafication
snippet on page 229.)
The /x modifier also causes
most whitespace to be ignored, so you can write the
$text =~ m/"([^"\\]*(\\.[^"\\]*)*)",?|([^,]+),?|,/g
$text =~ m/ # A field is one of three types of things:#1) DOUBLEQUOTED STRING"([^"\\]*(\\.[^"\\]*)*)" #- Standard doublequoted string (nab to $1).,? #- Eat any trailing comma.| #-OR-#2) NORMAL FIELD([^,]+) #- Capture (to $3) up to next comma.,? #- (and including comma if there is one)| #-OR-#3) EMPTY FIELD, #just match the comma./gx;
The underlying regular expression is exactly the same.
As with (?#...) comments, the regex's closing delimiter
still counts, so these comments are not entirely free-form either. As with
most other regex metacharacters, the # and whitespace metacharacters
that become active with /x are not available within a character class,
so there can be no comments or ignored whitespace within classes. Like
other regex metacharacters, you can escape # and whitespace within an /x
m{
^ # Start of line.
(?: # Followed by one of:
From # `From'
|Subject # `Subject'
|Date # `Date'
) #
: # All followed by a colon...
\ * # .. and any number of spaces. (note escaped space)
(.*) # Capture rest of line (except newline) to $1.
}x;
(?...) constructs(?modifiers) notation uses the `(?' notation
mentioned in this section, but for a different purpose. Traditionally,
case-insensitive matching is invoked with the /i modifier. You can
accomplish the same thing by putting (?i) anywhere in the regex
(usually at the beginning). You can specify /i (case insensitive), /m (multi-line mode), /s (single-line mode), and /x (free
formatting) using this mechanism. They can
be combined; (?si) is the same as using both /i and /s.
while (<>) loop (and the input record
separator $/ at its default), you know that the text being checked
is exactly one logical line, so the distinction between ``start of
logical line'' and ``start of the string'' is irrelevant.| 23 | A dollar sign can also indicate variable interpolation, but whether it represents that or the end-of-line metacharacter is rarely ambiguous. Details are in ``Doublequotish Processing and Variable Interpolation'' (=>222). |
However, if a string contains embedded newlines, it's reasonable to
consider the one string to be a collection of multiple logical lines. There
are many ways to create strings with embedded newlines (such as
"this\nway") -- when applying a regex to such a string,
however it might have acquired its data, should ^Subject: find `Subject:' at the start of any of the logical lines, or only at the
start of the entire multi-line string?
Perl allows you to do either. Actually, there are four distinct modes, summarized in Table 7-5
| mode | ^ and $ anchors consider target text as |
dot |
|---|---|---|
| default mode | a single string, without regard to newlines | doesn't match newline |
| single-line mode | a single string, without regard to newlines | matches all characters |
| multi-line mode | multiple logical lines separated by newlines | (unchanged from default) |
| clean multi-line | multiple logical lines separated by newlines | matches all characters |
When there are no embedded newlines in the target string, all modes are equal. You'll noticed that Table 7-5 doesn't mention...
| 24 | The line anchor optimization does become a bit less effective in the two multi-line modes because the transmission must still search for embedded newlines at which to reapply the regex (=>158). |
\n, an appropriate octal escape, or even
[^x] can match a newline (they always can).
In fact, meticulous study of Table 7-5 reveals that these modes are concerned only with how the three metacharacters, caret, dollar, and dot, consider newlines.
.*$ to consume
everything up to, but not including, the newline. My term for this default
mode is the default mode. You can quote me.
/m modifier to invoke a multi-line mode match. This allows
caret to match at the beginning of any logical line (at the start of the
string, as well as after any embedded newline), and dollar to match at the
end of any logical line (that is, before any newline, as well as at the end
of the string). The use of /m does not affect what dot does or
doesn't match, so in the typical case where /m is used alone, dot
retains its default behavior of not matching a newline. (As we'll soon see, /m can be combined with /s to create a clean multi-line
mode where dot matches anything at all.)
Let me say again:
The /m modifier influences only how ^ and $ treat
newlines.
The /m modifier affects only regular-expression matching, and in
particular, only the caret and dollar metacharacters. It has nothing
whatsoever to do with anything else. Perhaps ``multi-line mode'' is
better named ``line-anchors-notice-embedded-newlines mode.'' The /m and multi-line mode are debatably the most misunderstood simple
features of Perl, so please allow me to get on the soapbox for a moment to
make it clear that the /m modifier has nothing to do
with...
\n always matches a
newline without regard to multi-line mode, or the lack thereof.
/s modifier
(discussed on page 234)
influences dot, but the /m modifier does not. The /m
modifier influences only whether the anchors match at a
newline. Conceptually, this is not entirely unrelated to whether
dot matches a newline, so I often bring the description of
dot's default behavior under the umbrella term ``multi-line
mode,'' but the relationship is circumstantial.$/, the input-record
separator variable. If $/ is set to the empty string, it
puts <> and the like into paragraph mode in
which it will return, as a single string, all the lines until (and
including) the next blank line. If you set $/ to
undef, Perl goes into file slurp mode where the entire
(rest of the) file is returned, all in a single
string.
| 25 | A non-regex warning: It's a common mistake to think that
undefining $/ causes the next <> read to
return ``the rest of the input'' (or if used at the start of
the script, ``all the input''). ``File slurp'' mode is
still a file slurp mode. If there are multiple files for
<> to read, you still need one call per file. If
you really want to get all the input, use
join('',<>). |
It's convenient to use multi-line mode in conjunction with a
special setting of $/, but neither has any relationship to
the other.
The /m modifier was added in Perl5 -- Perl4 used the now-obsolete
special variable $* to indicate multi-line mode for all matches.
When $* is true, caret and dollar behave as if /m were
specified. This is less powerful than explicitly indicating that you want
multi-line mode on a per-use basis, so modern programs should not use
$*. However, modern programmers should worry if some old or
unruly library does, so the complement of /m is the /s modifier.
/s modifier forces caret and dollar to not consider newlines as
special, even if $* somehow gets turned on. It also affects dot: with /s, dot matches any character. The rationale is that if you go to
the trouble to use the /s modifier to indicate that you are not
interested in logical lines, a newline should not get special treatment
with dot, either.
/m and /s modifiers together creates what I
call a ``clean'' multi-line mode. It's the same as normal multi-line
mode except dot can match any character (the influence added by /s).
I feel that removing the special case of dot not matching newline
makes for cleaner, more simple behavior, hence the name.
\A and \Z to
match the beginning and end of the string. They are never concerned with
embedded newlines. They are exactly the same as the default and /s
versions of caret and dollar, but they can be used even with the /m
modifier, or when $* is true.
$ and \Z are always
allowed to match before a text-ending newline. In Perl4, a regex
cannot require the absolute end of the string. In Perl5, you can use ...(?!\n)$ as needed. On the other hand,
if you want to force a trailing newline, simply use ...\n$ in any
version of Perl.
$*$*, but sometimes the same code
needs to support Perl4 as well. Perl5 issues a warning if it sees $*
when warnings are turned on (as they generally should be). The warnings can
be quite annoying, but rather than turning them off for
the entire script, I recommend:
{ local($^W) = 0; eval '$* = 1' }
This turns off warnings while $* is modified, yet when done leaves
warnings on if they were on -- I explained this technique in detail in
``Dynamic Scope'' (=>213).
(?m), /s vs. /m(?mod) notation
within the regex itself, such as using (?m) as a substitute for using /m. There are no fancy rules regarding how the /m modifier might
conflict with (?m), or where (?m) can appear in the regex. Simply
using either /m or (?m) (anywhere at all in the regex) enables
multi-line mode for the entire match.
Although you may be tempted to want something like (?m)...(?s)...(?m)... to change the mode mid-stream, the line mode is
an all-or-nothing characteristic for the entire match. It makes no
difference how the mode is specified.
Combining both /s and /m has /m taking precedence with
respect to caret and dollar. Still, the use or non-use of /m has no
bearing on whether a dot matches a newline or not -- only the explicit
use of /s changes dot's default behavior. Thus, combining both modes
creates the clean multi-line mode.
All these modes and permutations might seem confusing, but Table 7-6 should keep things straight. Basically, they can
be summarized with ``/m means multi-line, /s means dot
matches newline.''
| Dot Matches | |||||
|---|---|---|---|---|---|
| Mode | Specified With | ^ |
$ |
\A, \Z |
Newline |
| default | neither /s nor /m, $* false |
string | string | string | no |
| single-line | /s ($* irrelevant) |
string | string | string | yes |
| multi-line | /m ($* irrelevant) |
line | line | string | no default |
| clean multi-line | both /m and /s ($* irrelevant) |
line | line | string | yes |
| obsolete multi-line | neither /s nor /m; $* true |
line | line | string | no default |
| string -- cannot anchor to an embedded newline. | |||||
| line -- can anchor to an embedded newline. | |||||
All other constructs are unaffected. \n always matches a newline.
A character class can be used to match or exclude the newline characters at
any time. An inverted character class such as [^x] always matches a
newline (unless
\n is included, of course). Keep this in mind if you want to change
something like .* to the seemingly more restrictive [^...]*.
\G anchor, which is related to \A, but is geared
for use with /g. It matches at the point where the previous match left
off. For the first attempt of a /g match, or when /g is not used,
it is the same as \A.
\G44. Here is
a sample line of data, with the target codes in bold:
03824531449411615213441829503544272752010217443235
@zips = m/\d\d\d\d\d/g; to create a list with one ZIP code per
element (assuming, of course, that the data is in the default search
variable $_). The regular expression matches one ZIP code
each time /g iteratively applies it. A point whose importance will
soon become apparent: the regex never fails until the entire list has been
parsed -- there are absolutely no bump-and-retries by the
transmission. (I'm assuming we'll have only proper data, an assumption
that is sometimes valid in the real world -- but usually not.)
So, it should be apparent that changing \d\d\d\d\d to 44\d\d\d in
an attempt to find only ZIP codes starting with 44 is
silly -- once a match attempt fails, the transmission bumps along one
character, thus putting the match for the 44 out of synch with
the start of each ZIP code. Using 44\d\d\d incorrectly finds
...5314494116... as the first match.
You could, of course, put a caret or \A at the head of the regex, but
they allow a target ZIP code to match only if it's the first in the
string. We need to keep the regex engine in synch manually by writing our
regex to pass over undesired ZIP codes as needed. The key here is that it
must pass over full ZIP codes, not single characters as with the
automatic bump-along.
(?:[^4]\d\d\d\d|\d[^4]\d\d\d)*...44. (Well, it's probably better to use
[1235-9] instead of [^4], but as I said earlier, I am
assuming properly formatted data.) By the way, we can't use
(?:[^4][^4]\d\d\d)*, as it does not pass over
undesired ZIP codes like 43210.(?:(?!44)\d\d\d\d\d)*...44. This English description sounds virtually identical to the
one above, but when rendered into a regular expression looks quite
different. Compare the two descriptions and related expressions. In
this case, a desired ZIP code (beginning with 44) causes
(?!44) to fail, thus causing the skipping to stop.(?:\d\d\d\d\d)*?...(?:\d\d\d\d\d) is not even attempted until
whatever follows has failed (and is repeatedly attempted until whatever
follows finally does match, thus effectively skipping only what is
absolutely needed).
Combining this last method with (44\d\d\d) gives us
@zips = m/(?:\d\d\d\d\d)*?(44\d\d\d)/g;
44xxx' codes, actively skipping
undesired ones that intervene. (When used in a list context, m/.../g
returns a list of the text matched by subexpressions within capturing
parentheses from each match; =>253.)
This regex can work with /g because we know each match always leaves
the ``current match position'' at the start of the next ZIP code,
thereby priming the next match (via /g) to start at the beginning of a
ZIP code as the regex expects. You might remember that we used this same
keeping-in-synch technique with the CSV example (=>206).
Hopefully, these techniques are enlightening, but we still haven't seen the \G we're supposed to be looking at. Continuing the analysis of the
problem, we find a use for \G quickly.
Let's look at our sample data again:
03824531449411615213441829503544272|7|5|2|010217443235
44272, no more
target codes are able to be matched, so the subsequent attempt fails. Does
the whole match attempt end? Of course not. The transmission bumps along to
apply the regex at the next character, putting us out of synch with the
real ZIP codes. After the fourth such bump-along, the regex skips 10217 as it matches the ``ZIP code'' 44323.
Our regex works smoothly so long it's applied at the start of a ZIP code,
but the transmission's bump-along defeats it. This is where \G
comes in:
@zips = m/\G(?:\d\d\d\d\d)*?(44\d\d\d)/g;
\G matches the point where the previous /g match ended (or the
start of the string during the first attempt). Because we crafted the
regex to explicitly end on a ZIP code boundary, we're assured that any
subsequent match beginning with \G will start on that same ZIP code
boundary. If the subsequent match fails, we're done for good because a
bump-along match is impossible -- \G requires that we
start from where we had last left off. In other words, this use of \G
effectively disables the bump along.
In fact, the transmission is optimized to actually disable the bump-along
in common situations. If a match must start with \G, a bump-along can
never yield a match, so it is done away with altogether. The optimizer can
be tricked, so you need to be careful. For example, this optimization isn't
activated with something like \Gthis|\Gthat, even though it is
effectively the same as \G(?:this|that) (which it
does optimize).
\G in perspective\G is not used often, but when needed, it is indispensable. As
enlightening as I hope this example has been, I can actually see a way to
solve it without \G. In the interest of study, I'd like to mention
that after successfully matching a 44xxx ZIP code, we can
use either of the first two ``skip undesired ZIP codes''
subexpressions to bypass any trailing undesired ZIP codes as well (the
third bypasses only when forced, so would not be appropriate here):
@zips = m/(?:\d\d\d\d\d)*?(44\d\d\d)(?:(?!44)\d\d\d\d\d)*/g;
After the last desired ZIP code has been matched, the added subexpression
consumes the rest of the string, if any, and the m/.../g is finished.
These methods work, but frankly, it is often prudent to take some of the work out of the regular expression using other regular expressions or other language features. The following two examples are easier to understand and maintain:
@zips = grep {defined} m/(44\d\d\d)|\d\d\d\d\d/g;
@zips = grep {m/^44/} m/\d\d\d\d\d/g;
In Perl4, you can't do it all in one regex because it doesn't have many of the constructs we've used, so you need to use a different approach regardless.
\G, the way Perl remembers the ``end of the
previous match'' is a concern.
In Perl4, it is
associated with a particular regex operator, but in Perl5 it is associated
with the matched data (the target string) itself. In fact, this position
can be accessed using the pos(...) function. This means that one
regex can actually pick up where a different one left off, in effect
allowing multiple regular expressions to do a tag-team match. As a simple
example, consider using
@nums = $data =~ m/\d+/g;
@nums a list of all numbers in
the data. Now, let's suppose that if the special value <xx>
appears on the line, you want only numbers after it. An easy way to do it
is:
$data =~ m/<xx>/g; # prime the /g start. pos($data) now points to just after the <xx>.
@nums = $data =~ m/\d+/g;
The match of <xx> is in a scalar context, so /g doesn't
perform multiple matches (=>253). Rather, it sets
the pos of $data, the ``end of the last match'' position
where the next /g-governed match of the same data will start. I call
this technique priming the /g. Once done, m/\d+/g picks up
the match at the primed point. If <xx> can't match in the first
place, the subsequent m/\d+/g starts at the beginning of the string,
as usual.
Two important points allow this example to work. First, the first match is in
a scalar context. In a list context, the <xx> is applied
repeatedly until it fails, and failure of a /g-governed match resets
pos to the start of the string. Secondly, the first match must use /g. Matches that don't use /g never access pos.
And here is something interesting: Because you can assign to pos,
you can prime the /g start manually:
pos($data) = $i if $i = index($data,"<xx>"), $i > 0; @nums = $data =~ m/\d+/g;
If index finds <xx> in the string, it sets the start
of the next /g-governed match of $data to begin there. This is
slightly different from the previous example -- here we prime it to
start at the <xx>, not after it as we did before. It turns
out, though, that it doesn't matter in this case.
This example is simple, but you can imagine how these techniques could be quite useful if used carefully in limited cases. You can also imagine them creating a maintenance nightmare if used carelessly.
\b and \B. (Note: in character classes, and in doublequoted strings for that
matter, \b is a shorthand for a backspace.) A word boundary is any
position where the character on one side
matches \w and the character on the other matches \W (with the
ends of the string being considered \W for the purposes of this
definition). Note that unlike most tools, Perl includes the underscore in
\w. It can also include additional characters if a locale is defined
in the user environment
(=>65, 242).
| 26 | The two styles are not mutually exclusive. GNU Emacs, for example, provides both. |
\b\b that I sometimes
run into.
As part of a Web search interface, given an $item to find,
I was using
m/\b\Q$item\E\b/ to do the search. I wrapped the item in \b...\b because I knew I wanted to find only whole-word matches of
whatever $item was. Given an item such as `3.75', the search
regex would become \b3\.75\b, finding things like `price is
3.75 plus tax' as expected.| 27 | Most recently, actually, just before feeling the need to write this section! |
However, if the $item were `$3.75', the regex would become \b\$3\.75\b, which requires a word boundary before the dollar sign.
(The only way a word boundary can fall before a \W character like
a dollar sign is if a word ends there.) I'd prepended \b with the
thought that it would force the match to start where $item started
its own word. But now that the start of $item can't start its own
word, (`$' isn't matched by \w, so cannot possibly start a word)
the \b is an impediment. The regex doesn't even match `... is $3.75 plus ...'.
We don't even want to
begin the match if the character before can match
\w, but this is lookbehind, something Perl doesn't have. One way to
address this issue is to add \b only when the $item starts or
ends with something matching \w:
$regex = "\Q$item\E"; # make $item ``safe''$regex = '\b' . $regex if $regex =~ m/^\w/; #if can start word, ensure it does$regex = $regex . '\b' if $regex =~ m/\w$/; #if can end word, ensure it does
This ensures that \b doesn't go where it will cause problems, but it
still doesn't address the situations where it does (that is, where the $item begins or ends with a \W character). For example, an $item of -998 still matches in `800-998-9938'. If we don't mind matching more text than
we really want (something that's not an option when embedded within a
larger subexpression, and often not an option when applied using the /g modifier), we can use the simple, but effective,
(?:\W|^)\Q$item\E(?!\w).
\b's
use in the regex almost always disambiguates it. I've never seen a situation
where a specific start-only or end-only anchor was actually needed, but if
you ever do bump into one, you can use \b(?=\w) and \b(?!\w) to
mimic them. For example,
s/\b(?!\w).*\b(?=\w)//
\<
and \> word boundaries, the same command would be s/\>.*\<//.
| 28 | The phantom \v (vertical tab) has been omitted. The manpage and
other documentation listed it for years, but it was never actually added
to the language! I doubt it will be missed now that it has finally been
removed from the documentation. |
| Byte Notations | Machine-dependent | ||
\num |
character specified in octal | Control Characters | |
\xnum |
character specified in hexadecimal | \a |
alarm (bell) |
\cchar |
control character | \f |
formfeed |
| Shorthand for Common Classes | \e |
escape | |
\d |
digit [0-9] |
\n |
newline |
\s |
whitespace, usually [·\f\n\r\t] |
\r |
carriage return |
\w |
word character, usually [a-zA-Z0-9_] |
\t |
tab |
\D, \S, \W -- complement of \d, \s, \w |
\b |
backspace | |
| (only within a class) | |||
The \n and other shorthands probably seem familiar -- these
machine dependent (=>72)
notations for common control characters are also available in doublequoted
strings. I feel it is important to maintain the mental distinction that
these regular expression metacharacters themselves are not available within
strings, but that strings just happen to have their own metacharacters
which parallel the regex ones in this area
(=>41).
If you compare m/(?:\r\n)+$/ with
$regex = "(\r\n)+"; m/$regex$/;
"\b[+\055/*]\d+\b" in the same situation and
you'll find a number of surprises. When you assign to a string, it's a
string -- your intention to use it as a regex is irrelevant to how
the string processes it. The two \b in this example are intended
to be word boundaries, but to a doublequoted string they're shorthands for
a backspace. The regex will never see \b, but instead raw backspaces
(which are unspecial to a regex and simply match backspaces). Surprise!
On a different front, both a regex and a doublequoted string convert \055 to a dash (055 is the ASCII code for a dash), but if
the regex does the conversion, it doesn't see the result as a
metacharacter. The string doesn't either, but the resulting [+-/*]
that the regex eventually receives has a dash that will be interpreted as
part of a class range. Surprise!
Finally, \d is not a known metasequence to a doublequoted string,
so it simply removes the backslash. The regex sees d. Surprise!
I want to emphasize all this because you must know what are and aren't metacharacters (and whose metacharacters they are, and when, and in what order they're processed) when building a string that you later intend to use as a regex. It can certainly be confusing at first. An extended example is presented in ``Matching an Email Address'' (=>294).
If compiled with appropriate libraries, though, Perl uses the ``is
this a letter?'' routines (isalpha, isupper,
and so on), as well as the mappings between uppercase and lowercase.
This affects /i and the items mentioned in Table 7-8 (=>245). It also allows \w, \W, \s, \S, (but not \d), and word
boundaries to respond to the locale. (Perl's regex flavor, however,
explicitly does not support [:digit:] and the other POSIX bracket
expression character classes listed on
page 80.)
POSIX module and
Jarkko Hietaniemi's I18N::Collate module. (I18n is the
common abbreviation for internationalization -- why is an exercise
for your free time.) Although they don't provide regular-expression support,
you might find them useful if locales are a concern. The POSIX
module is huge, but the documentation is relatively brief -- for
additional documentation, try the corresponding C library manpages, or
perhaps Donald Lewine's POSIX Programmer's Guide (published by
O'Reilly & Associates).
\33 and \177, and one-
or two-digit hexadecimal values like \xA and \xFF.
Perl strings also allow one-digit octal escapes, but Perl regexes generally
don't because something like \1 is usually taken as a backreference.
In fact, multiple-digit backreferences are possible if there are enough
capturing parentheses. Thus, \12 is a backreference if the expression
has at least 12 sets of capturing parentheses, an octal escape (for decimal
10) otherwise. Upon reading a
draft of this chapter, Wayne Berke offered a suggestion that I
wholeheartedly agree with: never use a two-digit octal escape such as \12, but rather the full three-digit \012. Why? Perl will never
interpret \012 as a backreference, while \12 is in danger of
suddenly becoming a backreference if the number of capturing parentheses warrants.
There are two special cases: A backreference within a character class
makes no sense so single-digit octal escapes are just peachy within
character classes (which is why I wrote generally don't in the previous
paragraph). Secondly, \0 is an octal escape everywhere, since it makes
no sense as a backreference.
\n and friends is not defined by Perl, but is
system-dependent (=>72).
[\-\],] is a single
class to match a dash, right bracket, and comma. (It might take a bit for
[\-\],] to sink in -- parse it carefully and it
should make sense.) Many other regex flavors do not support
backslashes within classes, which is too bad because being able to escape
class metacharacters is not only logical, but beneficial. Furthermore, it's
great to be allowed to escape items even when not strictly necessary, as it
can enhance readability.
| 29 | I generally use GNU Emacs when writing Perl code, and use
cperl-mode to provide automatic indenting and smart
colorization. I often escape quotes within a regex because
otherwise they confuse cperl-mode, which doesn't understand
that quotes within a regex don't start strings. |
As I've cautioned throughout this book, metacharacters recognized within a
character class are distinct from those recognized outside. Perl is no
exception to this rule, although many metacharacters have the same meaning
in both situations. In fact, the dual-personality of \b aside,
everything in Table 7-7 is also supported
within character classes, and I find this extremely convenient.
Many normal regex metacharacters, however, are either unspecial or utterly
different within character classes. Things like star, plus, parentheses,
dot, alternation, anchors and the like are all meaningless within a
character class. We've seen that \b, \3, and ^ have
special meanings within a class, but they are unrelated to their meaning
outside of a class. Both - and ] are unspecial outside of a
class, but special inside (usually).
[!-~]. (An exclamation point is the first such
character in ASCII, a tilde the last.) It might be less cryptic to spell
it out exactly: [\x21-\x7e]. Someone who already knew what you were
attempting would understand either method, but someone happening upon [!-~] for the first time would be confused, to say the least.
Using [\x21-\x7e] at least offers a clue that it is a character-encoding range.
Lacking real POSIX locale support, octal and hexadecimal escapes are
quite useful for working with non-ASCII text. For instance, when working
with the Latin-1 (ISO-8859-1) encoding popular on the Web, you
need to consider that u might also appear as ù, ú, û, or ü (using the character
encodings \xf9 through \xfc). Thus, to match any of these
u's, you can use [u\xf9-\xfc]. The uppercase versions are encoded
from \xd9 through \xdc,
so a case-insensitive match is [uU\xf9-\xfc\xd9-\xdc]
(Using the /i modifier applies only to the ASCII u, so it
is just as easy to include U directly and save ourselves the woe of
the /i penalty; =>278.)
Sorting is a practical use of this technique. Normally, to sort @items, you simply use sort @items, but this sorts based on
raw byte values and puts u (with ASCII value \x75) far away
from û and the like. If we make a copy of each item to use as
a sort key (conveniently associating them with an associated array), we can
modify the key so that it will work directly with sort and yield
the results we want. We can then map back to the unmodified key while
keeping the sorted order.
Here's a simple implementation:
foreach $item (@Items) {
$key = lc $item; # Copy $item, forcing ASCII to lowercase.
$key =~ s/[\xd9-\xdc\xf9-\xfc]/u/g; # All types of accented u become plain u
... same treatment for other accented letters...
$pair{$item} = $key; # Remember the item->sortkey relation
}
# Sort the items based upon their key.
@SortedItems = sort { $pair{$a} cmp $pair{$b} } @Items;
(lc is a convenient Perl5 feature, but the example is
easily rewritten in Perl4.)
In reality, this is only the beginning of a solution since each language
has its own particular sorting requirements, but it's a step in the right
direction.
Another step is to use the I18N::Collate module mentioned
\Q and Friends: True Lies\L, \E, \u, and the other items listed here in Table 7-8. Yet, it might surprise you that they are not
really regular-expression metacharacters. The regex engine understands
that `*' means ``any number'' and that `[' begins a character
class, but it knows nothing about `\E'. So why have I included them here?
| In-string Construct | Meaning | Built-in Function | ||||
|---|---|---|---|---|---|---|
\L, \U |
lower, raise case of text until \E |
lc(...), uc(...) |
||||
\l, \u |
lower, raise case of next character | lcfirst(...), ucfirst(...) |
||||
\Q |
add an escape before all non-alphabetic until \E |
quotemeta(...) |
||||
| Special Combinations | ||||||
\u\L |
Raise case of first character; lower rest until \E or end of text |
|||||
\l\U |
Lower case of first character; raise rest until \E or end of text |
|||||
| ||||||
For most practical purposes, they appear to be normal regex metacharacters. When used in a regex operand, the doublequotish processing of Figure 7-1's Phase B handles them, so they normally never reach the regex engine (=>222). But because of cases where it does matter, I call these Table 7-8 items second-class metacharacters.
This is because they are recognized only during Phase B of Figure 7-1, not later after the interpolation has taken place. If you don't understand this, you would be confused when the following didn't work:
$ForceCase = $WantUpper ? '\U' : '\L';
if (m/$ForceCase$RestOfRegex/) {
:
Because the \U or \L of $ForceCase are in the
interpolated text (which is not processed further until Phase C), nothing
recognizes them as special. Well, the regex engine recognizes the
backslash, as always, so \U is treated as the general case of an
unknown escape: the escape is simply ignored. If $RestOfRegex
contains Path and $WantUpper is true, the search would be for
the literal text UPath, not PATH as had been desired.
Another effect of the ``one-level'' rule is that something like m/([a-z])...\U\1 doesn't work. Ostensibly, the goal is to match a
lowercase letter, eventually followed by an uppercase version of the same
letter. But \U works only on the text in the regular expression
itself, and \1 represents text matched by (some other part of the)
expression, and that's not known until the match attempt is carried out. (I
suppose a match attempt can be considered a ``Phase E'' of Figure 7-1.)
m/.../ offers, I
take a divide-and-conquer approach, looking at:
The Perl regular-expression match is an operator that takes two operands (a target string operand and a regex operand) and returns a value, although exactly what kind of value depends on context. Also, there are optional modifiers which change how the match is done. (Actually, I suppose these can be considered operands as well.)
For example, to apply the expression ^/(?:[^/]+/)+Perl$ using the
match operator with standard delimiters,
m/^\/(?:[^\/]+\/)+Perl$/ is required. As
presented in Phase A of Figure 7-1, a closing
delimiter that appears within the regular expression must be escaped to
hide the character's delimiter status. (These escapes are bold in the
example.) Characters escaped for this reason are passed through to the
regex as if no escape were present. Rather than suffer from backslashitis, it's more readable to use a
different delimiter -- two examples of the same regex are
m!^/(?:[^/]+/)+Perl$! and
m,^/(?:[^/]+/)+Perl$,. Other common
delimiters are m|...|, m#...#, and m%...%.
There are several special-case delimiters:
m(...), m{...}, m[...],
and m<...> have different opening and closing delimiters,
and may be nested. Because parentheses and square brackets are so prevalent
in regular expressions, m(...) and m[...] are probably
not so appealing, but the other two can be.
In particular, with the
/x modifier, something such as the following becomes possible:
m{
regex # comments
here # here
}x;
When these special-case delimiters are nested, matching pairs can
appear unescaped. For example, m(^/((?:[^/]+/)+)Perl$) is
valid, although visually confusing.
$ and @
whose escaping becomes too untidy. ?-delimited
match, it returns success only with the first successful match.
Subsequent uses fail, even if they could otherwise match, until
reset in the same package is called.
With a list-context /g match, ``returning only once''
still means ``all the matches'' as normal.
This can be useful if you want a particular match to be successful only once during a loop. For example, when processing an email header, you could use:
$subject = $1 if m?^Subject: (.*)?;
Once you've matched the subject once, there's no reason to bother
checking for it on every line that follows. Still, realize that if,
for whatever reason, the header has more than one subject line,
m?...? matches only the first, while m/.../ matches
them all -- once the header has been processed, m?...?
will have left $subject with the data from the first subject
line, while m/.../ will have left it with data from the last.
The substitution operator s/.../.../ has other special delimiters
that we'll talk about later (=>255); the
above are the special cases for the match operator.
As yet another special case, if the delimiter is either the commonly used
slash, or the special match-only-once question mark, the m itself
becomes optional. It's common to use /.../ for matches.
Finally, the generic pattern
target =~ expression (with no delimiters and no
m) is supported. The expression is evaluated as a generic Perl
expression, taken as a string, and finally fed to the regex engine.
This allows you to use something like $text =~ &GetRegex()
instead of the longer:
my $temp_regex = &GetRegex(); ... $text =~ m/$temp_regex/ ...
Similarly, using $text =~ "...string..." could be useful
if you wanted real doublequote processing, rather than the
doublequoteish processing discussed earlier. But frankly, I would leave
this to an Obfuscated Perl Contest.
m// (or with m/$regex/
where the variable $regex is empty or undefined), Perl reuses the
regular expression most recently used successfully within the enclosing
dynamic scope. In this case, any match modifiers (discussed in the next section) are
completely ignored. This includes even the /g and /i modifiers.
The modifiers used with the default expression remain in effect.
The default regex is never recompiled (even if the original had been built
via variable interpolation without the /o modifier).
This can be used to your advantage for creating efficient tests.
There's an example in ``The /o Modifier'' (=>270).
/o =>268;
/x =>223)/i =>42;
/m, /s =>233)/g =>253)
You can group several modifier letters together and place them in any order
after the closing delimiter,
whatever it might be. For example, m/<code>/i
applies the regular expression <code> with the /i
modifier, resulting in a case-insensitive match. Do keep in mind that the
slash is not part of the modifier -- you could write this example as
m|<code>|i or perhaps
m{<code>}i or even
m<<code>>i.
| 32 | Because match-operator modifiers can appear in any order, a large
portion of a programmer's time is spent adjusting the order to achieve
maximal cuteness. For example, learn/by/osmosis is valid code
(assuming you have a function called learn). The
osmosis are the
modifiers -- repetition of match-operator modifiers (but not
the substitution-operator's /e) is allowed, but meaningless. |
As discussed earlier (=>231), the modifiers /x, /i, /m, and /s can also appear within a regular
expression itself using the (?...) construct. Allowing these options
to be indicated in the regular expression directly is extremely convenient
when one operator is applying different expressions at different times
(usually due to variable interpolation). When you use /i,
for example, every application of an expression via the match operator
in question is case-insensitive. By allowing each regular expression to
choose its own options, you get more general-purpose code.
A great example is a search engine on a Web page that offers a full Perl
regular-expression lookup. Most search engines offer very simple search
specifications that leaves power users frustrated, so offering a full regex
option is appealing. In such a case, the user could use (?i) and the
like, as needed, without the CGI having to provide special options to
activate these modifiers.
m/.../g with a regex that can match nothingness/g modifier start where
the previous match ended, but what if there is a way for the regex to
match the null string? As a simple example, consider the admittedly silly m/^/g. It matches at the start of the string, but doesn't actually
consume any characters, so the first match ends at the beginning of the
string. If the next attempt starts there as well, it will match there as
well. Repeat forever and you start to see a problem.Perl version 5.000 is broken in this respect, and indeed repeats until you run out of electrons. Perl4 and later versions of Perl5 work, although differently. They both begin the match where the previous one left off unless the previous match was of no text, in which case a special bump-along happens and the match is re-applied one character further down. Thus, each match after the first is guaranteed to progress down the string at least one character, and the infinite loop is avoided.
Except when used with the substitution operator, Perl5 takes the additional step of disallowing any match that ends at the same position as the previous match -- in such a case the automatic one-character heave-ho is done (if not already at the end of the string). This difference from Perl4 can be important -- Table 7-9 shows a few simple examples. (This table is not light reading by any means -- it might take a while to absorb.) Things are quite different with the substitution operator, but that's saved for ``The Substitution Operator'' (=>255).
| regex: | \d* |
count | \d* |
count | x|\d* |
count | \d*|x |
count | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Perl4 | 123| |
2 | 123|x| |
3 | |a123|wx|y|z456| |
8 | |a123|w|x|y|z456| |
8 | ||||
| Perl5 | 123 |
1 | 123x| |
2 | |a123wxy|z456 |
5 | |a123w|x|y|z456 |
6 | ||||
| (Each match shown via either an underline, or as | for a zero-width match) | ||||||||||||
| ||||||||||||
=~, as with $line =~ m/.../. Remember
that =~ is not an assignment operator, nor is it a comparison
operator. It is merely an odd way of providing the match operator with one
of its operands. (The notation was adapted from awk.)
Since the whole
``expr =~ m/.../'' is an
expression itself, you can use it wherever an expression is allowed. Some
examples (each separated by a wavy line):
$text =~ m/.../; # just do it, presumably, for the side effects.. . . . . . . . . . . . if ($text =~ m/.../) { ##do code if match successful: . . . . . . . . . . . . $result = ( $text =~ m/.../ ); #set $result to result of match against $text$result = $text =~ m/.../ ; #same thing; =~ has higher precedence than =. . . . . . . . . . . . $result = $text; #copy $text to $result...$result =~ m/.../ ; #...and perform match on $result( $result = $text ) =~ m/.../ ; #Same thing in one expression
If the target operand is the variable $_, you can omit
the ``$_ =~'' altogether.
In other words, the default target operand is $_.
Something like $line =~ m/regex/ means
``Apply regex to the text in $line, ignoring the
return value but doing the side effects.'' If you forget the
`~', the resulting $line = m/regex/ becomes
``Apply regex to the text in $_, returning a true or
false value that is then assigned to $line.'' In other words,
the following are the same:
$line = m/regex/ $line = ($_ =~ m/regex/)
You can also use !~ instead of =~ to logically negate the
return value. (Return values and side effects are discussed soon.)
$var !~ m/.../ is effectively the same as
not ($var =~ m/.../). All the normal side effects, such
as
the setting of $1 and the like, still happen. It is merely a
convenience in an ``If this doesn't match'' situation. Although you can
use !~ in a list context, it doesn't make much sense.
$&, $1, $+,
and so on; =>217), so here I'll look at
the remaining side effects a match attempt can have.
Two involve ``invisible status.'' First, if the
match is specified using m?...?, a successful match dooms future
matches to fail, at least until the next reset
(=>247). Of course, if you use m?...? explicitly, this particular side effect is probably the main
effect desired. Second, the regex in question becomes the default regex
until the dynamic scope ends, or another regex matches
(=>248).
Finally, for matches with the /g modifier, the pos of the
target string is updated to reflect the index into the string of the
match's end. (A failed attempt always resets pos.) The next /g-governed match attempt of the same string starts at that position,
unless:
pos.)pos is assigned to. (The match will start at that position.)
As discussed earlier (=>249), in order to
avoid an infinite loop, a successful match that doesn't actually match
characters causes the next match to begin one character further into the
string.
During the begun-further attempt, pos properly reflects the end of
the match because it's at the start of the subsequent attempt that the
anti-loop movement is done. During such a next match, \G still refers
to the true end of the previous match, so cannot be successful. (This is
the only situation where \G doesn't mean ``the start of the
attempt.'')
/g modifier.
/g modifier is the ``normal
situation.'' If a match is found, a Boolean true is returned:
if ($target =~ m/.../) {
# processing if match found
:
} else {
# processing if no match found
:
}
On failure, it returns an empty string (which is considered a Boolean false).
/g is the normal way to pluck information from
a string. The return value is a list with an element for
each set of capturing parentheses in the regex. A simple example is
processing a date of the form 69/8/31, using:
($year, $month, $day) = $date =~ m{^ (\d+) / (\d+) / (\d+) $}x;
The three matched numbers are then available in the three variables
(and $1 and such as well). There is one element in the return-value list for each set of capturing
parentheses, or an empty list upon failure. Of course, it is possible for a
set or sets to have not been part of a match, as is certainly guaranteed
with one in m/(this)|(that)/. List elements for such sets
exist, but are undefined. If there are no sets of capturing parentheses to begin with, a successful
list-context non-/g match returns the list (1).
Expanding a bit on the date example, using a match expression as the
conditional of an if (...) can be useful. Because of the assignment
to ($year, ...), the match operator finds itself in a list context
and returns the values for the variables. But since that whole assignment
expression is used in the scalar context of the if's conditional, it
is then contorted into the count of items in the list. Conveniently, this
is interpreted as a Boolean false if there were no matches, true if there were.
if ( ($year, $month, $day) = $date =~ m{^ (\d+) / (\d+) / (\d+) $}x ) {
# Process for when we have a match: $year and such have new values
} else {
# Process for when no match: $year and such have been newly cleared to undefined
}
/g list-context, but for all matches in the
string. For example, consider having the entire text of a Unix mailbox
alias file in a single string, where logical lines look like:
alias jeff jfriedl@ora.com
alias perlbug perl5-porters@perl.org
alias prez president@whitehouse
m/^alias\s+(\S+)\s+(.+)/ to pluck the
alias and full address from a single logical line. It returns a list of two
elements, such as ('jeff', 'jfriedl@ora.com') for the
first line. Now consider working with all the lines in one string. You can
do all the matches all at once by using /g (and /m,
to allow caret to match at the beginning of each
logical line), returning a list such as:
( 'jeff', 'jfriedl@ora.com', 'perlbug',
'perl5-porters@perl.org', 'prez', 'president@whitehouse' )
%alias = $text =~ m/^alias\s+(\S+)\s+(.+)/mg;
jeff' with $alias{'jeff'}.
m/.../g is a special construct quite different from
the other three situations. Like a normal m/.../, it does only one
match, but like a list-context m/.../g, it pays attention to where
previous matches occurred. Each time a scalar-context m/.../g is
reached, such as in a loop, it finds the ``next'' match. Once it fails,
the next check starts again from the beginning of the string.
This is quite convenient as the conditional of a while loop. Consider:
while ($ConfigData =~ m/^(\w+)=(.*)/mg) {
my($key, $value) = ($1, $2);
:
}
All matches are eventually found, but the body of the while loop
is executed between the matches (well, after each match). Once an
attempt fails, the result is false and the while loop finishes.
Also, upon failure, the /g state (given by pos) is reset.
Finally, be careful not to modify the target data within the loop unless
you really know what you're doing: it resets pos.
pos(...) -- Set explicitly, or implicitly by the /g
modifier, it indicates where in the string the next /g-governed
match should begin. Also, see \G in ``Multi-Match Anchor''
(=>236).$* -- A holdover from Perl4, can influence the caret and
dollar anchors. See ``String Anchors'' (=>232).study -- Has no effect on what is matched or returned, but
if the target string has been study'd, the match might be
faster (or slower). See ``The Study Function'' (=>287).m?...? and reset -- affects the invisible
``has/hasn't matched'' status of m?...? operators
(=>247).
while, if, and foreach control
constructs, you really need to keep your wits about you.
What do you expect the following to print?
while ("Larry Curly Moe" =~ m/\w+/g) {
print "WHILE stooge is $&.\n";
}
print "\n";
if ("Larry Curly Moe" =~ m/\w+/g) {
print "IF stooge is $&.\n";
}
print "\n";
foreach ("Larry Curly Moe" =~ m/\w+/g) {
print "FOREACH stooge is $&.\n";
}
It's a bit tricky. ¤ Turn the page to check your answer.
¤ Answer to the question on page 254.
The results differ depending on your version of Perl:
|
s/regex/replacement/
extends the idea of matching text to match-and-replace. The regex operand
is the same as with the match operator, but the replacement operand
used to replace matched text adds a new, useful twist. Most concerns of the
substitution operator are shared with the match operator and are covered in
that section (starting on page 246).
New concerns include:
/e modifier/g with a regex that can match nothingness
s/.../.../, the replacement operand immediately
follows the regex operand, using a total of three instances of the delimiter
rather than the two of m/.../. If the regex uses balanced
delimiters (such as <...>), the replacement operand then has its own
independent pair of delimiters (yielding four delimiters). In such cases,
the two sets may be separated by whitespace, and if so, by comments as
well.
Balanced delimiters are commonly used with /x or /e:
$test =~ s{
...some big regex here, with lots of comments and such...
} {
...a perl code snippet to be evaluated to produce the replacement text...
}ex
Perl normally provides true doublequoted processing of the replacement
operand, although there are a few special-case delimiters. The processing
happens after the match (with /g, after each match), so $1 and the
like are available to refer to the proper match slice.
Special delimiters of the replacement operand are:
?...? is not
special with the substitution operator.-version strings:
| Perl4 : | s`version of (\w+)`$1 --version 2>&1`g; |
| Both Perl4 and Perl5 : | s/version of (\w+)/`$1 --version 2>&1`/e; |
The marked portion is the replacement operand. In the first version,
it is executed as a system command due to the special delimiter. In
the second version, the backquotes are not special until the whole
replacement is evaluated due to the /e modifier. The
/e modifier, discussed in detail in just a moment, indicates
that the replacement operand is a mini Perl snippet that should be
executed, and whose final value should be used as the replacement
text.
Remember that this replacement-operand processing is all quite distinct from regex-operand processing, which usually gets doublequotish processing and has its own set of special-case delimiters.
/e modifier. When
used, the replacement operand is evaluated as if with eval {...}
(including the load-time syntax check), the result of which is substituted
for the matched text. The replacement operand does not undergo any
processing before the eval (except to determine its lexical extent, as
outlined in Figure 7-1's Phase A), not even
singlequotish processing. The actual evaluation, however, is redone upon each match.
As an example, you can encode special characters of a World Wide Web URL using
% followed by their two-digit hexadecimal representation. To encode
all non-alphanumerics this way, you can use
$url =~ s/([^a-zA-Z0-9])/sprintf('%%%02x', ord($1))/ge;
$url =~ s/%([0-9a-f][0-9a-f])/pack("C",hex($1))/ige;
In short, pack("C", value) converts from a numeric value
to the character with that value, while sprintf('%%%02x',
ord(character)) does the opposite; see your favorite Perl
documentation for more information. (Also, see the
footnote on page 66 for more on this example.)
/e modifier, you should understand exactly
who interprets what -- and when. It's not too confusing,
but it does take some effort to keep things straight. For
example, even with something as simple as s/.../`echo $$`/e, the
question arises whether it's Perl or the shell that interprets the $$.
To Perl and many shells, $$ is the process ID (of Perl or the shell,
as the case may be). You must consider several levels of interpretation.
First, the replacement-operand has no pre-eval processing in Perl5,
but in Perl4 has singlequotish processing.
When the result is evaluated, the backquotes provide
doublequoted-string processing. (This is when Perl interpolates
$$ -- it may be escaped to prevent this interpolation.)
Finally, the result is sent to the shell, which then runs the echo
command. (If the $$ had been escaped, it would be passed to the shell
unescaped, resulting in the shell's interpolation of $$.)
To add to the fray, if using the /g modifier as well, should `echo $$` be evaluated just once (with the result being used for
all replacements), or should it be done after each match? When $1 and
such appear in the replacement operand, it obviously must be evaluated on a
per-match basis so that the $1 properly reflects its after-match
status. Other situations are less clear. With this echo example,
Perl version 5.000 does only one evaluation,
while other versions before and after evaluate on a per-match basis.
evaluated multiple times
if /e is specified more than once. (It is the only modifier for which
repetition matters.) This is what Larry Wall calls an accidental
feature,
and was ``discovered'' in early 1991. During the ensuing
comp.lang.perl discussion, Randal Schwartz offered one of his
patented JAPH signatures:
$Old_MacDonald = q#print #; $had_a_farm = (q-q:Just another Perl hacker,:-); s/^/q[Sing it, boys and girls...],$Old_MacDonald.$had_a_farm/eieio;
| 35 | My thanks to Hans Mulder for providing the historical background for this section, and for Randal for being Just Another Perl Hacker with a sense of humor. |
The eval due to the first /e sees
q[Sing it, boys and girls...],$Old_MacDonald.$had_a_farm
print q:Just another Perl hacker,:
which then prints Randal's ``Just another Perl hacker'' signature when
evaluated due to the second /e.
Actually, this kind of construct is sometimes useful. Consider wanting
to interpolate variables into a string manually (such as if the string is read
from a configuration file). A simple approach uses
$data =~ s/(\$[a-zA-Z_]\w*)/$1/eeg;. Applying this to `option=$var', the regex matches option=$var. The
first eval simply sees the snippet $1 as provided in the
replacement-operand, which in this case expands to $var. Due to the
second /e, this result is evaluated again, resulting in whatever value
the variable $var has at the time. This then replaces the matched
`$var', in effect interpolating the variable.
I actually use something like this with my personal Web pages -- most of them are written in a pseudo Perl/HTML code that gets run through a CGI when pulled by a remote client. It allows me to calculate things on the fly, such as to remind readers how few shopping days are left until my birthday.
| 36 | If you, too, would like to see how many days are left until my birthday,
just load
http://omrongw.wg.omron.co.jp/cgi-bin/j-e/jfriedl.html
or perhaps one of its mirrors (see Appendix A). |
/g. The substitution operator,
however, has none of these complexities -- it returns the same type of
information regardless of either concern.
The return value is either the number of substitutions performed or,
if none were done, an empty string. When interpreted as a Boolean (such as for the conditional of an
if), the return value conveniently interprets as true if any
substitutions were done, false if not.
s/.../.../g matches as
well, but the entries marked ``Perl5'' are for the match operator only.
The entries marked ``Perl4'' apply to Perl4's match operator, and all
versions' substitution operator.
split operator (often called a function in
casual conversation) is commonly used as the converse of a list-context m/.../g (=>253). The latter returns text
matched by the regex, while a split with the same regex returns text
separated by matches. The normal match $text =~ m/:/g applied
against a $text of
`IO.SYS:225558:95-10-03:-a-sh:optional',
returns the four-element list
(':', ':', ':', ':')
split(/:/, $text)
returns the five-element list:
('IO.SYS', '225558', '95-10-03', '-a-sh', 'optional')
: matches four times. With split,
those four matches partition a copy of the target into five chunks which
are returned as a list of five strings.
In its most simple form with simple data like this, split is as easy
to understand as it is useful. However, when the use of split or the
data are complicated, understanding is less clear-cut. First, I'll quickly
cover some of the basics.
split is an operator that looks like a function, and takes up to
three operands:
split(match operand, target string, chunk-limit operand)
(The parentheses are optional with Perl5.) Default values, discussed below, are provided for operands left off the end.
/:/ or m/\s*<P>\s*/i. Conventionally, /.../ rather than m/.../ is used, although it doesn't really
matter. The /g modifier is not needed (and is ignored) because split itself provides the iteration for matching in multiple places.There is a default match operand if one is not provided, but it is one of the complex special cases discussed later.
split. The content of $_ is the default if no
target string is provided.
split partitions the string into. For example,
with our sample data, split(/:/, $text, 3) returns:
( 'IO.SYS', '225558', '95-10-03:-a-sh:optional' )
This shows that split stopped after /:/ matched twice,
resulting in the requested three-chunk partition. It could have matched
additional times, but that's irrelevant here because of the chunk-limit.
The limit is an upper bound, so no more than that many elements will ever
be returned, but note that it doesn't guarantee that many
elements -- no extra are produced to ``fill the count'' if the
data can't be partitioned enough to begin with.
split(/:/, $text, 1234)
still returns only a five-element list. Still, there is an important
difference between split(/:/, $text) and
split(/:/, $text, 1234) which does not manifest
itself with this example -- keep this in mind for when the details are
discussed later.
Remember that the chunk-limit operand is not a match-limit operand. Had it been for the example above, the three matches would have partitioned to
('IO.SYS', '225558', '95-10-03', '-a-sh:optional')
One comment on efficiency: Let's say you intended to fetch only the first few fields, such as with
($filename, $size, $date) = split(/:/, $text)
$date chunk would
contain it. So, you'd want to use a limit of 4 so that Perl doesn't
waste time finding further partitions. Indeed, you can use a chunk-limit of
4, but if you don't provide one, Perl provides an appropriate
default so that you get the performance enhancement without changing the
results you actually see.
split is an operator and not a function, it can interpret its
operands in magical ways not restricted by normal function-calling
conventions. That's why, for example, split can recognize whether
the first operand is a match operator, as opposed to some general
expression that gets evaluated independently before the ``function'' is called.
Although useful, split is not straightforward to
master. Some important points to consider are:
split's match operand differs from
the normal m/.../ match operator. In addition, there are
several special-case match operands.split, now deprecated, stuffs the list
into @_ instead of returning it.split behaves differently when its regex has capturing
parentheses.
split is that it returns the text between
matches. If the match operator matches twice in a row, the nothingness
between the matches is returned. Applying m/:/ to the sample string
:IO.SYS:225558:::95-10-03:-a-sh:
split, a match always
results in a split between two items, including even that first match
at the start of the string separating '' (an empty string, i.e.,
nothingness) from 'IO.SY...'. Similarly, the fourth match separates
two empty strings. All in all, the seven matches partition the target into
the eight strings:
('','IO.SYS','225558','','','95-10-03','-a-sh','')
| 37 | There is one special case where this is not true. It is detailed a bit
later during the discussion about advanced split's match operand. |
However, this is not what split(/:/, $text) returns. Surprised?
If you don't want to limit the number of chunks returned, but instead only
want to leave trailing empty items intact, simply choose a very large
limit. Also, a negative chunk-limit is taken as an arbitrarily large limit:
split(/:/, $text, -1) returns all elements, including any
trailing empty ones.
At the other extreme, if you want to remove all empty
items, you could put grep {length} before the split. The
grep lets pass only list elements with non-zero lengths (in other
words, elements that aren't empty).
split operator lies in the many
personalities of the match operand. There are four distinct styles of
the match operand:
'·' (a single space)
split examples we've seen so far, the most common use
of split uses a match operator for the match operand. However, there
are a number of important differences between a match operand and a real
match operator:
=~ with
split, as strange things can happen. split provides the
match operand with the target string operand behind the
scenes. m/x*/ (=>249) and
s/x*/.../ (=>259). Perhaps
because split provides the repetition instead of the match
operator, things are much simpler here.
The one exception is that a match of nothingness at the start of the string does not produce a leading empty element. Contrast this to a match of something at the start of the string: it does produce a leading empty element. (A match of nothingness at the end of the string does leave a trailing one, though, but such trailing empty items are removed unless a large enough, or negative, chunk-limit is used.)
For example, m/\W*/ matches
`|T|h|i|s, "T|h|a|t", O|t|h|e|r!'
at the places marked (either underlined, or with
|
for a match of nothingness). Because of this exception, the
string-leading match is ignored, leaving 13 matches, resulting in the
14 element list:
('T', 'h', 'i', 's', 'T', 'h', 'a', 't', 'O', 't', 'h', 'e', 'r', '')
Of course, if no chunk-limit is specified, the final empty element is removed.
split does not mean ``Use the current
default regex,'' but to split the target string at each character.
For example, you could use
$text = join "\b_", split(//, $text, -1);
$text. (However, $text =~ s/(.)/$1\b_/g might be better
for a variety of reasons, one of which is that it's probably easier to
understand at first glance). split doesn't affect the default regex
for later match and substitution operators. Also, the variables
$&, $', $1, and so on are not available
from a split. A split is completely isolated from the
rest of the program with respect to side-effects. /g modifier is meaningless (but harmless) when used with
split. ?...? is not special with split.
/\s+/ except
that leading whitespace is skipped. (This is meant to simulate the default
input-record-separator splitting that awk does with its input, although it
can certainly be quite useful for general use.)
For instance, the call
split('·', "···this···is·a·····test")
returns the four-element list
('this', 'is', 'a', 'test').
As a contrast to '·', consider using m/\s+/ directly. This
bypasses the leading-whitespace removal and returns
('', 'this', 'is', 'a', 'test')
Finally, both of these are quite different from using m/·/,
which matches each individual space and returns:
('','','','this','','','is','a','','','','','test')
split(/\s+/, ...) is the same as
split('\s+', ...) except the former's regex is compiled only once,
the latter's each time the split is executed.
// or ''), is identical to using '·'. Thus, a raw split without any operands is the same
as split('·', $_, 0).
split, which returned the number of
chunks instead of the list itself, causing the variable
@_ to receive the list of chunks as a side effect. Although Perl5
currently supports it, this feature has been deprecated and will likely
disappear in the future. Its use generates a warning when warnings are
enabled, as they generally should be.
split. In such a case, the returned array has additional,
independent elements interjected for the item(s) captured by the
parentheses. This means that text normally elided entirely by split is now included in the returned list.
In Perl4, this was more of a pain than a useful feature because it meant
you could never (easily) use a regex that happened to require parentheses
merely for grouping. If grouping is the only intent, the littering of extra
elements in the return list is definitely not a feature. Now that you
can select the style of parentheses -- capturing or not -- it's
a great feature. For example, as part of HTML processing,
split(/(<[^>]*>)/) turns
...·and·<B>very·<FONT·color=red>very</FONT>·much</B>·effort...
( '...·and ', '<B>', 'very·', '<FONT·color=red>', 'very', '</FONT>', '·much', '</B>', '·effort...' )
"[^"]*" doublequoted strings
(probably necessary
for real HTML work), you run into problems as the full regex becomes
something along the lines of:
| 38 | You might recognize this as being a partially unrolled version of
<("[^"]*"|[^>"])*>, with a
normal of [^>"] and a special of "[^"]*". You
could, of course, also unroll special independently. |
(<[^>"]*("[^"]*"[^>"]*)*>)
The added set of capturing parentheses means an added element returned for
each match during the split, yielding two items per match plus the normal
items due to the split. Applying this to
Please <A HREF="test">press me</A> today
returns (with descriptive comments added):
( 'Please·', before first match'<A·HREF="test">', '"test"',from first match'press me',between matches'</A>', '',from second match'·today'after last match)
The extra elements clutter the list. Using (?:...) for the added
parentheses, however, returns the regex to split usefulness, with
the results being:
( 'Please·', before first match'<A HREF="test">',from first match'press me',between matches'</A>',from second match'·today'after last match)
There are, of course, Perl-specific efficiency issues, such as the use of non-capturing parentheses unless you specifically need capturing ones. There are some much larger issues as well, and even the issue of capturing vs. non-capturing is larger than the micro-optimization explained in Chapter 5 (=>152). In this section, we'll look at this (=>276), as well as the following topics:
/o modifier, which I
haven't
discussed much yet, gives you some control over when the costly
re-compilation takes place.$& Penalty The three match side effect variables,
$`, $&, and $', can be convenient, but there's a hidden
efficiency gotcha waiting in store for any script that uses them, even
once, anywhere. Heck, you don't even have to use them -- the
entire script is penalized if one of these variables even appears in
the script./i Penalty You pay another penalty when you
use the /i modifier. Particularly with very long target strings, it
can pay to rewrite the regex to avoid using /i.Study Since ages past, Perl has provided the
study(...) function. Using it supposedly makes regexes faster, but
no one really understands it. We'll see whether we can figure it out.-Dr Option Perl's regex-debug flag can tell you
about some of the optimizations the regex engine and transmission do, or
don't do, with your regexes. We'll look at how to do this and see what
secrets Perl gives up.
18.181.0.24 such that each of the four
parts becomes exactly three digits: 018.181.000.024. One simple and
readable solution is:
$ip = sprintf "%03d.%03d.%03d.%03d", split(/\./, $ip);
This is a fine solution, but there are certainly other ways to do the job. In the same style as the The Perl Journal article I mentioned in the footnote on page 229, let's examine various ways of achieving the same goal. This example's goal is simple and not very ``interesting'' in and of itself, yet it represents a common text-handling task. Its simplicity will let us concentrate on the differing approaches to using Perl. Here are a few other solutions:
$ip =~ s/(\d+)/sprintf("%03d", $1)/eg; $ip =~ s/\b(\d{1,2}\b)/sprintf("%03d", $1)/eg; $ip = sprintf("%03d.%03d.%03d.%03d", $ip =~ m/(\d+)/g); $ip =~ s/\b(\d\d?\b)/'0' x (3-length($1)) . $1/eg; $ip = sprintf("%03d.%03d.%03d.%03d",
$ip =~ m/^(\d+)\.(\d+)\.(\d+)\.(\d+)$/); $ip =~ s/\b(\d(\d?)\b)/$2 eq '' ? "00$1" : "0$1"/eg; $ip =~ s/\b(\d\b)/00$1/g;
$ip =~ s/\b(\d\d\b)/0$1/g;
Like the original solution, each produces the same results when given a correct IP address, but fail in different ways if given something else. If there is any chance that the data will be malformed, more care than any of these solutions provide is needed. That aside, the practical differences lie in efficiency and readability. As for readability, about the only thing that's easy to see about most of these is that they are cryptic at best.
So, what about efficiency? I benchmarked these solutions on my system with Perl version 5.003, and have listed them in order from least to most efficient. The original solution belongs somewhere between positions four and five, the best taking only 80 percent of its time, the worst about 160 percent. But if efficiency is really important, faster methods are still available:
substr($ip, 0, 0) = '0' if substr($ip, 1, 1) eq '.'; substr($ip, 0, 0) = '0' if substr($ip, 2, 1) eq '.'; substr($ip, 4, 0) = '0' if substr($ip, 5, 1) eq '.'; substr($ip, 4, 0) = '0' if substr($ip, 6, 1) eq '.'; substr($ip, 8, 0) = '0' if substr($ip, 9, 1) eq '.'; substr($ip, 8, 0) = '0' if substr($ip, 10, 1) eq '.'; substr($ip, 12, 0) = '0' while length($ip) < 15;
| 39 | With Perl4, for reasons I don't exactly know, the original solution is
actually the fastest of those listed. I wasn't able, however, to
benchmark the solutions using /e due to a related Perl4 memory
leak that rendered the results meaningless. |
This takes only half the time as the original, but at a fairly expensive toll in understandability. Which solution you choose, if any, is up to you. There are probably other ways still. Remember, ``There's more than one way to do it.''
On the other hand, if the regex changes each time, it certainly makes sense for Perl to reprocess it for us. Of course, it takes longer to redo the processing, but it's a very convenient feature that adds remarkable flexibility to the language, allowing a regex to vary with each use. Still, as useful as it may be, the extra work is sometimes needless. Consider a situation where a variable that doesn't change from use to use is used to provide a regex:
$today = (qw<Sun Mon Tue Wed Thu Fri Sat>)[(localtime)[6]];
# $today now holds the day ("Mon", "Tue", etc., as appropriate)
$regex = "^$today:";
while (<LOGFILE>) {
if (m/$regex/) {
:
The variable $regex is set just once, before the loop. The match
operator that uses it, however, is inside a loop, so it is applied
over and over again, once per line of <LOGFILE>. We can
look at this script and know for sure that the regex doesn't change during
the course of the loop, but Perl doesn't know that. It knows that the
regex operand involves interpolation, so it must re-evaluate that operand
each time it is encountered.
This doesn't mean the regex must be fully recompiled each time. As an intermediate optimization, Perl uses the compiled form still available from the previous use (of the same match operand) if the re-evaluation produces the same final regex. This saves a full recompilation, but in cases where the regex never does change, the processing of Figure 7-1's Phase B and C, and the check to see if the result is the same as before, are all wasted effort.
This is where the /o modifier comes in. It instructs Perl to process
and compile a regex operand the first time it is used, as normal, but to
then blindly use the same internal form for all subsequent tests by the
same operator. The /o ``locks in'' a regex the first time a match
operator is used. Subsequent uses apply the same regex even if variables
making up the operand were to change. Perl won't even bother looking.
Normally, you use /o as a measure of efficiency when you don't intend
to change the regex, but you must realize that even if the variables do
change, by design or by accident, Perl won't reprocess or recompile if /o is used.
Now, let's consider the following situation:
while (...)
{
:
$regex = &GetReply('Item to find');
foreach $item (@items) {
if ($item =~ m/$regex/o) { # /o used for efficiency, but has a gotcha!
:
}
}
:
}
The first time through the inner foreach loop (of the first time
through the outer while loop), the regular expression is processed
and compiled, the result of which is then used for the actual match
attempt. Because the /o modifier is used, the match operator in
question uses the same compiled form for all its subsequent attempts.
Later, in the second iteration of the outer loop, a new $regex is
read from the user with the intention of using it for a new search. It
won't work -- the /o modifier means to compile a regex operator's
regex just once, and since it had already been done, the original regex
continues to be used -- the new value of $regex is completely ignored.
The easiest way to solve this problem is to remove the /o modifier.
This allows the program to work, but it is not necessarily the best
solution. Even though the intermediate optimization stops the full
recompile (except when the regex really has changed, the first time through
each inner loop), the pre-compile processing and the check to see if it's
the same as the previous regex must still be done each and every time. The
resulting inefficiency is a major drawback that we'd like to avoid if at
all possible.
m// (=>248):
while (...)
{
:
$regex = &GetReply('Item to find');
# install regex (must be successful to install as default)
if ($sample_text !~ m/$regex/) {
die "internal error: sample text didn't match!";
}
foreach $item (@items) {
if ($item =~ m//) # use default regex
{
:
}
}
:
}
Unfortunately, it's usually quite difficult to find something appropriate for the sample string if you don't know the regex beforehand. Remember, a successful match is required to install the regex as the default. Additionally (in modern versions of Perl), that match must not be within a dynamic scope that has already exited.
while (...)
{
:
$regex = &GetReply('Item to find');
eval 'foreach $item (@items) {
if ($item =~ m/$regex/o) {
:
}
}';
# if $@ is defined, the eval had an error.
if ($@) {
...report error from eval had there been one ...
}
:
}
Notice that the entire foreach loop is within a singlequoted
string, which itself is the argument to eval. Each time the string
is evaluated, it is taken as a new Perl snippet, so Perl parses it
from scratch, then executes it. It is executed ``in place,'' so it has
access to all the program's variables just as if it had been part of the
regular code. The whole idea of using eval in this way is to
delay the parsing until we know what each regex is to be.
What is interesting for us is that, because the snippet is parsed afresh
with each eval, any regex operands are parsed from scratch, starting
with Phase A of Figure 7-1
(=>223). As a consequence, the regular expression
will be compiled when first encountered in the snippet (during the first
time through the foreach loop), but not recompiled further due to
the /o modifier. Once the eval has finished, that incarnation
of the snippet is gone forever. The next time through the outer while loop, the string handed to eval is the same, but
because the eval interprets it afresh, the snippet is considered
new all over again. Thus, the regular expression is again new, and so it is
compiled (with the new value of $regex) the first time it is
encountered within the new eval.
Of course, it takes extra effort for eval to compile the snippet
with each iteration of the outer loop. Does the /o savings justify the
extra time? If the array of @items is short, probably not. If long,
probably so. A few benchmarks (addressed later) can often help you decide.
This example takes advantage of the fact that, when we build a program
snippet in a string to feed to eval, Perl doesn't consider it to be
Perl code until eval is actually executed. You can, however, have
eval work with a normally compiled block of code, instead.
eval is special in that its argument can be a general expression
(such as the singlequoted string just used) or a {...} block of
code. When using the block method, such as with
eval {foreach $item (@items) {
if ($item =~ m/$regex/o) {
:
}
}};
eval in the first place. We rely on the snippet being recompiled
with each use, so we must use the non-block version.
Included among the variety of reasons to use eval are the
recompile-effects of the non-block style and the ability to trap errors.
Run-time errors can be trapped with either the string or the block style,
while only the string style can trap compile-time errors as well. (We saw
an example of this with $* on
page 235.) Trapping run-time errors
(such as to test if a feature is supported in your version of Perl), and to
trap warn, die, exit, and the like, are about the only
reasons I can think of to use the eval {...} block version.
A third reason to use eval is to execute code that you
build on the fly. The following snippet shows a common trick:
sub Build_MatchMany_Function
{
my @R = @_; # Arguments are regexes
my $program = ''; # We'll build up a snippet in this variable
foreach $regex (@R) {
$program .= "return 1 if m/$regex/;"; # Create a check for each regex
}
my $sub = eval "sub { $program; return 0 }"; # create anonymous function
die $@ if $@;
$sub; # return function to user
}
Before explaining the details, let me show an example of how it's used.
Given an array of regular expressions, @regexes2check, you might use
# Create a function to check a bunch of regexes$CheckFunc = Build_MatchMany_Function(@regexes2check); while (<>) { #Call the function to check the current $_if (&$CheckFunc) { ...have a line which matches one of the regexes... } }
Given a list of regular expressions (or, more specifically, a list of
strings intended to be taken as regular expressions), Build_MatchMany_Function builds and returns a function that, when
called, indicates whether any of the regexes match the contents of
$_.
The reason to use something like this is efficiency. If you knew what the regexes were when writing the script, all this would be unnecessary. Not knowing, you might be able to get away with
$regex = join('|', @regexes2check); # Build monster regex
while (<>) {
if (m/$regex/o) {
...have a line which matches one of the regexes...
}
}
while (<>) {
foreach $regex (@regexes2check) {
if (m/$regex/) {
...have a line which matches one of the regexes...
last;
}
}
}
This, too, is inefficient because each regex must be reprocessed and recompiled each time. Extremely inefficient. So, spending time to build an efficient match approach in the beginning can, in the long run, save a lot.
If the strings passed to Build_MatchMany_Function are this,
that, and other, the snippet that it builds and
evaluates is effectively:
sub {
return 1 if m/this/;
return 1 if m/that/;
return 1 if m/other/;
return 0
}
Each time this anonymous function is called, it checks the $_ at the
time for the three regexes, returning true the moment one is found.
It's a nice idea, but there are problems with how it's commonly
implemented (including the one
on the previous page).
Pass Build_MatchMany_Function
a string which contains a $ or @ that can
be interpreted, within the eval, by variable interpolation, and
you'll get a big surprise. A partial solution is to use a singlequote delimiter:
$program .= "return 1 if m'$regex';"; # Create a check for each regex
But there's a bigger problem. What if one of the regexes contains a
singlequote (or one of whatever the regex delimiter is)? Wanting the regex don't adds
return 1 if m'don't';
to the snippet, which results in a syntax error when evaluated. You
can use \xff or some other unlikely character as the delimiter, but
why take a chance? Here's my solution to take care of these problems:
sub Build_MatchMany_Function
{
my @R = @_;
my $expr = join '||', map { "m/\$R[$_]/o" } (0..$#R);
my $sub = eval "sub { $expr }"; # create anonymous function
die $@ if $@;
$sub; # return function to user
}
I'll leave the analysis as an exercise. However, one question: What happens
if this function uses local instead of my for the @R
array?
¤
Turn the page to check your answer.
¤ Answer to the question on page 273.
Before answering the question, first a short summary of binding:
Whenever Perl compiles a snippet, whether during program load or
When
When
Mmm. If before we leave |
$& and Friends$`, $&, and $' refer to the text leading the match, the
text matched, and the text that trails the match, respectively
(=>217). Even if the target string is
later changed, these variables must still refer to the original text, as
advertised. Case in point: the target string is changed immediately during
a substitution, but we still need to have $& refer to the original
(and now-replaced) text. Furthermore, even if we change the target string
ourselves, $1, $&, and friends must all continue to refer to
the original text (at least until the next successful match, or until the
block ends). So, how does Perl conjure up the original text despite
possible changes?It makes a copy. All the variables described above actually refer to this internal-use-only copy, rather than to the original string. Having a copy means, obviously, that there are two copies of the string in memory at once. If the target string is huge, so is the duplication. But then, since you need the copy to support these variables, there is really no other choice, right?
Warning: The situations and tricks I describe exploit internal workings of Perl. It's nice if they make your programs faster, but they're not part of the Perl specification and may be changed in future releases. (I am writing as of version 5.003.) If these optimizations suddenly disappear, the only effect will be on efficiency -- programs will still produce the same results, so you don't need to worry that much.
| 40 | Well, they wouldn't be a part if there were a Perl specification. |
Three situations trigger the copy for a successful match or
substitution :
· the use of $`, $&, or $' anywhere in the entire script
· the use of capturing parentheses in the regex
· the use of the /i modifier with a non-/g match operator
Also, you might need additional internal copies to support:
· the use of the /i modifier (with any match or substitute)
· the use of many, but not all, substitution operators
I'll go over the first three here, and the other two in the following section.
$`, $&, or $' require the copy$`, $&, and $'. In practice, these variables are not used after most matches, so
it would be nice if the copy were done only for those matches that needed
it. But because of the dynamically scoped nature of these variables, their
use may well be some distance away from the actual match. Theoretically, it
may be possible for Perl to do exhaustive analysis to determine that all
uses of these variables can't possibly refer to a particular match (and
thus omit the copy for that match), but in practice Perl does not do this.
Therefore, it must normally do the copy for every successful match of all
regexes during the entire run of the program.
However, it does notice whether there are no references whatsoever to $`, $&, or $' in the entire program (including all
libraries referenced by the script!). Since the variables never appear in
the program, Perl can be quite sure that a copy merely to support them can
safely be omitted. Thus, if you can be sure that your code and any
libraries it might reference never use $`, $&, or $',
you are not penalized by the copy except when explicitly required by the
two other cases.
$1 has no bearing
whatsoever -- if capturing parentheses are used, the copy is made even
if its results are never used.) In Perl4 there were no grouping-only
parentheses, so even if you didn't intend to capture text, you did anyway
as a side effect and were penalized accordingly. Now, with (?:...),
you should never find that you are capturing text that you don't intend to
use. But when you do intend to use $1, $2, etc.,
Perl will have made the copy for you.
| 41 | Well, capturing parentheses are also used for backreferences, so it's
possible that capturing parentheses might be used when $1 and the
like are not. This seems uncommon in practice. |
m/.../i requires the copy/g match operator with /i causes a copy. Why? Frankly, I
don't know. From looking at the code, the copy seems entirely superfluous
to me, but I'm certainly no expert in Perl internals. Anyway, there's
another, more important efficiency hit to be concerned about with /i.
I'll pick up this subject again in a moment, but first I'd like to show
some benchmarks that illustrate the effects of the $&-support copy.
m/c/ against each of the
50,000 or so lines of C that make up the Perl source distribution. The
check merely noted whether there was a `c' on a line -- the
benchmark didn't actually do anything with the information since the goal
was to determine the effect of the behind-the-scenes copying. I ran the
test two different ways: once where I made sure not to trigger any of the
conditions mentioned above, and once where I made sure to do so. The only
difference, therefore, was in the extra copy overhead.The run with the extra copying consistently took over 35 percent longer than the one without. This represents an ``average worst case,'' so to speak. The more real work a program does, the less of an effect, percentage-wise, the copying has. The benchmark didn't do any real work, so the effect is highlighted.
On the other hand, in true worst-case scenarios, the extra copy might truly
be an overwhelming portion of the work. I ran the same test on the same
data, but this time as one huge line incorporating the more than
megabyte of data rather than the 50,000 or so reasonably sized lines. Thus,
the relative performance of a single match can be checked. The match
without the copy returned almost immediately, since it was sure to find a
`c' somewhere near the start of the string. Once it did, it was done.
The test with the copy is the same except, well, it had to make a copy of
the megabyte-plus-sized string first. Relatively
speaking, it took over 700 times longer! Knowing the ramifications,
therefore, of certain constructs allows you to tweak your code for better efficiently.
$& and friendsA solution in Perl can be approached in many ways, and I've said numerous times that if you write in Perl as you write in another language (such as C), your Perl will be lacking and almost certainly inefficient. For the most part, crafting programs The Perl Way should go a long way toward putting you on the right track, but still, as with any discipline, special care can produce better results. So yes, while the copies aren't ``wrong,'' we still want to avoid unnecessary copying whenever possible. Towards that end, there are steps we can take.
Foremost, of course, is to never use $`, $&, or $'
anywhere in your code. This also means to never use English.pm
nor any library modules that use it, or in any other way references these
variables. Table 7-10 shows a list of standard
libraries (in Perl version 5.003) which reference one of the
naughty variables, or uses another library that does. You'll notice that
most are tainted only because they use Carp.pm. If you look into
that file, you'll find only one naughty variable:
$eval =~ s/[\\\']/\\$&/g;
Changing this to
$eval =~ s/([\\\'])/\\$1/g;
AutoLoader |
Fcntl |
Pod::Text |
AutoSplit |
File::Basename |
POSIX |
Benchmark |
File::Copy |
Safe |
Carp |
File::Find |
SDBM_File |
DB_File |
File::Path |
SelectSaver |
diagnostics |
FileCache |
SelfLoader |
DirHandle |
FileHandle |
Shell |
dotsh.pl |
GDBM_File |
Socket |
dumpvar.pl |
Getopt::Long |
Sys::Hostname |
DynaLoader |
IPC::Open2 |
Syslog |
English |
IPC::Open3 |
Term::Cap |
ExtUtils::Install |
lib |
Test::Harness |
ExtUtils::Liblist |
Math::BigFloat |
Text::ParseWords |
ExtUtils::MakeMaker |
MM_VMS |
Text::Wrap |
ExtUtils::Manifest |
newgetopt.pl |
Tie::Hash |
ExtUtils::Mkbootstrap |
ODBM_File |
Tie::Scalar |
ExtUtils::Mksymlists |
open2.pl |
Tie::SubstrHash |
ExtUtils::MM_Unix |
open3.pl |
Time::Local |
ExtUtils::testlib |
perl5db.pl |
vars |
Naughty due to the use of: C: Carp B: File::Basename E: English L: Getopt::Long |
||
If you can be sure these variables never appear, you'll know you will do
the copy only when you explicitly request it with capturing parentheses, or
via the rogue m/.../i. Some expressions in current code might need
to be rewritten. On a case-by-case basis, $` can often be mimicked by (.*?) at the head of the regex, $& by (...) around the
regex, and $' by (?=(.*)) at the end of the regex.
If your needs allow, there are other, non-regex methods that might be
attempted in place of some regexes. You can use index(...) to find a
fixed string, for example. In the benchmarks I described earlier, it was
almost 20 percent faster than m/.../, even without the copy overhead.
Especially with the use of libraries, it's not always easy to notice
whether your program ever references An easier approach is to test for the performance penalty, although it doesn't tell you where the offending variable is. Here's a subroutine that I've come up with:
This is not a function you would keep in production code, but one you might
insert temporarily and call once at the beginning of the program (perhaps
immediately following it with an |
Before a match or substitution operator applies an /i-governed regex,
Perl first makes a temporary copy of the entire target string. This
copy is in addition to any copy in support of $&
and friends. The latter is done only after a successful match, while the
one to support a case-insensitive match is done before the attempt. After
the copy is made, the engine then makes a second pass over the entire
string, converting any uppercase characters to lowercase. The result might
happen to be the same as the original, but in any case, all letters are lowercase.
This goes hand in hand with a bit of extra work done during the compilation of the regex to an internal form. At that time, uppercase letters in the regex are converted to lowercase as well.
The result of these two steps is a string and a regex that then matches normally -- nothing special or extra needs to be done within the actual matching portion of the regex engine. It all appears to be a very tidy arrangement, but this has got to be one of the most gratuitous inefficiencies in all of Perl.
Many subexpressions (and full regular expressions, for that matter) do not require special handling: the CSV program at the start of the chapter (=>205), the regex to add commas to numbers (=>229), and even the huge, 4,724-byte regex we construct in ``Matching an Email Address'' (=>294) are all free of the need for special case-insensitive handling. A case-insensitive match with such expressions should not have any efficiency penalty at all.
Even a character class with letters shouldn't entail an efficiency penalty. At compile time, the appropriate other-case version of any letter can easily be included. (A character class's efficiency is not related to the number of characters in the class; =>115.) So, the only real extra work would be when letters are included in literal text, and with backreferences. Although they must be dealt with, they can certainly be handled more efficiently than making a copy of the entire target string.
By the way, I forgot to mention that when the /g modifier is used, the
copy is done with each match. At least the copy is only from the start
of the match to the end of the string -- with a m/.../ig on a
long string, the copies are successively shorter as the matches near the
end.
As an unfair and cruel first test, I loaded the entire file into a single
string, and benchmarked 1 while m/./g and
1 while m/./gi. Dot certainly doesn't care one
way or the other about capitalization, so it's not reasonable to
penalize this match for case-insensitive handling. On my machine, the first
snippet benchmarked at a shade under 12 seconds. Simply adding the /i
modifier (which, you'll note, is meaningless in this case) slowed the
program by four orders of magnitude, to over a day and a
half!
I calculate that the needless copying caused Perl to shuffle around more
than 647,585 megabytes inside my CPU. This is particularly
unfortunate, since it's so trivial for the compilation part of the engine
to tell the matching part that case-insensitiveness is irrelevant for ., the regex at hand.
| 42 | I didn't actually run the benchmark that long. Based on other test cases, I calculated that it would take about 36.4 hours. Feel free to try it yourself, though. |
This unrealistic benchmark is definitely a worst-case scenario. Searching a
huge string for something that matches less often than . is more
realistic, so I benchmarked m/\bwhile\b/gi and
m/\b[wW][hH][iI][lL][eE]\b/g on the same string. Here, I try to
mimic the regex-oriented approach myself. It's incredibly naïve
for a regex-oriented implementation to actually turn literal text into
character classes, so we can consider the /i-equivalent to be a
worst-case situation in this respect. In fact, manually turning while
into [wW][hH][iI][lL][eE] also kills Perl's fixed string check
(=>155), and renders study
(=>287) useless for the regex. With all this
against it, we should expect it to be very slow indeed. But it's still over
50 times faster than the /i version!
| 43 | Although this is exactly what the original implementation of grep did! |
Perhaps this test is still unfair -- the /i-induced copy made at
the start of the match, and after each of the 412 matches of \bwhile\b
in my test data, is large. (Remember, the single string is over a megabyte
long.) Let's try testing m/^int/i and m/^[iI][nN][tT]/ on
each of the 50,000 lines of the test file. In this case, /i has each
line copied before the match attempt, but since they're so short, the copy
is not so crushing a penalty as before: the /i version is now just 77
percent slower. Actually, this includes the extra copies inexplicably made
for each of the 148 matches -- remember, a non-/g m/.../i
induces the $&-support copy.
/i-related penalty is not as
heinous as the first benchmark leads you to believe. Still, it's a concern
you should be very aware of, and I hope that future versions of Perl
eliminate the most outlandish of these inefficiencies.
Foremost: don't use /i unless you really have to. Blindly
adding it to a regex that doesn't require it invites many wasted CPU
cycles. In particular, when working with long strings, it can be a huge
benefit to rewrite a regex to mimic the regex-oriented approach to case
insensitivity, as I did with the last two benchmarks.
Still, I have managed to understand a bit about how it works, and have developed a few rules of thumb that I'd like to
share with you. Let me warn you, up front, that there's no simple
one-sentence summary to all this. Perl often takes internal optimizations
in bits and pieces where they can be found, and numerous special cases and
opportunities surrounding the substitution operator provide fertile
ground for optimizations. It turns out that the $&-support copy
disables all of the substitution-related optimizations, so that's all the
more reason to banish $& and friends from your code.
| 44 | I modified my copy of version 5.003 to spit out volumes of pretty color-coded messages as various things happen internally. This way, I have been able to understand the overall picture without having to understand the fine details. |
Let me start by stepping back to look at the substitution operator efficiency's worst-case scenario.
s[(\d+(\.\d*)?)F\b]{sprintf "%.0fC", ($1-32) * 5/9}eg
Water boils at 212F, freezes at 32F.
Water·boils·at·', to
it. The substitution text is then computed (`100C' in this case) and
added to the end of the temporary string. (By the way, it's at this point
that the $&-support copy would be made were it required.)
At the next match (because /g is used), the text between the two
matches is added to the temporary string, followed by the newly computed
substitution text, `0C'. Finally, after it can't find any more matches,
the remainder of the string (just the final period in this case)
is copied to close out the temporary string. This leaves us with:
Water boils at 100C, freezes at 0C.
The original target string, $_, is then discarded and replaced by the
temporary string. (I'd think the original target could be used to support
$1, $&, and friends, but it does not -- a separate copy
is made for that, if required.)
At first, this method of building up the result might seem
reasonable because for the general case, it is reasonable. But
imagine something simple like s/\s+$// to remove trailing
whitespace. Do you really need to copy the whole (potentially huge) string,
just to lop off its end? In theory, you don't. In practice, Perl doesn't
either. Well, at least not always.
$& and friends disable all substitute-operator optimizationss/\s+$// to simply adjusting the
length of the target string. This means no extra copies -- very fast.
For reasons that escape me, however, this optimization (and all the
substitution optimizations I mention in a moment) are disabled when the $&-support copy is done. Why? I don't know, but the practical effect
is yet another way that $& is detrimental to your code's efficiency.
The $&-support copy is also done when there are capturing parentheses
in the regex, although in that case, you'll likely be
enjoying the fruits of that copy (since $1 and the like are supported
by it). Capturing parentheses also disable the substitution optimizations,
but at least only for the regexes they're used in and not for all regexes
as a single stray $& does.
s/\s+$// is among many examples that fit a pattern:
when the replacement text is shorter or the same length as the text being
replaced, it can be inserted right into the string. There's no need to make
a full copy. Figure 7-2 shows part of the example
from page 45, applying the
substitution s/<FIRST>/Tom/ to the string
`Dear·<FIRST>,[NL]'. The new text is copied directly over
what is being replaced, and the text that follows the match is moved down
to fill the gap.

s/<FIRST>/Tom/ to `Dear·<FIRST>,[NL]'
In the case of s/\s+$//, there's no replacement text to be filled
in and no match-following text to be moved down -- once the match is
found, the size of the string is adjusted to lop off the match, and that's
that. Very zippy. The same kinds of optimizations also apply to matches
at the beginning of the string.
When the replacement text is exactly the same size as the matched text, you'd think that as a further optimization, the ``move to fill the gap'' could be omitted, since there is no gap when the sizes are an exact match. For some reason, the needless ``move'' is still done. At least the algorithm, as it stands, never copies more than half the string (since it is smart enough to decide whether it should copy the part before the match, or the part after).
A substitution with /g is a bit better still. It doesn't do the
gap-filling move until it knows just how much to move (by delaying the move
until it knows where the next match is). Also, it seems in this case that
the worthless move to fill a non-existent gap is kindly omitted.
$1 and friends disables the
optimizations.
Other variable interpolation, however, does not. With the previous
example, the original substitution was:
$given = 'Tom'; $letter =~ s/<FIRST>/$given/g;
| 45 | It is redundant to say that the use of $1 in the replacement string
disables the optimizations. Recall that the use of capturing parentheses
in the regex causes the $&-support copy, and that copy also
disables the substitution optimizations. It's silly to use $1
without capturing parentheses, as you're guaranteed its value will be
undefined (=>217). |
The replacement operand has variables interpolated before any matching begins, so the size of the result is known.
Any substitution using the /e modifier, of course, doesn't know the
size of the substitution text until after a match, and the
substitution operand is evaluated, so there are no substitution
optimizations with /e either.
Benchmark module, but it
is tainted by the $& penalty. This is quite
unfortunate, for the penalty might itself silently render the benchmark
results invalid. I prefer to keep things simple by just wrapping the code
to be tested in something like:
$start = (times)[0]; : $delta = (times)[0] - $start; printf "took %.1f seconds\n", $delta;
An important consideration about benchmarking is that due to clock granularity (1/60 or 1/100 of a second on many systems), it's best to have code that runs for at least a few seconds. If the code executes too quickly, do it over and over again in a loop. Also, try to remove unrelated processing from the timed portion. For example, rather than
$start = (times)[0]; # Go! Start the clock.$count = 0; while (<>) { $count++ while m/\b(?:char\b|return\b|void\b)/g; } print "found $count items.\n"; $delta = (times)[0] - $start; #Done. Stop the clock.printf "the benchmark took %.1f seconds.\n", $delta;
$count = 0; # (no need to have this timed, so bring above the clock start)@lines = <>; #Do all file I/O here, so the slow disk is not an issue when timed$start = (times)[0]; #Okay, I/O now done, so now start the clock.foreach (@lines) { $count++ while m/\b(?:char\b|return\b|void\b)/g; } $delta = (times)[0] - $start; #Done. Stop the clock.print "found $count items.\n"; #(no need to have this timed)printf "the benchmark took %.1f seconds.\n", $delta;
The biggest change is that the file I/O has been moved out from the timed portion. Of course, if you don't have enough free memory and start swapping to disk, the advantage is gone, so make sure that doesn't happen. You can simulate more data by using a smaller amount of real data and processing it several times:
for ($i = 0; $i < 10; $i++) {
foreach (@lines) {
$count++ while m/\b(?:char\b|return\b|void\b)/g;
}
}
It might take some time to get used to benchmarking in a reasonable way, but the results can be quite enlightening and downright rewarding.
-DDEBUGGING during its build), the -D debugging command-line
option is available. The use of -Dr (-D512 with
Perl4) tells you a bit about how Perl compiles your regular expressions
and gives you a blow-by-blow account of each application.
Much of what -Dr provides is beyond the scope of this book, but
you can readily understand some of its information. Let's look at a simple
example (I'm using Perl version 5.003):
[1]jfriedl@tubby> perl -cwDr -e '/^Subject: (.*)/'[2]rarest char j at 3[3]first 14 next 83 offset 4[4]1:BRANCH(47)[5]5:BOL(9)[6]9:EXACTLY(23) <Subject: >[7]23:OPEN1(29) :[8]47:END(0)[9]start `Subject: ' anchored minlen 9
At , I invoke perl at my shell prompt, using the command-line
arguments
[1]-c (which means check script, don't actually execute it),
-w (issue warnings about things Perl thinks are
dubious -- always used as a matter of principle), -Dr (regex
debugging), and -e (the next argument is the Perl snippet
itself).
This combination is convenient for checking regexes right from the command
line. The regex here is ^Subject:·(.*) which we've seen several
times in this book.
Lines through [4] represents Perl's compiled form of
the regex. For the most part, we won't be concerned much about it here.
However, in even a casual look, line [8] sticks out as understandable.[6]
Subject:·', but many expressions either have no required literal
text, or have it beyond Perl's ability to deduce. (This is one area where
Emacs' optimization far outshines
Perl's; =>197.) Some examples where
Perl can deduce nothing include -?([0-9]+(\.[0-9]*)?|\.[0-9]+), ^\s*, ^(-?\d+)(\d{3}),
and even int|void|while.
Examining int|void|while, you see that `i' is
required in any match. Some NFA engines can deduce exactly that (any DFA
knows it implicitly), but Perl's engine is unfortunately not one of them.
In the debugging output, int, void, and while appear
on lines similar to above, but those are local (subexpression)
requirements. For literal text cognizance, Perl needs global regex-wide
confidence, and as a general rule, it can't deduce fixed text from anything
that's part of alternation.[6]
Many expressions, such as <CODE>(.*?)</CODE>, have more than one clump
of literal text. In these cases, Perl (somewhat magically) selects one or
two of the clumps and makes them available to the rest of the optimization
subroutines. The selected clump(s) are shown on a line similar to .
[9]
[9] can report a number of different things.
Some items you might see include:start `clump'must have "clump" back numstart item above,
but the clump is not at the start of
the regex. If num is not -1, Perl knows any match must
begin that many characters earlier. For example, with
[Tt]ubby..., the report is
`must have "ubby" back 1', meaning that if a fixed
string check reveals ubby starting at such and such a location,
the whole regex should be applied starting at the previous position.
Conversely, with something like .*tubby, it's not helpful to know
exactly where tubby might be in a string, since a match
including it could start at any previous position, so num is
-1.
stclass `:kind'\s+, kind is SPACE, while with
\d+ it is DIGIT. With a character class like the
[Tt]ubby example, kind is reported as ANYOF.
plusstclass or a single-character start
item is governed by +, so not only can the first character
discrimination find the start of potential matches, but it can also
quickly zip past a leading \s+ and the like before letting the
full (but slower) regex engine attempt the complete match.anchoredimplicit.*
(=>158).study (which is examined momentarily). Perl somewhat arbitrarily
selects one character it considers ``rare'' from the selected
clump(s). Before a match, if a string has been study'd, Perl
knows immediately if that character exists anywhere in the string. If it
doesn't exist, no match is possible and the regex engine does not need to
get involved at all. It's a quick way to prune some impossible matches. The
character selected is reported at [2].
For the trivia-minded, Perl's idea of the rarest character is \000,
followed by \001, \013, \177, and \200. Some of
the rarest printable characters are ~, Q, Z, ?,
and @. The least-rare characters are e, space, and t.
(The manpage says that this was derived by examining a combination of C
programs and English text.)
study(...) optimizes
access to certain information about a string. A regex, or multiple
regexes, can then benefit from the cached knowledge when applied to the
string. What it does is simple, but understanding when it's a benefit or
not can be quite difficult. It has no effect whatsoeveron any values or results of a program -- the only
effects are that Perl uses more memory, and that overall execution time
might increase, stay the same, or (here's the goal) decrease.| 46 | Or, at least it shouldn't in theory. However, as of Perl version 5.003,
there is a bug in which the use of study can cause successful
matches to fail. This is discussed further at the end of this section. |
When you study a string, Perl takes some time and memory to build a
list of places in the string each character is found. On most systems, the
memory required is four times the size of the string (but is reused with
subsequent calls of study). study's benefit can be realized
with each subsequent regex match against the string, but only until the
string is modified. Any modification of the string renders the study
list invalid, as does studying a different string.
The regex engine itself never looks at the study list; only the
transmission references it. The transmission looks
at the start and must have debugging information mentioned on
page 286 to pick what it
considers a rare character (discussed
It picks a rare (yet required) character because it's not likely to be
found in the string, and a quick check of the study list that turns
up nothing means a match can be discounted immediately without having to
rescan the entire string.
If the rare character is found in the string, and if that character
must occur at a known position in any possible match (such as with ..this, but not .?this), the transmission
can use the study list to start matching from near the location.
This saves time by bypassing perhaps large portions of the
string.
| 47 | There's a bug in the current implementation which disables this
optimization when the regex begins with literal text. This is
unfortunate because such expressions have generally been thought to
benefit most from study. |
study when the target string is short. In such cases,
the normal fixed-string cognizance optimization should suffice.
study when you plan only a few matches against the target
string (or, at least, few before it is modified, or before you
study a different string). An overall speedup is more likely if
the time spent to study a string is amortized over many matches.
With just a few matches, the time spent scanning the string (to build the
study list) can overshadow any savings.
Note that with the current implementation, m/.../g is considered
one match: the study list is consulted only with the first
attempt. Actually, in the case of a scalar context m/.../g, it
is consulted with each match, but it reports the same location each
time -- for all but the first match, that location is before the
beginning of the match, so the check is just a waste of time.
study when Perl has no literal text cognizance
(=>286) for the regular expressions
that you intend to benefit from the study. Without a known
character that must appear in any match, study is useless.
study is best used when you have a large string you intend to match
many times before the string is modified. A good example is a filter I used
in preparing this book. I write in a home-grown markup that the filter
converts to SGML (which is then converted to troff, which is then
converted to PostScript). Within the filter, an entire chapter eventually
ends up within one huge string (this chapter is about 650 kilobytes).
Before exiting, I apply a bevy of checks to guard against mistaken markup
leaking through. These checks don't modify the string, and they often look
for fixed strings, so they're what study thrives on.
study has been hexed from the start. First, the
programming populace never seemed to understand it well. Then, a bug in
Perl versions 5.000 and 5.001 rendered study completely useless.
In recent versions, that's been fixed, but now there's a study bug
that can cause successful matches in $_ to fail (even matches that
have nothing to do with the string that was study'd). I discovered
this bug while investigating why my markup filter wasn't working, quite
coincidentally, just as I was writing this section on study. It was
a bit eerie, to say the least.
You can get around this bug with an explicit undef, or other
modification, of the study'd string (when you're done with it, of
course). The automatic assignment to $_ in while (<>)
is not sufficient.
When study can work, it often doesn't live up to its full potential,
either due to simple bugs or an implementation that hasn't matured as fast
as the rest of Perl. At this juncture, I recommend against the use of study unless you have a very specific situation you know benefits.
If you do use it, and the target string is in $_, be sure to undefine it when you are done.
Let's look again at the initial CSV problem. Here's my Perl5 solution, which, as you'll note, is fairly different from the original on page 205:
@fields = ();
push(@fields, $+) while $text =~ m{
"([^"\\]*(?:\\.[^"\\]*)*)",? # standard quoted string, with possible comma
| ([^,]+),? # anything else, with possible comma
| , # lone comma
}gx;
# add a final empty field if there's a trailing comma
push(@fields, undef) if substr($text,-1,1) eq ',';
Like the first version, it uses a scalar-context m/.../g with a while loop to iterate over the string. We want to
stay in synch, so we make sure that at least one of the alternatives
matches at any location a match could be started. We allow three types of
fields, which is reflected in the three alternatives of the main match.
Because Perl5 allows you to choose exactly which parentheses are capturing
and which aren't, we can ensure that after any match, $+ holds
the desired text of the field. For empty fields where the third alternative
matches and no capturing parentheses are used, $+ is guaranteed to be undefined, which is exactly what we want. (Remember undef is
different from an empty string -- returning these different values for
empty and "" fields retains the most information.)
The final push covers cases in which the string ends with a comma,
signifying a trailing empty field. You'll note that I don't use m/,$/ as I did earlier. I did so earlier because I was using it as
an example to show regular expressions, but there's really no need to use a
regex when a simpler, faster method exists.
Along with the CSV question, many other common tasks come up time and again in the Perl newsgroups, so I'd like to finish out this chapter by looking at a few of them.
s/^\s+//; s/\s+$//;
For some reason, it seems to be The Thing to try to find a way to do it all
in one shot, so I'll offer a few methods. I don't recommend them, but it's
educational to understand why they work, and why they're not desirable.
s/\s*(.*?)\s*$/$1/*? must try to see whether what follows can
match. That's a lot of backtracking, particularly since it's the
kind that goes in and out of the parentheses
(=>151).s/^\s*(.*\S)?\s*$/$1/^\s* takes care of
leading whitespace before the parentheses start capturing. Then the
.* matches to the end of the line, with the \S causing
backtracking past trailing whitespace to the final non-whitespace.
If there's nothing but whitespace in the first place, the
(.*\S)? fails (which is fine), and the final \s* zips to
the end.$_ = $1 if m/^\s*(.*\S)?/s/^\s*|\s*$//g/g modifier allows each alternative
to match, but it seems a waste to use /g when we know we
intend at most two matches, and each with a different
subexpression. Fairly slow.s/^\s+//; s/\s+$// can take twice the
time of $_ = $1 if m/^\s*(.*\S)?/. Still, in my programs, I use
s/^\s+//; s/\s+$// because it's almost always fastest, and
certainly the easiest to understand.
12,345,678.
The FAQ currently gives
1 while s/^(-?\d+)(\d{3})/$1,$2/;
\d+,
backtracks three digits so \d{3} can match, and finally inserts a
comma via the replacement text `$1,$2'. Because it works primarily
``from the right'' instead of the normal left, it is
useless to apply with /g. Thus, multiple passes to add multiple commas
is achieved using a while loop.
You can enhance this solution by using a common optimization from
Chapter 5 (=>156), replacing \d{3}
with \d\d\d. Why bother making the regex engine count the occurrences
when you can just as easily say exactly what you want? This one change
saved a whopping three percent in my tests. (A penny saved...)
Another enhancement is to remove the start-of-string anchor. This allows
you to comma-ify a number (or numbers) within a larger string. As a
byproduct, you can then safely remove the -?, since it exists only to
tie the first digit to the anchor. This change could be dangerous if you
don't know the target data, since 3.14159265 becomes 3.14,159,265. In any case, if you
know the number is the string by itself, the anchored version is better.
A completely different, but almost-the-same approach I've come up with
uses a single /g-governed substitution:
s<
(\d{1,3}) # before a comma: one to three digits
(?= # followed by, but not part of what's matched...
(?:\d\d\d)+ # some number of triplets...
(?!\d) # ...not followed by another digit
) # (in other words, which ends the number)><$1,>gx;
3.14159265. To
take care of that, and to bring it in line with the FAQ solution for
all strings, change the (\d{1,3}) to
\G((?:^-)?\d{1,3}). The \G anchors
the overall match to the start of the string, and anchors each subsequent /g-induced match to the previous one. The (?:^-)? allows a
leading minus sign at the start of the string, just as the FAQ solution
does. With these changes, it slows down a tad, but my tests show it's still
over 30 percent faster than the FAQ solution.
/\*[^*]*\*+([^/*][^*]*\*+)*/, and a Tcl
program to remove comments. Let's express it in Perl.
Chapter 5 dealt with generic NFA engines, so our comment-matching regex
works fine in Perl. For extra efficiency, I'd use non-capturing
parentheses, but that's about the only direct change I'd make. It's not
unreasonable to use the FAQ's simpler /\*.*?\*/ -- Chapter 5's solution leads the engine to a match
more efficiently, but /\*.*?\*/ is fine for applications that aren't
time critical. It's certainly easier to understand at first glance, so I'll
use it to simplify the first draft of our comment-stripping regex.
Here it is:
s{
# First, we'll list things we want to match, but not throw away
(
" (?:\\.|[^"\\])* " # doublequoted string.
| # -or-
' (?:\\.|[^'\\])* ' # singlequoted constant
)
| # OR...
# ...we'll match a comment. Since it's not in the $1 parentheses above,
# the comments will disappear when we use $1 as the replacement text.
/\* .*? \*/ # Traditional C comments.
| # -or-
//[^\n]* # C++ //-style comments
}{$1}gsx;
After applying the changes discussed during the Tcl treatment, and
combining the two comment regexes into one top-level alternative (which is
easy since we're writing the regex directly and not building up from
separate $COMMENT and $COMMENT1 components), our Perl version becomes:
s{
# First, we'll list things we want to match, but not throw away
(
[^"'/]+ # other stuff
| # -or-
(?:"[^"\\]*(?:\\.[^"\\]*)*" [^"'/]*)+ # doublequoted string.
| # -or-
(?:'[^'\\]*(?:\\.[^'\\]*)*' [^"'/]*)+ # singlequoted constant
)
| # OR...
# ...we'll match a comment. Since it's not in the $1 parentheses above,
# the comments will disappear when we use $1 as the replacement text.
/ (?: # (all comments start with a slash)
\*[^*]*\*+(?:[^/*][^*]*\*+)*/ # Traditional C comments.
| # -or-
/[^\n]* # C++ //-style comments
)
}{$1}gsx;
To make a full program out of this, just insert it into:
undef $/; # Slurp-whole-file mode$_ = join('', <>); #The join(...) can handle multiple files.... insert the substitute command from above ... print;
Yup, that's the whole program.
| 48 | Internet RFC 822.
Available at: ftp://ftp.rfc-editor.org/in-notes/rfc822.txt |
Still, it's not for the faint at heart. In fact, the regex we'll come up
with is 4,724 bytes long! At first thought, you might think something
as simple as \w+\@[.\w]+ could work, but it is much more complex.
Something like
Jeffy <"That Tall Guy"@ora.com (this address no longer active)>
| 49 | It is certainly not valid in the sense that mail sent there will bounce for want of an active username, but that's an entirely different issue. |
| 50 | The program we develop in this section is available on my home page -- see Appendix A. |
| Item | Description | |
|---|---|---|
| 1 | mailbox |
addr-spec | phrase route-addr |
| 2 | addr-spec |
local-part @ domain |
| 3 | phrase |
( word )+ |
| 4 | route-addr |
< ( route )? addr-spec > |
| 5 | local-part |
word ( . word )* |
| 6 | domain |
sub-domain ( . sub-domain )* |
| 7 | word |
atom | quoted-string |
| 8 | route |
@ domain ( , @ domain )* : |
| 9 | sub-domain |
domain-ref | domain-literal |
| 10 | atom |
( any char except specials, space and ctls )+ |
| 11 | quoted-string |
" ( qtext | quoted-pair )* " |
| 12 | domain-ref |
atom |
| 13 | domain-literal |
[ ( dtext | quoted-pair )* ] |
| 14 | char |
any ASCII character (000-177 octal) |
| 15 | ctl |
any ASCII control (000-037 octal) |
| 16 | space |
ASCII space (040 octal) |
| 17 | CR |
ASCII carriage return (015 octal) |
| 18 | specials |
any of the characters: ()<>@,;:\".[] |
| 19 | qtext |
any char except ", \ and CR |
| 20 | dtext |
any char except [, ], \ and CR |
| 21 | quoted-pair |
\ char |
| 22 | comment |
( ( ctext | quoted-pair | comment )* ) |
| 23 | ctext |
any char except (, ), \ and CR |
^\w+\@[.\w]+$ as an example, you might naïvely render that as
$username = "\w+"; $hostname = "\w+(\.\w+)+"; $email = "^$username\@$hostname$"; : ... m/$email/o ...
$email sees `^w+@w+(.w+)+$' with Perl4, and can't even
compile with Perl5 because of the trailing dollar sign. Either the escapes
need to be escaped so they'll be preserved through to the regex, or a
singlequoted string must be used. Singlequoted strings are not applicable
in all situations, such as in the third line where we really do need the
variable interpolation provided by a doublequoted string:
$username = '\w+'; $hostname = '\w+(\.\w+)+'; $email = "^$username\@$hostname\$";
Let's start building the real regex by looking at item 16 in Table 7-11. The simple $space = "·" isn't
good because if we use the /x modifier when we apply the
regex (something we plan to do), spaces outside of character classes, such
as this one, will disappear. We can also represent a space in the regex
with \040 (octal 40 is the ASCII code for the space
character), so we might be tempted to assign "\040" to $space.
This would be a silent mistake because, when the doublequoted string is
evaluated, \040 is turned into a space. This is what the regex will
see, so we're right back where we started. We want the regex to see \040 and turn it into a space itself, so again, we must use "\\040" or '\040'.
Getting a match for a literal backslash into the regex is particularly
hairy because it's also the regex escape metacharacter. The regex
requires \\ to match a single literal backslash. To assign it to, say,
$esc, we'd like to use '\\', but because \\ is special
even within singlequoted strings,
we need $esc = '\\\\' just to have the final regex match a single
backslash. This backslashitis is why I make $esc once and then use it
wherever I need a literal backslash in the regex. We'll use it a few times
as we construct our address regex. Here are the preparatory variables I'll
use this way:
# Some things for avoiding backslashitis later on.$esc = '\\\\'; $Period = '\.'; $space = '\040'; $tab = '\t'; $OpenBR = '\['; $CloseBR = '\]'; $OpenParen = '\('; $CloseParen = '\)'; $NonASCII = '\x80-\xff'; $ctrl = '\000-\037'; $CRlist = '\n\015'; #note: this should really be only\015.
| 51 | Within Perl singlequoted strings, \\ and the escaped closing
delimiter (usually \') are special. Other escapes are passed
through untouched, which is why \040 results in \040. |
The $CRlist requires special mention. The specification indicates
only the ASCII carriage return (octal 015). From a practical point
of view, this regex is likely to be applied to text that has already been
converted to the system-native newline format where \n represents
the carriage return. This may or may not be the same
as an ASCII carriage return. (It usually is, for example, on MacOS, but
not with Unix; =>72.) So I
(perhaps arbitrarily) decided to consider both.
# Items 19, 20, 21$qtext = qq/[^$esc$NonASCII$CRlist"]/; #for within"..."$dtext = qq/[^$esc$NonASCII$CRlist$OpenBR$CloseBR]/; #for within[...]$quoted_pair = qq< $esc [^$NonASCII] >; #an escaped characterItem 10: atom
#$atom_char = qq/[^($space)<>\@,;:".$esc$OpenBR$CloseBR$ctrl$NonASCII]/; $atom = qq< $atom_char+ #some number of atom characters...(?!$atom_char) #..not followed by something that could be part of an atom>;
$atom, might need some explanation. By itself, $atom need be only $atom_char+, but look ahead to Table 7-11's item 3, phrase. The combination yields ($atom_char+)+, a lovely
example of one of those neverending-match patterns
(=>144). Building a regex in a variable is prone to
this kind of hidden danger because you can't normally see everything at
once. This visualization problem is why I used
$NonASCII = '\x80-\xff' above. I could have used
"\x80-\xff", but I wanted to be able to print the partial regex at
any time during testing. In the latter case, the regex holds the raw
bytes -- fine for the regex engine, but not for our display if we
print the regex while debugging.
Getting back to ($atom_char+)+, to help delimit the inner-loop
single atom, I can't use \b because Perl's idea of a word is
completely different
from an email address atom. For example, `--genki--' is a valid
atom that doesn't match \b$atom_char+\b. Thus, to ensure that
backtracking doesn't try to claim an atom that ends in the middle of
what it should match, I use (?!...) to make sure that $atom_char can't match just after the atom's end. (This is a
situation where I'd really like the possessive quantifiers that I pined
for in the footnote on page 111.)
Even though these are doublequoted strings and not regular
expressions, I use free spacing and comments (except within the character
classes) because these strings will eventually be used with an
/x-governed regex. But I do take particular care to ensure that each
comment ends with a newline, as I don't want to run into the overzealous
comment problem (=>223).
$comment to allow for one level of
internal nesting:
# Items 22 and 23, comment.#Impossible to do properly with a regex, I make do by allowing at most one level of nesting.$ctext = qq< [^$esc$NonASCII$CRlist()] >; $Cnested = qq< $OpenParen (?: $ctext | $quoted_pair )* $CloseParen >; $comment = qq< $OpenParen (?: $ctext | $quoted_pair | $Cnested )* $CloseParen >;required separator
$sep = qq< (?: [$space$tab] | $comment )+ >; #$X = qq< (?: [$space$tab] | $comment )* >; #optional separator
You'll not find comment, item 22, elsewhere in the table. What the
table doesn't show is that the specification allows comments, spaces, and
tabs to appear freely between most tokens. Thus, we create $X for
optional spaces and comments, $sep for required ones.
$X
where required, but for efficiency's sake, no more often than necessary.
The method I use is to provide $X only between elements within a
single subexpression. Most of the remaining items are shown
# Item 11: doublequoted string, with escaped items allowed$quoted_str = qq< " (?: #opening quote...$qtext #Anything except backslash and quote| #or$quoted_pair #Escaped something (something != CR))* " #closing quote>;Item 7: word is an atom or quoted string
#$word = qq< (?: $atom | $quoted_str ) >;Item 12: domain-ref is just an atom
#$domain_ref = $atom;Item 13 domain-literal is like a quoted string, but [...] instead of "..."
#$domain_lit = qq< $OpenBR #[(?: $dtext | $quoted_pair )* #stuff$CloseBR #]>;Item 9: sub-domain is a domain-ref or domain-literal
#$sub_domain = qq< (?: $domain_ref | $domain_lit ) >;
# Item 6: domain is a list of subdomains separated by dots.$domain = qq< $sub_domain #initial subdomain(?: #$X $Period #if led by a period...$X $sub_domain #...further okay)* >;Item 8: a route. A bunch of "@ $domain" separated by commas, followed by a colon
#$route = qq< \@ $X $domain (?: $X , $X \@ $X $domain )* #further okay, if led by comma: #closing colon>;
# Item 5: local-part is a bunch of $word separated by periods$local_part = qq< $word #initial word(?: $X $Period $X $word )* #further okay, if led by a period>;Item 2: addr-spec is local@domain
#$addr_spec = qq< $local_part $X \@ $X $domain >;Item 4: route-addr is <route? addr-spec>
#$route_addr = qq[ < $X #leading <(?: $route $X )? #optional route$addr_spec #address spec$X > #trailing >];
phrase poses some difficulty. According to Table 7-11, it is one or more word, but we can't
use (?:$word)+ because we need to allow $sep between
items. We can't use (?:$word|$sep)+, as that doesn't
require a $word, but merely allows one. So, we might be
tempted to try $word(?:$word|$sep)*, and this is where we
really need to keep our wits about us. Recall how we constructed
$sep. The non-comment part is effectively [$space$tab]+,
and wrapping this in the new
(...)* smacks of a neverending match
(=>166). The $atom within $word would also be suspect except for the (?!...) we took care
to tack on to checkpoint the match. We could try the same with $sep,
but I've a better idea.
Four things are allowed in a phrase: quoted strings, atoms, spaces,
and comments. Atoms are just sequences of $atom_char -- if
these sequences are broken by spaces, it means only that there are multiple
atoms in the sequence. We don't need to identify individual atoms, but only
the extent of the entire sequence, so we can just use something like:
$word (?: [$atom_char$space$tab] | $quoted_string | $comment )+
We can't actually use that character class because $atom_char is
already a class itself, so we need to construct a new one from scratch,
mimicking $atom_char, but removing the space and tab
(removing from the list of a negated class includes them in what
the class can match):
# Item 3: phrase$phrase_ctrl = '\000-\010\012-\037'; #like ctrl, but without tabLike atom-char, but without listing space, and uses phrase_ctrl.
##Since the class is negated, this matches the same as atom-char plus space and tab$phrase_char = qq/[^()<>\@,;:".$esc$OpenBR$CloseBR$NonASCII$phrase_ctrl]/;one word, optionally followed by....
$phrase = qq< $word #(?: $phrase_char | #atom and space parts, or...$comment | #comments, or...$quoted_str #quoted strings)* >;
$X after any use of $phrase.
mailbox
# Item #1: mailbox is an addr_spec or a phrase/route_addr$mailbox = qq< $X #optional leading comment(?: $addr_spec #address| #or$phrase $route_addr #name and address) $X #optional trailing comment>;
Well, we can now use this like:
die "invalid address [$addr]\n" if $addr !~ m/^$mailbox$/xo;
(With a regex like this, I strongly suggest not forgetting
the /o modifier.)
| 52 | From the ``don't try this at home, kids'' department: During
initial testing, I was stumped to find that the optimized version
(presented momentarily) was consistently slower than the normal
version. I was really dumbfounded until I realized that I'd
forgotten /o! This caused the entire huge regex
operand to be reprocessed for each match
(=>268). The optimized expression turned out
to be considerably longer, so the extra processing time completely
overshadowed any regex efficiency benefits. Using /o not only
revealed that the optimized version was faster, but caused the
whole test to finish an order of magnitude quicker. |
It might be interesting to look at the final regex, the contents of $mailbox. After removing comments and spaces and breaking it into
lines for printing, here are the first few out of 60 or so lines:
(?:[\040\t]|\((?:[^\\\x80-\xff\n\015()]|\\[^\x80-\xff]|\((?:[^\\\x80-\xff\n\015( )]|\\[^\x80-\xff])*\))*\))*(?:(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^ (\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"(?:[^\\\x80-\xff\n\015"]|\\[^\x80-\xff ])*")(?:(?:[\040\t]|\((?:[^\\\x80-\xff\n\015()]|\\[^\x80-\xff]|\((?:[^\\\x80-\xf f\n\015()]|\\[^\x80-\xff])*\))*\))*\.(?:[\040\t]|\((?:[^\\\x80-\xff\n\015()]|\\[ ^\x80-\xff]|\((?:[^\\\x80-\xff\n\015()]|\\[^\x80-\xff])*\))*\))*(?:[^(\040)<>@,; :".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"(?:[
Wow. At first you might think that such a gigantic regex could not possibly be efficient, but the size of a regex has little to do with its efficiency. More at stake is how much backtracking it has to do. Are there places with lots of alternation? Neverending-match patterns and the like? Like an efficient set of ushers at the local 20-screen theater complex, a huge regex can still guide the engine to a fast match, or to a fast failure, as the case may be.
jfriedl by itself is a perfectly
valid email address, but is not an Internet email address. (This is not a
problem with the regex, but with its use.) Also, an address might be
lexically valid but might not actually point anywhere, as with the earlier That Tall Guy example. A step toward eliminating some of these is
to require a domain to end in a two- or three-character subdomain (such as
.com or .jp). This could be as simple as appending
$esc . $atom_char {2,3} to $domain, or more strictly with
something like:
$esc . (?: com | edu | gov | ... | ca | de | jp | u[sk] ... )
When it comes down to it, there is absolutely no way to ensure a
particular address actually reaches someone. Period. Sending a test message
is a good indicator if someone happens to reply. Including a Return-Receipt-To header in the message is also useful, as it has
the remote system generate a short response to the effect that your message
has arrived to the target mailbox.
$quoted_str = qq< " # opening quote$qtext * #leading normal(?: $quoted_pair $qtext * )* #( special normal* )*" #closing quote>;
$phrase might become:
$phrase = qq< $word # leading word$phrase_char * #"normal" atoms and/or spaces(?: (?: $comment | $quoted_str ) #"special" comment or quoted string$phrase_char * #more "normal")* >
Items such as $Cnested, $comment, $phrase, $domain_lit, and $X can be optimized similarly, but be
careful -- some can be tricky. For example, consider $sep from
the section on comments. It requires at least one match, but using the
normal unrolling-the-loop technique creates a regex that doesn't require a match.
Talking in terms of the general unrolling-the-loop pattern
(=>164), if you wish to require
special, you can change the outer (...)* to (...)+, but
that's not what $sep needs. It needs to require something, but that
something can be either special or normal.
It's easy to create an unrolled expression that requires one or the other in particular, but to require either we need to take a dual-pronged approach:
$sep = qq< (?:
[$space$tab]+ # for when space is first
(?: $comment [$space$tab]* )*
|
(?: $comment [$space$tab]* )+ # for when comment is first
)
>;
normal*(specialnormal*)* pattern,
where the class to match spaces is normal, and $comment is
special. The first requires spaces, then allows comments and spaces.
The second requires a comment, then allows spaces. For this last
alternative, you might be tempted to consider $comment to be
normal and come up with:
$comment (?: [$space$tab]+ $comment )*
As it turns out, though, none of this is needed, since $sep isn't
used in the final regex; it appeared only in the early attempt of $phrase. I kept it alive this long because this is a common variation
on the unrolling-the-loop pattern, and the discussion of its touchy
optimization needs is valuable.
$X. Examine how
the $route part of our regex matches `@·gateway·:'. You'll
find times where an optional part fails, but only after one or more
internal $X match. Recall our definitions for $domain and $route:
# Item 6: domain is a list of subdomains separated by dots.$domain = qq< $sub_domain #initial subdomain(?: #$X $Period #if led by a period...$X $sub_domain #...further okay)* >;Item 8: a route. A bunch of "@ $domain" separated by commas, followed by a colon
#$route = qq< \@ $X $domain (?: $X , $X \@ $X $domain )* #further okay, if led by comma: #closing colon>;
$route matches the initial `@·gateway·:' and the first $sub_domain of $domain matches `@·gateway·:', the regex
checks for a period and another $sub_domain (allowing $X at
each juncture). In making the first attempt at this
$X $Period $X $sub_domain subexpression, the initial $X
matches the space `@·gateway·:', but the
subexpression fails trying to match a period. This causes backtracking out
of the enclosing parentheses, and the instance of $domain finishes.
Back in $route, after the first $domain is finished, it then
tries to match another if separated by a colon. Inside the
$X , $X \@... subexpression, the initial $X matches the
same space that had been matched (and unmatched) earlier. It, too, fails
just after.
It seems wasteful to spend the time matching $X when the
subexpression ends up failing. Since $X can match almost anywhere,
it's more efficient to have it match only when we know the associated
subexpression can no longer fail.
Consider the following, whose only changes are the placement of $X:
$domain = qq<
$sub_domain $X
(?:
$Period $X $sub_domain $X
)*
>;
$route = qq<
\@ $X $domain
(?: , $X \@ $X $domain )*
: $X
>;
$X only between
elements within a subexpression'' to ``ensure a subexpression consumes
any trailing $X.'' This kind of change has a ripple effect on where
$X appears in many of the expressions.
After applying all these changes, the resulting expression is almost 50
percent longer (these lengths are after comments and free spacing are
removed), but
executed 9-19 percent faster with my benchmarks (the 9 percent being for
tests that primarily failed, 19 percent for tests that primarily matched).
Again, using /o is very important. The final version of this regex is
in Appendix B.
$word = qq< $atom | $quoted_str >; $local_part = qq< $word (?: $X $Period $X $word) * >
(?:...) when possible. If you don't easily
recognize the mistake in the snippet above, consider what $line
becomes in:
$field = "Subject|From|Date"; $line = "^$field: (.*)";
^Subject|From|Date:·(.*) is very different from, and certainly not
as useful as, ^(Subject|From|Date):·(.*).
/x does not affect character classes,
so free spacing and comments can't be used within them. #-style comments continue until newline or the end of the regex.
Consider:
$domain_ref = qq< $atom # just a simple atom > $sub_domain = qq< (?: $domain_ref | $domain_lit ) >
Just because the variable $domain_ref ends, it doesn't mean that
the comment that's inserted into the regex from it ends. The comment
continues until the end of the regex or until a newline, so here the
comment extends past the end of $domain_ref to consume the
alternation and the rest of $sub_domain, and anything that follows
$sub_domain wherever it is used, until the end of the regex or a
newline. This can be averted by adding the newline manually
(=>223): $domain_ref = qq< $atom # just a simple atom\n >
print a regex during debugging,
consider using something like '\0xff' instead of "\0xff". $quoted_pair = qq< $esc[^$NonASCII] >; |
$quoted_pair = qq< $esc [^$NonASCII] >; |
$quoted_pair = qq< ${esc}[^$NonASCII] >; |
The first is very different from the other two. It is interpreted as an
attempt to index an element into the array @esc, which is
certainly not what is wanted here (=>222).
However, I'm not a blind fanatic -- Perl does not offer features that
I wish it did. The most glaring omission is offered by other
implementations, such as by Tcl, Python, and GNU Emacs: the index
into the string where the match (and $1, $2, etc.) begins and
ends. You can get a copy of text matched by a set of parentheses using the
aforementioned variables, but in general, it's impossible to know exactly
where in the string that text was taken from. A simple example that shows
the painfulness of this feature's omission is in writing a regex tutor.
You'd like to show the original string and say ``The first set of
parentheses matched right here, the second set matched here, and so on,''
but this is currently impossible with Perl.
Another feature I've found an occasional need for is an array ($1,
$2, $3, ...) similar to Emacs'
match-data (=>196). I can construct
something similar myself using:
$parens[0] = $&; $parens[1] = $1; $parens[2] = $2; $parens[3] = $3; $parens[4] = $4; :
Then there are those possessive quantifiers that I mentioned in the footnote on page 111. They could make many expressions much more efficient.
There are a lot of esoteric features I can (and do) dream of. One that I
once went so far as to implement locally was a special notation whereby the
regex would reference an associative array during the match, using
\1 and such as an index. It made it possible to extend something like (['"]).*?\1 to include <...>, ... and the like.
Another feature I'd love to see is named subexpressions, similar to
Python's symbolic group names feature. These would be capturing
parentheses that (somehow) associated a variable with them, filling the
variable upon a successful match. You could then pick apart a phone number
like (inventing some fictitious (?<var>...) notation on the fly):
(?<$area>\d\d\d)-(?<$exchange>\d\d\d)-(?<$num>\d\d\d\d)
Well, I'd better stop before I get carried away. The sum of it all is that I definitely do not think Perl is the ideal regex-wielding language.
But it is very close.
@ now interpolates
within a regex (and a doublequoted string, for that matter). Still, you
should be aware of a number of subtle (and not-so-subtle) differences
when working with Perl4:
$&, $1, and so on are not read-only
in Perl4 as they are in Perl5. Although it would be useful, modifying
them does not magically modify the original string they were copied
from. For the most part, they're just normal variables that are
dynamically scoped and that get new values with each successful match.
$` sometimes does
refer to the text from the start of the match (as opposed to
the start of the string). A bug that's been fixed in
newer versions caused $` to be reset each time the regex
was compiled. If the regex operand involved variable
interpolation, and was part of a scalar-context m/.../g
such as the iterator of a while loop, this recompilation
(which causes $` to be reset) is done during each
iteration.
$+ magically becomes a copy of $& when there
are no parentheses in the regex.
$MonthName[...] as an array reference only
if @MonthName is known to exist. Perl5 does it regardless.
s*2\*2*4* would not work as expected in
Perl5.
m{...} special-case
delimiters for the match operator. It does, however, support
them for the substitution operator.
?-match, but only in the
?...? form. The m?...? form is not special.
reset in the program resets all
?-delimited matches. Perl5's reset affects only
those in the current package.
An example should make this clear. Consider:
"5" =~ m/5/; # install5as the default regex{ #start a new scope..."4" =~ m/4/; #install4as the default regex} #... end the new scope."45" =~ m//; #use default regex to match4or5, depending on which regex is usedprint "this is Perl $&\n";
Perl4 prints `this is Perl 4', while
Perl5 prints `this is Perl 5'.
m/.../g returns the list of texts
matched within parentheses. Perl4, however, does not set $1 and
friends in this case. Perl5 does both.
m/.../g are
undefined in Perl5, but are simply empty strings in Perl4. Both are
considered a Boolean false, but are otherwise quite different.
m/.../g
resets the target's pos. In Perl4, the /g position is
associated with each regex operand. This means that modifying what
you intended to use as the target data has no effect on the /g
position (this could be either a feature or a bug depending on how you
look at it). In Perl5, however, the /g position is associated with
each target string, so it is reset when modified.
m[...], Perl4's substitution does. Like Perl5, if the regex
operand has balanced delimiters, the replacement operand has its own
set. Unlike Perl5, however, whitespace is not allowed between the two
(because whitespace is valid as a delimiter in Perl4).
s'...'...' provides a singlequotish
context to the regular expression operand as you would expect, but
not to the replacement operand -- it gets the normal
doublequoted-string processing.
\' and
\\ have their leading backslash removed before eval
ever gets a hold of it. With Perl5, the eval gets everything
as is.
($filename, $size, $date) = split(...)
@_ gets if the
?...? form of the match operand is used. This is not an issue
with Perl5 since it does not support the forced split to @_.
split supports a special match operand: if the
list-context match operand uses ?...? (but not
m?...?), split fills @_ as it does in a
scalar-context. Despite current documentation to the contrary,
this feature does not exist in Perl5.
m/\s+/, not '·'.
The difference affects how leading whitespace is treated.
eval anywhere in the
program, also triggers the copy for each successful match. Bummer.