[ Previous | Index | Next ]

Regular expressions provide a very powerful method of defining a pattern, but they are a bit awkward to understand and to use properly. So let us examine some more examples in detail.

We start with a simple yet non-trivial example: finding *floating-point numbers* in a line of text. Do not worry: we will keep the problem simpler than it is in its full generality. We only consider numbers like **1.0** and not **1.00e+01**.

How do we *design* our regular expression for this problem? By examining typical examples of the strings we want to match:

A pattern is beginning to emerge:

- A number can start with a sign (- or +) or with a digit. This can be captured with the expression
**[-+]?**, which matches a single "-", a single "+" or nothing.
- A number can have zero or more digits in front of a single period (.) and it can have zero or more digits following the period. Perhaps:
**[0-9]*\.[0-9]*** will do ...
- A number may not contain a period at all. So, revise the previous expression to:
**[0-9]*\.?[0-9]***

The total expression is: [-+]?[0-9]*\.?[0-9]*

At this point we can do three things:

- Try the expression with a bunch of examples like the ones above and see if the proper ones match and the others do not.
- Try to make it look nicer, before we start off testing it. For instance the class of characters "[0-9]" is so common that it has a shortcut, "\d". So, we could settle for: [-+]?\d*\.?\d* instead. Or we could decide that we want to capture the digits before and after the period for special processing: [-+]?([0-9])*\.?([0-9]*)
- Or, and that may be a good strategy in general!, we can carefully examine the pattern before we start actually using it.

You see, there is a problem with the above pattern: all the parts are optional, that is, each part can match a null string - no sign, no digits before the period, no period, no digits after the period. In other words: *Our pattern can match an empty string!*

Our questionable numbers, like "+000" will be perfectly acceptable and we (grudgingly) agree. But more surprisingly, the strings "--1" and "A1B2" will be accepted too! Why? Because the pattern can start anywhere in the string, so it would match the substrings "-1" and "1" respectively!

We need to reconsider our pattern - it is too simple, too permissive:

- The character before a minus or a plus, if there is any, can not be another digit, a period or a minus or plus. Let us make it a space or a tab or the beginning of the string:
**^|[ \t]** - This may look a bit strange, but what it says is: either the beginning of the string (^ outside the square brackets) - or (the vertical bar)

a space or tab (remember: the string "\t" represents the tab character).
- Any sequence of digits before the period (if there is one) is allowed:
**[0-9]+\.?**
- There may be zero digits in front of the period, but then there must be at least one digit behind it:
**\.[0-9]+**
- And of course digits in front and behind the period:
**[0-9]+\.[0-9]+**
- The character after the string (if any) can not be a "+","-" or "." as that would get us into the unacceptable number-like strings:
**$|[^+-.]** (The dollar sign signifies the end of the string).

Before trying to write down the complete regular expression, let us see what different forms we have:

- No period:
**[-+]?[0-9]+**
- A period without digits before it:
**[-+]?\.[0-9]+**
- Digits before a period, and possibly digits after it:
**[-+]?[0-9]+\.[0-9]***

Now the synthesis: **(^|[ \t])([-+]?([0-9]+|\.[0-9]+|[0-9]+\.[0-9]*))($|[^+-.])**

Or: **(^|[ \t])([-+]?(\d+|\.\d+|\d+\.\d*))($|[^+-.])**

The parentheses are needed to distinguish the alternatives introduced by the vertical bar and to capture the substring we want to have. Each set of parentheses also defines a substring and this can be put into a separate variable: **regexp {.....} $line whole char_before number nosign char_after**

# Or simply only the recognised number (x's as placeholders), the

# last can be left out.

regexp {.....} $line x x number

*Tip:* To identify these substrings: just count the opening parentheses from left to right.