Limiting Matches: Regular Expression Negation

One of the most powerful means of finding a specific bit of text is through the use of a regular expression capable search, as is typically available in any advanced text editor. A simple but powerful way to find specific text can be through through the use of Regular Expression Negation. For the purpose of this article, consider the use of Regular Expressions within the context of searching for text strings in a text editor.

It is beyond the scope of this article to explain what Regular Expressions are.^†
This article simply means to serve as a quick tip for the reader to consider what might be the best approach to locating a desired portion of text using Regular Expressions.

When composing a Regular Expression matching pattern, oftentimes the developer focuses himself on those Regular Expression classes and Meta-Characters designed to locate specific texts. It makes sense one would approach the problem from this perspective. However, in the context of a Regular Expression Search pattern, it is sometimes beneficial to approach the problem by considering what is NOT part of the desired text.

Instead of attempting to construct a matching-expression, designed to select a specifically desired text, begin by constructing a pattern designed to match nothing at all– or– at least, nothing at all related to your desired text. An artist might think of this as the negative space, or anything which is NOT the subject itself. I composed one such Regular Expression Matching Pattern in effort to remove all HTML tags from a text file, such that the resulting text would appear as a plain text file, devoid of HTML tags.

After some bit of effort in constructing a Matching Pattern to target a specific pattern of textual objects, it became obvious to me that the most effective Matching Pattern might be one which focused on what i did NOT wish to find. It is fortunate that the Regular Expression engine provides for precisely this approach to locating text, through negation.

When composing a regular expression, it is common to use square brackets (E.g. [a-b0-9 ,]+ ) to find any of the characters contained within the brackets, in any order (see Regular Expression tutorials for more information on the use of brackets/ character classes). It is also possible to use Square Brackets for negation; to indicate text characters which should be omitted from the attempt to match a pattern in the target text. It is important to recognize the primary difference between a typical Square Brackets pattern, and one designed to negate text elements is the insertion of the caret character directly following the initial Square Bracket (E.g. [^do not match this]+ ).

Constructing a Match / Replace expression pair for the purpose of removing HTML tags from the target text, I knew my pattern would begin with (<) and end with (>), as would target typical HTML elements, such as <html>, or <div class=”htmlClass”>, or a virtually limitless set of pattern permutations whereby class, ID, href and other HTML element properties might appear in the markup.

Instead of trying to compose a matching pattern to identify all the possible characters which could appear in any given HTML element, as suggested above, I decided to apply Negation to the matching pattern instead. Note the 2nd atom in the pattern I ultimately composed for matching HTML elements, which performed exactly as I desired, as follows:
(<)([^>]+)(>)

Examine the Matching Expression. Note the second atom indicates the Regular Expression engine should find anything which is NOT a closing HTML tag bracket (I.e. > ). Using this approach, the Regular Expression Matching Pattern successfully located every HTML tag. It stands to reason that anything inside of an HTML tag will NOT be a closing bracket (I.e. the greater-than symbol, “>”), so by telling the matching pattern to find anything which is NOT >, using the negation pattern, [^>]+ , and adding a quantifier for multiple instances of objects which are NOT a closing bracket, the Matching Pattern performed precisely as I desired.

For successful removal of HTML Tags, I simply left the Replacement Expression portion of the Search / Replace dialogue blank in the text editor, so the desired text would be found and replaced with nothing.

If you like this article, please be sure to share it via your favourite social network service. Twitter and Facebook buttons are already here for your convenience. Thank you for your continued reading at WordPressCenter.net
~ @ajaxStardust , the Author

^†This author recommends the resources at www.Regular-Expressions.Info for a basic description of Regular Expressions, accompanying tutorials, and regexp-specific software to aid in learning-by-example; testing user composed regular expressions in real-time to immediately identify whether an expression is valid, highlight successful matching elements or invalid regular expressions, and– among other advanced features– allows for loading existing text files as (e.g. TXT, HTML, PHP, JavaScript, etc.) sample-text for testing the effectiveness of an expression for better understanding how regular expressions are useful in real-world examples.

Treading out the WordPress

Limiting Matches: Regular Expression Negation

Related Posts:

Whatchu do

Leave a Reply Cancel reply

About Us

Featured Posts

Follow Us

Categories