Repetition counters @ and {n,m}
Parentheses () and group backreferences \n
Special characters in regular expressions
Regular expressions can be used to find text that matches a given pattern and optionally replace those matches with new text.
A regular expression can contain ordinary text and wildcards. A wildcard is a character that can be used to represent one or many characters.
To search for patterns through regular expressions, make sure that the "Use wildcards" option is enabled in the Find/Replace panel of the Control Board:
or in the Find/Replace dialog:
When this option is on, the search patterns that you enter in the "Find what" or "Replace with" boxes are treated as regular expressions. Otherwise, all wildcards are treated as ordinary characters with no special meaning, and Atlantis performs an ordinary search.
Here are the wildcards that you can use to compose regular expressions in Atlantis:
"?" is the most basic wildcard. It matches any single character. For example, the "c?t" expression matches "cat" and "cut".
"*" (the asterisk) matches any string of characters (including a blank string containing no characters at all). For example, "be*st" matches "best", "beast", "behest", and "bee from the nest". In this example, the asterisk returns all characters that lie between "be" and "st".
"<" matches the beginning of a word. For example, "<comp" matches "comp" in "international competition", but doesn't match "comp" in "recompile".
">" matches the end of a word. For example, "ter>" matches "ter" in "character spacing", but doesn't match "ter" in "terrifying".
"<<" matches the beginning of a paragraph. For example, "<<1." matches "1." at the beginning of a paragraph.
"<<" can be used in conjunction with "<". For example, "<<<*>" matches any word at the beginning of a paragraph.
Square brackets "[ ]" can be used to specify a set of characters. A character set always matches only a single character in the document text. For example, "[aeo]" is a set of three characters "a", "e", and "o". "[aeo]" matches any single one of these three characters. For example, "b[aeo]t" matches "bat", "bet", and "bot".
You can specify entire ranges of characters within character sets. For example "[a-z]" matches any letter from "a" to "z".
Note that the ranges must be in ascending order. You can specify "[a-z]", but not "[z-a]".
You can have both individual characters and character ranges within the same character set. For example, "[b-df-hj-np-tv-xz]" matches any English consonant. It contains 5 character ranges: from "b" to "d", from "f" to "h", from "j" to "n", from "p" to "t", and from "v" to "x". It also includes 1 individual character "z".
You can compose character sets matching any characters except the characters included in the character sets. This is done by putting "!" after the opening bracket "[" of the character set. For example, "[!a-e]" matches any character except the characters included in the specified range, i.e. the "a", "b", "c", "d", and "e" characters.
If you want to include a hyphen "-" or an exclamation mark "!" as literal characters in a character set, you need to "escape" them, i.e. neutralize their special meaning. This is done by putting a backslash "\" right in front of them. For example, "[\!\-]" matches any exclamation mark or hyphen.
To include the backslash character itself as an ordinary character in a character set, you need to escape it too. For example, "[1\\2]" matches either the digit "1", or the backslash character "\" (represented by "\\"), or the digit "2".
"@" and "{n,m}" can be used to specify how many occurrences of the previous character or expression should be found.
"@" means "one or more occurrences of the previous character". For example, "bo@t" matches "bot" and "boot".
The "{n,m}" counter can be used in three different ways:
Note that "n" in the "{n,m}" counter can be 0 . This means that zero occurrence of the preceding character or expression is allowed. For example, "<[0-9]{4}-{0,1}[0-9]{4}-{0,1}[0-9]{4}-{0,1}[0-9]{4}>" matches 16-digit credit card numbers, whether they contain dashes between each four digits, or not:
1234-5678-9012-3456
1234567890123456
In the above regular expression, "-{0,1}" means "zero or one occurrence of a dash character".
Note about repetition counters:
While both "@" and "{n,m}" allow to match multiple occurrences of the previous character or expression, there is one important difference between these two repetition counters.
With "@", Atlantis performs "lazy matching". "Lazy matching" means that Atlantis will try to find as few occurrences of the previous character or expression as possible.
With the "{n,m}" counter, Atlantis works in the opposite way and performs "greedy matching". "Greedy matching" means that Atlantis will try to find as many occurrences of the previous character or expression as possible.
Let's take an example. At first sight, "be@" and "be{1,}" might seem identical regular expressions. Indeed, both "be@" and "be{1,}" mean the same: "find 'b' followed by at least one occurrence of 'e'". But the difference is that "e@" is completely happy with a single occurrence of "e" even when there are more occurrences in the current document word, while "e{1,}" tries to match as many occurrences of "e" in the current document word as possible. This is why "be@" always matches only the first two letters of "bee" (i.e. "be"), while "be{1,}" matches the entire word "bee".
However, there is one situation when the "{n,m}" counter becomes "lazy": it is when such a counter follows the "?" wildcard (meaning "any character"). Accordingly, the following two regular expressions will match identical texts:
?@
?{1,}
Repetition counters can be applied to
Parentheses (or round brackets) can be used to divide the regular expression into shorter expressions (or groups). Each group is assigned an ordinary number starting with the leftmost group. For example, the following regular expression contains two groups:
(<[A-Z].) (<*>)
The group #1 — "<[A-Z]." — is enclosed within the first pair of round brackets. The group #2 — "<*>" — is enclosed within the second pair of round brackets.
You can have up to 9 groups within a regular expression.
Groups can be used in three ways:
J. Johnson
B. Miller
C. Bower
Each line in the above text is the initial letter of a first name followed by a surname.
Johnson J.
Miller B.
Bower C.
This can be done with the following "Replace with" expression:
Johnson
Miller
Bower
Note that a group backreference can be used recursively within a "Replace with" expression. For example, "(<<[!^p]{1,})" as a "Find what" expression catches email addresses from the following list:
support@company.com
rd@company.com
sales@company.com
"<<" means that the expression catches text at a paragraph start. "[!^p]{1,}" catches as much text in the paragraph as it can, up to, and not including the paragraph end mark.
<a href="mailto:support@company.com">support@company.com</a>
<a href="mailto:rd@company.com">rd@company.com</a>
<a href="mailto:sales@company.com">sales@company.com</a>
More precise Find & Replace operations can be performed if an "environment" is specified in the search pattern. By "environment", we mean any elements surrounding the text that you are searching for, —preceding and/or following it.
To specify such a restrictive environment within a regular expression, you will use the "|" wildcard (i.e. the "vertical bar"). Any portion of the regular expression that precedes this "|" wildcard will automatically be matched against any text preceding the found item. Any portion of the regular expression that follows a second instance of this "|" wildcard will automatically be matched against any text following the found item. In other words, you can use one or two "|" wildcards to specify what should precede and/or follow the text that you want to find.
Let's take an example. Let's suppose that you have the following list of international phone numbers:
+1-646-222-3333
+49-89-636-48018
+1-541-754-3010
+1-562-756-2233
+49-0711-680-0
+54-11-5530-3250
+44-0844-800-2400
+1-800-233-4175
They all include a country code (1, 44, 49, 54) preceded by "+".
Let's suppose that you need to find all the German phone numbers from the above list but not retrieve the country code "+49" as part of the found items. In other words, you want to find all the German phone numbers in the list, but in their local format (without the country code). So you need to both make sure that the found numbers begin with the "+49" code, and that Atlantis doesn't include that code in the found items. For this, you need to tell Atlantis that it should search for phone numbers beginning with "+49", but not include "+49" in the search results.
Let's take an example.
The following expression:
+49-[!^p]{1,}
will find any German phone number, including the country code:
+1-646-222-3333
+49-89-636-48018
+1-541-754-3010
+1-562-756-2233
+49-0711-680-0
+54-11-5530-3250
+44-0844-800-2400
+1-800-233-4175
The "+49-" part of the above expression catches the country code plus the following hyphen. "[!^p]{1,}" catches the rest of the phone number (i.e. all the characters up to the paragraph end mark).
Now let's put the "|" wildcard after the "+49-" part of the above regular expression:
+49-|[!^p]{1,}
A search using this new regular expression will catch the following text:
+1-646-222-3333
+49-89-636-48018
+1-541-754-3010
+1-562-756-2233
+49-0711-680-0
+54-11-5530-3250
+44-0844-800-2400
+1-800-233-4175
You can similarly specify a "following environment" in a regular expression. Let's suppose that you want to find any US toll-free phone number (the ones preceded with the "+1" country code) from the following list:
+1-646-222-3333
+49-89-636-48018
+1-541-754-3010 - toll-free
+1-562-756-2233
+49-0711-680-0
+54-11-5530-3250
+44-0844-800-2400
+1-800-233-4175 - toll-free
This expression:
+1-|[!^p]@| - toll-free
will catch the following text from the above list of phone numbers:
+1-646-222-3333
+49-89-636-48018
+1-541-754-3010 - toll-free
+1-562-756-2233
+49-0711-680-0
+54-11-5530-3250
+44-0844-800-2400
+1-800-233-4175 - toll-free
So this expression finds any phone number preceded with "+1-" and followed by " - toll-free", but neither "+1-" nor " - toll-free" are reported as found text.
If you need to specify only a "following environment", you still have to include two instances of the "|" wildcard. But the first instance of "|" should then be placed at the very beginning of the expression. For example, the following expression:
|[!^p]@| - toll-free
will find all the "toll-free" phone numbers (including the country code) from the following list, but it will leave out the part that is outside the boundaries delimited by the "|" wildcard, i.e. the part that matches the pattern or text placed after the second delimiter:
+1-646-222-3333
+49-89-636-48018
+1-541-754-3010 - toll-free
+1-562-756-2233
+49-0711-680-0 - toll-free
+54-11-5530-3250
+44-0844-800-2400 - toll-free
+1-800-233-4175 - toll-free
If you want a wildcard to be treated as an ordinary character in a regular expression, you must tell Atlantis that you mean to use it as a literal character, and not as a wildcard with special meaning. This is done by prefixing the wildcard character with a backslash. For example, "\?" will not match "any character" but an ordinary textual question mark. "\{" will search for an opening curly bracket. "\(*\)" finds any text enclosed between round brackets. "\\" will search for an ordinary textual backslash. Etc.
A number of special characters can be used within ordinary search patterns (when the "Use wildcards" option is unchecked in the "Find/Replace" window). Each of these special characters is preceded with "^". For example, you can type "^p" in the "Find what" box to search for paragraph end marks. "^g" will match any picture, "^s" any nonbreaking space. Etc.
All these "^..." special characters available for ordinary searches can also be used in wildcard searches with same meaning. For example, "<^$@e>" matches words ending with "e"; "<^#{2,3}>" matches numbers containing 2 or 3 digits ("78", "112", etc). Special characters can also be used within character sets in regular expressions. For example, "[!^p]" will match any character except a paragraph end mark; "<[^$^#]@>" will match any word containing letters and/or digits.
The following regular expression will find groups of two neighboring identical paragraphs:
(<<[!^p]{0,}^p)\1
It would find the following pairs of paragraphs:
Bob
Jack
Jack
John
Charlie
Peter
Peter
Chuck
Walkthrough:
"<<[!^p]{0,}^p" matches any entire paragraph. Obviously, any paragraph starts with "<<" (meaning "paragraph start"), and is terminated with "^p" (meaning "paragraph end mark"). "[!^p]{0,}" matches all the paragraph contents that lie between the paragraph start and the paragraph end mark. The "{0,} repetition counter means that a paragraph can be blank (i.e. contain nothing but a paragraph end mark). "<<[!^p]{0,}^p" is enclosed within round brackets so that it acts as a group, and it can be referred to through the "\n" wildcard. So the terminating item "\1" of the above regular expression again matches the text already matched by the expression enclosed within round brackets.
To find three neighboring identical paragraphs, you could use this regular expression:
(<<[!^p]{0,}^p)\1\1
The following expression will find any number (2 or more) of neighboring identical paragraphs:
(<<[!^p]{0,}^p)\1{1,}
For example, it would find the following groups of paragraphs:
Bob
Jack
Jack
Jack
Jack
John
Charlie
Peter
Peter
Chuck
Peter
Peter
Peter
Chuck
Now let's suppose that you want to remove the duplicate paragraphs. Simply combine the above "Find what" pattern with the following "Replace with" pattern:
\1
This would replace each found group of paragraphs with the first found paragraph of each such group. In other words, it would remove the redundant paragraphs and keep only the first occurrence of them.
Let's suppose that we have the following list of international phone numbers:
+1-646-222-3333
+49-89-636-48018
+1-541-754-3010
+1-562-756-2233
+49-0711-680-0
+54-11-5530-3250
+44-0844-800-2400
+1-800-233-4175
Each phone number in the above list is preceded with "+" and a country code.
Let's suppose that we need to put the country codes in round brackets and get rid of the "+" signs in order to convert each phone number to the following format:
(1)646-222-3333
This can be done with the following regular expressions:
Find what:
+(^#@)-([!^p]{1,})
Replace with:
(\1)\2
Walkthrough:
"+(^#@)" in the "Find what" box catches any country code, including the associated "+" sign. "([!^p]{1,})" catches the phone number, but leaves out the hyphen following the country code. The "+" sign before the country code and the hyphen after the country code are kept outside the round brackets used in the "Find what" expression. In this way, we make sure that the "+" sign and the hyphen after the country code will not be part of the back-reference provided by the grouping of the other elements within round brackets. In this way, when Atlantis uses the text matched by the capturing groups, the "+" sign and the hyphen after the country code will not be included in it. On the other hand, the country code number — "^#@" — and the phone number in the local format — "[!^p]{1,}" — are put within round brackets. This is because these elements will be used to reformat the phone numbers, and need to be associated with group numbers for back-reference in the "Replace with" box.
The "Replace with" expression — "(\1)\2" — means that any country code number and the associated phone number matching the "Find what" expressions "^#@" and "[!^p]{1,}" respectively, will be rearranged so that they follow each other immediately, the country code being placed in first position, and put within round brackets.
So if we use the above "Find what" and "Replace with" patterns on the above original list of phone numbers, we will get this:
(1)646-222-3333
(49)89-636-48018
(1)541-754-3010
(1)562-756-2233
(49)0711-680-0
(54)11-5530-3250
(44)0844-800-2400
(1)800-233-4175
The following paragraph contains a hyperlink (a clickable email address):
Technical support: support@company.com
The hyperlink target address (in our case, the email address) explicitly displays within the paragraph text.
In HTML, this paragraph can be represented through the following code:
<p>Technical support: <a href="mailto:support@company.com">support@company.com</a></p>
But there is an alternative way to compose a hyperlink. Below is a hyperlink to the same email address:
But it doesn't display the associated email address within the paragraph text. "Technical support" is displayed instead.
Below is the corresponding HTML code:
<p><a href="mailto:support@company.com">Technical" support</a></p>
Let's suppose that we have the following document text:
Technical support: support@company.com
Research department: rd@company.com
Sales department: sales@company.com
and we need to convert them into the following HTML code:
<p><a href="mailto:support@company.com">Technical" support</a></p>
<p><a href="mailto:rd@company.com">Research" department</a></p>
<p><a href="mailto:sales@company.com">Sales" department</a></p>
The above HTML code would display like this in a Web browser:
This conversion can be done with the following regular expressions:
Find what:
(<<[!^p]@): ([!^p]{1,})
Replace with:
<p><a href="mailto:\2">\1</a></p>
Walkthrough:
"(<<[!^p]@)" in the "Find what" pattern catches the description of the email address ("Technical support", "Research department", etc). "([!^p]{1,})" catches the email address itself ("support@company.com", "rd@company.com", etc).
The "Replace with" expression contains the HTML code with two group backreferences. "\2" gets replaced with the email address, and "\1" — with its textual description.
See also...