This is a web version for reference of a .docx file originally produced for the Mashcat 2012 session I did called How Big Is My Book. Resurrected to form the Manual for Meret, a regular expressions tutorial based on Marcedit examples.
Literals
Characters as you type them. E.g. i will look for a letter “i”. ii will look for two letter “i”s in a row. Eldorado will look for the exact string “Eldorado”, and 1234s will look for “1234s”.
Types of Character
There are a number of ways of looking for specific types of character:
. looks for any single character. It could be a letter, number, punctuation or anything.
[] looks for any one of the characters in square brackets, so m[ae]rc
will match “marc” and “merc”. You can also specify ranges, e.g. [a-z] will find any letter from “a” to “z”, so [a-d]ad
will match “aad”, “bad”, “cad”, and “dad”. Putting a ^ after the [ will look for any character that isn’t in the square brackets: u[^ks]marc
will not match “ukmarc” or “usmarc” but will match “unmarc”.
\d a digit, same as [0-9]. Like all the following, counts as one character although written as two.
\D not a digit, same as [^0-9].
\w alphanumeric, including underscore, same as [A-Za-z0-9_]
\W non-alphanumeric, same as [^A-Za-z0-9_]
\s whitespace characters, e.g. spaces, tabs
\S non-whitespace characters
\b word boundary: the beginning or end of words (i.e. strings of alphanumeric characters), or the beginning or end of strings.
\ is also used before a special character so you can search on it. E.g, searching on .
will look for any character and will match “.”, “d”, or “5”. To look for a full-stop, put \ in front: \.
.
Starts and Ends
^ matches the start of any string. So, in “marc must die” ^marc
will match “marc” but ^must
will match nothing.
$ matches the end of any string. So, in “marc must die” die$
will match “die” but must$
will match nothing.
Numbers of Characters
* matches the preceding element zero or more times, e.g. catalogu*ing
will match “cataloging”, “cataloguing”, as well as “cataloguuing” and “cataloguuuuuuuuuuing”.
? matches the preceding element zero or one times, e.g. catalogu?ing
will match “cataloging” and “cataloguing” but not “cataloguuing”. See also ? below.
+ matches the preceding element one or more times, e.g. catalogu+ing
will match “cataloguing”, “cataloguuing”, and “cataloguuuuuuuuuuing”, but not “cataloging”.
{n} matches the preceding element exactly n times, e.g. catalogu{10}ing
will match “cataloguuuuuuuuuuing” but not “cataloging”, “cataloguing”, or “cataloguuing”.
{m,n} matches the preceding element at least m times and no more than n times.
? also has a special meaning to restrict matches of multiple characters, e.g. looking for catalog.*ing
in “cataloguing is ace. I love cataloguing” will greedily find “cataloguing is ace. I love cataloguing” as the .* matches both “uing is ace. I love catalogu” and “u”. Amending the regular expression to catalog.*?ing
will find only “cataloguing”.
Grouping
() groups characters together. This has a variety of uses. The group can be used a single character, e.g. (meta)*
looks for the string “meta” zero or more times. It can also be used for capturing smaller parts of the expression for later use, e.g. catalog(.*)
will match anything starting “catalog” but will also store what comes afterwards as $1.
| [pipe] allows alternatives either side of it, e.g. marc|rdf
will match “marc” or “rdf”. Smaller alternatives can be matched with brackets, e.g. (uk|us)marc
will match “ukmarc” or “usmarc” (and if there is a match will store “uk” or “us” as $1).
Regular Expressions in Javascript
To get matches, use string.match(//)
. The regular expression goes between the forward slashes. Put a g
after the second slash to search for all matches, rather than just the first one. Put an i
after the second slash to do a case-insensitive search. String.match
returns an array of matches, or null if it finds nothing.
var hits = “team”.match(/i/g);
hits is null as there is no “i” in “team”.
var text = “Fox in socks in box on Knox”; var hits = text.match(/\w*ox\b/g);
hits is an array of three elements, all a series of words ending in “ox”: [“Fox”, “box”, “Knox”].
To search and replace within string, use string.replace(//, ””). The regular expression goes between the forward slashes. The g and i work in the same way. The string to replace matches with goes after the comma. You can insert subexpressions captured with round brackets by using $1 for the first, $2 for the second, and so on (see Grouping above and the example below). String.replace returns the string with replacements made:
To search and replace within string, use string.replace(//, ””). The regular expression goes between the forward slashes. The g and i work in the same way. The string to replace matches with goes after the comma. You can insert subexpressions captured with round brackets by using $1 for the first, $2 for the second, and so on (see Grouping above and the example below). String.replace returns the string with replacements made:
var text = “I love MARC. I think MARC is the future.”; text = text.replace(/MARC/g, ”linked data”);
text is now “I love linked data. I think linked data is the future.”
var text = “UKMARC is better than USMARC”; text=text.replace(/(.*?MARC) is better than (.*?MARC)/gi, “$2 is better than $1”);
Now, “USMARC is better than UKMARC”. Run the replacement again, and history is reset.
Examples
ISBN (from Thingology blog) ([0-9]{9}[0-9X]|(978|979)[0-9]{10})
UK Postcode (from Wikipedia) (GIR 0AA|[A-PR-UWYZ]([0-9][0-9A-HJKPS-UW]?|[A-HK-Y][0-9][0-9ABEHMNPRV-Y]?) [0-9][ABD-HJLNP-UW-Z]{2})