Javascript regular expression checking for a number. Example: removing capital letters

JavaScript regexp is an object type that is used to match sequences of characters in strings.

Creating the first regular expression

There are two ways to create a regular expression: using a regular expression literal or using a regular expression builder. Each of them represents the same pattern: the symbol " c", followed by " a" and then the symbol " t».

// regular expression literal is enclosed in slashes (/)
var option1 = /cat/;
// Regular expression constructor
var option2 = new RegExp("cat");

As a general rule, if the regular expression is going to be constant, meaning it won't change, it's better to use a regular expression literal. If it will change or depend on other variables, it is better to use a method with a constructor.

RegExp.prototype.test() method

Remember when I said that regular expressions are objects? This means they have a number of methods. The simplest method is JavaScript regexp test which returns a boolean value:

True: The string contains a regular expression pattern.

False: No match found.

console.log(/cat/.test(“the cat says meow”));
// right
console.log(/cat/.test(“the dog says bark”));
// incorrect

Regular Expression Basics Cheat Sheet

The secret of regular expressions is to remember common characters and groups. I highly recommend spending a few hours on the chart below and then coming back and studying further.

Symbols

  • . – (dot) matches any single character with the exception of line breaks;
  • *  –  matches the previous expression, which is repeated 0 or more times;
  • +  –  matches a previous expression that is repeated 1 or more times;
  • ? – the previous expression is optional ( matches 0 or 1 time);
  • ^ – corresponds to the beginning of the line;
  • $ – matches the end of the line.

Character groups

  • d– matches any single numeric character.
  • w– matches any character (number, letter or underscore).
  • [XYZ]  – a set of characters. Matches any single character from the set specified in parentheses. You can also specify ranges of characters, for example, .
  • [XYZ ]+  – matches a character from a set that is repeated one or more times.
  • [^A—Z]  – within a character set, “^” is used as a negation sign. IN in this example The pattern matches anything that is not an uppercase letter.

Flags:

There are five optional flags in JavaScript regexp. They can be used separately or together, and are placed after the closing slash. For example: /[ A —Z ]/g. Here I will show only two flags.

g– global search.

i  – case-insensitive search.

Additional designs

(x)  –  capturing parentheses. This expression matches x and remembers that match so you can use it later.

(?:x)  – non-capturing parentheses. The expression matches x but does not remember the match.

Matches x only if it is followed by y.

Let's test the material we've studied

First, let's test all of the above. Let's say we want to check a string for any numbers. To do this, you can use the “d” construction.

console.log(/d/.test("12-34"));
// right

The above code returns true if there is at least one digit in the string. What to do if you need to check a string for compliance with the format? You can use multiple "d" characters to define the format:

console.log(/dd-dd/.test("12-34"));
//right
console.log(/dd-dd/.test("1234"));
//wrong

If you don't care how the numbers come before and after the "-" sign in JavaScript regexp online, you can use the "+" symbol to show that the "d" pattern occurs one or more times:

console.log(/d+-d+/.test("12-34"));
// right
console.log(/d+-d+/.test("1-234"));
// right
console.log(/d+-d+/.test("-34"));
// incorrect

For simplicity, you can use parentheses to group expressions. Let's say we have a cat meowing and we want to check the pattern " meow"(meow):

console.log(/me+(ow)+w/.test("meeeeowowoww"));
// right

Now let's figure it out.

m => match one letter ‘m’;

e + => match the letter "e" one or more times;

(ow) + => match the letters "ow" one or more times;

w => matches the letter ‘w’;

‘m’ + ‘eeee’ + ‘owowow’ + ‘w’.

When operators like "+" are used immediately after parentheses, they affect the entire contents of the parentheses.

Operator "? " It indicates that the previous character is optional. As you'll see below, both test cases return true because the "s" characters are marked as optional.

console.log(/cats? says?/i.test("the Cat says meow"));
//right
console.log(/cats? says?/i.test("the Cats say meow"));
//right

If you want to find a slash character, you need to escape it using a backslash. The same is true for other characters that have special meaning, such as the question mark. Here's a JavaScript regexp example of how to look for them:

var slashSearch = ///;
var questionSearch = /?/;

  • d is the same as : each construction corresponds to a digital symbol.
  • w– this is the same as [ A —Za —z 0-9_]: Both expressions match any single alphanumeric character or underscore.

Example: adding spaces to camel-style lines

In this example, we're really tired of the camel style of writing and we need a way to add spaces between words. Here's an example:

removeCc("camelCase") // => should return "camel Case"

There is a simple solution using a regular expression. First, we need to find everything capital letters. This can be done using a character set lookup and a global modifier.

This matches the character "C" in "camelCase"

Now, how to add a space before "C"?

We need to use captivating parentheses! They allow you to find a match and remember it to use later! Use catching brackets to remember the capital letter you find:

You can access the captured value later like this:

Above we use $1 to access the captured value. By the way, if we had two sets of capturing parentheses, we would use $1 and $2 to refer to the captured values ​​and similarly for more captivating parentheses.

If you need to use parentheses but don't need to capture that value, you can use non-capturing parentheses: (?: x ). In this case, a match to x is found, but it is not remembered.

Let's return to the current task. How do we implement capturing parentheses? Using the JavaScript regexp replace method! We pass "$1" as the second argument. It is important to use quotation marks here.

function removeCc(str)(
return str.replace(/()/g, "$1");
}


Let's look at the code again. We grab the uppercase letter and then replace it with the same letter. Inside the quotes, insert a space followed by the variable $1 . As a result, we get a space after each capital letter.

function removeCc(str)(
return str.replace(/()/g, " $1");
}
removeCc("camelCase") // "camel Case"
removeCc("helloWorldItIsMe") // "hello World It Is Me"

Example: removing capital letters

Now we have a string with a bunch of unnecessary capital letters. Have you figured out how to remove them? First, we need to select all capital letters. Then we search for a character set using the global modifier:

We'll use the replace method again, but how do we make the character lowercase this time?

function lowerCase(str)(
return str.replace(//g, ???);
}


Hint: In the replace() method, you can specify a function as the second parameter.

We will use an arrow function to avoid capturing the value of the match found. When using a function in the JavaScript regexp replace method, the function will be called after a match is found, and the result of the function is used as the replacement string. Even better, if the match is global and multiple matches are found, the function will be called for each match found.

function lowerCase(str)(
return str.replace(//g, (u) => u.toLowerCase());
}
lowerCase("camel Case") // "camel case"
lowerCase("hello World It Is Me") // "hello world it is me"

Example: convert the first letter to capital

capitalize("camel case") // => should return "Camel case"

Let's use the function in the replace() method again. However, this time we only need to look for the first character in the string. Recall that the symbol “^” is used for this.

Let's dwell on the "^" symbol for a second. Remember the example given earlier:

console.log(/cat/.test("the cat says meow"));
//right

When adding a "^" character, the function no longer returns true because the word "cat" is not at the beginning of the line.

Some people, when faced with a problem, think: “Oh, I’ll use regular expressions.” Now they have two problems.
Jamie Zawinski

Yuan-Ma said, “It takes a lot of force to cut wood across the grain of the wood. It takes a lot of code to program across the problem structure.
Master Yuan-Ma, “Book of Programming”

Programming tools and techniques survive and spread in a chaotic evolutionary manner. Sometimes it is not the beautiful and brilliant that survive, but simply those that work well enough in their field - for example, if they are integrated into another successful technology.

In this chapter, we will discuss such a tool - regular expressions. This is a way to describe patterns in string data. They create a small, stand-alone language that is included in JavaScript and many other languages ​​and tools.

The regular schedules are both very strange and extremely useful. Their syntax is cryptic and their JavaScript programming interface is clunky. But this powerful tool for exploring and processing strings. Once you understand them, you will become a more effective programmer.

Creating a regular expression

Regular – object type. It can be created by calling the RegExp constructor, or by writing required template, surrounded by slashes.

Var re1 = new RegExp("abc"); var re2 = /abc/;

Both of these regular expressions represent the same pattern: the character “a” followed by the character “b” followed by the character “c”.

If you use the RegExp constructor then the pattern is written as regular string, so all the rules regarding backslashes apply.

The second entry, where the pattern is between slashes, treats backslashes differently. First, since the pattern ends with a forward slash, we need to put a backslash before the forward slash that we want to include in our pattern. Additionally, backslashes that are not part of special characters type \n will be preserved (rather than ignored, as in strings), and will change the meaning of the pattern. Some characters, such as the question mark or plus, have a special meaning in regular expressions, and if you need to find such a character, it must also be preceded by a backslash.

Var eighteenPlus = /eighteen\+/;

To know which characters need to be preceded by a slash, you need to learn a list of all special characters in regular expressions. This is not yet possible, so when in doubt, just put a backslash in front of any character that is not a letter, number or space.

Checking for matches

Regulars have several methods. The simplest one is test. If you pass it a string, it will return a Boolean value indicating whether the string contains an occurrence of the given pattern.

Console.log(/abc/.test("abcde")); // → true console.log(/abc/.test("abxde")); // → false

A regular sequence consisting only of non-special characters is simply a sequence of these characters. If abc is anywhere in the line we're testing (not just at the beginning), test will return true.

Looking for a set of characters

You could also find out whether a string contains abc using indexOf. Regular patterns allow you to go further and create more complex patterns.

Let's say we need to find any number. When we put a set of characters in square brackets in regular expression, it means that that part of the expression matches any of the characters in the brackets.

Both expressions are in lines containing a number.

Console.log(//.test("in 1992")); // → true console.log(//.test("in 1992")); // → true

In square brackets, a dash between two characters is used to specify a range of characters, where the sequence is specified by the Unicode encoding. The characters from 0 to 9 are there just in a row (codes from 48 to 57), so it captures them all and matches any number.

Several character groups have their own built-in abbreviations.

\d Any number
\w Alphanumeric character
\s Whitespace character (space, tab, newline, etc.)
\D not a number
\W is not an alphanumeric character
\S is not a whitespace character
. any character except line feed

Thus, you can set the date and time format like 01/30/2003 15:20 with the following expression:

Var dateTime = /\d\d-\d\d-\d\d\d\d \d\d:\d\d/; console.log(dateTime.test("30-01-2003 15:20")); // → true console.log(dateTime.test("30-Jan-2003 15:20")); // → false

Looks terrible, doesn't it? There are too many backslashes, which makes the pattern difficult to understand. We'll improve it slightly later.

Backslashes can also be used in square brackets. For example, [\d.] means any number or period. Notice that the period inside the square brackets loses its special meaning and becomes simply a period. The same applies to other special characters, such as +.

You can invert a set of characters - that is, say that you need to find any character except those that are in the set - by placing a ^ sign immediately after the opening square bracket.

Var notBinary = /[^01]/; console.log(notBinary.test("1100100010100110")); // → false console.log(notBinary.test("1100100010200110")); // → true

Repeating parts of the template

We know how to find one number. What if we need to find the whole number - a sequence of one or more digits?

If you put a + sign after something in the regular sequence, this will mean that this element can be repeated more than once. /\d+/ means one or more digits.

Console.log(/"\d+"/.test(""123"")); // → true console.log(/"\d+"/.test("""")); // → false console.log(/"\d*"/.test(""123"")); // → true console.log(/"\d*"/.test("""")); // → true

The asterisk * has almost the same meaning, but it allows the pattern to occur zero times. If something is followed by an asterisk, then it never prevents the pattern from being in the line - it just appears there zero times.

A question mark makes part of the pattern optional, meaning it can occur zero or once. In the following example, the character u may appear, but the pattern matches even when it does not.

Var neighbor = /neighbou?r/; console.log(neighbor.test("neighbor")); // → true console.log(neighbor.test("neighbor")); // → true

To specify the exact number of times a pattern must occur, use braces. (4) after an element means that it must appear 4 times in the line. You can also specify a gap: (2,4) means that the element must occur at least 2 and no more than 4 times.

Another version of the date and time format, where days, months and hours of one or two digits are allowed. And it's also a little more readable.

Var dateTime = /\d(1,2)-\d(1,2)-\d(4) \d(1,2):\d(2)/; console.log(dateTime.test("30-1-2003 8:45")); // → true

You can use open-ended spaces by omitting one of the numbers. (,5,) means that the pattern can occur from zero to five times, and (5,) means from five or more.

Grouping Subexpressions

To use the * or + operators on multiple elements at once, you can use parentheses. The part of the regular expression enclosed in brackets is considered one element from the point of view of operators.

Var cartoonCrying = /boo+(hoo+)+/i; console.log(cartoonCrying.test("Boohoooohoohooo")); // → true

The first and second pluses only apply to the second o's in boo and hoo. The third + refers to the whole group (hoo+), finding one or more such sequences.

The letter i at the end of the expression makes the regular expression case-insensitive - so that B matches b.

Matches and Groups

The test method is the simplest method for checking regular expressions. It only tells you whether a match was found or not. Regulars also have an exec method, which will return null if nothing was found, and otherwise return an object with information about the match.

Var match = /\d+/.exec("one two 100"); console.log(match); // → ["100"] console.log(match.index); // → 8

The object returned by exec has an index property, which contains the number of the character from which the match occurred. In general, the object looks like an array of strings, where the first element is the string that was checked for a match. In our example, this will be the sequence of numbers we were looking for.

Strings have a match method that works in much the same way.

Console.log("one two 100".match(/\d+/)); // → ["100"]

When a regular expression contains subexpressions grouped by parentheses, the text that matches these groups will also appear in the array. The first element is always a complete match. The second is the part that matched the first group (the one whose parentheses occurred first), then the second group, and so on.

Var quotedText = /"([^"]*)"/; console.log(quotedText.exec("she said "hello"")); // → [""hello"", "hello"]

When a group is not found at all (for example, if it is followed by a question mark), its position in the array is undefined. If a group matches several times, then only the last match will be in the array.

Console.log(/bad(ly)?/.exec("bad")); // → ["bad", undefined] console.log(/(\d)+/.exec("123")); // → ["123", "3"]

Groups are useful for retrieving parts of strings. If we don't just want to check whether a string has a date, but extract it and create an object representing the date, we can enclose the sequences of numbers in parentheses and select the date from the result of exec.

But first, a little digression in which we will learn the preferred way to store date and time in JavaScript.

Date type

JavaScript has standard type object for dates - or rather, moments in time. It's called Date. If you simply create a date object via new you will get current date and time.

Console.log(new Date()); // → Sun Nov 09 2014 00:07:57 GMT+0300 (CET)

You can also create an object containing a given time

Console.log(new Date(2015, 9, 21)); // → Wed Oct 21 2015 00:00:00 GMT+0300 (CET) console.log(new Date(2009, 11, 9, 12, 59, 59, 999)); // → Wed Dec 09 2009 12:59:59 GMT+0300 (CET)

JavaScript uses a convention where month numbers start with a zero and day numbers start with a one. This is stupid and ridiculous. Watch out.

The last four arguments (hours, minutes, seconds and milliseconds) are optional and are set to zero if missing.

Timestamps are stored as the number of milliseconds that have passed since the beginning of 1970. For times before 1970 use negative numbers(this is due to the Unix time convention that was created around that time). The date object's getTime method returns this number. It is naturally big.
console.log(new Date(2013, 11, 19).getTime()); // → 1387407600000 console.log(new Date(1387407600000)); // → Thu Dec 19 2013 00:00:00 GMT+0100 (CET)

If you give the Date constructor one argument, it is treated as this number of milliseconds. You can get the current millisecond value by creating a Date object and calling the getTime method, or by calling the Date.now function.

The Date object has methods getFullYear, getMonth, getDate, getHours, getMinutes, and getSeconds to retrieve its components. There is also a getYear method that returns a rather useless two-digit code like 93 or 14.

By enclosing the relevant parts of the template in parentheses, we can create a date object directly from the string.

Function findDate(string) ( var dateTime = /(\d(1,2))-(\d(1,2))-(\d(4))/; var match = dateTime.exec(string); return new Date(Number(match), Number(match) - 1, Number(match) ) console.log(findDate("30-1-2003")); // → Thu Jan 30 2003 00:00:00 GMT+0100 (CET)

Word and line boundaries

Unfortunately, findDate will just as happily extract the meaningless date 00-1-3000 from the string "100-1-30000". The match can occur anywhere in the string, so in this case it will simply start from the second character and end on the second to last one.

If we need to force the match to take the entire string, we use the ^ and $ tags. ^ matches the beginning of the line, and $ matches the end. Therefore, /^\d+$/ matches a string containing only one or more digits, /^!/ matches a string starting with an exclamation point, and /x^/ does not match any string (there cannot be a x).

If, on the other hand, we just want to make sure that the date starts and ends on a word boundary, we use the \b mark. A word boundary can be the beginning or end of a line, or any place in a line where there is an alphanumeric character \w on one side and a non-alphanumeric character on the other.

Console.log(/cat/.test("concatenate")); // → true console.log(/\bcat\b/.test("concatenate")); // → false

Note that the boundary label is not a symbol. It's simply a constraint, meaning that a match only occurs if a certain condition is met.

Templates with choice

Let's say you need to find out whether the text contains not just a number, but a number followed by pig, cow, or chicken in the singular or plural.

It would be possible to write three regular expressions and check them one by one, but there is a better way. Symbol | denotes a choice between the patterns to the left and to the right of it. And we can say the following:

Var animalCount = /\b\d+ (pig|cow|chicken)s?\b/; console.log(animalCount.test("15 pigs")); // → true console.log(animalCount.test("15 pigchickens")); // → false

Parentheses delimit the portion of the pattern to which | is applied, and many such operators can be placed one after the other to indicate a choice from more than two options.

Search engine

Regular expressions can be thought of as flowcharts. The following diagram describes a recent livestock example.

An expression matches a string if it is possible to find a path from the left side of the diagram to the right. We remember the current position in the line, and each time we go through the rectangle, we check that the part of the line immediately after our position in it matches the contents of the rectangle.

This means that checking for a match of our regular character in the string “the 3 pigs” when going through the flowchart looks like this:

At position 4 there is a word boundary, and we pass the first rectangle
- starting from the 4th position we find the number and go through the second rectangle
- at position 5, one path closes back in front of the second rectangle, and the second goes further to the rectangle with a space. We have a space, not a number, and we choose the second path.
- now we are at position 6, the beginning of “pigs”, and at the triple branching of the paths. There is no “cow” or “chicken” in the line, but there is “pig”, so we choose this path.
- at position 9 after the triple fork, one path bypasses “s” and goes to the last word boundary rectangle, and the second goes through “s”. We have an “s” so we go there.
- at position 10 we are at the end of the line, and only the word boundary can match. The end of the line is considered the boundary, and we pass through the last rectangle. And now we have successfully found our template.

Basically, the way regular expressions work is that the algorithm starts at the beginning of the string and tries to find a match there. In our case, there is a word boundary, so it passes the first rectangle - but there is no number there, so it stumbles on the second rectangle. Then it moves to the second character in the string, and tries to find a match there... And so on until it finds a match or gets to the end of the string, in which case no match is found.

Kickbacks

The regular expression /\b(+b|\d+|[\da-f]h)\b/ matches either a binary number followed by a b, a decimal number without a suffix, or a hexadecimal number (the numbers 0 to 9 or the symbols from a to h), followed by h. Relevant diagram:

When searching for a match, it may happen that the algorithm takes the top path (binary number), even if there is no such number in the string. If there is a line “103”, for example, it is clear that only after reaching the number 3 the algorithm will understand that it is on the wrong path. In general, the line matches the regular sequence, just not in this thread.

Then the algorithm rolls back. At a fork, it remembers the current position (in our case, this is the beginning of the line, just after the word boundary) so that you can go back and try another path if the chosen one does not work. For the string “103”, after encountering a three, it will go back and try to go through the decimal path. This will work so a match will be found.

The algorithm stops as soon as it finds a complete match. This means that even if several options may be suitable, only one of them is used (in the order in which they appear in the regular sequence).

Backtracking occurs when using repetition operators such as + and *. If you search for /^.*x/ in the string "abcxe", the regex part.* will try to consume the entire string. The algorithm will then realize that it also needs “x”. Since there is no “x” after the end of the string, the algorithm will try to look for a match by moving back one character. After abcx there is also no x, then it rolls back again, this time to the substring abc. And after the line, it finds x and reports a successful match, in positions 0 to 4.

You can write a regular routine that will lead to multiple rollbacks. This problem occurs when the pattern can match the input data multiple times. different ways. For example, if we make a mistake when writing a regular expression for binary numbers, we might accidentally write something like /(+)+b/.

If the algorithm were to look for such a pattern in a long string of 0s and 1s that didn't have a "b" at the end, it would first go through the inner loop until it ran out of digits. Then he will notice that there is no “b” at the end, he will roll back one position, go through the outer loop, give up again, try to roll back to another position along the inner loop... And he will continue to search in this way, using both loops. That is, the amount of work with each character of the line will double. Even for several dozen characters, finding a match will take a very long time.

replace method

Strings have a replace method that can replace part of a string with another string.

Console.log("dad".replace("p", "m")); // → map

The first argument can also be a regular expression, in which case the first occurrence of the regular expression in the line is replaced. When the “g” (global) option is added to a regex, all occurrences are replaced, not just the first

Console.log("Borobudur".replace(//, "a")); // → Barobudur console.log("Borobudur".replace(//g, "a")); // → Barabadar

It would make sense to pass the "replace all" option through a separate argument, or through a separate method like replaceAll. But unfortunately, the option is transmitted through the regular system itself.

The full power of regular expressions is revealed when we use links to groups found in a string, specified in the regular expression. For example, we have a line containing people's names, one name per line, in the format "Last Name, First Name". If we need to swap them and remove the comma to get “First Name Last Name,” we write the following:

Console.log("Hopper, Grace\nMcCarthy, John\nRitchie, Dennis" .replace(/([\w ]+), ([\w ]+)/g, "$2 $1")); // → Grace Hopper // John McCarthy // Dennis Ritchie

$1 and $2 in the replacement line refer to groups of characters enclosed in parentheses. $1 is replaced with the text that matches the first group, $2 with the second group, and so on, up to $9. The entire match is contained in the $& variable.

You can also pass a function as the second argument. For each replacement, a function will be called whose arguments will be the found groups (and the entire matching part of the line), and its result will be inserted into a new line.

Simple example:

Var s = "the cia and fbi"; console.log(s.replace(/\b(fbi|cia)\b/g, function(str) ( return str.toUpperCase(); ))); // → the CIA and FBI

Here's a more interesting one:

Var stock = "1 lemon, 2 cabbages, and 101 eggs"; function minusOne(match, amount, unit) ( amount = Number(amount) - 1; if (amount == 1) // only one left, remove the "s" at the end unit = unit.slice(0, unit.length - 1); else if (amount == 0) amount = "no"; return amount + " " + unit; ) console.log(stock.replace(/(\d+) (\w+)/g, minusOne)); // → no lemon, 1 cabbage, and 100 eggs

The code takes a string, finds all occurrences of numbers followed by a word, and returns a string with each number reduced by one.

The group (\d+) goes into the amount argument, and (\w+) goes into the unit argument. The function converts amount to a number - and this always works, because our pattern is \d+. And then makes changes to the word, in case there is only 1 item left.

Greed

It's easy to use replace to write a function that removes all comments from JavaScript code. Here's the first try:

Function stripComments(code) ( return code.replace(/\/\/.*|\/\*[^]*\*\//g, ""); ) console.log(stripComments("1 + /* 2 */3")); // → 1 + 3 console.log(stripComments("x = 10;// ten!")); // → x = 10; console.log(stripComments("1 /* a */+/* b */ 1")); // → 1 1

The part before the "or" operator matches two slashes followed by any number of characters except newlines. The part that removes multi-line comments is more complex. We use [^], i.e. any character that is not empty as a way to find any character. We can't use a period because block comments continue on new line, and the newline character does not match the period.

But the output of the previous example is incorrect. Why?

The [^]* part will first try to capture as many characters as it can. If because of this the next part of the regular sequence does not find a match, it will roll back one character and try again. In the example, the algorithm tries to grab the entire line, and then rolls back. Having rolled back 4 characters, he will find */ in the line - and this is not what we wanted. We wanted to grab only one comment, and not go to the end of the line and find the last comment.

Because of this, we say that the repetition operators (+, *, ?, and ()) are greedy, meaning they first grab as much as they can and then go back. If you put a question after an operator like this (+?, *?, ??, ()?), they will turn into non-greedy, and start finding the smallest possible occurrences.

And that's what we need. By forcing the asterisk to find matches in the minimum possible number of characters in a line, we consume only one block of comments, and no more.

Function stripComments(code) ( return code.replace(/\/\/.*|\/\*[^]*?\*\//g, ""); ) console.log(stripComments("1 /* a */+/* b */ 1")); // → 1 + 1

Many errors occur when using greedy operators instead of non-greedy ones. When using the repeat operator, always consider the non-greedy operator first.

Dynamically creating RegExp objects

In some cases, the exact pattern is unknown at the time the code is written. For example, you will need to look for the user's name in the text, and enclose it in underscores. Since you will only know the name after running the program, you cannot use slash notation.

But you can construct the string and use the RegExp constructor. Here's an example:

Var name = "harry"; var text = "And Harry has a scar on his forehead."; var regexp = new RegExp("\\b(" + name + ")\\b", "gi"); console.log(text.replace(regexp, "_$1_")); // → And _Harry_ has a scar on his forehead.

When creating word boundaries, we have to use double slashes because we write them in a normal line, and not in a regular sequence with forward slashes. The second argument to RegExp contains options for regular expressions - in our case “gi”, i.e. global and case-insensitive.

But what if the name is “dea+hlrd” (if our user is a kulhatzker)? As a result, we will get a meaningless regular expression that will not find matches in the string.

We can add backslashes before any character we don't like. We can't add backslashes before letters because \b or \n are special characters. But you can add slashes before any non-alphanumeric characters without any problems.

Var name = "dea+hlrd"; var text = "This dea+hlrd is annoying everyone."; var escaped = name.replace(/[^\w\s]/g, "\\$&"); var regexp = new RegExp("\\b(" + escaped + ")\\b", "gi"); console.log(text.replace(regexp, "_$1_")); // → This _dea+hlrd_ annoyed everyone.

search method

The indexOf method cannot be used with regular expressions. But there is a search method that just expects regular expression. Like indexOf, it returns the index of the first occurrence, or -1 if none occurs.

Console.log(" word".search(/\S/)); // → 2 console.log(" ".search(/\S/)); // → -1

Unfortunately, there is no way to tell the method to look for a match starting at a specific offset (as you can do with indexOf). That would be helpful.

lastIndex property

The exec method also does not work convenient way start searching from a given position in the string. But it gives an inconvenient way.

A regex object has properties. One of them is source, which contains a string. Another one is lastIndex, which controls, under some conditions, where the next search for occurrences will begin.

These conditions include that the global option g must be present, and that the search must be done using the exec method. A more reasonable solution would be to simply allow an extra argument to be passed to exec, but reasonableness is not a fundamental feature of the JavaScript regex interface.

Var pattern = /y/g; pattern.lastIndex = 3; var match = pattern.exec("xyzzy"); console.log(match.index); // → 4 console.log(pattern.lastIndex); // → 5

If the search was successful, the exec call updates the lastIndex property to point to the position after the found occurrence. If there was no success, lastIndex is set to zero - just like the lastIndex of the newly created object.

When using a global regular variable and multiple exec calls, these automatic lastIndex updates can cause problems. Your regular server can start searching from the position left from the previous call.

Var digit = /\d/g; console.log(digit.exec("here it is: 1")); // → ["1"] console.log(digit.exec("and now: 1")); // → null

Another interesting effect of the g option is that it changes how the match method works. When called with this option, instead of returning an array similar to the result of exec, it finds all occurrences of the pattern in the string and returns an array of the found substrings.

Console.log("Banana".match(/an/g)); // → ["an", "an"]

So be careful with global regular variables. The cases where they are needed - replace calls or places where you specifically use lastIndex - are probably all the cases in which they should be used.

Occurrence cycles

A typical task is to iterate through all occurrences of a pattern in a string so that it can access the match object in the body of the loop, using lastIndex and exec.

Var input = "A line with 3 numbers in it... 42 and 88."; var number = /\b(\d+)\b/g; var match; while (match = number.exec(input)) console.log("Found ", match, " on ", match.index); // → Found 3 by 14 // Found 42 by 33 // Found 88 by 40

It takes advantage of the fact that the value of the assignment is the value being assigned. Using match = re.exec(input) as a condition in while loop, we search at the beginning of each iteration, store the result in a variable, and end the loop when all matches are found.

Parsing INI files

To conclude the chapter, let's look at a problem using regular expressions. Imagine that we are writing a program that collects information about our enemies via the Internet. automatic mode. (We won’t write the entire program, just the part that reads the settings file. Sorry.) The file looks like this:

Searchengine=http://www.google.com/search?q=$1 spitefulness=9.7 ; a semicolon is placed before comments; each section refers to a different enemy fullname=Larry Doe type=kindergarten bull website=http://www.geocities.com/CapeCanaveral/11451 fullname=Gargamel type=evil wizard outputdir=/home/marijn/enemies/gargamel

The exact file format (which is quite widely used, and is usually called INI) is as follows:

Empty lines and lines starting with a semicolon are ignored
- lines enclosed in square brackets begin a new section
- lines containing an alphanumeric identifier followed by = add a setting in this section

Everything else is incorrect data.

Our task is to convert such a string into an array of objects, each with a name property and an array of settings. One object is needed for each section, and one more for global settings at the top of the file.

Since the file needs to be parsed line by line, it's a good idea to start by breaking the file into lines. To do this, we used string.split("\n") in Chapter 6. Some operating systems use not one \n character for line breaks, but two - \r\n. Since the split method takes regular expressions as an argument, we can split lines using the expression /\r?\n/, allowing both single \n and \r\n between lines.

Function parseINI(string) ( // Let's start with an object containing top-level settings var currentSection = (name: null, fields: ); var categories = ; string.split(/\r?\n/).forEach(function(line ) ( var match; if (/^\s*(;.*)?$/.test(line)) ( return; ) else if (match = line.match(/^\[(.*)\]$ /)) ( currentSection = (name: match, fields: ); categories.push(currentSection); ) else if (match = line.match(/^(\w+)=(.*)$/)) ( currentSection. fields.push((name: match, value: match)); else ( throw new Error("The line "" + line + "" contains invalid data."); ) ));

The code goes through all the lines, updating the current section object “current section”. First, it checks whether the line can be ignored using the regular expression /^\s*(;.*)?$/. Can you imagine how this works? The part between the brackets matches the comments, huh? makes it so that the regular character will also match lines consisting of only spaces.

If the line is not a comment, the code checks to see if it starts a new section. If yes, it creates a new object for the current section, to which subsequent settings are added.

The last meaningful possibility is that the string is a normal setting, in which case it is added to the current object.

If none of the options work, the function throws an error.

Notice how the frequent use of ^ and $ ensures that the expression matches the entire string rather than just part of it. If you don't use them, the code will generally work, but will sometimes produce strange results and the error will be difficult to track down.

The if (match = string.match(...)) construct is similar to the trick of using assignment as a condition in a while loop. Often you don't know that the match call will succeed, so you can only access the result object inside an if block that checks for it. In order not to break the beautiful chain of if checks, we assign the search result to a variable and immediately use this assignment as a check.

International symbols

Due to the initially simple implementation of the language, and the subsequent fixation of such an implementation “in granite,” JavaScript regular expressions are stupid with characters that are not found in the English language. For example, the “letter” character, from the point of view of JavaScript regular expressions, can be one of the 26 letters of the English alphabet, and for some reason also an underscore. Letters like é or β, which are clearly letters, do not match \w (and will match \W, which is a non-letter).

In a strange coincidence, historically \s (space) matches all characters that are considered whitespace in Unicode, including things like non-breaking space or Mongolian vowel separator.

Some regex implementations in other languages ​​have special syntax for searching for special categories of Unicode characters, such as "all caps", "all punctuation" or "control characters". There are plans to add such categories to JavaScript, but they, apparently, will not be implemented soon.

Bottom line

Regulars are objects that represent search patterns in strings. They use their own syntax to express these patterns.

/abc/ Character sequence
// Any character from the list
/[^abc]/ Any character except characters from the list
// Any character from the interval
/x+/ One or more occurrences of the pattern x
/x+?/ One or more occurrences, non-greedy
/x*/ Zero or more occurrences
/x?/ Zero or one occurrence
/x(2,4)/ From two to four occurrences
/(abc)/ Group
/a|b|c/ Any of several patterns
/\d/ Any number
/\w/ Any alphanumeric character (“letter”)
/\s/ Any whitespace character
/./ Any character except newlines
/\b/ Word boundary
/^/ Start of line
/$/ End of line

The regex has a test method to check whether the pattern is in the string. There is an exec method that returns an array containing all the groups found. The array has an index property, which contains the number of the character from which the match occurred.

Strings have a match method to match patterns, and a search method that returns only the starting position of the occurrence. The replace method can replace occurrences of a pattern with another string. In addition, you can pass a function to replace that will build a replacement line based on the template and found groups.

Regular characters have settings that are written after the closing slash. The i option makes the regular expression case-insensitive, and the g option makes it global, which, among other things, causes the replace method to replace all occurrences found, not just the first one.

The RegExp constructor can be used to create regular expressions from strings.

Regulators are a sharp instrument with an uncomfortable handle. They greatly simplify some tasks, and can become unmanageable when solving other, complex problems. Part of learning to use regex is to be able to resist the temptation to stuff it with a task for which it is not intended.

Exercises

Inevitably, when solving problems, you will encounter incomprehensible cases, and you may sometimes despair when you see the unpredictable behavior of some regular expressions. Sometimes it helps to study the behavior of a regular engine through an online service like debuggex.com, where you can see its visualization and compare it with the desired effect.
Regular golf
“Golf” in code is a game where you need to express given program minimum number of characters. Regular golf is a practical exercise in writing the smallest possible regulars to find a given pattern, and only that.

For each of the sublines, write a regular expression to check their location in the line. The regular engine should find only these specified substrings. Don't worry about word boundaries unless specifically mentioned. When you have a working regular pattern, try reducing it.

Car and cat
- pop and prop
- ferret, ferry, and ferrari
- Any word ending in ious
- A space followed by a period, comma, colon or semicolon.
- A word longer than six letters
- Word without letters e

// Enter your regular expressions verify(/.../, ["my car", "bad cats"], ["camper", "high art"]); verify(/.../, ["pop culture", "mad props"], ["plop"]); verify(/.../, ["ferret", "ferry", "ferrari"], ["ferrum", "transfer A"]); verify(/.../, ["how delicious", "spacious room"], ["ruinous", "consciousness"]); verify(/.../, ["bad punctuation ."], ["escape the dot"]); verify(/.../, ["hottenottententen"], ["no", "hotten totten tenten"]); verify(/.../, ["red platypus", "wobbling nest"], ["earth bed", "learning ape"]); function verify(regexp, yes, no) ( // Ignore unfinished exercises if (regexp.source == "...") return; yes.forEach(function(s) ( if (!regexp.test(s)) console .log("Not found "" + s + """ )); no.forEach(function(s) ( if (regexp.test(s)) console.log("Unexpected occurrence "" + s + " ""); ));

Quotes in text
Let's say you wrote a story and used single quotes throughout to indicate dialogue. Now you want to replace the dialogue quotes with double quotes, and leave the single quotes in abbreviations for words like aren’t.

Come up with a pattern that distinguishes between these two uses of quotes, and write a call to the replace method that does the replacement.

Numbers again
Sequences of numbers can be found with a simple regular expression /\d+/.

Write an expression that finds only numbers written in JavaScript style. It should support a possible minus or plus before the number, a decimal point, and scientific notation 5e-3 or 1E10 - again with possible plus or minus. Also note that there may not necessarily be numbers before or after the dot, but the number cannot consist of a single dot. That is, .5 or 5. are valid numbers, but one dot by itself is not.

// Enter the regular sequence here. var number = /^...$/; // Tests: ["1", "-1", "+15", "1.55", ".5", "5.", "1.3e2", "1E-4", "1e+12"] .forEach(function(s) ( if (!number.test(s)) console.log("Did not find "" + s + """); )); ["1a", "+-1", "1.2.3", "1+1", "1e4.5", ".5.", "1f5", "."].forEach(function(s) ( if (number.test(s)) console.log("Incorrectly accepted "" + s + """); ));

Regex or regular expressions are intimidating for beginners, but essential for any programmer. Let's understand regular expressions using 5 simple examples with JavaScript.

If you have a problem and you're going to solve it with regular expressions, you now have two problems. There is a saying. Regular expressions found in code sometimes cause fear and hatred in people who are not familiar with them.

But in fact, any regex is just a template expression that can solve the problem of an entire function in one line. However, to build a regular expression, you need to take into account a set of strict rules that a beginner can get confused and make mistakes in.

Matching characters

The most basic regular expressions are those that match single characters. Here are their rules:

1. A period (.) matches any character. If you need to search for a specific point, you must escape it using the “\” character (\.).

2. A question mark (?) indicates that the previous character is optional. To search for the question mark itself in a string, it must also be escaped with "\" (\?).

var text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit lest. Donec convallis dignissim ligula, et rutrum est elat vistibulum eu."; // Both "elit" and "elat" will do. The dot means any symbol will do. var regex = /el.t/g; console.log(text.match(regex)); // "est" and "lest" will work equally well. The question mark makes the "l" optional. var regex2 = /l?est/g; console.log(text.match(regex2));

var text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit lest. Donec convallis dignissim ligula, et rutrum est elat vistibulum eu.";

// Both "elit" and "elat" will do. The dot means any symbol will do.

var regex = /el.t/g ;

console. log(text. match(regex));

// "est" and "lest" will work equally well. The question mark makes the "l" optional.

var regex2 = /l?est/g ;

console. log(text. match(regex2));

Match multiple characters

A set is one or more characters enclosed in parentheses, for example . This expression will look for only that set of characters in a string—in this example, only a, b, or c. On the contrary, you can search for occurrences of any symbols except using the “^” symbol. [^abc] will match any character that is not a, b, or c. You can also specify a range of characters or numbers, for example , .

There are built-in character sets that make it easier to write regular expressions. They are called abbreviations or shorthand. For example, you can write \D instead. There are abbreviations for other characters (including numbers and underscores) - \w and \W, as well as for spaces - \s and \S.

// Only "cat" and "can" will work, not "car". var text = "cat car can"; console.log(text.match(/ca/g)); // Everything will pass except cat and can (there is a ^ symbol) console.log(text.match(/ca[^tn]/g)); // Another example where only numbers will pass text = "I would like 8 cups of coffee, please."; console.log("How many cups: " + text.match(//g)); // Easier way using shortcut \d console.log("How many cups: " + text.match(/\d/g)); // Passes everything except numbers console.log(text.match(/\D/g));

// Only "cat" and "can" will work, not "car".

var text = "cat car can" ;

console. log(text. match(/ca/g));

// Passes everything except cat and can (there is a ^ symbol)

console. log (text . match (/ca[^tn]/g ) ) ;

// Another example where only numbers will pass

text = "I would like 8 cups of coffee, please.";

console. log ("How many cups: " + text . match (//g ) ) ;

// Easier way using shortcut \d

console. log ("How many cups: " + text . match (/\d/g ) ) ;

// Passes everything except numbers

console. log(text. match(/\D/g));

Matching words

In most cases, you need to search for whole words, not individual characters. This is done using modifiers (+) and (-), which repeat a character or set of characters.

Adding (X) specifies the exact number of repetitions, (x, y) – the range (x and y are numbers).

Additionally, there is a special pattern \b that matches boundaries at the ends of words.

var text = "Hello people of 1974. I come from the future. In 2014 we have laser guns, hover boards and live on the moon!"; // Will find years. \d+ will match one or more characters var yearRegex = /\d+/g; console.log("Years: ", text.match(yearRegex)); // Finds all sentences. Our sentences begin with a capital letter and end with a period or exclamation point. var sentenceRegex = /.+?(\.|!)/g; console.log("Sentences: ", text.match(sentenceRegex)); // Finds all words starting with "h". Both uppercase and lowercase are suitable for us, so we use the i // \b modifier to define the word boundary. var hWords = /\bh\w+/ig; console.log("H Words: ", text.match(hWords)); // Finds all words from 4 to 6 characters var findWords = /\b\w(4,6)\b/g; console.log("Words between 4 and 6 chars: ", text.match(findWords)); // Find words longer than 5 characters console.log("Words 5 chars or longer: ", text.match(/\b\w(5,)\b/g)); // Finds words exactly 6 characters long console.log("Words exactly 6 chars long: ", text.match(/\b\w(6)\b/g));

var text = "Hello people of 1974. I come from the future. In 2014 we have laser guns, hover boards and live on the moon!";

// Will find years. \d+ matches one or more characters

var yearRegex = /\d+/g ;

console. log ( "Years: " , text . match ( yearRegex ) ) ;

// Finds all sentences. Our sentences begin with a capital letter and end with a period or exclamation point.

var sentenceRegex = /.+?(\.|!)/g ;

console. log("Sentences: ", text. match(sentenceRegex));

// Finds all words starting with "h". Both uppercase and lowercase are suitable for us, so we use the i modifier

// \b to determine word boundaries.

var hWords = /\bh\w+/i g ;

console. log ( "H Words: " , text . match ( hWords ) ) ;

// Finds all words from 4 to 6 characters

var findWords = /\b\w(4,6)\b/g ;

console. log( "Words between 4 and 6 chars: ", text . match(findWords));

// Find words longer than 5 characters

console. log ("Words 5 chars or longer: " , text . match (/\b\w(5,)\b/g ) ) ;

// Find words exactly 6 characters long

console. log( "Words exactly 6 chars long: ", text . match (/\b\w(6)\b/g ) ) ;

Whole string validation

In JavaScript, such expressions can be used to validate user input from text fields. To validate strings, a regular regular expression is used, tied to the beginning and end of a text fragment, using the expressions ^ (beginning of line) and $ (end of line) for this purpose. These symbols ensure that the pattern you write spans the entire length of the text and not just matches part of it.

Additionally, in this case, we use the test() method of the regex object, which returns true or false when testing whether the regular expression matches the string.

// We have an array of strings, let's find links..com/", "123461", "https://site/?s=google", "http://not a valid url", "abc http:/ /invalid.url/" ]; var regex = /^https?:\/\/[\w\/?.&-=]+$/; var urls = ; for(var i = 0; i< strings.length; i++){ if(regex.test(strings[i])){ // Валидная ссылка urls.push(strings[i]); } } console.log("Valid URLs: ", urls);

// We have an array of strings, let's find the links.

var strings = [

"https://site/" ,

"this is not a URL" ,

"https://google.com/" ,

"123461" ,

"https://site/?s=google" ,

"http://not a valid url" ,

"abc http://invalid.url/"

var regex = / ^ https ? : \ / \ / [ \ w \ / ? . & -= ] + $ / ;

var urls = ;

for (var i = 0 ; i< strings . length ; i ++ ) {

if (regex . test (strings [ i ] ) ) (

urls. push(strings[i]);

console. log("Valid URLs: ", urls);

Search and replace

Another common task, which is facilitated by the use of regular expressions, is searching and replacing text.

Regular Expressions

Regular expression is an object that describes a character pattern. The RegExp class in JavaScript represents regular expressions, and the String and RegExp class objects define methods that use regular expressions to perform pattern matching and text search and replacement operations. Regular expression grammar in JavaScript contains a fairly complete subset of the regular expression syntax used in Perl 5, so if you have experience with the Perl language, you can easily describe patterns in JavaScript programs.

Features of Perl regular expressions that are not supported in ECMAScript include the s (single-line mode) and x (extended syntax) flags; escape sequences \a, \e, \l, \u, \L, \U, \E, \Q, \A, \Z, \z and \G and other extended constructs starting with (?.

Defining Regular Expressions

IN JavaScript regular expressions are represented by objects RegExp. RegExp objects can be created using the RegExp() constructor, but more often they are created using a special literal syntax. Just as string literals are specified as characters surrounded by quotation marks, regular expression literals are specified as characters surrounded by a pair of slash characters (/). So your JavaScript code might contain lines like this:

Var pattern = /s$/;

This line creates a new RegExp object and assigns it to the pattern variable. This object RegExp looks for any strings ending with an "s". The same regular expression can be defined using the RegExp() constructor:

Var pattern = new RegExp("s$");

A regular expression pattern specification consists of a sequence of characters. Most characters, including all alphanumeric ones, literally describe the characters that must be present. That is, the regular expression /java/ matches all lines containing the substring “java”.

Other characters in regular expressions are not intended to be used to find their exact equivalents, but rather have special meanings. For example, the regular expression /s$/ contains two characters. The first character s denotes a search for a literal character. Second, $ is a special metacharacter that marks the end of a line. So this regular expression matches any string ending with the character s.

The following sections describe the various characters and metacharacters used in regular expressions in JavaScript.

Literal characters

As noted earlier, all alphabetic characters and numbers in regular expressions match themselves. Regular expression syntax in JavaScript also supports the ability to specify certain non-alphabetic characters using escape sequences starting with a backslash (\) character. For example, the sequence \n matches the newline character. These symbols are listed in the table below:

Some punctuation marks have special meanings in regular expressions:

^ $ . * + ? = ! : | \ / () { } -

The meaning of these symbols is explained in the following sections. Some of them have special meaning only in certain regular expression contexts, while in other contexts they are interpreted literally. However, in general, to literally include any of these characters in a regular expression, you must precede it with a backslash character. Other characters, such as quotes and @, have no special meaning and simply match themselves in regular expressions.

If you can't remember exactly which characters should be preceded by a \, you can safely put a backslash in front of any of the characters. However, keep in mind that many letters and numbers become special meaning, so the letters and numbers you are looking for literally should not be preceded by a \ character. To include the backslash character itself in a regular expression, you must obviously precede it with another backslash character. For example, the following regular expression matches any string that contains a backslash character: /\\/.

Character classes

Individual literal characters can be combined into character classes by enclosing them in square brackets. A character class matches any character contained in that class. Therefore, the regular expression // matches one of the characters a, b, or c.

Negative character classes can also be defined to match any character except those specified in parentheses. The negation character class is specified by the ^ character as the first character following the left parenthesis. The regular expression /[^abc]/ matches any character other than a, b, or c. In character classes, a range of characters can be specified using a hyphen. All lowercase Latin characters are found using the // expression, and any letter or number from the Latin character set can be found using the // expression.

Certain character classes are particularly common, so regular expression syntax in JavaScript includes special characters and escape sequences to represent them. Thus, \s matches space, tab, and any Unicode whitespace characters, and \S matches any non-Unicode whitespace characters.

The table below provides a list of these special characters and the syntax of the character classes. (Note that some of the character class escape sequences match only ASCII characters and are not extended to work with Unicode characters. You can explicitly define your own Unicode character classes, for example /[\u0400-\u04FF]/ matches any character Cyrillic alphabet.)

JavaScript Regular Expression Character Classes
Symbol Correspondence
[...] Any of the characters shown in parentheses
[^...] Any of the characters not listed in parentheses
. Any character other than a newline or other Unicode line delimiter
\w Any ASCII text character. Equivalent
\W Any character that is not an ASCII text character. Equivalent to [^a-zA-Z0-9_]
\s Any whitespace character from the Unicode set
\S Any non-whitespace character from the Unicode set. Please note that the characters \w and \S are not the same thing
\d Any ASCII numbers. Equivalent
\D Any character other than ASCII numbers. Equivalent to [^0-9]
[\b] Backspace character literal

Note that class special character escape sequences can be enclosed in square brackets. \s matches any whitespace character and \d matches any digit, hence /[\s\d]/ matches any whitespace character or digit.

Repetition

Given the knowledge of regular expression syntax gained so far, we can describe a two-digit number as /\d\d/ or a four-digit number as /\d\d\d\d/, but we cannot, for example, describe a number consisting of any number of digits, or a string of three letters followed by an optional digit. These more complex patterns use regular expression syntax, which specifies how many times a given regular expression element can be repeated.

Repeat symbols always follow the pattern to which they are applied. Some types of repetitions are used quite often, and special symbols are available to indicate these cases. For example, + matches one or more instances of the previous pattern. The following table provides a summary of the repetition syntax:

The following lines show several examples:

Var pattern = /\d(2,4)/; // Matches a number containing two to four digits pattern = /\w(3)\d?/; // Match exactly three word characters and one optional digit pattern = /\s+java\s+/; // Matches the word "java" with one or more spaces // before and after it pattern = /[^(]*/; // Matches zero or more characters other than the opening parenthesis

Be careful when using repetition characters * and ?. They can match the absence of a pattern specified before them and therefore the absence of characters. For example, the regular expression /a*/ matches the string "bbbb" because it does not contain the character a.

The repetition characters listed in the table represent the maximum possible number of repetitions that will allow subsequent parts of the regular expression to be matched. We say this is greedy repetition. It is also possible to implement repetition performed in a non-greedy manner. It is enough to indicate after the symbol (or symbols) the repetition question mark: ??, +?, *? or even (1.5)?.

For example, the regular expression /a+/ matches one or more instances of the letter a. Applied to the string "aaa", it matches all three letters. On the other hand, the expression /a+?/ matches one or more instances of the letter a and selects the least possible number of characters. Applied to the same string, this pattern matches only the first letter a.

“Greedless” repetition does not always give the expected result. Consider the pattern /a+b/, which matches one or more a's followed by a b's. When applied to the string "aaab", it corresponds to the entire string.

Now let's check the "non-greedy" version of /a+?b/. One might think that it would match a b preceded by only one a. If applied to the same string, "aaab" would be expected to match the single character a and last character b. However, this pattern actually matches the entire string, just like the greedy version. The fact is that a regular expression pattern search is performed by finding the first position in the string, starting from which a match becomes possible. Since a match is possible starting from the first character of the string, shorter matches starting from subsequent characters are not even considered.

Alternatives, Grouping and Links

Regular expression grammar includes special characters for defining alternatives, grouping subexpressions, and references to previous subexpressions. Pipe symbol | serves to separate alternatives. For example, /ab|cd|ef/ matches either the string "ab", or the string "cd", or the string "ef", and the pattern /\d(3)|(4)/ matches either three digits or four lowercase letters .

Note that alternatives are processed from left to right until a match is found. If a match is found with the left alternative, the right one is ignored, even if a “better” match can be achieved. Therefore, when the pattern /a|ab/ is applied to the string "ab", it will only match the first character.

Parentheses have multiple meanings in regular expressions. One of them is to group individual elements into one subexpression, so that the elements when using the special characters |, *, +, ? and others are considered as one whole. For example, the pattern /java(script)?/ matches the word "java" followed by the optional word "script", and /(ab|cd)+|ef)/ matches either the string "ef" or one or more repetitions of one from the strings "ab" or "cd".

Another use of parentheses in regular expressions is to define subpatterns within a pattern. When a regular expression match is found in the target string, the portion of the target string that matches any specific subpattern enclosed in parentheses can be extracted.

Suppose you want to find one or more lowercase letters followed by one or more numbers. To do this, you can use the template /+\d+/. But let's also assume that we only want the numbers at the end of each match. If we put this part of the pattern in parentheses (/+(\d+)/), we can extract numbers from any matches we find. How this is done will be described below.

A related use of parenthetical subexpressions is to refer to subexpressions from a previous part of the same regular expression. This is achieved by specifying one or more digits after the \ character. The numbers refer to the position of the parenthesized subexpression within the regular expression. For example, \1 refers to the first subexpression, and \3 refers to the third. Note that subexpressions can be nested within each other, so the position of the left parenthesis is used when counting. For example, in the following regular expression, a nested subexpression (cript) reference would look like \2:

/(ava(cript)?)\sis\s(fun\w*)/

A reference to a previous subexpression does not point to the pattern of that subexpression, but to the text found that matches that pattern. Therefore, references can be used to impose a constraint that selects parts of a string that contain exactly the same characters. For example, the following regular expression matches zero or more characters inside single or double quotes. However, it does not require that the opening and closing quotes match each other (that is, that both quotes be single or double):

/[""][^""]*[""]/

We can require quotation marks to match using a reference like this:

Here \1 matches the first subexpression. In this example, the link imposes a constraint that requires the closing quotation mark to match the opening quotation mark. This regular expression does not allow single quotes inside double quotes, and vice versa.

It is also possible to group elements in a regular expression without creating a numbered reference to those elements. Instead of simply grouping elements between ( and ), start the group with symbols (?: and end it with a symbol). Consider, for example, the following pattern:

/(ava(?:cript)?)\sis\s(fun\w*)/

Here the subexpression (?:cript) is only needed for grouping so that the repetition character ? can be applied to the group. These modified parentheses do not create a link, so in this regular expression, \2 refers to text that matches the pattern (fun\w*).

The following table lists the selection, grouping, and reference operators in regular expressions:

Regular expression symbols for selecting from alternatives, grouping, and JavaScript links
Symbol Meaning
| Alternative. Matches either the subexpression on the left or the subexpression on the right.
(...) Grouping. Groups elements into a single unit that can be used with the characters *, +, ?, | etc. Also remembers characters matching this group for use in subsequent references.
(?:...) Only grouping. Groups elements into a single unit, but does not remember the characters corresponding to this group.
\number Matches the same characters that were found when matching group number number. Groups are subexpressions inside (possibly nested) parentheses. Group numbers are assigned by counting left parentheses from left to right. Groups formed using the symbols (?:) are not numbered.

Specifying a Match Position

As described earlier, many elements of a regular expression match a single character in a string. For example, \s matches a single whitespace character. Other regular expression elements match the positions between characters rather than the characters themselves. For example, \b matches a word boundary—the boundary between \w (an ASCII text character) and \W (a non-text character), or the boundary between an ASCII text character and the beginning or end of a line.

Elements such as \b do not specify any characters that must be present in the matched string, but they do specify valid positions for matching. These elements are sometimes called regular expression anchor elements because they anchor the pattern to a specific position in the string. The most commonly used anchor elements are ^ and $, which link patterns to the beginning and end of a line, respectively.

For example, the word "JavaScript" on its own line can be found using the regular expression /^JavaScript$/. To find separate word"Java" (rather than a prefix, such as in the word "JavaScript"), you can try using the pattern /\sJava\s/, which requires a space before and after the word.

But such a solution raises two problems. First, it will only find the word "Java" if it is surrounded by spaces on both sides, and will not be able to find it at the beginning or end of the line. Secondly, when this pattern does match, the string it returns will contain leading and trailing spaces, which is not exactly what we want. So instead of using a pattern that matches whitespace characters \s, we'll use a pattern (or anchor) that matches word boundaries \b. The result will be the following expression: /\bJava\b/.

The anchor element \B matches a position that is not a word boundary. That is, the pattern /\Bcript/ will match the words “JavaScript” and “postscript” and will not match the words “script” or “Scripting”.

Arbitrary regular expressions can also serve as anchor conditions. If you place an expression between the characters (?= and), it becomes a forward match test against subsequent characters, requiring that those characters match the specified pattern but not be included in the match string.

For example, to match the name of a common programming language followed by a colon, you can use the expression /ava(cript)?(?=\:)/. This pattern matches the word "JavaScript" in the string "JavaScript: The Definitive Guide", but it will not match the word "Java" in the string "Java in a Nutshell" because it is not followed by a colon.

If you enter the condition (?!), then this will be a negative forward check for subsequent characters, requiring that the following characters do not match the specified pattern. For example, the pattern /Java(?!Script)(\w*)/ matches the substring “Java”, followed by a capital letter and any number text ASCII characters provided that the substring "Java" is not followed by the substring "Script". It will match the string "JavaBeans" but not the string "Javanese", and it will match the string "JavaScrip" but not the strings "JavaScript" or "JavaScripter".

The table below provides a list of regular expression anchor characters:

Regular expression anchor characters
Symbol Meaning
^ Matches the beginning of a string expression or the beginning of a line in a multiline search.
$ Matches the end of a string expression or the end of a line in a multiline search.
\b Matches a word boundary, i.e. matches the position between the \w character and the \W character, or between the \w character and the beginning or end of a line. (Note, however, that [\b] matches the backspace character.)
\B Matches a position that is not a word boundary.
(?=p) Positive lookahead check for subsequent characters. Requires subsequent characters to match the pattern p, but does not include those characters in the matched string.
(?!p) Negative forward check for subsequent characters. Requires that the following characters do not match the pattern p.

Flags

And one more last element regular expression grammars. Regular expression flags specify high-level pattern matching rules. Unlike the rest of regular expression grammar, flags are specified not between the slash characters, but after the second one. JavaScript supports three flags.

Flag i specifies that pattern matching should be case insensitive, and flag g- that the search should be global, i.e. all matches in the string must be found. Flag m performs a pattern search in multi-line mode. If the string expression being searched contains newlines, then in this mode the anchor characters ^ and $, in addition to matching the beginning and end of the entire string expression, also match the beginning and end of each text string. For example, the pattern /java$/im matches both “java” and “Java\nis fun”.

These flags can be combined in any combination. For example, to search for the first occurrence of the word "java" (or "Java", "JAVA", etc.) in a case-insensitive manner, you can use the case-insensitive regular expression /\bjava\b/i. And to find all occurrences of this word in a string, you can add the g flag: /\bjava\b/gi.

Methods of the String class for searching by pattern

Up to this point, we've discussed the grammar of generated regular expressions, but we haven't looked at how those regular expressions can actually be used in JavaScript scripts. In this section we will discuss methods String object, in which regular expressions are used for pattern matching and search with replacement. And then we'll continue our conversation about pattern matching with regular expressions by looking at the RegExp object and its methods and properties.

Strings supports four methods using regular expressions. The simplest of them is the method search(). It takes a regular expression as an argument and returns either the position of the first character of the matched substring, or -1 if no match is found. For example, the following call will return 4:

Var result = "JavaScript".search(/script/i); // 4

If the argument to the search() method is not a regular expression, it is first converted by passing it to the RegExp constructor. The search() method does not support global search and ignores the g flag in its argument.

Method replace() performs a search and replace operation. It takes a regular expression as its first argument and a replacement string as its second. The method searches the line on which it is called for a match to the specified pattern.

If the regular expression contains the g flag, the replace() method replaces all matches found with the replacement string. Otherwise, it only replaces the first match found. If the replace() method's first argument is a string rather than a regular expression, then the method performs a literal search for the string rather than converting it to a regular expression using the RegExp() constructor as the search() method does.

As an example, we can use the replace() method to capitalize the word "JavaScript" consistently across an entire line of text:

// Regardless of the case of characters, we replace them with a word in the required case var result = "javascript".replace(/JavaScript/ig, "JavaScript");

The replace() method is more powerful than this example would suggest. Let me remind you that the subexpressions in parentheses within a regular expression are numbered from left to right, and that the regular expression remembers the text corresponding to each of the subexpressions. If the replacement string contains a $ sign followed by a number, the replace() method replaces those two characters with the text that matches the specified subexpression. This is a very useful feature. We can use it, for example, to replace straight quotes in a string with typographical quotes, which are simulated by ASCII characters:

// A quote is a quote followed by any number of characters // other than quotes (which we remember), followed by another quote // var quote = /"([^"]*)"/g; // Replace the straight quotes with typographic ones and leave "$1" unchanged // the contents of the quote stored in $1 var text = ""JavaScript" is an interpreted programming language."; var result = text.replace(quote, ""$1"") ; // "JavaScript" is an interpreted programming language.

An important thing to note is that the second argument to replace() can be a function that dynamically calculates the replacement string.

Method match() is the most general of the String class methods that use regular expressions. It takes a regular expression as its only argument (or converts its argument to a regular expression by passing it to the RegExp() constructor) and returns an array containing the search results. If the g flag is set in the regular expression, the method returns an array of all matches present in the string. For example:

// will return ["1", "2", "3"] var result = "1 plus 2 equals 3".match(/\d+/g);

If the regular expression does not contain the g flag, the match() method does not perform a global search; it just looks for the first match. However, match() returns an array even when the method does not perform a global search. In this case, the first element of the array is the substring found, and all remaining elements are subexpressions of the regular expression. Therefore, if match() returns an array arr, then arr will contain the entire string found, arr the substring corresponding to the first subexpression, etc. Drawing a parallel with the replace() method, we can say that the contents of $n are entered into arr[n].

For example, take a look at the following code that parses a URL:

Var url = /(\w+):\/\/([\w.]+)\/(\S*)/; var text = "Visit our website http://www..php"; var result = text.match(url); if (result != null) ( var fullurl = result; // Contains "http://www..php" var protocol = result; // Contains "http" var host = result; // Contains "www..php " )

It should be noted that for a regular expression that does not have the global search flag g set, the match() method returns the same value as the regular expression's exec() method: the returned array has the index and input properties, as described in the discussion of the exec( ) below.

The last of the String object methods that uses regular expressions is split(). This method splits the string on which it is called into an array of substrings, using the argument as a delimiter. For example:

"123,456,789".split(","); // Return ["123","456","789"]

The split() method can also take a regular expression as an argument. This makes the method more powerful. For example, you can specify a delimiter that allows an arbitrary number of whitespace characters on both sides:

"1, 2, 3 , 4 , 5".split(/\s*,\s*/); // Return ["1","2","3","4","5"]

RegExp object

As mentioned, regular expressions are represented as RegExp objects. In addition to the RegExp() constructor, RegExp objects support three methods and several properties.

The RegExp() constructor takes one or two string arguments and creates a new RegExp object. The first argument to the constructor is a string containing the body of the regular expression, i.e. text that must appear between slash characters in a regular expression literal. Note that string literals and regular expressions use the \ character to represent escape sequences, so when passing the regular expression as a string literal to the RegExp() constructor, you must replace each \ character with a pair of \\ characters.

The second argument to RegExp() may be missing. If specified, it defines the regular expression flags. It must be one of the characters g, i, m, or a combination of these characters. For example:

// Finds all five-digit numbers in a string. Note // the use of symbols in this example \\ var zipcode = new RegExp("\\d(5)", "g");

The RegExp() constructor is useful when the regular expression is generated dynamically and therefore cannot be represented using regular expression literal syntax. For example, to find a string entered by the user, you need to create a regular expression at runtime using RegExp().

RegExp Properties

Each RegExp object has five properties. Property source- a read-only string containing the text of the regular expression. Property global is a read-only boolean value that specifies whether the g flag is present in the regular expression. Property ignoreCase is a read-only boolean value that determines whether the i flag is present in the regular expression. Property multiline is a read-only boolean value that determines whether the m flag is present in the regular expression. And the last property lastIndex is an integer that can be read and written. For patterns with the g flag, this property contains the position number in the line at which the next search should begin. As described below, it is used by the exec() and test() methods.

RegExp Methods

RegExp objects define two methods that perform pattern matching; they behave similarly to the String class methods described above. The main method of the RegExp class used for pattern matching is exec(). It is similar to the previously mentioned match() method of the String class, except that it is a RegExp class method that takes a string as an argument, rather than a String class method that takes a RegExp argument.

The exec() method executes the regular expression for the specified string, i.e. looks for a match in a string. If no match is found, the method returns null. However, if a match is found, it returns the same array as the array returned by the match() method for searching without the g flag. The zero element of the array contains the string that matches the regular expression, and all subsequent elements contain substrings that match all subexpressions. In addition, the property index contains the position number of the character with which the corresponding fragment begins, and the property input refers to the line that was searched.

Unlike match(), the exec() method returns an array whose structure does not depend on the presence of the g flag in the regular expression. Let me remind you that when passing a global regular expression, the match() method returns an array of matches found. And exec() always returns one match, but provides information about it full information. When exec() is called on a regular expression that contains the g flag, the method sets the lastIndex property of the regular expression object to the position number of the character immediately following the found substring.

When exec() is called a second time on the same regular expression, it begins the search at the character whose position is specified in the lastIndex property. If exec() does not find a match, the lastIndex property is set to 0. (You can also set lastIndex to zero at any time, which should be done in all cases where the search ends before the last match in a single row is found, and the search begins on another line with the same RegExp object.) This special behavior allows exec() to be called repeatedly to iterate over all matches of the regular expression in the line. For example:

Var pattern = /Java/g; var text = "JavaScript is more fun than Java!"; var result; while((result = pattern.exec(text)) != null) ( console.log("Found "" + result + """ + " at position " + result.index + "; next search will start at " + pattern .lastIndex);

Another method of the RegExp object is test(), which is much simpler method exec(). It takes a string and returns true if the string matches the regular expression:

Var pattern = /java/i; pattern.test("JavaScript"); // Return true

Calling test() is equivalent to calling exec(), which returns true if exec() returns something other than null. For this reason, the test() method behaves in the same way as the exec() method when called on a global regular expression: it begins searching for the specified string at the position specified by the lastIndex property, and if it finds a match, sets the lastIndex property to the character position number directly next to the found match. Therefore, using the test() method, you can create a line traversal loop in the same way as using the exec() method.

In JavaScript, regular expressions are represented by RegExp objects. RegExp objects can be created using the RegExp() constructor, but more often they are created using a special literal syntax. Just as string literals are specified as characters enclosed in quotation marks, regular expression literals are specified as characters enclosed in a pair of slash characters / .

/pattern/flags new RegExp("pattern"[, search options])

pattern- a regular expression for searching (more on replacement later), and flags - a string of any combination of characters g (global search), i (case is not important) and m (multi-line search). The first method is used often, the second - sometimes. For example, two such calls are equivalent.

Search options

When creating a regular expression, we can specify additional search options

Characters in JavaScript Regular Expressions

SymbolCorrespondence
Alphanumeric charactersCorrespond to themselves
\0 NUL character (\u0000)
\tTab (\u0009)
\nLine feed (\u000A)
\vVertical tab (\u000B)
\fPage translation (\u000C)
\rCarriage return (\u000D)
\xnnA character from the Latin set, specified by the hexadecimal number nn; for example, \x0A is the same as \n
\uxxxxUnicode character specified by hexadecimal number xxxx; for example, \u0009 is the same as \t
\cXThe control character "X", for example, the sequence \cJ is equivalent to the newline character \n
\ For regular characters - makes them special. For example, the expression /s/ simply looks for the character "s". And if you put \ before s, then /\s/ already denotes a space character. And vice versa, if the character is special, for example *, then \ will make it just a regular “asterisk” character. For example, /a*/ searches for 0 or more consecutive "a" characters. To find a with an asterisk "a*" - put \ in front of the special. symbol: /a\*/ .
^ Indicates the beginning of the input data. If the multiline search flag ("m") is set, it will also fire on the start of a new line. For example, /^A/ will not find the "A" in "an A", but will find the first "A" in "An A."
$ Indicates the end of the input data. If the multiline search flag is set, it will also work at the end of the line. For example, /t$/ will not find "t" in "eater", but it will find it in "eat".
* Indicates repetition 0 or more times. For example, /bo*/ will find "boooo" in "A ghost booooed" and "b" in "A bird warbled", but will find nothing in "A goat grunted".
+ Indicates repetition 1 or more times. Equivalent to (1,). For example, /a+/ will match the "a" in "candy" and all the "a" in "caaaaaaandy".
? Indicates that the element may or may not be present. For example, /e?le?/ will match "el" in "angel" and "le" in "angle." If used immediately after one of the quantifiers * , + , ? , or () , then specifies a "non-greedy" search (repeating the minimum number of times possible, to the nearest next pattern element), as opposed to the default "greedy" mode, in which the number of repetitions is maximum, even if the next pattern element also matches. Additionally , ? used in the preview, which is described in the table under (?=) , (?!) , and (?:) .
. (Decimal point) represents any character other than a newline: \n \r \u2028 or \u2029. (you can use [\s\S] to search for any character, including newlines). For example, /.n/ will match "an" and "on" in "nay, an apple is on the tree", but not "nay".
(x)Finds x and remembers. This is called "memory brackets". For example, /(foo)/ will find and remember "foo" in "foo bar." The found substring is stored in the search result array or in the predefined properties of the RegExp object: $1, ..., $9. In addition, the parentheses combine what is contained in them into a single pattern element. For example, (abc)* - repeat abc 0 or more times.
(?:x)Finds x, but does not remember what it finds. This is called "memory parentheses". The found substring is not stored in the results array and RegExp properties. Like all brackets, they combine what is in them into a single subpattern.
x(?=y)Finds x only if x is followed by y. For example, /Jack(?=Sprat)/ will only match "Jack" if it is followed by "Sprat". /Jack(?=Sprat|Frost)/ will only match "Jack" if it is followed by "Sprat" or "Frost". However, neither "Sprat" nor "Frost" will appear in the search result.
x(?!y)Finds x only if x is not followed by y. For example, /\d+(?!\.)/ will only match a number if it is not followed by a decimal point. /\d+(?!\.)/.exec("3.141") will find 141, but not 3.141.
x|yFinds x or y. For example, /green|red/ will match "green" in "green apple" and "red" in "red apple."
(n)Where n is a positive integer. Finds exactly n repetitions of the preceding element. For example, /a(2)/ will not find the "a" in "candy," but will find both a's in "caandy," and the first two a's in "caaandy."
(n,)Where n is a positive integer. Finds n or more repetitions of an element. For example, /a(2,) will not find "a" in "candy", but will find all "a" in "caandy" and in "caaaaaaandy."
(n,m)Where n and m are positive integers. Find from n to m repetitions of the element.
Character set. Finds any of the listed characters. You can indicate spacing by using a dash. For example, - the same as . Matches "b" in "brisket" and "a" and "c" in "ache".
[^xyz]Any character other than those specified in the set. You can also specify a span. For example, [^abc] is the same as [^a-c] . Finds "r" in "brisket" and "h" in "chop."
[\b]Finds the backspace character. (Not to be confused with \b .)
\bFinds a (Latin) word boundary, such as a space. (Not to be confused with [\b]). For example, /\bn\w/ will match "no" in "noonday"; /\wy\b/ will find "ly" in "possibly yesterday."
\BIt does not indicate a word boundary. For example, /\w\Bn/ will match "on" in "noonday", and /y\B\w/ will match "ye" in "possibly yesterday."
\cXWhere X is a letter from A to Z. Indicates a control character in a string. For example, /\cM/ represents the Ctrl-M character.
\dfinds a number from any alphabet (ours is Unicode). Use to find only regular numbers. For example, /\d/ or // will match the "2" in "B2 is the suite number."
\DFinds a non-numeric character (all alphabets). [^0-9] is the equivalent for regular numbers. For example, /\D/ or /[^0-9]/ will match the "B" in "B2 is the suite number."
\sFinds any whitespace character, including space, tab, newline, and other Unicode whitespace characters. For example, /\s\w*/ will match "bar" in "foo bar."
\SFinds any character except whitespace. For example, /\S\w*/ will match "foo" in "foo bar."
\vVertical tab character.
\wFinds any word (Latin alphabet) character, including letters, numbers and underscores. Equivalent. For example, /\w/ will match "a" in "apple," "5" in "$5.28," and "3" in "3D."
\WFinds any non-(Latin) verbal character. Equivalent to [^A-Za-z0-9_] . For example, /\W/ and /[^$A-Za-z0-9_]/ will equally match "%" in "50%."

Working with Regular Expressions in Javascript

Working with regular expressions in Javascript is implemented by methods of the String class

exec(regexp) - finds all matches (entries in the regular pattern) in a string. Returns an array (if there is a match) and updates the regexp property, or null if nothing is found. With the g modifier - each time this function is called, it will return the next match after the previous one found - this is implemented by maintaining an offset index of the last search.

match(regexp) - find part of a string using a pattern. If the g modifier is specified, then match() returns an array of all matches or null (rather than an empty array). Without the g modifier, this function works like exec();

test(regexp) - the function checks a string for matching a pattern. Returns true if there is a match, and false if there is no match.

split(regexp) - Splits the string it is called on into an array of substrings, using the argument as a delimiter.

replace(regexp, mix) - the method returns a modified string in accordance with the template (regular expression). The first parameter to regexp can also be a string rather than a regular expression. Without the g modifier, the method in the line replaces only the first occurrence; with the modifier g - a global replacement occurs, i.e. all occurrences of a given line are changed. mix - replacement template, can accept the values ​​of a string, replacement template, function (function name).

Special characters in the replacement string

Replacement via function

If you specify a function as the second parameter, it is executed for each match. A function can dynamically generate and return a substitution string. The first parameter of the function is the found substring. If the first argument to replace is a RegExp object, then the next n parameters contain nested parentheses matches. The last two parameters are the position in the line where the match occurred and the line itself.