Chess in Regular Expressions

Chess in Regular Expressions

This content was threaded together by the @mrflowbot so that reading can flow more smoothly. If you'd like to try it, send a DM to @mrflowbot.

You can read the original thread on Twitter here:
https://twitter.com/8bitclassroom/status/1295194073016365056

Wanna learn some byte-sized #regex?

Let's break down a chess move and learn some patterns along the way! THREAD

https://twitter.com/8bitclassroom/status/1294610517911199749

First, @gillnj had recently asked me how to get started with regex.

My take on regex is you only need to find the regular expression you need at a given time. I prefer to keep a log of the patterns I use over time instead of learning it all.

Nick Gill asked about how to learn regex.

Background

Chess moves are logged in portable game notation (PGN).

This means a chess game can save all of its moves as a single/multiple line of text in a database instead of dedicating individual rows to each move. Massive implications.

https://en.wikipedia.org/wiki/Portable_Game_Notation

But the compactness of a chess game also means to read it or reopen it, you need to take apart the PGN. Here's some examples of chess moves:

  • e4
  • xd5
  • Nf3
  • Rhd4
  • a2xa3=Q#

It's complete gibberish and even more so when it's all in one paragraph when saved.

The Standard Move

Let's break down the parts of an individual chess move before putting it in the context of the overall chess game.

The move e4 is the most common opening for white.

In regex, this can be identified by:

[a-h][1-8]

This means, "Search for a pattern matching a character from the choices [a,b,c,d,e,f,g,h] beside a digit between 1-8."

A pawn that captures another piece is noted with an "x" and a square, e.g. xd5.

But since not all moves are captures, we want x to be optional. Include a question mark, ?, to make the preceding match optional.

x?[a-h][1-8]

Pawns are not the only pieces in the game. Each of the other pieces has an abbreviation:

  • K for king
  • Q for queen
  • R for rook
  • B for bishop
  • N for knight (to disambiguate it from K for king)

We can include these choices in square brackets:

[KQRBN]?x?[a-h][1-8]

This means, "Search for one of the characters from K, Q, R, B, or N, but optionally. Match x for a capture, but optionally. Require the rank (1-8) and file (a-h) of a square."

What happens when two rooks are on the same rank and are both capable of moving to the same square. How do you distinguish which rook made a move?

Chess includes a character beside the piece's abbreviation to disambiguate it:

Rhd4 means "move the rook from file h to d4."

We already know how to match for a square from the first regex pattern. We just include it beside the abbreviation:

[KQRBN]?[a-h]?[1-8]?x?[a-h][1-8]

Relative to the last pattern, this means, "Match an optional rank and file for a square between the piece abbreviation and x.

Regex Summary: Part 1

Let's recap some patterns:

Lesson 1:
✅ [ ] requires a match of characters in the brackets and can include a range of letters or digits.

I remember it because [ ] is used to denote an array in JSON.

Lesson 2:
✅ A character like "x" outside of brackets is matched case-sensitive.

Lesson 3:
✅ Including a ? following criteria makes the preceding content optional.

And some implied lessons:

Lesson 4:
✅ Patterns can be combined and matched together by placing them one after another.

Lesson 5:
✅ Each pattern can be marked as optional--not just one.

There may be side effects to a given move:

  • ⬜ A pawn can be promoted if it reaches the other side of the board
  • ⬜ The opposing king can be placed in check or checkmate
  • ⬜ The move may include a score indicating the winner or a draw.

Pawn Promotion

Pawn promotion is indicated with an equal sign '=' and the abbreviation for the piece that the pawn is promoted to, typically queen, but it can also be a rook, bishop or knight.

[KQRBN]?[a-h]?[1-8]?x?[a-h][1-8](\=[QRBN])?

Here's the change in isolation:

(\=[QRBN])?

Wait. What's up with the slash and the parentheses?

The equals sign, =, already means something in regex. If you want to match the character for '=', you need to disambiguate it with \.

Parentheses, ( ), indicate that everything enclosed is matched together.

Check and Checkmate

Check is indicated with a + following a move, and checkmate with a #. Both are optional for each move.

Using what we learned earlier--if we want to search from an array of choices, use square brackets:

[+#]?

In context:

[KQRBN]?[a-h]?[1-8]?x?[a-h][1-8](\=[QRBN])?[+#]?

Scoring

Chess is scored in 3 ways:

  • 1-0 White wins
  • 0-1 Black wins
  • 1/2-1/2 Draw game

To show options, use a pipe symbol: |.
I remember it because || is the operator for 'or' in Power Apps.

(1-0|0-1|1\/2-1\/2)

There's that slash again! The \ slash again is used to disambiguate / which already has a meaning in regex.

The score should have a space between the move and the score. It's easiest to see a space with \s which matches a whitespace character.

\s(1-0|0-1|1\/2-1\/2)

Where you place parentheses makes a difference for the group. We want the entire score to be optional, including the space so we wrap the whole thing.

(\s(1-0|0-1|1\/2-1\/2))?

So far we've matched the standard move involving one piece moving to another square and its consequences:

[KQRBN]?[a-h]?[1-8]?x?[a-h][1-8](\=[QRBN])?[+#]?(\s(1-0|0-1|1\/2-1\/2))?

But there's another kind of move: castling.

Castling

The funny and maybe frustrating thing about castling is that it's written differently by different standards. It could appear as:

  • O-O-O
  • o-o-o
  • 0-0-0

And it could vary in length depending on if the king is castling kingside or queenside:

  • O-O
  • o-o
  • 0-0

Your first instinct may be to use | and () to encapsulate all the different possibilities. It would work, but that's a long expression.

Think about what's the same here. The O looking character is repeated. And there's options for what kind of O.

Options use an array:

[Oo0]

The pipe can certainly match either 2 or 3 instances of the Os:

[Oo0]-[Oo0]-[Oo0]|[Oo0]-[Oo0]

It works, but it's still very long. Think about what's the minimum match here--two of those instances:

[Oo0]-[Oo0]

Now we can make the third optional:

[Oo0]-[Oo0](-[Oo0])?

Can we vary the repeats of something even more efficiently?

It is possible to check for repeats of a pattern using {min,max}.

So let's look at castling again:

[Oo0](-[Oo0]){1,2}

This matches one instance of [Oo0] and 1-2 instances of -[Oo0].

Let's summarize so far. There are two kinds of moves:

(
  [Oo0](-[Oo0]){1,2}
  | 
  [KQRBN]?[a-h]?[1-8]?x?[a-h][1-8](\=[QRBN])?[+#]?(\s(1-0|0-1|1\/2-1\/2))?
)

This means, "Match a move in the castling pattern OR the standard move pattern." Line breaks added for emphasis.

Note that when using the pipe, each option does not need parentheses around the entire thing--it's understood.

Regex Summary: Part 2

Lesson 6:
✅ A pattern can vary its length for matching by using {min,max}.

Lesson 7:
✅ Options do not need to be wrapped in parentheses.

Chess moves in context

Players take turns moving in chess and pairs of moves are offset by a move number:

1. e4 e5

Here's what we need to figure out next:
✅ How do you match the digit and period (.)?
✅ How do you match two instances of a move which is already a long expression?

Move number

Any digit can be matched by using \d. But that causes individual digits to be matched.

To match multiple digits at once, we can use Lesson 6 to vary the length of the match:

\d{1,}

This matches at least one digit with no maximum specified so it doesn't have a limit.

It can be problematic if you match for an unspecified number of digits--how do you stop that?

We can ensure it matches with the content beside it:

\d{1,}\.

This matches multiple digits followed by a period. Again a \ disambiguates the period which has a meaning in regex.

Using the same logic of varying length, we can use {} to repeat the existing part of the expression we have for a move in chess!

In this expression, we match a chess move followed by an optional space at least one time and up to two times. Line breaks added for emphasis.

(
(
[Oo0](-[Oo0]){1,2}
|
[KQRBN]?[a-h]?[1-8]?x?[a-h][1-8]?[+#]?(\s(1-0|0-1|1\/2-1\/2))?
)
\s?
){1,2}

And together with the move number, period, and optional space, we can match the top using the bottom:

1. e4 e5

\d{1,}\.\s?
(([Oo0](-[Oo0]){1,2}|[KQRBN]?[a-h]?[1-8]?x?[a-h][1-8](\=[QRBN])?[+#]?(\s(1-0|0-1|1\/2-1\/2))?)\s?){1,2}

The regex so far will result in a table in which each row contains a string with the move number and a pair of move text. Ex:

  • Row 1: 1. e4 e5
  • Row 2: 2. Nf3 Nf6

I match moves as a pair because of ... CHESS COMMENTS.

Comments

Each move in chess may be accompanied by comments of unspecified length. And unfortunately, chess comments might contain the rank and file of a square or entire move text.

Here's an example of comments:

1. e4 {King's Pawn opening} e5 {King's Gambit}

Comments are actually written after a move and check/checkmate, but ahead of any score.

17. a1xa3=Q# {Pawn from a1 captures piece on a3, is promoted to Queen, and causes checkmate} 1-0

Let's match the comments in isolation.

  • \{ matches the left brace
  • \} matches the right brace
  • .+? can match any number of characters, not including line breaks or terminators.

\s?\{.+?\} can match an optional space and a comment enclosed by the curly braces.

.+? is very inclusive in matching, but the right curly brace stops it from going too far.

But what happens if a given comment has a line break or terminator?

I solve that by using Trim() and Substitute on Char(10) and Char(13) before the match. Saves a lot of headache.

I'll cover an edge case next, but here's the expression so far, emphasizing where the optional comment fits in:

\d{1,}\.\s?
(
([Oo0]-[Oo0](-[Oo0])?|[KQRBN]?[a-h]?[1-8]?x?[a-h][1-8](\=[QRBN])?[+#]?)
(\s?\{.+?\})?
(\s(1-0|0-1|1\/2-1\/2))?
\s?
){1,2}

Edge cases

There is one edge case--a rare one. There might be an ellipsis after the move number when a comment is so long it is better to repeat the move number.

1. e4 {Long comment...} 1... e5

Move number 1 is repeated, and where white's move would be is an ellipsis.

\d{1,}
\.{1,3}
\s?

Next steps

Here's the final form of the regex. Note that for it to work, all spaces must be removed. This makes is very hard to read and understand for a blog post, so I've kept the spacing intact for readability (not that regex is too readable to begin with).

\d{1,}\.{1,3}\s?
(
([Oo0]-[Oo0](-[Oo0])?|[KQRBN]?[a-h]?[1-8]?x?[a-h][1-8](\=[QRBN])?[+#]?)
(\s?\{.+?\})?
(\s(1-0|0-1|1\/2-1\/2))?
\s?
){1,2}

Once you have each pair of moves for each move number, you can match individual moves in each as a follow up action. It becomes much more manageable when your match is not competing against the text from an entire game of chess.

You can recycle the move and comment portions of the expression. The goal is to ignore comments and get the moves that are outside of comments. It's possible to do in regex, but it's much easier to do the filtering in Power Apps or wherever you are parsing the moves.

Lesson 8:
✅ It's not necessary to do everything in regex. Pair up regex with your other skills to perform the rest of the heavy lifting.

For more examples of regex in the context of Chess, please find my open source app from my repo:

https://github.com/mr-dang-msft/Power-Platform-Games

My Chess app is littered with regex to construct and deconstruct portable game notation (PGN) and Forsyth-Edwards Notation (FEN) to save and load chess matches.