Post · bonfire.cafe

Yesterday, I crafted a RegEx that finds double words (aka “the the”). This is something that’s pretty useless if you write or edit in a relatively recent word processor — and maybe I’d better start doing an edit pass in one of those — but I share it here in case it’s useful to you.

(\b[A-z’‘’]+\b]) \1

Don’t do a replace-all with that. Until you’re confident it’s not picking up false positives … and probably not even then.

Not every RegEx implementation will allow you to reference a capture group inside the search term so be sure to intentionally put a “the the” in the text somewhere to make sure it’s working correctly for you.

If I get a moment, I’ll explain what it’s doing so you can make adjustments if you need to.

#AmWriting #AmEditing #WritingCommunity #Writing #RegEx

( starts a capture group and ) ends it. The usual purpose of a capture group is to keep or rearrange part of a string. In this case, I'm using it so I can reference it later in the same search string.

\b marks word boundaries. I use it here so that the RegEx doesn't try to work against larger phrases.

[ marks the start of a range of text possibilities. ] marks the end of that range. The range can just be a bunch of characters that are allowable. In this case, it has A-z as a range. This grabs all letter characters in English. If you have case sensitivity off, a-z would have done the same thing.

After A-z, I included '‘’. That's each of the single quote options I could think of. This is to allow the RegEx to find doubled contractions. For example: don't don't. If you have others, you could add them to the characters I already included. If you need a dash / hyphen, it's safest to include that character with an escape (example: \-). I explain more about escapes in the paragraph after the next.

+ says to repeat the previous character or section as many times as needed. This is what allows it to grab whole words instead of, say, just sections where one word has the same first character as the last character of the word before it.

\ before a character either escapes it or designates a special function. If you wanted to find + marks in a string, you would need to 'escape' it so that RegEx wouldn't treat it as repeating the character or group that came before it. To 'escape' it, you would write a slash before it like this \+.

The special functions that are available in \ will vary based on the implementation.

Slash followed by a number references a capture group. Capture groups are numbered on a first-come, first served basis. Since I only have one capture group in this RegEx, \1 here refers to what was found by the first capture group.

This is the part that allows the RegEx to identify that the word it's looking at now is identical to the previous word.