Regex Golf is a programming game where the goal is to write a regex which matches all the items in one list without matching any item in a second list, with the catch that the regex has to be as short as possible (otherwise it would only be necessary to exactly match each item in the first list, like /^(item|item|item|...)$/). The game is a descendant of code golf, a much older programming game where the goal is to write the shortest possible program that performs a given task.
Regex golf appears to have been first introduced by the website regex.alf.nu sometime in 2013, but went unnamed until January 6, 2014, when Randall Munroe published xkcd #1313: "Regex Golf".
With carefully chosen lists, regex golf might also serve as an example of nerd sniping, another concept introduced by Munroe in a much earlier xkcd strip, on December 12, 2007.
In the original xkcd strip, two examples of regex golf are given: the first, M | [TN]|B, is said to match the subtitles of all Star Wars films without matching any of the subtitles of Star Trek films, and the second, bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls, is said to match the last names of elected United States presidents but not their opponents; both regexes are analyzed by Explainxkcd.com and an analysis won't be repeated here.
The website regex.alf.nu takes the form of a series of programming challenges in which the reader is presented with two lists of words or phrases and an input box to type the regex in. Because this is an in-browser game, the ECMA flavor of regexes is used for solving levels. The left list is the one to be matched, and the right list is the one to be excluded; a score is given to the right of the input box.
Each level has a maximum possible score assuming perfect matches and a regex with no characters. The score starts at 0, and each correct match adds ten points, each incorrect match subtracts ten points, and each character in the regex subtracts one point.
Each level also has a hard mode, in which both lists are extended with additional items that are automatically generated (thus preventing brute-force solutions) to incorrectly match the player's regex if it is incorrect. Incorrect solutions cause 100 points to be subtracted from the player's score.
Each of the following sections covers one level. A section starts with the explanation(s) given on the level, followed by a brief further explanation that gives the maximum and minimum possible score for the level (both of which assume a regex with zero characters; while a regex can't increase the score past the maximum, it can decrease it past the minimum). After this is a table of the word lists, and then a table of solutions with explanations and scores. The word lists and solutions are hidden in collapsed tables to allow the page to be read without spoiling any levels. Generally, only correct solutions of a minimum length are given, unless a longer or incorrect solution is interesting enough to merit inclusion.
Warmup — Type a regex in the box. You get ten points per match (or lose ten, if you match something you shouldn't); each character costs one point.
As the first level, this one has the simplest solution. Each list contains 21 items, for a maximum possible score of 210 points (and a minimum possible score of -210).
Word lists
afoot
catfoot
dogfoot
fanfoot
foody
foolery
foolish
fooster
footage
foothot
footle
footpad
footway
hotfoot
jawfoot
mafoo
nonfood
padfoot
prefool
sfoot
unfool
Atlas
Aymoro
Iberic
Mahran
Ormazd
Silipan
altared
chandoo
crenel
crooked
fardo
folksy
forest
hebamic
idgah
manlike
marly
palazzi
sixfold
tarrock
unfold
Solutions
This one is very simple; all the words in the left list and none of the words in the right list have the string "foo" in them, so a regex of foo correctly solves this level, giving the player 207 points in both normal and hard mode.
Anchors — You are deducted one point per character you use, and ten if you match something you shouldn't.
This is the first level to require regex features beyond simple text matching, and the first to have different minimal solutions for normal and hard mode. There are 21 items in each list, again giving a maximum score of 210 and a minimum score of -210.
Word lists
Mick
Rick
allocochick
backtrick
bestick
candlestick
counterprick
heartsick
lampwick
lick
lungsick
potstick
quick
rampick
rebrick
relick
seasick
slick
tick
unsick
upstick
Kickapoo
Nickneven
Rickettsiales
billsticker
borickite
chickell
fickleness
finickily
kilbrickenite
lickpenny
mispickel
quickfoot
quickhatch
ricksha
rollicking
slapsticky
snickdrawing
sunstricken
tricklingly
unlicked
unnickeled
Solutions
The common pattern to the left list is ending in the string "ick".
Normal mode
The regex k$ solves this level, getting the player 208 points.
Hard mode
The regex ick$ must be used in hard mode, since words are generated for the right list that end in "ck". This gets the player 206 points.
Ranges — The test vectors were generated by grepping /usr/dict/words. Can you tell?
This level requires more advanced features of regexes to solve. Each list again has 21 items, resulting in maximum and minimum possible scores of 210 and -210, respectively.
Word lists
abac
accede
adead
babe
bead
bebed
bedad
bedded
bedead
bedeaf
caba
caffa
dace
dade
daff
dead
deed
deface
faded
faff
feed
beam
buoy
canjac
chymia
corah
cupula
griece
hafter
idic
lucy
martyr
matron
messrs
mucose
relose
sonly
tegua
threap
towned
widish
yite
Solutions
The words in the left list only use letters from a to f, while the words in the right list all use letters from g onward as well. The regex ^[a-f]*$ or ^[a-f]+$ will solve this level, getting the player 202 points in both normal and hard mode.
In normal mode, the regex[a-f]{4} also works, for 202 points.
Backrefs — This doesn't really work as a tutorial. Not really clear what you're supposed to do here.
This is the first level that requires a solution which isn't a valid regular expression per the computer science definition. Each list has 21 items, for a maximum and minimum possible score of 210 and -210, respectively.
Word lists
allochirally
anticovenanting
barbary
calelectrical
entablement
ethanethiol
froufrou
furfuryl
galagala
heavyheaded
linguatuline
mathematic
monoammonium
perpera
photophonic
purpuraceous
salpingonasal
testes
trisectrix
undergrounder
untaunted
anticker
corundum
crabcatcher
damnably
foxtailed
galvanotactic
gummage
gurniad
hypergoddess
kashga
nonimitative
parsonage
pouchlike
presumptuously
pylar
rachioparalysis
scherzando
swayed
unbridledness
unupbraidingly
wellside
Solutions
The common pattern in the words in the left list is a string of three characters which is repeated twice. The regex (...).*\1 solves this level, for a score of 201 points in both modes.
Beware some reported answers suggesting regexes such as (.{3}).*\1 or (\w\w\w).*\1, these are invariably longer than the above solution (the first regex, for example, only scores 200 points).
Abba — Let's pretend this one is not a rehash of the last one.
This level requires a solution similar to the solution for "Backrefs", true to its description. It's the first level to have different numbers of items in the lists, with 21 items in the left list and 22 items in the right, for a maximum possible score of 210 points, but a minimum possible of -220.
Word lists
acritan
aesthophysiology
amphimictical
baruria
calomorphic
disarmature
effusive
fluted
fusoid
goblinize
nihilistic
noisefully
picrorhiza
postarytenoid
revolutionize
suprasphanoidal
suspenseful
tapachula
transmit
unversatile
vibetoite
abba
anallagmatic
bassarisk
chorioallantois
coccomyces
commotive
engrammatic
glossoscopia
hexacoralla
hippogriffin
inflammableness
otto
overattached
saffarid
sarraceniaceae
scillipicrin
tlapallan
trillion
unclassably
unfitting
unsmelled
warrandice
Solutions
Normal mode
In normal mode, the regex^(?!(.)+\1)|ef solves this level for 196 points. An older solution, ^(?!.*(.)\1)|ef, scored 195 points.
Hard mode
The common pattern here is a palindrome - in this case, two letters in the pattern a, b, b, a. While it's not possible to match palindromes of arbitrary length in most modern flavors of regexes, including ECMA, matching palindromes of known length is possible, though the regex to do so gets longer and more unwieldy with each additional character. It isn't a problem here because there are only four characters to be matched, but this level introduces another twist: words with a palindrome must not be matched. Many players arrive at a regex such as (.)(.)\2\1or(([\w])([\w])\3\2), which correctly includes one list while excluding the other, but it includes and excludes the wrong lists; since regex doesn't provide any explicit means of inverting a match, this is often where players stop on this problem. Match inversion is possible, however, using negative lookahead, and the regex ^(?!.*(.)(.)\2\1) does the trick, netting 193 points.
A man, a plan — You're allowed to cheat a little. Even in hard mode, words will be no longer than 13 characters.
This is similar to the last level, but a somewhat simpler solution can be used. The left list has 19 items, making the maximum possible score 190, while the right list has 21 items, for a minimum possible score of -210.
Word lists
civic
deedeed
degged
hallah
kakkak
kook
level
murdrum
noon
redder
repaper
retter
reviver
rotator
sexes
sooloos
tebbet
tenet
terret
arrogatingly
camshach
cinnabar
defendress
derivedly
gourmet
hamleteer
hydroaviation
lophine
nonalcohol
outslink
pretest
psalterium
psorosperm
scrummage
sporous
springer
sunburn
teleoptile
unstuttering
womanways
Solutions
This level has different optimal solutions for normal and hard mode, though the hard mode solution is strictly an extension of the normal mode solution. All the words to be matched are palindromes.
Normal mode
The regex ^(.)(.).*?\2\1$ is frequently given, but can actually be shortened by noting that, with the exception of "spurious", none of the words in the right list start and end with the same letter. Taking this into account gives^(.)[^p].*\1$, for a score of 177 points.
Hard mode
While matching arbitrary palindromes isn't possible in most flavors of regex, the words to be matched are limited to at most 13 characters long, meaning only six capture groups are necessary; the regex ^(.?)(.?)(.?)(.?)(.?)(.?).?\6\5\4\3\2\1$ does the trick for a score of 150 points.
Prime — The length is not part of the string. I should probably have chosen a different color.
The items to be matched are all sequences of a prime number of "x"s; hard mode only adds a single-x item to the list of items not to be matched. There are 20 items in each list, but each item is worth 15 points instead of the normal ten, for a maximum possible score of 300 and a minimum possible score of -300.
This is the first level to truncate words shown to the player due to length; levels are only truncated in this way when the truncated characters are obvious from the non-truncated portion (as with these items, which are obviously just sequences of "x"s).
Word lists
xx
xxx
xxxxx
xxxxxxx
xxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx... [31 chars]
xxxxxxxxxxxxxx... [37 chars]
xxxxxxxxxxxxxx... [41 chars]
xxxxxxxxxxxxxx... [43 chars]
xxxxxxxxxxxxxx... [47 chars]
xxxxxxxxxxxxxx... [53 chars]
xxxxxxxxxxxxxx... [59 chars]
xxxxxxxxxxxxxx... [61 chars]
xxxxxxxxxxxxxx... [67 chars]
xxxxxxxxxxxxxx... [71 chars]
xxxx
xxxxxx
xxxxxxxx
xxxxxxxxx
xxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx... [30 chars]
xxxxxxxxxxxxxx... [32 chars]
Solutions
Normal mode
The regex ^(?!(..+)\1+$) correctly matches this level, for a score of 286 points. This regex works by capturing a group of two or more "x"s and repeating the group one or more times, and then inverting the match.
Hard mode
The inclusion of an item which is just a single "x" means the regex used for the normal mode won't work for hard mode; it must instead be extended to ^(?!(..+)\1+$).., which gives a score of 284.
Four — You can get an extra point by ignoring the name of this level.
This level again involves repeating matches, but is much more straightforward than previous levels. Each list has 21 items, for maximum and minimum possible scores of 210 and -210, respectively.
Word lists
Makaraka
Wasagara
degenerescence
desilicification
elevener
hipponosological
homoeomorphy
homologous
ileocolotomy
intervisibility
jararaca
locomotory
micropoikilitic
odontonosology
parhomologous
pogonotomy
promonopolist
protohomo
pseudoprimitivism
tocororo
unintelligibility
Ludgate
Mitsukurinidae
Ternstroemiaceae
arrhythmical
bleater
energetics
inthrow
mecopterous
multum
naphthalene
nullibicity
observancy
overpunishment
overregularly
overwilily
participator
predisable
reyield
rubeola
traitorlike
unregainable
Solutions
The key pattern in this level is a single vowel which is repeated every other letter, four times. This leads immediately to the regex ([aeiou]).\1.\1.\1, but this is not optimal: while vowels are the repeated letter of note, there's no need to make sure the captured character is a vowel. So the previous regex can be shortened to (.).\1.\1.\1, which can then be optimized to (.)(.\1){3}, for 199 points.
In normal mode, the regex(.)...\1.\1 also works, for 199 points.
This level is one that must be solved by ignoring the level's name and the obvious property of the items to be matched in order to get a good score, though hard mode prevents this strategy thanks to its open-ended word generation. Each list has 21 items, for a maximum and minimum possible score of 210 and -210, respectively.
Word lists
access
accloy
adeem
aflow
aglow
beefin
befist
billot
bossy
certy
chintz
chips
chort
cloop
coost
demos
fitty
flory
flossy
ghost
mopsy
analyse
balanism
baronet
biddable
griefless
harebrain
jestword
laicize
marvelry
oriole
pickietar
preferee
primness
pulghere
rebirth
scupper
serigraph
sororize
theowman
unfrayed
wagonman
Solutions
Normal mode
Normal mode can be solved by noting that all the words in the left list are five or six letters long, whereas the only word in the right list which is this short is "oriole"; no word in the left list starts with "o" or has an "e" as the sixth letter, leading to the regexes ^[^o]?.{5}$ and ^.{5}[^e]?$, both for 199 points.
Hard mode
Because extra words are generated rather than being pulled from a pregenerated list, the normal mode regex doesn't work for hard mode. Instead, the regex must check that the letters are actually in alphabetical order, resulting in ^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$, which gives 156 points.
Triples — Multiples of 7 are left as an exercise for the reader.
Each list has 21 items, but each item in this level is worth 30 points, resulting in a maximum possible score of 630 and a minimum possible score of -630.
Word lists
000000000
000000003
000000006
000000009
000000012
000000015
066990060
140091876
173655750
312440187
321769005
368542278
390259104
402223947
443512431
714541758
747289572
819148602
878531775
905586303
953734824
000000005
000000008
000000010
000000011
000000014
018990130
112057285
159747125
176950268
259108903
333162608
388401457
477848777
478621693
531683939
704168662
759282218
769340942
851936815
973816159
979204403
Solutions
Normal mode
Normal mode is solvable without having to make sure the string is actually a multiple of three. The regex00($|3|6|9|12|15)|4.2|.1.+4|.17|55 or 00($|[369]|1[25])|4.2|.1.+4|.17|55or00($|3|6|9|15)|[04].2|.1.+4|.17|55 or 00($|[369]|15)|[04].2|.1.+4|.17|55or[02-5][123][257]|[07][0269]+3?$|55 is sufficient, giving 596 points.
Hard mode
Solving hard mode requires actually using a regex tailored to finding multiples of 3. This was well beyond me, so after googling I came across this site which goes through the steps of building a finite-state machine and reducing the resulting Kleene algebra before substituting character groups to give a final regex. Jumping to the end shows the regex ^([0369]|[258][0369]*[147]|[147]([0369]|[147][0369]*[258])*[258]|[258][0369]*[258]([0369]|[147][0369]*[258])*[258]|[147]([0369]|[147][0369]*[258])*[147][0369]*[147]|[258][0369]*[258]([0369]|[147][0369]*[258])*[147][0369]*[147])*$, but looking down further, in the credits section, provides the much shorter regex ^([0369]|[258][0369]*[147]|([147]|[258][0369]*[258])([0369]|[147][0369]*[258])*([258]|[147][0369]*[147]))*$or^([0369]|[147][0369]*[258]|([258]|[147][0369]*[147])([0369]|[258][0369]*[147])*([147]|[258][0369]*[258]))*$. However, a very carefully hand-optimized regex(?=((.*?[147]){3})*((.*?[147]|){2}))(?=((.*?[258]){3})*((.*?[258]|){2}))^.*$(\3\7|(?!\3|\7)\4\8|(?!\4|\8)) scores one point better, giving 524 points.
Glob — No test case will have more than three asterisks (and max total length of, let's say 100 why not)
Each list has 21 items, though each item is worth 15 points in this level, for a maximum possible score of 320 and a minimum possible score of -320.
Word lists
*err* matches superreform
*falle*ess matches unfallenness
*il*log* matches unphilological
*plen*tud* matches overplenitude
*taiodi* matches pentaiodide
*viceberry matches serviceberry
bowdl* matches bowdlerism
bron*hopleur*sy matches bronchopleurisy
chromatophobia matches chromatophobia
cockneyla* matches cockneyland
colorlessly matches colorlessly
cretefaction matches cretefaction
downrightly matches downrightly
leather* matches leatherbark
mitogenet* matches mitogenetic
palindrom* matches palindromic
parallelepiped matches parallelepiped
primigenial matches primigenial
puppe* matches puppetlike
resurrender matches resurrender
wreathwi* matches wreathwise
*anapaestical* matches anapaestically
*chegonio matches archegoniophore
*dissoluti* matches dissolutional
*domestica matches domesticality
*expedition matches expedition
*hormog matches hormogonium
*stipular* matches infrastipular
*strabis matches strabismal
cathartica matches cathartically
di matches gerundively
hacean matches zoanthacean
headmist matches headmistress
herwi matches trencherwise
iemphraxia matches cardiemphraxia
kmak matches packmaking
mbable* matches unclimbable
nspi*tor matches inspirator
ocumidi matches pseudocumidine
raretinal* matches intraretinal
tte matches whitterick
uefoliate matches quinquefoliate
Solutions
Normal mode
There are at least two solutions which take advantage of particular letter pairs or placements to solve this level with 397 points: ai|c$|^p|[bcnrw][bnopr]andc$|^p|ta|[bcnrw][bnopr].
Hard mode
The regex^(.*)(\*?)(.*)(\*?)(.*)(\*?)(.*) .* \1((?!\2).+|\2)\3((?!\4).+|\4)\5((?!\6).+|\6)\7$ solves this correctly for 336 points.
Balance — This one is also impossible, but there's a finite number of test cases (except in hard mode, where no input is more than seven levels deep)
The left list has 33 items, and the right list has 32, with each item worth 15 points, for a maximum possible score of 495 points and a minimum possible score of -480 points.
Word lists
<<<<<>><<>>><<... [62 chars]
<<<<<>><>><<><... [110 chars]
<<<<<>><>><>><... [102 chars]
<<<<<>><>>>><<... [88 chars]
<<<<<>>><<<>><... [58 chars]
<<<<<>>><<><>>... [152 chars]
<<<<<>>><<>><<... [42 chars]
<<<<<>>><>><<<>>>><<>>
<<<<<>>>><<<<>... [102 chars]
<<<<<>>>><<<><... [30 chars]
<<<<<>>>><><<<... [66 chars]
<<<<<>>>><><<<... [124 chars]
<<<<<>>>><>><<>>
<<<<><<>>><<<>... [34 chars]
<<<<>><<<>>>><... [92 chars]
<<<<>>><<<<>><>><<<>>>>>
<<<<>>><<<><<>>><><<>>>><<>>
<<<<>>><<><<<>... [84 chars]
<<<<>>>><<<><<... [52 chars]
<<<><<<>>>><<<... [50 chars]
<<<><<><>>>>
<<<><>><<<>>>>
<<<>><<<><<>>>... [44 chars]
<<<>><><<<><>>... [48 chars]
<<<>>><<><<<<>>>><<><<<>>>>>
<<><<<<>><>>>>... [60 chars]
<<>>
<<>><<<<<>>>>>... [54 chars]
<<>><<<<>><<<>... [74 chars]
<>
<><>
<
<<<<<<>>><<><>>>>>><<>
<<<<<>>><>>><<<>>>><>>
<<<<<>>>>>>
<<<<>><<<<<><<>><><<<<
<<<>><<<<><><><><
<<<>>>><><<<><>
<<><<<<><<><<>>><<
<<><<<>>>>><<
<<>>><<<>>
<><<<>><<>>><<>
<><<>>><<<><>><<<>>><<>>>><
<><<>>><><<<>
<><>><>>><><<<... [36 chars]
<>><><<<><>
<>>>>>><<<>><<>><><
<>>>>>>><<<
>
><
><<<>><><<<><<
><<<>>>><><<<<><>>><<><><<
><<><<<<><<<<>>>><
><><><<<>>>>>
><><>>><>><>
><><>>>><>>>>>>><>>><>>
><>><<<<<>>
><>><><><<>><<>>><<
><>>><>>>>><<><<<><>><>><<<
>><<<><<<<<<><>><<
>><>>><<<><>>><><<>><<><><<
>>>><>><>>>><>>><>><><
>>>>><<<>>>
Solutions
Normal mode
The regex.{37}|^(<(..(?!<.>$))*>)*$ solves this level, giving a score of 454 points.
Hard mode
Hard mode can be solved with the regex ^(<(<(<(<(<(<(<>)*>)*>)*>)*>)*>)*>)*$, earning 443 points.
Interestingly, the left list has only a single item, while the right list has 20 items, and each item is worth 270 points, for maximum and minimum possible scores of 270 and -5400 points, respectively.
This level is much the same as "Powers", except the number of "x"s to be matched is now a power of 3 instead of 2. Unusually, the left list has more items than the right: the left list has 11 items, for a maximum possible score of 110, while the right list has only 10 items, for a minimum possible score of -100.
Interestingly, in normal mode, there is a nonperfect solution that scores higher than the best perfect solution.
Word lists
x
xxx
xxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx... [81 chars]
xxxxxxxxxxxxxx... [243 chars]
xxxxxxxxxxxxxx... [729 chars]
xxxxxxxxxxxxxx... [2187 chars]
xxxxxxxxxxxxxx... [6561 chars]
xxxxxxxxxxxxxx... [19683 chars]
xxxxxxxxxxxxxx... [59049 chars]
[empty string]
xx
xxxxxxx
xxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx... [163 chars]
xxxxxxxxxxxxxx... [1337 chars]
xxxxxxxxxxxxxx... [2601 chars]
xxxxxxxxxxxxxx... [6557 chars]
xxxxxxxxxxxxxx... [16384 chars]
Solutions
Normal mode
The regex^(?!((...)+.|..|)\1*$) solves this level perfectly, with a final score of 88 points. The nonperfect regex^(.|(...)+)$ also scores 88 points, correctly matching the left list, and not the right list except for a single item (the one with 2601 "X"s).
Hard mode
Because of the automatic 100-point deduction for incorrect solutions, the normal mode trick doesn't work in hard mode, and instead theregex^(?!((...)+..?|..|)\1*$) is required to solve this level, giving 86 points.