Performance Solving "WonderWord" Puzzle

KP Lee

3.67/5 (4 votes)

Mar 6, 2015

CPOL

15 min read

31384

153

Cross-matching an array of characters to a set of words using NlogN logic

Download source - 318.5 KB

Introduction

Design Efficiency involved in a puzzle called "WonderWord" and comparison with "similar" puzzle solving technique.

Background

I am interested in efficient processing techniques.

One very common routine written about is the bubble sort. This is extremely simple to write, but is an N^2 process. The quicksort process is an NlogN process and is much more complex to read and understand, but has highly superior performance to N^2. With N^2, you quadruple the processing time when you double the size.

While working on the logic for solving this efficiently, I read http://www.codeproject.com/Articles/861524/Solving-Word-Search-Puzzles-in-Linear-Time. I was not impressed with the above article's “efficiency”. The author of the article had logic that started with a fair but flawed premise in my opinion and didn't efficiently implement it to boot. (I DON'T see Linear Time, I see squared time.) You add the characters' values together to get a keyword value. The sum of APPLE, AELPP, AOPME, and ZUAAA character values are exactly equal, so if you get a sum match your probability is higher but you still have to verify the individual characters match in the same relative position to the others. It's a bad assumption, but I admit the probability of a match is higher when they do match. The total sum was recalculated for each position and direction, instead of subtracting one letter from one direction and adding one letter on the other end. You have to go through this for every word. As you increase the size of the array, the number of words to search would have to go up as well. I didn't go through the whole logic but it looked to me that it went through the array in 4 directions for one of the keywords and then did it again for the next word. This “Linear Time” process looks exactly like an N^2 process to me.

The article did point out a more efficient way I could look at the keywords than I had thought of. (I started looking in 8 directions when you could look in 4 directions if you added a reversed word look-up.)

If it doesn't find a Wonderword, hopefully, it's because you messed up the setup, not the author's puzzle, nor my code's mess-up. (I had plenty of those at the start of this design, haven't found any new ones to fix for a while now. Each time I add a new puzzle, seems 10-90 will be setup perfectly immediately.)

The size of the array being searched is one part of the problem and I don't see how to solve that. All the keywords also need to be checked at every starting location and half the eight directions a word can be defined. and I think what I came up with is a pretty efficient way to do so for this problem. I am at a loss on how (or if) it could be better tweaked.

I took on a puzzle called Wonderword published in a local newspaper by author David Ouellet. (I since found out there is a website: http://www.wonderword.com/ that is authored by him.) This is basically a 15X15 array of characters, followed by a list of keywords that you need to find in the array. The keywords are entered horizontally, vertically, or diagonally in forwards or backwards order. The puzzle wants you to circle the keywords' letters and the remaining unmarked letters spells out the “Wonderword”. (Can be a phrase instead of a word.) I put several of the published puzzles in the code to get a "N" sized processing comparison.

While working on the logic for solving this efficiently, I read http://www.codeproject.com/Articles/861524/Solving-Word-Search-Puzzles-in-Linear-Time. I was not impressed with the above article's “efficiency”. The author of the article had logic that started with a fair but flawed premise in my opinion and didn't efficiently implement it to boot. You add the characters' values together to get a keyword value. The sum of APPLE, AELPP, AOPME, and ZUAAA character values are exactly equal along with probably thousands of other combinations that add to 370, so if you get a sum match your probability is higher but you still have to verify the individual characters match in the same relative position to the others. It's a bad assumption, but I admit the probability of a match is higher when they do match. The total sum was recalculated for each position and direction, instead of subtracting one letter from one direction and adding one letter on the other end. As far as I can tell, you have to go through this for every word. (I am not a C++ guru) As you increase the size of the array, the number of words to search would have to go up as well. I didn't go through the whole logic but it looked to me that it went through the array in 4 directions for one of the keywords and then did it again for the next word. This “Linear Time” process looks exactly like an N^2 process to me. (It's linear as long as you don't change the array size or the number of words.) N is linear, NLogN approaches linear. NLogN, going from 2 to 4 should take 4 times as long, going from 1 to 2 million should take 2.1 times as long. Going from 1 to 2 thousand should take 2.2 times as long.

It seems I need to explain a little more about what is going on. Here is small portion of the output generated for a 15X330 array puzzle. (22 15X15 puzzles concatinated together):

Loop counts level 0: 4950, 1: 19800, 2: 59580 Array size: 4950, Search words: 855 words found: 1155 Characters in words: 5339 Search Keys: 2472

15*330 == 4950 == the number of characters in the search array .(15 is the width of the string, 330 is the number of strings in the array.) 19800/4 == 4950 Level 0 should and does exactly match the Array size in characters, Level 1 is 4 times the number of first character matches. This is such a big puzzle, all 26 characters are defined as a 1 letter key, so it will always go to the second character. It is 4 times, because it counts the 4 direction loop, whether a search in that direction is made or not. (Code verifies the shortest word found would fit in that direction.) Level 2 is a count of all successful and failed searches.

1155 is higher than 855 because several puzzles used the same keywords. With 22 concatinated puzzles, that will happen. I did happen to copy the same 15X15 puzzle twice so every word in that puzzle was found at least twice.

The Search Key dictionary will contain keys of the search words in forwards and backwards orders. Generally the keys will be double the keys as characters in the word. IE "Key" would have K, KE, KEY, Y,YE,YEK as its 6 keys. The list if it doesn't exist it is created and mostly the key's keyword object is added (Yesterday would be defined once for Y)

So KeyVals would maybe have 10,678 keys. It again would go through the 855 keywords and copy the same set of keys until only 1 key is listed in both directions. (Including the first 1 item key found.) It's kind of interesting that the total keys has almost always come out close to 3 times the number of search words.

Here is a short example (Note that this time I listed the keys, which the code provides at this point):

Loop counts level 0: 225, 1: 848, 2: 1750 Array size: 225, Search words: 44 words found: 44 Characters in words: 268 Search Keys: 128
A, AD, AI, B, BE, BL, BLA, BLAN, BLAS, BLO, BR, C, CH, CL, CO, COF, COL, D, E, EE, EM, ET, EX, EZ, EZE, EZO, F, FE, FEB, FEE, FI, FL, FR, FRI, FRO, FROS, FROZ, G, GN, GR, GU, H, HA, HE, HEA, HEAT, HEAV, HO, HS, HT, I, IC, IN, J, K, L, LA, LE, LEE, LEP, LL, LLA, LLE, LLI, M, ME, MI, MR, N, NI, NO, NOS, NOV, O, R, RA, RE, REB, RED, RET, RO, S, SA, SD, SE, SEA, SEI, SER, SET, SH, SL, SLE, SLU, SN, SNE, SNO, ST, STA, STS, T, TE, TEA, TEE, TEK, TS, TSA, TSO, TSU, W, WA, WI, WO, WOL, WOLB, WOLE, Y, YD, YE, YR, YRA, YRAU, YRAUN, YRAUR, YRI, YRO, YV, YZ, Z

Letters not defined:PQUVX.

Note that 214 characters that reached level 1 so 11 were eliminated with the first character being one of the 5 characters not defined. Also, all the above information is provided to see what the code is doing. Actually a fair percentage of the code is used to get performance information on the technique instead of solving the problem.

Note that if the first character looked at is B, if the second character isn't E, L, or R that direction is toast. If you do reach BR, you never look at the dictionary again. With YRAU, you still have one more dictionary search to make, but with a 4 letter match what is the probability it WON'T be N or R or won't spell out the full search word?

Also in this puzzle FROZ and EZO uniquely identify Froze in this puzzle, so you would have to search 3 or 4 keyed words to find the unique 5 letter search word. Only D, J, K, O, or Z would uniquely identify the search word, otherwise it would use the dictionary lookup more than once. Once it is not using the dictionary, it is an N process. Overall, it remains an NlogN process.

Both loading the dictionary and searching the array are NlogN processes, but for the dictionary, N and logN is the number of characters in the search words times 2 (1 to load the dictionary, and 2 to reduce the size of the dictionary loading a temp dictionary. Or you could look at it as loading a temp dictionary and replacing it with the reduced size dictionary) With the array, N is the number of characters in the array times 4, times the number of letters in the word and logN is the dictionary lookup

Loop counts level 0: 4950, 1: 19800, 2: 59580 Array size: 4950, Search words: 855 words found: 1155 Characters in words: 5339 Search Keys: 2472 The cost of creating the original keys would be about 10,669*14. To reduce the keys listed would take about 2,472*(14+12) operations. (log base 2 of 10,578 is 14 and of 2,472 is 12) to check the keys would take about 4950*12*4*2=475,200 the size of the array, the logN reading, 4 directions and average of 2 reads. (Another way to look at it: 59580*12=714,960, the reported lookup loops which is high because it includes loops not using the dictionary lookup times the LogN dictionary lookup cost.) To find the words in C++ version giving it the best possible check 855*9*324*4*12= 119,672,640 operations. (About 250 times slower than the C# version, in theory, no way I can test it.) That's the number words, subtracting 6 from both dimensions of the array, four directions and the position and addition calculations for average of 6 characters.)

The size of the array being searched is one part of the problem and that can't disappear. All the keywords also need to be checked at basically every starting location and half the eight directions a word can be defined. I think what I came up with is a pretty efficient way to do so for this problem. I am at a loss on how (or if) it could be better tweaked. Because I put the search words in a string key and a keyword list dictionary, to look each string (leading to a keyword) up is a LogN process. (To look up one key in a million keys takes 20 loops), to define the keys in a Dictionary is NlogN, to match the word against every position is NlogN too.

Here is an example of the output I use to show where the keywords are defined:

I wrote a WonderWord routine, I can't do that without proving it can find the Wonder Word in a published puzzle:

Letter and location: C x=1 y=5 R x=3 y=7 I x=10 y=9 S x=6 y=12 P x=13 y=12
Wonder word is: CRISP

Wow, found eight words in the first line (None found in another):
MAEBIYRAUNAJELF Meteorologists x=0, xd=1, yd=1 Mittens x=0, xd=0, yd=1 Advisory x=1, xd=1, yd=1 Breeze x=3, xd=1, yd=1 Indoor x=4, xd=1, yd=1 January x=11, xd=-1, yd=0 Airy x=7, xd=1, yd=1 Flurries x=14, xd=0, yd=1
IEDERNVBILOTLEL Below x=7, xd=1, yd=1 Lazy x=9, xd=1, yd=1
TBTVFEDAERAEEEU Blanket x=1, xd=1, yd=1
TSLEIFEOELYZWFR Feel x=13, xd=0, yd=-1 Yell x=10, xd=1, yd=-1 February x=13, xd=0, yd=1 Windy x=12, xd=-1, yd=1
ETOAOSOZOHOIYER Heavy x=9, xd=-1, yd=-1 Zero x=7, xd=1, yd=-1
NCSRNROCERNWTBI Coffee x=7, xd=-1, yd=-1 Sleet x=2, xd=0, yd=1
SGLAFKORFDMAFRE Frost x=4, xd=-1, yd=-1 Shivering x=0, xd=0, yd=1 Gust x=1, xd=0, yd=1 Froze x=8, xd=1, yd=-1 Fires x=12, xd=0, yd=1 Frigid x=12, xd=-1, yd=1
HUERLHELYRERIUS Rain x=11, xd=0, yd=1 Snowfall x=14, xd=0, yd=1
ISEECBETOHIARAN Blast x=5, xd=-1, yd=-1 Heater x=9, xd=1, yd=-1
VTTTMPRTTGIIERO Teas x=8, xd=1, yd=1
EAOEUESMIEINSYW
RHVADERDSEASONF Hats x=1, xd=0, yd=-1 HotChocolate x=1, xd=1, yd=-1 Season x=8, xd=1, yd=0
IORLCASTKIDSTPA Kids x=8, xd=1, yd=0
NGOIWOLBXHSULSL November x=0, xd=1, yd=-1 Graupel x=1, xd=1, yd=-1 IceStorm x=3, xd=1, yd=-1 Warmth x=4, xd=1, yd=-1 Blow x=7, xd=-1, yd=0 Slush x=13, xd=-1, yd=0
GCLIMATESECHILL Extreme x=9, xd=-1, yd=-1 Colder x=1, xd=1, yd=-1 Climates x=1, xd=1, yd=0 Chill x=10, xd=1, yd=0

y isn't provided because the word is printed on the y line. x is the index where the first character can be found, with the directions, 1 is right(xd) or down(yd), -1 is left or up, and 0 is vertical or horizontal. Note I solve in 4 directions, but list in 8 always starting with the first character. Also the author lists HotChocolate as Hot Chocolate, you need to remove special characters like blank and dash when loading the words into code.

Using the Code

This is a set of test console code I used to see if performance matched expectations. Mainly no, but sometimes extremely close matches. I found that DateTime had major variances. I dislike using Stopwatch because I have never found it to be accurate. Looping 4 billion times should take about 4 seconds to run if it runs perfectly on my machine, DateTime says about 4.6 seconds, Stopwatch was just over 3 seconds, a manual stopwatch was about 5 seconds (expectedly slow reaction time with no advance warning when printing a line is going to happen). Including it seemed to calm or reduce the variances DateTime was giving me. It looks like JIT sticks its nose into performance because the first call seems to be consistently longer processing than subsequent tests. Maybe something else is going on?

By the way, I put in the Linear programming article's 30X30 puzzle in this code too. The puzzle is in array2 and the search words are in words2. The article's WonderWord is of course nonsense but the method of blanking out the characters used by the other article isn't helpful when all but 5 of 225 characters would be blank.

The following is a map of the code's main parts

Type	Name	Usage
`struct`	`Keyword`	Tool to handle search words passed to WonderWord
`int`	`Keyword.Length`	Total length of the string defined here
`string`	`Keyword.OrigWord`	Case sensitive version of word passed in “words” array
`string`	`Keyword.ReverseWord`	Upper case version of `OrigWord` in reverse order. “The” would be “EHT”
`string`	`Keyword.Upperword`	Upper case version of `OrigWord`
`Function`	`Keyword.KeyVal`	pass `int` for the length, returns substring of `Upperword`
`Function`	`Keyword.RevKeyVal`	pass `int` for the length, returns substring of `ReverseWord`
`Function`	`AppendWords`	Tool to increase the number(size) of puzzles and keywords to solve
`Function`	`Main`	Console app to define and run various hard-coded puzzles
`Function`	`WonderWord`	Name of the main routine designed to solve puzzle
`string[]`	`array`	`WonderWord` field to search for words – order vital, fixed width
`string[]`	`words`	`WonderWord` field of words to search for – order immaterial, variable width
`Dictionary<string, bool>`	`UniqueKeywrd`	Uniquely identifies keywords passed to `WonderWord` in reverse/forward order
`Dictionary<string, List<Keyword>>`	`KeyVals`	Initially stores all unique words in forward and backward order from first to last character. Basically twice the number of characters in the word are keys.
`Dictionary<string, List<Keyword>>`	`TkeyVals`	Temp object, copies keys in `KeyVals` to the first one with one item in list
`Dictionary<string, List<int>>`	`KeyLoks`	`Keyword.Upperword` is the key, multiples of four for list. Location and direction
`Various`	“`Undefined`”	Lot of working objects, `StringBuilder`, `int`, `string`, You'll need to just figure it out. Lots of comments in the code.

Points of Interest

I am not that conversant in C++, if my criticisms of the other article were inaccurate would you mind explaining why?

A dictionary loads in NlogN and reads in logN. I discounted the loading and setting up of the dictionary and just figured that finding the keyword key was the logN portion being repeatedly run in the array.

FYI, I thought I'd fully tested it and found a bug today. I used the arrayblk/wordblk to load today's puzzle and for the first time screwed up and left empty (blank) words. When loading the words, verify the length is > 0 before adding the keyword.

History

March 5^th, 2015
- Published
March 6^th, 2015
- Revised new edition because tip didn't have source code
- Reloaded zip file whose link was broken and still contains the source code
- Asked for editor help on publishing the source code since I supplied the source code in the first go-round.
March 6^th, 2015
- Described bug in supplied code because I supplied empty words to search for the first time today
March 11
- Added more detail of how the code works and statistics for both processes