Click here to Skip to main content
15,881,852 members
Articles / Programming Languages / C#
Tip/Trick

Performance Solving "WonderWord" Puzzle

Rate me:
Please Sign up or sign in to vote.
3.67/5 (4 votes)
12 Mar 2015CPOL15 min read 29.9K   132   5   15
Cross-matching an array of characters to a set of words using NlogN logic

Introduction

Design Efficiency involved in a puzzle called "WonderWord" and comparison with "similar" puzzle solving technique.

Background

I am interested in efficient processing techniques.

One very common routine written about is the bubble sort. This is extremely simple to write, but is an N^2 process. The quicksort process is an NlogN process and is much more complex to read and understand, but has highly superior performance to N^2. With N^2, you quadruple the processing time when you double the size.

While working on the logic for solving this efficiently, I read http://www.codeproject.com/Articles/861524/Solving-Word-Search-Puzzles-in-Linear-Time. I was not impressed with the above article's “efficiency”. The author of the article had logic that started with a fair but flawed premise in my opinion and didn't efficiently implement it to boot. (I DON'T see Linear Time, I see squared time.) You add the characters' values together to get a keyword value. The sum of APPLE, AELPP, AOPME, and ZUAAA character values are exactly equal, so if you get a sum match your probability is higher but you still have to verify the individual characters match in the same relative position to the others. It's a bad assumption, but I admit the probability of a match is higher when they do match. The total sum was recalculated for each position and direction, instead of subtracting one letter from one direction and adding one letter on the other end. You have to go through this for every word. As you increase the size of the array, the number of words to search would have to go up as well. I didn't go through the whole logic but it looked to me that it went through the array in 4 directions for one of the keywords and then did it again for the next word. This “Linear Time” process looks exactly like an N^2 process to me.

The article did point out a more efficient way I could look at the keywords than I had thought of. (I started looking in 8 directions when you could look in 4 directions if you added a reversed word look-up.)

If it doesn't find a Wonderword, hopefully, it's because you messed up the setup, not the author's puzzle, nor my code's mess-up. (I had plenty of those at the start of this design, haven't found any new ones to fix for a while now. Each time I add a new puzzle, seems 10-90 will be setup perfectly immediately.)

The size of the array being searched is one part of the problem and I don't see how to solve that. All the keywords also need to be checked at every starting location and half the eight directions a word can be defined. and I think what I came up with is a pretty efficient way to do so for this problem. I am at a loss on how (or if) it could be better tweaked.

I took on a puzzle called Wonderword published in a local newspaper by author David Ouellet. (I since found out there is a website: http://www.wonderword.com/ that is authored by him.) This is basically a 15X15 array of characters, followed by a list of keywords that you need to find in the array. The keywords are entered horizontally, vertically, or diagonally in forwards or backwards order. The puzzle wants you to circle the keywords' letters and the remaining unmarked letters spells out the “Wonderword”. (Can be a phrase instead of a word.) I put several of the published puzzles in the code to get a "N" sized processing comparison.

While working on the logic for solving this efficiently, I read http://www.codeproject.com/Articles/861524/Solving-Word-Search-Puzzles-in-Linear-Time. I was not impressed with the above article's “efficiency”. The author of the article had logic that started with a fair but flawed premise in my opinion and didn't efficiently implement it to boot. You add the characters' values together to get a keyword value. The sum of APPLE, AELPP, AOPME, and ZUAAA character values are exactly equal along with probably thousands of other combinations that add to 370, so if you get a sum match your probability is higher but you still have to verify the individual characters match in the same relative position to the others. It's a bad assumption, but I admit the probability of a match is higher when they do match. The total sum was recalculated for each position and direction, instead of subtracting one letter from one direction and adding one letter on the other end. As far as I can tell, you have to go through this for every word. (I am not a C++ guru) As you increase the size of the array, the number of words to search would have to go up as well. I didn't go through the whole logic but it looked to me that it went through the array in 4 directions for one of the keywords and then did it again for the next word. This “Linear Time” process looks exactly like an N^2 process to me. (It's linear as long as you don't change the array size or the number of words.) N is linear, NLogN approaches linear. NLogN, going from 2 to 4 should take 4 times as long, going from 1 to 2 million should take 2.1 times as long. Going from 1 to 2 thousand should take 2.2 times as long.

The article did point out a more efficient way I could look at the keywords than I had thought of. (I started looking in 8 directions when you could look in 4 directions if you added a reversed word look-up.)

It seems I need to explain a little more about what is going on. Here is small portion of the output generated for a 15X330 array puzzle. (22 15X15 puzzles concatinated together):

Loop counts level 0: 4950, 1: 19800, 2: 59580 Array size: 4950, Search words: 855 words found: 1155 Characters in words: 5339 Search Keys: 2472

15*330 == 4950 == the number of characters in the search array .(15 is the width of the string, 330 is the number of strings in the array.) 19800/4 == 4950 Level 0 should and does exactly match the Array size in characters, Level 1 is 4 times the number of first character matches. This is such a big puzzle, all 26 characters are defined as a 1 letter key, so it will always go to the second character. It is 4 times, because it counts the 4 direction loop, whether a search in that direction is made or not. (Code verifies the shortest word found would fit in that direction.) Level 2 is a count of all successful and failed searches.

1155 is higher than 855 because several puzzles used the same keywords. With 22 concatinated puzzles, that will happen. I did happen to copy the same 15X15 puzzle twice so every word in that puzzle was found at least twice.

The Search Key dictionary will contain keys of the search words in forwards and backwards orders. Generally the keys will be double the keys as characters in the word. IE "Key" would have K, KE, KEY, Y,YE,YEK as its 6 keys. The list if it doesn't exist it is created and mostly the key's keyword object is added (Yesterday would be defined once for Y)

So KeyVals would maybe have 10,678 keys. It again would go through the 855 keywords and copy the same set of keys until only 1 key is listed in both directions. (Including the first 1 item key found.) It's kind of interesting that the total keys has almost always come out close to 3 times the number of search words.

Here is a short example (Note that this time I listed the keys, which the code provides at this point):

Loop counts level 0: 225, 1: 848, 2: 1750 Array size: 225, Search words: 44 words found: 44 Characters in words: 268 Search Keys: 128
A, AD, AI, B, BE, BL, BLA, BLAN, BLAS, BLO, BR, C, CH, CL, CO, COF, COL, D, E, EE, EM, ET, EX, EZ, EZE, EZO, F, FE, FEB, FEE, FI, FL, FR, FRI, FRO, FROS, FROZ, G, GN, GR, GU, H, HA, HE, HEA, HEAT, HEAV, HO, HS, HT, I, IC, IN, J, K, L, LA, LE, LEE, LEP, LL, LLA, LLE, LLI, M, ME, MI, MR, N, NI, NO, NOS, NOV, O, R, RA, RE, REB, RED, RET, RO, S, SA, SD, SE, SEA, SEI, SER, SET, SH, SL, SLE, SLU, SN, SNE, SNO, ST, STA, STS, T, TE, TEA, TEE, TEK, TS, TSA, TSO, TSU, W, WA, WI, WO, WOL, WOLB, WOLE, Y, YD, YE, YR, YRA, YRAU, YRAUN, YRAUR, YRI, YRO, YV, YZ, Z

Letters not defined:PQUVX.

Note that 214 characters that reached level 1 so 11 were eliminated with the first character being one of the 5 characters not defined. Also, all the above information is provided to see what the code is doing. Actually a fair percentage of the code is used to get performance information on the technique instead of solving the problem.

Note that if the first character looked at is B, if the second character isn't E, L, or R that direction is toast. If you do reach BR, you never look at the dictionary again. With YRAU, you still have one more dictionary search to make, but with a 4 letter match what is the probability it WON'T be N or R or won't spell out the full search word?

Also in this puzzle FROZ and EZO uniquely identify Froze in this puzzle, so you would have to search 3 or 4 keyed words to find the unique 5 letter search word. Only D, J, K, O, or Z would uniquely identify the search word, otherwise it would use the dictionary lookup more than once. Once it is not using the dictionary, it is an N process. Overall, it remains an NlogN process.

Both loading the dictionary and searching the array are NlogN processes, but for the dictionary, N and logN is the number of characters in the search words times 2 (1 to load the dictionary, and 2 to reduce the size of the dictionary loading a temp dictionary. Or you could look at it as loading a temp dictionary and replacing it with the reduced size dictionary) With the array, N is the number of characters in the array times 4, times the number of letters in the word and logN is the dictionary lookup

Loop counts level 0: 4950, 1: 19800, 2: 59580 Array size: 4950, Search words: 855 words found: 1155 Characters in words: 5339 Search Keys: 2472 The cost of creating the original keys would be about 10,669*14. To reduce the keys listed would take about 2,472*(14+12) operations. (log base 2 of 10,578 is 14 and of 2,472 is 12) to check the keys would take about 4950*12*4*2=475,200 the size of the array, the logN reading, 4 directions and average of 2 reads. (Another way to look at it: 59580*12=714,960, the reported lookup loops which is high because it includes loops not using the dictionary lookup times the LogN dictionary lookup cost.)  To find the words in C++ version giving it the best possible check 855*9*324*4*12= 119,672,640 operations. (About 250 times slower than the C# version, in theory, no way I can test it.) That's the number words, subtracting 6 from both dimensions of the array, four directions and the position and addition calculations for average of 6 characters.)

The size of the array being searched is one part of the problem and that can't disappear. All the keywords also need to be checked at basically every starting location and half the eight directions a word can be defined. I think what I came up with is a pretty efficient way to do so for this problem. I am at a loss on how (or if) it could be better tweaked. Because I put the search words in a string key and a keyword list dictionary, to look each string (leading to a keyword) up is a LogN process. (To look up one key in a million keys takes 20 loops), to define the keys in a Dictionary is NlogN, to match the word against every position is NlogN too.

Here is an example of the output I use to show where the keywords are defined:

I wrote a WonderWord routine, I can't do that without proving it can find the Wonder Word in a published puzzle:

Letter and location: C x=1 y=5 R x=3 y=7 I x=10 y=9 S x=6 y=12 P x=13 y=12
Wonder word is: CRISP

Wow, found eight words in the first line (None found in another):
MAEBIYRAUNAJELF Meteorologists x=0, xd=1, yd=1 Mittens x=0, xd=0, yd=1 Advisory x=1, xd=1, yd=1 Breeze x=3, xd=1, yd=1 Indoor x=4, xd=1, yd=1 January x=11, xd=-1, yd=0 Airy x=7, xd=1, yd=1 Flurries x=14, xd=0, yd=1
IEDERNVBILOTLEL Below x=7, xd=1, yd=1 Lazy x=9, xd=1, yd=1
TBTVFEDAERAEEEU Blanket x=1, xd=1, yd=1
TSLEIFEOELYZWFR Feel x=13, xd=0, yd=-1 Yell x=10, xd=1, yd=-1 February x=13, xd=0, yd=1 Windy x=12, xd=-1, yd=1
ETOAOSOZOHOIYER Heavy x=9, xd=-1, yd=-1 Zero x=7, xd=1, yd=-1
NCSRNROCERNWTBI Coffee x=7, xd=-1, yd=-1 Sleet x=2, xd=0, yd=1
SGLAFKORFDMAFRE Frost x=4, xd=-1, yd=-1 Shivering x=0, xd=0, yd=1 Gust x=1, xd=0, yd=1 Froze x=8, xd=1, yd=-1 Fires x=12, xd=0, yd=1 Frigid x=12, xd=-1, yd=1
HUERLHELYRERIUS Rain x=11, xd=0, yd=1 Snowfall x=14, xd=0, yd=1
ISEECBETOHIARAN Blast x=5, xd=-1, yd=-1 Heater x=9, xd=1, yd=-1
VTTTMPRTTGIIERO Teas x=8, xd=1, yd=1
EAOEUESMIEINSYW
RHVADERDSEASONF Hats x=1, xd=0, yd=-1 HotChocolate x=1, xd=1, yd=-1 Season x=8, xd=1, yd=0
IORLCASTKIDSTPA Kids x=8, xd=1, yd=0
NGOIWOLBXHSULSL November x=0, xd=1, yd=-1 Graupel x=1, xd=1, yd=-1 IceStorm x=3, xd=1, yd=-1 Warmth x=4, xd=1, yd=-1 Blow x=7, xd=-1, yd=0 Slush x=13, xd=-1, yd=0
GCLIMATESECHILL Extreme x=9, xd=-1, yd=-1 Colder x=1, xd=1, yd=-1 Climates x=1, xd=1, yd=0 Chill x=10, xd=1, yd=0

y isn't provided because the word is printed on the y line. x is the index where the first character can be found, with the directions, 1 is right(xd) or down(yd), -1 is left or up, and 0 is vertical or horizontal. Note I solve in 4 directions, but list in 8 always starting with the first character. Also the author lists HotChocolate as Hot Chocolate, you need to remove special characters like blank and dash when loading the words into code.

Using the Code

This is a set of test console code I used to see if performance matched expectations. Mainly no, but sometimes extremely close matches. I found that DateTime had major variances. I dislike using Stopwatch because I have never found it to be accurate. Looping 4 billion times should take about 4 seconds to run if it runs perfectly on my machine, DateTime says about 4.6 seconds, Stopwatch was just over 3 seconds, a manual stopwatch was about 5 seconds (expectedly slow reaction time with no advance warning when printing a line is going to happen). Including it seemed to calm or reduce the variances DateTime was giving me. It looks like JIT sticks its nose into performance because the first call seems to be consistently longer processing than subsequent tests. Maybe something else is going on?

By the way, I put in the Linear programming article's 30X30 puzzle in this code too. The puzzle is in array2 and the search words are in words2. The article's WonderWord is of course nonsense but the method of blanking out the characters used by the other article isn't helpful when all but 5 of 225 characters would be blank.

The following is a map of the code's main parts

Type Name Usage
struct Keyword Tool to handle search words passed to WonderWord
int Keyword.Length Total length of the string defined here
string Keyword.OrigWord Case sensitive version of word passed in “words” array
string Keyword.ReverseWord Upper case version of OrigWord in reverse order. “The” would be “EHT”
string Keyword.Upperword Upper case version of OrigWord
Function Keyword.KeyVal pass int for the length, returns substring of Upperword
Function Keyword.RevKeyVal pass int for the length, returns substring of ReverseWord
Function AppendWords Tool to increase the number(size) of puzzles and keywords to solve
Function Main Console app to define and run various hard-coded puzzles
Function WonderWord Name of the main routine designed to solve puzzle
string[] array WonderWord field to search for words – order vital, fixed width
string[] words WonderWord field of words to search for – order immaterial, variable width
Dictionary<string, bool> UniqueKeywrd Uniquely identifies keywords passed to WonderWord in reverse/forward order
Dictionary<string, List<Keyword>> KeyVals Initially stores all unique words in forward and backward order from first to last character.  Basically twice the number of characters in the word are keys.
Dictionary<string, List<Keyword>> TkeyVals Temp object, copies keys in KeyVals to the first one with one item in list
Dictionary<string, List<int>> KeyLoks Keyword.Upperword is the key, multiples of four for list. Location and direction
Various Undefined Lot of working objects, StringBuilder, int, string, You'll need to just figure it out. Lots of comments in the code.

Points of Interest

I am not that conversant in C++, if my criticisms of the other article were inaccurate would you mind explaining why?

A dictionary loads in NlogN and reads in logN. I discounted the loading and setting up of the dictionary and just figured that finding the keyword key was the logN portion being repeatedly run in the array.

FYI, I thought I'd fully tested it and found a bug today. I used the arrayblk/wordblk to load today's puzzle and for the first time screwed up and left empty (blank) words. When loading the words, verify the length is > 0 before adding the keyword.

History

  • March 5th, 2015
    • Published
  • March 6th, 2015
    • Revised new edition because tip didn't have source code
    • Reloaded zip file whose link was broken and still contains the source code
    • Asked for editor help on publishing the source code since I supplied the source code in the first go-round.
  • March 6th, 2015
    • Described bug in supplied code because I supplied empty words to search for the first time today 
  • March 11
    • Added more detail of how the code works and statistics for both processes

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer Comcast Cable
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralMy vote of 3 Pin
VadimAlk8-Mar-15 14:56
VadimAlk8-Mar-15 14:56 
GeneralRe: My vote of 3 Pin
KP Lee9-Mar-15 16:53
KP Lee9-Mar-15 16:53 
Thanks for your comment.

You nailed me on "they are wrong", but I didn't intend "they", just "he", I also gave credit to him when he gave an idea that helped me improve my design. Nor did I intend to sound like the one voice in the wilderness shouting out the truth in tongues so you wouldn't understand me.

Figuring out what to include in the article, vs letting you read the code is tough too.

The point of the code was to write a puzzle solving routine and run tests against the code, both to verify the code works as intended and to see how it performs, if it performs as expected, and how fast it does it.

The puzzle supplies an array of characters in which you need to find the supplied words. Sorry, thought that the names were obvious. The combination of those names identifies the parts of the puzzle, and the numbers identifies the version. When I started, I hadn't designed a naming convention so 1 was as good as any. Later I thought about dating the puzzles so 304 means David supplied that puzzle on March 4th. I don't intend to carry on copying puzzles for years.

This was not a well-planned out code design, where who's affected and how were taken into account, nor was a team of people working on the naming convention used. The lifetime of the code wasn't taken into consideration either. I admit it is a sloppy, very poor design for long-term support by a variety of members in the team.

As far as algorithm. I did describe the dictionary's algorithm as NlogN for loading and logN for retrieving data. In this code there is a cost in loading the dictionary but as a "gut" feel the definition of the dictionary's cost is insignificant compared to retrieving it in the array multiple times.

The "real" cost is going through the array, searching

I hope I did get across that in general all the versions from front to back and back to front are stored, so in general the keys will be twice the number of characters supplied in the search words and then is cut down to just the "unique" definitions needed.

Sorry, I never mentioned in the zip file, the "txt" files are generally results of running the code and most are "older" ,so the data shouldn't be the same and the format may not be either because I stored them as I was developing the code.

By the way, array2 is a copy of the array provided by the referenced author and isn't a "wonderword" puzzle.

So array1 has a 15X15 array of characters:
SASLOBMYSSTIURF
TPPRIESTSGROWLL
UPSGGBITENOTODA
NLEFNROVIDSVODM
NEEWINEAEUEYDTE
OSDPTRHSGATRSHN
IDSRSMEUIIAAHGV
TEEDARAPLHETKIE
ARTSOGOICFYRSNR
VESSRCTROMANCET
IUEMUROMOPARTYU
TQVNECELTICEKAM
LNRFMVILLAGERSN
UOADOOFPROTECTU
CCHTREESILVANUS
That you need to find 43 words like
Stir and Apples:
SASLOBMYSSTIURF Stir x=8, xd=-1, yd=1 Apples x=1, xd=0, yd=1 Symbols x=8, xd=-1, yd=0 Fruits x=14, xd=-1, yd=0 Fruit x=14, xd=-1, yd=0 Flamen x=14, xd=0, yd=1
Stir is the ninth character (x=8) on the above line, it moves horizontally left (xd=-1) and down (yd=1) and Apples is the second character and reads vertically both Fruits and Fruit starts on the last character and reads Left (Symbols does too.) They both exist because this is output from a concatenated puzzle that has maybe 800 search words and another one wanted to find Fruit but Flamen was definitely intended to be part of this puzzle (I think check word1) because it is an unusual word.
Note in the output I provide stats too:

I didn't find Stir on this line, I found it on the forth line down in reverse order: ...RI, RIC, RIG, RIN, RIS, RISE, RISK, RIT...
I had to search for R, RI, and RIT before I got a key with one word in it.
RIT is a reverse key lookup, but I bet a buck RIS is a forward key look-up with two words in it and I have to go through the entire keyed list to find those two words. RIT is the last dictionary value it found and then just had one more character in the keyword struct.
In this case, earlier it says
"Loop counts level 0: 4950, 1: 19800, 2: 59580 Array size: 4950, Search words: 855 words found: 1155 Characters in words: 5339 Search Keys: 2472"
level 0 is the number of characters stored in the array: 4950 divide that by 225=22 concatinated puzzles and probably isn't the number of puzzles your zip text files has because this grows as I add more as time goes on.
I got "...RI, RIC..." because as you go on, it lists the 2,472 keywords it kept and used.
19,800/4=4,950 No surprise because all 26 starting characters are keyed, this counts the starting characters that are keyed and advance to the 4 direction loop whether it goes forward or not. 59,580 is a combination of key lookup and continued searches. (S in Stir is counted even when it isn't keyed if it is found.) So, false starts and successful searches are counted. A false start has at least the second character to match, but won't count when the next character in that direction isn't found.
So many duplicate words found has to mean that I appended the same puzzle(s) multiple times.
Also, I did warn you about how easy (not) it is to read because I mentioned bubble and quick sort methods and how the second is much more efficient, but much harder to read.
This code has much more comments than I am used to seeing, did they help you out at all?
GeneralRe: My vote of 3 Pin
KP Lee11-Mar-15 18:44
KP Lee11-Mar-15 18:44 
GeneralRe: My vote of 3 Pin
VadimAlk15-Mar-15 16:44
VadimAlk15-Mar-15 16:44 
GeneralRe: My vote of 3 Pin
KP Lee15-Mar-15 23:10
KP Lee15-Mar-15 23:10 
GeneralRe: My vote of 3 Pin
VadimAlk17-Mar-15 17:26
VadimAlk17-Mar-15 17:26 
GeneralRe: My vote of 3 Pin
KP Lee18-Mar-15 16:50
KP Lee18-Mar-15 16:50 
GeneralRe: My vote of 3 Pin
KP Lee20-Mar-15 14:56
KP Lee20-Mar-15 14:56 
GeneralRe: My vote of 3 Pin
KP Lee21-Mar-15 18:45
KP Lee21-Mar-15 18:45 
GeneralRe: My vote of 3 Pin
KP Lee23-Mar-15 11:41
KP Lee23-Mar-15 11:41 
GeneralRe: My vote of 3 Pin
VadimAlk26-Mar-15 13:08
VadimAlk26-Mar-15 13:08 
GeneralRe: My vote of 3 Pin
KP Lee26-Mar-15 18:02
KP Lee26-Mar-15 18:02 
GeneralRe: My vote of 3 Pin
KP Lee26-Mar-15 19:25
KP Lee26-Mar-15 19:25 
GeneralRe: My vote of 3 Pin
KP Lee2-Apr-15 17:03
KP Lee2-Apr-15 17:03 
GeneralRe: My vote of 3 Pin
KP Lee3-Apr-15 18:40
KP Lee3-Apr-15 18:40 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.