Click here to Skip to main content
15,881,516 members
Articles / Programming Languages / C#
Tip/Trick

Performance Solving "WonderWord" Puzzle

Rate me:
Please Sign up or sign in to vote.
3.67/5 (4 votes)
12 Mar 2015CPOL15 min read 29.8K   132   5   15
Cross-matching an array of characters to a set of words using NlogN logic

Introduction

Design Efficiency involved in a puzzle called "WonderWord" and comparison with "similar" puzzle solving technique.

Background

I am interested in efficient processing techniques.

One very common routine written about is the bubble sort. This is extremely simple to write, but is an N^2 process. The quicksort process is an NlogN process and is much more complex to read and understand, but has highly superior performance to N^2. With N^2, you quadruple the processing time when you double the size.

While working on the logic for solving this efficiently, I read http://www.codeproject.com/Articles/861524/Solving-Word-Search-Puzzles-in-Linear-Time. I was not impressed with the above article's “efficiency”. The author of the article had logic that started with a fair but flawed premise in my opinion and didn't efficiently implement it to boot. (I DON'T see Linear Time, I see squared time.) You add the characters' values together to get a keyword value. The sum of APPLE, AELPP, AOPME, and ZUAAA character values are exactly equal, so if you get a sum match your probability is higher but you still have to verify the individual characters match in the same relative position to the others. It's a bad assumption, but I admit the probability of a match is higher when they do match. The total sum was recalculated for each position and direction, instead of subtracting one letter from one direction and adding one letter on the other end. You have to go through this for every word. As you increase the size of the array, the number of words to search would have to go up as well. I didn't go through the whole logic but it looked to me that it went through the array in 4 directions for one of the keywords and then did it again for the next word. This “Linear Time” process looks exactly like an N^2 process to me.

The article did point out a more efficient way I could look at the keywords than I had thought of. (I started looking in 8 directions when you could look in 4 directions if you added a reversed word look-up.)

If it doesn't find a Wonderword, hopefully, it's because you messed up the setup, not the author's puzzle, nor my code's mess-up. (I had plenty of those at the start of this design, haven't found any new ones to fix for a while now. Each time I add a new puzzle, seems 10-90 will be setup perfectly immediately.)

The size of the array being searched is one part of the problem and I don't see how to solve that. All the keywords also need to be checked at every starting location and half the eight directions a word can be defined. and I think what I came up with is a pretty efficient way to do so for this problem. I am at a loss on how (or if) it could be better tweaked.

I took on a puzzle called Wonderword published in a local newspaper by author David Ouellet. (I since found out there is a website: http://www.wonderword.com/ that is authored by him.) This is basically a 15X15 array of characters, followed by a list of keywords that you need to find in the array. The keywords are entered horizontally, vertically, or diagonally in forwards or backwards order. The puzzle wants you to circle the keywords' letters and the remaining unmarked letters spells out the “Wonderword”. (Can be a phrase instead of a word.) I put several of the published puzzles in the code to get a "N" sized processing comparison.

While working on the logic for solving this efficiently, I read http://www.codeproject.com/Articles/861524/Solving-Word-Search-Puzzles-in-Linear-Time. I was not impressed with the above article's “efficiency”. The author of the article had logic that started with a fair but flawed premise in my opinion and didn't efficiently implement it to boot. You add the characters' values together to get a keyword value. The sum of APPLE, AELPP, AOPME, and ZUAAA character values are exactly equal along with probably thousands of other combinations that add to 370, so if you get a sum match your probability is higher but you still have to verify the individual characters match in the same relative position to the others. It's a bad assumption, but I admit the probability of a match is higher when they do match. The total sum was recalculated for each position and direction, instead of subtracting one letter from one direction and adding one letter on the other end. As far as I can tell, you have to go through this for every word. (I am not a C++ guru) As you increase the size of the array, the number of words to search would have to go up as well. I didn't go through the whole logic but it looked to me that it went through the array in 4 directions for one of the keywords and then did it again for the next word. This “Linear Time” process looks exactly like an N^2 process to me. (It's linear as long as you don't change the array size or the number of words.) N is linear, NLogN approaches linear. NLogN, going from 2 to 4 should take 4 times as long, going from 1 to 2 million should take 2.1 times as long. Going from 1 to 2 thousand should take 2.2 times as long.

The article did point out a more efficient way I could look at the keywords than I had thought of. (I started looking in 8 directions when you could look in 4 directions if you added a reversed word look-up.)

It seems I need to explain a little more about what is going on. Here is small portion of the output generated for a 15X330 array puzzle. (22 15X15 puzzles concatinated together):

Loop counts level 0: 4950, 1: 19800, 2: 59580 Array size: 4950, Search words: 855 words found: 1155 Characters in words: 5339 Search Keys: 2472

15*330 == 4950 == the number of characters in the search array .(15 is the width of the string, 330 is the number of strings in the array.) 19800/4 == 4950 Level 0 should and does exactly match the Array size in characters, Level 1 is 4 times the number of first character matches. This is such a big puzzle, all 26 characters are defined as a 1 letter key, so it will always go to the second character. It is 4 times, because it counts the 4 direction loop, whether a search in that direction is made or not. (Code verifies the shortest word found would fit in that direction.) Level 2 is a count of all successful and failed searches.

1155 is higher than 855 because several puzzles used the same keywords. With 22 concatinated puzzles, that will happen. I did happen to copy the same 15X15 puzzle twice so every word in that puzzle was found at least twice.

The Search Key dictionary will contain keys of the search words in forwards and backwards orders. Generally the keys will be double the keys as characters in the word. IE "Key" would have K, KE, KEY, Y,YE,YEK as its 6 keys. The list if it doesn't exist it is created and mostly the key's keyword object is added (Yesterday would be defined once for Y)

So KeyVals would maybe have 10,678 keys. It again would go through the 855 keywords and copy the same set of keys until only 1 key is listed in both directions. (Including the first 1 item key found.) It's kind of interesting that the total keys has almost always come out close to 3 times the number of search words.

Here is a short example (Note that this time I listed the keys, which the code provides at this point):

Loop counts level 0: 225, 1: 848, 2: 1750 Array size: 225, Search words: 44 words found: 44 Characters in words: 268 Search Keys: 128
A, AD, AI, B, BE, BL, BLA, BLAN, BLAS, BLO, BR, C, CH, CL, CO, COF, COL, D, E, EE, EM, ET, EX, EZ, EZE, EZO, F, FE, FEB, FEE, FI, FL, FR, FRI, FRO, FROS, FROZ, G, GN, GR, GU, H, HA, HE, HEA, HEAT, HEAV, HO, HS, HT, I, IC, IN, J, K, L, LA, LE, LEE, LEP, LL, LLA, LLE, LLI, M, ME, MI, MR, N, NI, NO, NOS, NOV, O, R, RA, RE, REB, RED, RET, RO, S, SA, SD, SE, SEA, SEI, SER, SET, SH, SL, SLE, SLU, SN, SNE, SNO, ST, STA, STS, T, TE, TEA, TEE, TEK, TS, TSA, TSO, TSU, W, WA, WI, WO, WOL, WOLB, WOLE, Y, YD, YE, YR, YRA, YRAU, YRAUN, YRAUR, YRI, YRO, YV, YZ, Z

Letters not defined:PQUVX.

Note that 214 characters that reached level 1 so 11 were eliminated with the first character being one of the 5 characters not defined. Also, all the above information is provided to see what the code is doing. Actually a fair percentage of the code is used to get performance information on the technique instead of solving the problem.

Note that if the first character looked at is B, if the second character isn't E, L, or R that direction is toast. If you do reach BR, you never look at the dictionary again. With YRAU, you still have one more dictionary search to make, but with a 4 letter match what is the probability it WON'T be N or R or won't spell out the full search word?

Also in this puzzle FROZ and EZO uniquely identify Froze in this puzzle, so you would have to search 3 or 4 keyed words to find the unique 5 letter search word. Only D, J, K, O, or Z would uniquely identify the search word, otherwise it would use the dictionary lookup more than once. Once it is not using the dictionary, it is an N process. Overall, it remains an NlogN process.

Both loading the dictionary and searching the array are NlogN processes, but for the dictionary, N and logN is the number of characters in the search words times 2 (1 to load the dictionary, and 2 to reduce the size of the dictionary loading a temp dictionary. Or you could look at it as loading a temp dictionary and replacing it with the reduced size dictionary) With the array, N is the number of characters in the array times 4, times the number of letters in the word and logN is the dictionary lookup

Loop counts level 0: 4950, 1: 19800, 2: 59580 Array size: 4950, Search words: 855 words found: 1155 Characters in words: 5339 Search Keys: 2472 The cost of creating the original keys would be about 10,669*14. To reduce the keys listed would take about 2,472*(14+12) operations. (log base 2 of 10,578 is 14 and of 2,472 is 12) to check the keys would take about 4950*12*4*2=475,200 the size of the array, the logN reading, 4 directions and average of 2 reads. (Another way to look at it: 59580*12=714,960, the reported lookup loops which is high because it includes loops not using the dictionary lookup times the LogN dictionary lookup cost.)  To find the words in C++ version giving it the best possible check 855*9*324*4*12= 119,672,640 operations. (About 250 times slower than the C# version, in theory, no way I can test it.) That's the number words, subtracting 6 from both dimensions of the array, four directions and the position and addition calculations for average of 6 characters.)

The size of the array being searched is one part of the problem and that can't disappear. All the keywords also need to be checked at basically every starting location and half the eight directions a word can be defined. I think what I came up with is a pretty efficient way to do so for this problem. I am at a loss on how (or if) it could be better tweaked. Because I put the search words in a string key and a keyword list dictionary, to look each string (leading to a keyword) up is a LogN process. (To look up one key in a million keys takes 20 loops), to define the keys in a Dictionary is NlogN, to match the word against every position is NlogN too.

Here is an example of the output I use to show where the keywords are defined:

I wrote a WonderWord routine, I can't do that without proving it can find the Wonder Word in a published puzzle:

Letter and location: C x=1 y=5 R x=3 y=7 I x=10 y=9 S x=6 y=12 P x=13 y=12
Wonder word is: CRISP

Wow, found eight words in the first line (None found in another):
MAEBIYRAUNAJELF Meteorologists x=0, xd=1, yd=1 Mittens x=0, xd=0, yd=1 Advisory x=1, xd=1, yd=1 Breeze x=3, xd=1, yd=1 Indoor x=4, xd=1, yd=1 January x=11, xd=-1, yd=0 Airy x=7, xd=1, yd=1 Flurries x=14, xd=0, yd=1
IEDERNVBILOTLEL Below x=7, xd=1, yd=1 Lazy x=9, xd=1, yd=1
TBTVFEDAERAEEEU Blanket x=1, xd=1, yd=1
TSLEIFEOELYZWFR Feel x=13, xd=0, yd=-1 Yell x=10, xd=1, yd=-1 February x=13, xd=0, yd=1 Windy x=12, xd=-1, yd=1
ETOAOSOZOHOIYER Heavy x=9, xd=-1, yd=-1 Zero x=7, xd=1, yd=-1
NCSRNROCERNWTBI Coffee x=7, xd=-1, yd=-1 Sleet x=2, xd=0, yd=1
SGLAFKORFDMAFRE Frost x=4, xd=-1, yd=-1 Shivering x=0, xd=0, yd=1 Gust x=1, xd=0, yd=1 Froze x=8, xd=1, yd=-1 Fires x=12, xd=0, yd=1 Frigid x=12, xd=-1, yd=1
HUERLHELYRERIUS Rain x=11, xd=0, yd=1 Snowfall x=14, xd=0, yd=1
ISEECBETOHIARAN Blast x=5, xd=-1, yd=-1 Heater x=9, xd=1, yd=-1
VTTTMPRTTGIIERO Teas x=8, xd=1, yd=1
EAOEUESMIEINSYW
RHVADERDSEASONF Hats x=1, xd=0, yd=-1 HotChocolate x=1, xd=1, yd=-1 Season x=8, xd=1, yd=0
IORLCASTKIDSTPA Kids x=8, xd=1, yd=0
NGOIWOLBXHSULSL November x=0, xd=1, yd=-1 Graupel x=1, xd=1, yd=-1 IceStorm x=3, xd=1, yd=-1 Warmth x=4, xd=1, yd=-1 Blow x=7, xd=-1, yd=0 Slush x=13, xd=-1, yd=0
GCLIMATESECHILL Extreme x=9, xd=-1, yd=-1 Colder x=1, xd=1, yd=-1 Climates x=1, xd=1, yd=0 Chill x=10, xd=1, yd=0

y isn't provided because the word is printed on the y line. x is the index where the first character can be found, with the directions, 1 is right(xd) or down(yd), -1 is left or up, and 0 is vertical or horizontal. Note I solve in 4 directions, but list in 8 always starting with the first character. Also the author lists HotChocolate as Hot Chocolate, you need to remove special characters like blank and dash when loading the words into code.

Using the Code

This is a set of test console code I used to see if performance matched expectations. Mainly no, but sometimes extremely close matches. I found that DateTime had major variances. I dislike using Stopwatch because I have never found it to be accurate. Looping 4 billion times should take about 4 seconds to run if it runs perfectly on my machine, DateTime says about 4.6 seconds, Stopwatch was just over 3 seconds, a manual stopwatch was about 5 seconds (expectedly slow reaction time with no advance warning when printing a line is going to happen). Including it seemed to calm or reduce the variances DateTime was giving me. It looks like JIT sticks its nose into performance because the first call seems to be consistently longer processing than subsequent tests. Maybe something else is going on?

By the way, I put in the Linear programming article's 30X30 puzzle in this code too. The puzzle is in array2 and the search words are in words2. The article's WonderWord is of course nonsense but the method of blanking out the characters used by the other article isn't helpful when all but 5 of 225 characters would be blank.

The following is a map of the code's main parts

Type Name Usage
struct Keyword Tool to handle search words passed to WonderWord
int Keyword.Length Total length of the string defined here
string Keyword.OrigWord Case sensitive version of word passed in “words” array
string Keyword.ReverseWord Upper case version of OrigWord in reverse order. “The” would be “EHT”
string Keyword.Upperword Upper case version of OrigWord
Function Keyword.KeyVal pass int for the length, returns substring of Upperword
Function Keyword.RevKeyVal pass int for the length, returns substring of ReverseWord
Function AppendWords Tool to increase the number(size) of puzzles and keywords to solve
Function Main Console app to define and run various hard-coded puzzles
Function WonderWord Name of the main routine designed to solve puzzle
string[] array WonderWord field to search for words – order vital, fixed width
string[] words WonderWord field of words to search for – order immaterial, variable width
Dictionary<string, bool> UniqueKeywrd Uniquely identifies keywords passed to WonderWord in reverse/forward order
Dictionary<string, List<Keyword>> KeyVals Initially stores all unique words in forward and backward order from first to last character.  Basically twice the number of characters in the word are keys.
Dictionary<string, List<Keyword>> TkeyVals Temp object, copies keys in KeyVals to the first one with one item in list
Dictionary<string, List<int>> KeyLoks Keyword.Upperword is the key, multiples of four for list. Location and direction
Various Undefined Lot of working objects, StringBuilder, int, string, You'll need to just figure it out. Lots of comments in the code.

Points of Interest

I am not that conversant in C++, if my criticisms of the other article were inaccurate would you mind explaining why?

A dictionary loads in NlogN and reads in logN. I discounted the loading and setting up of the dictionary and just figured that finding the keyword key was the logN portion being repeatedly run in the array.

FYI, I thought I'd fully tested it and found a bug today. I used the arrayblk/wordblk to load today's puzzle and for the first time screwed up and left empty (blank) words. When loading the words, verify the length is > 0 before adding the keyword.

History

  • March 5th, 2015
    • Published
  • March 6th, 2015
    • Revised new edition because tip didn't have source code
    • Reloaded zip file whose link was broken and still contains the source code
    • Asked for editor help on publishing the source code since I supplied the source code in the first go-round.
  • March 6th, 2015
    • Described bug in supplied code because I supplied empty words to search for the first time today 
  • March 11
    • Added more detail of how the code works and statistics for both processes

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer Comcast Cable
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralMy vote of 3 Pin
VadimAlk8-Mar-15 14:56
VadimAlk8-Mar-15 14:56 
GeneralRe: My vote of 3 Pin
KP Lee9-Mar-15 16:53
KP Lee9-Mar-15 16:53 
GeneralRe: My vote of 3 Pin
KP Lee11-Mar-15 18:44
KP Lee11-Mar-15 18:44 
GeneralRe: My vote of 3 Pin
VadimAlk15-Mar-15 16:44
VadimAlk15-Mar-15 16:44 
GeneralRe: My vote of 3 Pin
KP Lee15-Mar-15 23:10
KP Lee15-Mar-15 23:10 
GeneralRe: My vote of 3 Pin
VadimAlk17-Mar-15 17:26
VadimAlk17-Mar-15 17:26 
GeneralRe: My vote of 3 Pin
KP Lee18-Mar-15 16:50
KP Lee18-Mar-15 16:50 
I have had absolutely NO formal training in Big O notation. What I know about it is what I've read about it on the web. and there is NEVER Laugh | :laugh: conflicting information available there.

Wickipedia wrote:
The sort has a known time complexity of O(n^2), and after the subroutine runs the algorithm must take an additional 55n^3+2n+10 time before it terminates. Thus the overall time complexity of the algorithm can be expressed as

T(n)=55n^3+O(n^2).

Here the terms 2n+10 are subsumed within the faster-growing O(n^2). Again, this usage disregards some of the formal meaning of the "=" symbol, but it does allow one to use the big O notation as a kind of convenient placeholder.

Note that I altered the quote for the special character 2 that didn't copy so well here. I fully understand subsuming 2n+10, but I don't understand why "+O(n^2)" didn't get subsumed too, because if n is 1000, who cares about a million when you are comparing it to 55 billion? That's 55.001 billion, or more accurately 55.00100201 B. (And then, it IS only an estimate. Laugh | :laugh: ) The whole article is hard for me to read since it's been 40 decades since I really used algebraic equations. Even then, = was NOT order dependent so switching the order would not change the meaning, though the example was perfectly clear on how and why it could be order dependent. I certainly didn't know enough to know why the equations couldn't be swapped.

Why anyone would pick a bubble sort routine when the quick sort has a known complexity of NlogN is beyond me. (Going by memory 150K took 1.6 minutes vs. 4 milliseconds. Sorting 150M with the quick sort took 49+ seconds.) 55 is also irrelevant. 55 what? N^3 is really the only relevant value in the cost. If N=100 costs 10 seconds, N=1000 costs 10,000 seconds (About 13 minutes short of 3 hours. You need to divide 10 and 10,000 by 55 to find the cost if N^3 wasn't done 55 times, but the ratio of the cost is still cubed.)

Also subsuming processes that aren't relevant to the overall cost makes sense to me. In this case, N1 is the number of characters in the array and N2 is 3 times the number of words to search. The process of my routine is then N1LogN2. I don't care what dimensions you pass in the array, N1 is the number of characters I have to process. In estimations, edge cases are ignored, so when I bypass looking in a direction because the shortest word would go beyond the boundary, that isn't counted in any estimate because it is, overall, irrelevant to the cost.
GeneralRe: My vote of 3 Pin
KP Lee20-Mar-15 14:56
KP Lee20-Mar-15 14:56 
GeneralRe: My vote of 3 Pin
KP Lee21-Mar-15 18:45
KP Lee21-Mar-15 18:45 
GeneralRe: My vote of 3 Pin
KP Lee23-Mar-15 11:41
KP Lee23-Mar-15 11:41 
GeneralRe: My vote of 3 Pin
VadimAlk26-Mar-15 13:08
VadimAlk26-Mar-15 13:08 
GeneralRe: My vote of 3 Pin
KP Lee26-Mar-15 18:02
KP Lee26-Mar-15 18:02 
GeneralRe: My vote of 3 Pin
KP Lee26-Mar-15 19:25
KP Lee26-Mar-15 19:25 
GeneralRe: My vote of 3 Pin
KP Lee2-Apr-15 17:03
KP Lee2-Apr-15 17:03 
GeneralRe: My vote of 3 Pin
KP Lee3-Apr-15 18:40
KP Lee3-Apr-15 18:40 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.