|
|
Now remember, this is an interview question , so stuff like databases, etc. are irrelevant... they were testing my data structure knowledge .
Say you are working for the DMV and need to write a function that checks available license plates. Valid #'s are 0000000 to ZZZZZZZ (no variable length plates). That means there will be 36^7 = 78B possible #'s. Storing that in a packed bit array would require 9GB, so thats out .
Thought about storing a linked list of ranges, but if the range struct is ~20 to 30 bytes, once you get above around ~200M ranges, the packed bit array starts to look better.
Oh yeah, you can not assume they are handing out #'s sequentially, so it is possible that you'll end up with every other # taken = 39B ranges * 30 bytes = like 1000GB or something like that haha.
Point was, they didn't like the linked list answer too much because it didn't scale to handle the worst case or even beyond 200M ranges.
I had heard of interval trees before, but didn't know if they would solve the scaling issue.
|
|
|
|
|
repost!
Luc Pattyn [My Articles] Nil Volentibus Arduum
Fed up by FireFox memory leaks I switched to Opera and now CP doesn't perform its paste magic, so links will not be offered. Sorry.
|
|
|
|
|
I thought about doing something like that... i.e. breaking it up into 36 lists. So you would eliminate the first char and have 36 lists of 36^6 lists. Turns out it would use the same amount of memory or more , but you wouldn't have to have the entire list loaded at once.
|
|
|
|
|
I don't think so. For starters I'm not storing all these symbols over and over. Second, it is the last symbol that gets collected in a bitmask, assuming they are handed out pretty much sequentially. Third, I'm not using linked lists nor real pointers, instead I suggested using arrays and indexes.
And yes, whatever approach, it should take advantage of the set being filled sparsely (I don't think there are 78B license plates around).
Luc Pattyn [My Articles] Nil Volentibus Arduum
Fed up by FireFox memory leaks I switched to Opera and now CP doesn't perform its paste magic, so links will not be offered. Sorry.
|
|
|
|
|
Yeah, they said don't assume license plates are handed out sequentially... Otherwise you would just do a return lastLicensePlate++ type thing .
I was thinking you were going for something like:
list[0] = list (of 6 char words) for plates starting with 0
list[1] = list (of 6 char words) for plates starting with 1
etc and have 36 of those. If you had 36 lists of 36^6 words, thats = just having one list of 36^7 words.
I'm not getting your idea though... maybe you can explain a bit further?
Yeah, this wasn't a "real world scenario"... they were just testing to see if I can deal with huge data sets and large amounts of data.
|
|
|
|
|
SledgeHammer01 wrote: I'm not getting your idea
which one? I offered 3 ideas.
Here is the gist of the third and simplest one:
private static Dictionary<string, long) plates;
private static string alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890";
public static bool Exists(string plate) {
string lead=plate.SubString(0,6);
if (!plates.ContainsKey(lead)) return false;
long bitmask=plates[lead];
int index=alphabet.IndexOf(plate[6]);
return (bitmask & (1L<<index))!=0;
}
You might notice it requires just one 6-char string and one long to hold information about 36 consecutive plates.
Luc Pattyn [My Articles] Nil Volentibus Arduum
Fed up by FireFox memory leaks I switched to Opera and now CP doesn't perform its paste magic, so links will not be offered. Sorry.
|
|
|
|
|
So if we have 78B #'s, that'd be 78B / 36 = 2B dictionary entries? If each entry had no other data except the 6 byte string and the 4 byte long, thats 10 bytes = 20GB . Unless my math is wrong I mean (which is entirely possible). Vast improvement over 1000GB, but you still fail the interview because my original packed bit array was only 9GB . I'm thinking there's probably a very compact solution out there using some obscure data structure. Thats what I'm trying to find. So far, the leads are kind of pointing towards a DAWG.
|
|
|
|
|
If the set is fully populated, then it simply deserves one bit per plate, obviously; and then it doesn't need any intelligence. It is when the set is sparse that some intelligence can improve the situation. You don't want 78B license plates, do you? (that would be 10 per human being) So don't judge the quality, efficiency, or whatever characteristic for a sparse solution by feeding it the numbers of a full set.
Luc Pattyn [My Articles] Nil Volentibus Arduum
Fed up by FireFox memory leaks I switched to Opera and now CP doesn't perform its paste magic, so links will not be offered. Sorry.
|
|
|
|
|
Oh right... I missed a minor part of your solution... you are only storing the prefixes in use . Just places like Google, when they interview you, the want solution that work at all levels .
|
|
|
|
|
Again, nothing to do with license plates, just how you handle vast amounts of data (think Google or something along that scale).
With your solution, you are using the first 6 chars as the "prefix" / key and then "compressing" the last char... (36 plates into one entry).
So wouldn't you need the FULLY populated dictionary if say, every 36th plate was taken .
So your storage requirement is 20GB between 2B plates and 78B plates . 2B in 20GB vs. 78B in 9GB with the bit array.
Sorry, hope I'm not getting you angry or anything , just this is how the interview went, so its a real world thing. They kept poking holes in everything haha...
But unless I misunderstood your algorithm, it does seem like you need the full 20GB as soon as you hit 2B plates (assuming 1 in every range of course)... if you had them in tightly packed groups, then your requirements wouldn't be as great.
I even told the interviewer once I started getting annoyed at his hole poking "Well, at that point I would probably change the license plate # selector algorithm to hand out #'s from a more tightly packed range". Lol.
|
|
|
|
|
Again, I offered 3 ideas. Your current concerns get handled by the first one. I gave a detailed implementation for the third and simplest approach, I am not going to do the others as well.
Luc Pattyn [My Articles] Nil Volentibus Arduum
Fed up by FireFox memory leaks I switched to Opera and now CP doesn't perform its paste magic, so links will not be offered. Sorry.
|
|
|
|
|
If the data is going to be the worst case for whatever datastructure you pick, I pick the packed bit array. Information theory gets in the way otherwise. Any datastructure that can be smaller than the packed bit array can only do so because it's using some sort of regularity in the data - if there isn't one then there are 2^(36^7) possible states requiring 36^7 bits to store and that's the end of it.
|
|
|
|
|
right.
Luc Pattyn [My Articles] Nil Volentibus Arduum
Fed up by FireFox memory leaks I switched to Opera and now CP doesn't perform its paste magic, so links will not be offered. Sorry.
|
|
|
|
|
Yeah, I dunno what structure he was expecting. My guess is he was just trying to get me to punch him in his face or something. Who knows? The packed bit array is best in the worst case obviously, but it's not good in the best case or an even remotely "average" case. 1 license plate should not take up 9GB of memory . I think he wanted something along the lines of a DAWG. If Office can store the entire English dictionary in 4MB, then surely this problem can be solved in less space .
|
|
|
|
|
Was googling how spell checkers and dictionaries store words. Seems like most use a DAWG.
http://en.wikipedia.org/wiki/Directed_acyclic_word_graph[^]
I guess if I stored each license plate in the DAWG and did a "spell check" on it... it would work, although I'm not sure how well that'll scale.
Says a DAWG is most space efficient, so maybe thats the answer.
EDIT: saw a sample on the net where the guy said a dictionary was 17MB and the DAWG version was only 4MB. He didn't say how many words, etc.
|
|
|
|
|
|
Thanks for the suggestion.
Does it work in such a way that Chris understands when and how texts gets pasted in forum messages, resulting in CP article links being turned into article titles, other links being linkified, and pasted code being formatted with PRE tags?
Luc Pattyn [My Articles] Nil Volentibus Arduum
Fed up by FireFox memory leaks I switched to Opera and now CP doesn't perform its paste magic, so links will not be offered. Sorry.
|
|
|
|
|
my previous reply was using Dragon and CP auto-magically linked the URLs and created tags.
|
|
|
|
|
Thanks. I'll give it a spin.
Luc Pattyn [My Articles] Nil Volentibus Arduum
Fed up by FireFox memory leaks I switched to Opera and now CP doesn't perform its paste magic, so links will not be offered. Sorry.
|
|
|
|
|
|
|
|
SledgeHammer01 wrote: I had one large company disqualify me because they assumed I couldn't write
socket code because I didn't memorize the 7 layer OSI model
lol.
I had one interviewer who got visibly upset with me after I said I was capable of writing database code but I couldn't explain the 5 rules of normalization.
Might note that I still can't. But I do know that 2 of them are absolutely worthless for practical programming.
|
|
|
|
|
jschell wrote: I had one interviewer who got visibly upset with me after I said I was capable of writing database code but I couldn't explain the 5 rules of normalization.
SEVEN!
There are seven levels of normalization as defined, with 6NF being the top. None of the companies that I worked for went beyond BNF. Now, would that moron be able to explain why he'd need to normalize to the fifth level, or was it merely random?
Bastard Programmer from Hell
|
|
|
|