|
How about a ZDD of 36 * 7 variables? If there are any patterns (and there will be) it should do fine. As a bonus you can give it more interesting queries than just "is this one available?" or even "give me the lexicographically smallest available number" - such as, "give me all the available numbers that have 'R0FL' in them"
By the way, why is a packed bit array out again? 9GB isn't that much at all..
|
|
|
|
|
Never even heard of a ZDD haha. From googling it, it looks like a DAWG type thing?
|
|
|
|
|
Well, not exactly, but I can see how they could be called related.
|
|
|
|
|
Lol, well, I tried looking at a paper on it and it went way over my head, but it was a technical paper.
|
|
|
|
|
|
As has been said, a database is likely the best general solution. Otherwise, a word search tree (similar to the DAWG you mentioned) might work fairly well.
One of the problems with the situation described is that few real-world cases require each client to maintain its own copy of such a large collection of data. Usually there will be connectivity to a shared server of some sort. However, there are situations in which a client must continue working when the connection is disconnected. But in such situations, the client probably shouldn't be allowed to add data to the collection.
Anecdote time: I was once involved in such a situation. What had been done was to store four bits per item in a binary file (this was on DOS by the way), so for instance the first byte of the file held the status for item 0 in the low nybble and the status of item 1 in the high nybble, etc. Access to a particular item's status was a simple calculation, disk seek, and read one byte. I don't remember what the highest numbered item was at the time -- let's say a million -- so that would be about a half MB of disk to store the list. Once a day a whole new file was downloaded to each connected client and throughout the day updates would be sent. If a client was disconnected it would start getting updates when it reconnected and a whole file within a day -- this was fine for the purpose.
|
|
|
|
|
Why would you store ALL license plates? Why not just the ones that are available eg?
That should reduce the size of your object significantly.
In addition, the question is just ridiculous without use of a database.
V.
|
|
|
|
|
How would you only store the available ones
For what is now probably the 17th time , this had nothing to do with databases or license plates , they were asking a DATA STRUCTURE question. I.e. how would you deal with large amounts of data. You do not always have the option of using a database.
|
|
|
|
|
SledgeHammer01 wrote: You do not always have the option of using a database.
No, you pretty much always do nowadays.
|
|
|
|
|
Oh my lord... .
Ok, fine... let me rephrase the question since people don't seem to understand what "hypothetical" and "think outside the box" mean .
Almost the same exact problem, but completely rephrased. Instead of 0 - 9 & A - Z as your alphabet, you now only need to deal with A - Z. Instead of "words" fixed at 6 characters, you can now have "words" of any length. Given an input, I need you to find out if that "word" is "taken". Oh yeah, and this is for a mobile device where you have limited resources, so the customer will get mad if he is only able to install our application and no other because it fills up his mobile device.
Sounds almost like a spell check / dictionary type problem, no? . Kind of like the original problem if you were able to "think outside the box" .
Would you store a complete list of English words in a database to do a spell check? Heck No. That would be a horrible solution and a complete waste of space.
But you say "thats a completely different problem!!!! you bastard!!!"... not really... imagine if instead of spell checking english words, you "spell checked" the license plates .
Now obviously, when applied to the license plate problem, there are other issues, like the "spell check" solution AND/OR a database solution would be a horrible idea if you had a lot of items. Clearly for that problem, the packed bit array is the most scalable, compact & fastest runtime. Ok, but do mobile devices have 9GB of resources available? Probably not.
Anyways, it wasn't a real world design issue, it was a problem solving exercise .
But clearly, the license plate issue is pretty similar to a spell check problem.
|
|
|
|
|
It's not that similar because the reason you wouldn't store every word* in a dictionary is because words have associations that you can exploit to organise the data. In particular, words tend to share their beginnings and therefore you can create some form of tree-like structure which can chop off large sections of the search space, and cut down on duplication.
As stated in the problem, the plates are entirely uniformly distributed in the search space and that means you can't do that.
(*: to be honest you probably would, a word list is going to be measured in megabytes and even on a phone these days that's not even close to a problem)
|
|
|
|
|
Whaaaaatttt??? I'm confused. Aren't there 1.6M license plates that start with XYZ? And 1.6M license plates that start with 123 and so on?
You are probably right though, that a flat list of English words if we assumed 1M words at 10chars avg would only be 10MB.
There are 78B possible license plates though .
|
|
|
|
|
Yes, that's my point. Potential plates (at least in this system) are uniformly distributed, so you can't use the clustering to your advantage. For words, there are (say) 100,000 starting with 'exa' but 0 starting with 'ejf'. So your top level of tree, if you make it 3 characters (max 26³ entries), can cut out whole sections of the theoretical word space. That is because of the clustering of words within the search space which is predictable and non-random.
|
|
|
|
|
I think it was you who misunderstood me.
The whole hypothetical situation of storing a buttload of data in memory is fine, but it's not because "no database is available".
SledgeHammer01 wrote: Would you store a complete list of English words in a database
Yes, I have, for checking Scrabble words. In fact I used Access to do it.
For an actual spell-checker I would use a spell check tree, but I still need to persist the list between runs -- ergo, a database, which can then be loaded (perhaps on demand) the next time I run it. (The tree could be more of a cache.)
On a mobile, I expect a Web Service would be more appropriate.
SledgeHammer01 wrote: the packed bit array is the most scalable, compact & fastest runtime.
And it's even better if you can leave it on disk rather than hold it in memory
.
|
|
|
|
|
PIEBALDconsult wrote: it's not because "no database is available".
No, it's because that's what you were asked in the interview . If you told your interviewer that you refused to answer the question aside from using a database and you would refuse to implement any solution that did not use one, I'm guessing you probably wouldn't get the job .
Anyways, while its not likely the majority of the time NOW, there are plenty of scenarios where a database might not be available: mobile, embedded, military, etc. Now sure, I could add a requirement to the Air Force Smart Bomb design where I need to install an Access database on every bomb, but seeing as I'm just going to blow it up, I'm guessing they want to make it as cheap as possible.
Now, before you go off all half-cocked, foaming at the mouth about why a smart bomb would have a spell checker, it wouldn't. It was a hypothetical example of a situation where you MIGHT not have one.
I have an iPhone now, but I used to have a Moto Razr. I'm guessing the iPhone has a database and the Moto Razr didn't. Hmm... weird... the Moto Razr had EXTREMELY limited resources, but it still happened to have a full spell checker in it.
EDIT: the Moto Razr had 5MB of available space. That's for the OS, phone book, pics, videos, games, email, etc. STILL managed to squeeze a spell checker in there... bet there wasn't room for an access database or a database of any variety . Bet they had to use some really efficient data structures.
|
|
|
|
|
V. wrote: Why not just the ones that are available eg?
That doesn't alter the essential question (stupid question it may be) -- creating an in-memory structure to hold a whole lot of data.
|
|
|
|
|
PIEBALDconsult wrote: stupid question it may be) -- creating an in-memory structure to hold a whole
lot of data.
Yeah, true. Cuz you NEVER need to do that . Maybe not that AMOUNT of data, but...
|
|
|
|
|
Such a vague scenario is pretty much impossible to answer, because how you'd go about it depends on where the data has patterns. For example, if IDs are allocated in blocks, use a range based structure. If it will always be very sparse, store only the 'hits'. If the search space is likely to be over 50% full, you might want to use an inverse (i.e. store the empties).
License plates are a clearly unrealistic scenario because the number you need to store will only be of the order of 100m at most and therefore you can just use a conventional database (even at 1kb per record you're still only talking 100GB which is within the capacity of a normal computer, never mind an institutional server).
9GB is not big enough to be a problem, as long as you don't hook it all into memory, so I'd just dump the whole bit array to a file, in sequence, and use seek to look in the right place (and poke a bit back when a plate is allocated), if a database was disallowed and there was no aggregation or querying requirement.
Edit: obviously, if I was writing this system for real, I'd just use a database, heh.
|
|
|
|
|
I'm confused: you say the specific question asked was specifically about checking available license plates, but, then, you say the question was about data structures, and databases are "irrelevant:" did the interviewer(s) tell you the question was about data structures, and exclude "databases," specifically, or not ?
Or are these conclusions that you inferred from the interviewer(s) question ?
imho any perceived ambiguity in the interviewer(s)' question "opens the door" for you to follow-through with clarifying questions that attempt to frame the question asked in terms of real-world scenarios: like, as mentioned in a comment above, that the largest possible number of license plates for any one real state (in the US) would be far less than 78 billion ... as in California's 37 million, of which we can safely assume that many do not have a driver's license.
My answer would have been to say straight out: that storing the total possible set of all possible variations of license plates was absurd, no matter what data structures would be used, and that the database should store only the plates already allocated, and use a function to generate new ones at random, or new ones that meet some predefined criteria like having the letters "IAMOK" somewhere in the sequence (similar to the custom tag services offered by many states for which a fee is charged).
If the interviewer(s) then altered the question, so it became, in general: how would you store very large amounts of data: I would have replied that to answer that question requires a consideration of the internal organization of the data, and its usage, and storage (centralized vs. distributed), and there is no "one-size-fits-all" answer, and that a good programmer becomes intimately familiar with the internal organization of the data, its real-world "incarnation," and how the data is used in the real world, and then selects data structures based on several criteria, including efficiency of storage, compatibility with existing hardware and network constraints, the speed of access that end-users will demand/require, etc.
And then, I'd probably walk over to the food-stamp office, to collect my monthly allocation, and think about the next job interview
best, Bill
"Our life is a faint tracing on the surface of mystery, like the idle, curved tunnels of leaf miners on the surface of a leaf. We must somehow take a wider view, look at the whole landscape, really see it, and describe what's going on here. Then we can at least wail the right question into the swaddling band of darkness, or, if it comes to that, choir the proper praise." Annie Dillard
|
|
|
|
|
BillWoodruff wrote: I'm confused: you say the specific question asked was specifically about
checking available license plates, but, then, you say the question was about
data structures, and databases are "irrelevant:" did the interviewer(s) tell you
the question was about data structures, and exclude "databases," specifically,
or not ?
The question was to determine how you think when large amounts of data are involved. Do you think you would not have to solve such problems if you were working at say Google? or Amazon?, etc.
Your response is actually pretty similiar to the majority I have gotten . "This problem is stupid", "there aren't that many license plates in the entire world", etc. rather then thinking of the problem in generic terms. For example.. oh hey, maybe this problem is a variation on the traveling salesman problem, or the backpack problem and I can apply that solution .
That was the point of the exercise , but everybody seems to be focusing on the license plate aspect of the question .
modified 3-Feb-12 12:31pm.
|
|
|
|
|
SledgeHammer01 wrote: The question was to determine how you think when large amounts of data are involved.
SledgeHammer01 wrote: everybody seems to be focusing on the license plate aspect of the question I think the very simple reason for the focus on the "license plate" issue is: that's what you told us the actual question was.
Now, if you had told us the question actually asked was:
"We want you to tell us the strategies you would employ for storing very large amounts of data, as Google, and Amazon, do: you might use as an example, to make this question more specific: the issue of a license-plate management system where all permutations of the sets ... blah and blah ... are ... blah ... blah ... blah ... and an immediate task is the allocation of an unused license number.
Then, I think you would have received a very different set of replies.
To that (hypothetical) question, I would have replied: "what is the correlation of adding an unused license number of a car from a set of all permutations, to adding a new user, book review, to Amazon, or to adding a new link-listing to Google: are you asking about the issue of generating unique identification indexes in master databases (GUIDS, or some hash) ?"
Note my assumption (which you have every right to be skeptical about) that: in technical interviews, where you know the people interviewing you are "smart," and informed about what they are asking you about: that they respect someone who responds to vague, overly-broad, questions, by pro-actively demanding they be clarified: in the process of which you reveal your awareness of the issues and factors that surround real-world implementations.
Also, I believe a "challenging interviewee" ... who shows intellectual balance in the act of challenging ... and demonstrates emotional equanimity in the art of doing so ... is impressive ... or at least would impress persons at companies I would wish to be employed by.
But, I would make these distinctions: HR "screening" interviews are inappropriate places to be challenging: they ask "mush," and should be given back whatever "mush" they want to hear. "Rubber-stamp" interviews with managers, when you've been offered the job based on the real technical interviews, are also scenarios where you just want to "go with the flow," and appear as a "good citizen" who looks forward to "learning the job."
"Float like a butterfly, sting like a bee." Muhammad Ali
Where'd I put those food stamps
best, Bill
"Our life is a faint tracing on the surface of mystery, like the idle, curved tunnels of leaf miners on the surface of a leaf. We must somehow take a wider view, look at the whole landscape, really see it, and describe what's going on here. Then we can at least wail the right question into the swaddling band of darkness, or, if it comes to that, choir the proper praise." Annie Dillard
|
|
|
|
|
BillWoodruff wrote: I think the very simple reason for the focus on the "license plate" issue is:
that's what you told us the actual question was.
That's my whole point. The question given was very specific. A senior / principle / lead software engineer who makes $120k to $150k / yr should not have to be told that he is allowed to generalize a problem or how to generalize it . I'm sorry if that comes out bad, its not intended as an attack on you or anybody else. Its just how they expect you to think. Think of the problem in generic terms and then, hey, maybe you can do something specific for this type of data that wouldn't work for other types of data, even in the same problem type.
BillWoodruff wrote: To that (hypothetical) question, I would have replied: "what is the correlation
of adding an unused license number of a car from a set of all permutations, to
adding a new user, book review, to Amazon, or to adding a new link-listing to
Google: are you asking about the issue of generating unique identification
indexes in master databases (GUIDS, or some hash) ?"
Again, you are focusing on specifics . None of that matters to the hypothetical question. It's just "given this data and the vast amount of it, how would you store / work with it?". Thats just a question to see if you come back and say "I can't, its too hard", or "I can do it, but my algorithm requires 10TB of hard drive space and 5 days of processing per request". Maybe Amazon and Google don't issue license plates today, but they are planning to in the future? It doesn't really matter, that wasn't the point of the question.
One responder came back and said he would just store a flat list of used license plates. Ok. Awesome. Lets say 50B of the 78B plates were taken. Thats 50B x 6 bytes = 279GB. Unless you maintain the list in sorted order, your search time to check a plate # could be O ( n ). 279GB and worst case O ( n ) for only 64% of the data. Through discussion, we have already found a solution for O ( 1 ) for ALL operations and 9GB worst case storage requirements for 100% of the data and actually, you would only need to read/write a SINGLE BYTE out of the 9GB file so your memory requirements are 0. What's the better solution?
BillWoodruff wrote: Note my assumption (which you have every right to be skeptical about) that: in
technical interviews, where you know the people interviewing you are "smart,"
and informed about what they are asking you about: that they respect someone who
responds to vague, overly-broad, questions, by pro-actively demanding they be
clarified: in the process of which you reveal your awareness of the issues and
factors that surround real-world implementations.
You are certainly allowed to ask questions. HOWEVER... are you a sr. / principle / lead software engineer in real life? I am, and I can tell you that if my boss came to me and asked me to implement feature X and I started asking him to clarify / solve the problem / lead me to water / etc. whatever you want to call it... he'd probably get annoyed and say "I'm paying you to solve the technical problems, thats your job, its not mine". That is actually true. As senior / lead software engineers, its our job to take a problem and make it work .
However, since you are skeptical about how Google interviews and probably think I made this whole scenario up ... I *DID* really have an interview there and one of the questions was "we have millions of servers in production running millions of queries & processes per second... lets say your piece was not working properly (memory leak was the exact scenario), how would you go about debugging it?"
Me: If I did not instantly know what the issue was, I would first try to reproduce it on my own 1 or 2 PC test environment
Him: Lets say it was not an issue that showed up on 1 or 2 PCs and you needed say 10,000 PCs to see the issue?
Me: I would put logging information in the suspect areas
Him: What would you log?
Me: memory allocations and deletions
Him: Ok, but remember, we are talking about 10,000 machines doing millions of allocations and deletions, do you really think you could make sense of a log file with billions of entries?
Me: good point (chuckle)
Me: Well, I've always been a big fan of what Microsoft did with MFC. They overloaded the new and delete operators and kept track of allocated memory in a linked list or whatever and when the process would exit dump out the list of leaked memory and its contents. That way it would solve the billion entry log file issue and you would only get a log file with the leaks and nothing else and your code wouldn't be cluttered with logging writes since it all happened automatically.
Him: Cool, that would work in theory, but remember, these are production servers and you are not allowed to deploy test code to production servers since a programming error could bring down the entire site.
Me < at this point, I started to get annoyed and headed down your road >... hmm... well, I'm assuming you have a smaller scale test lab or a procedure in place for testing scenarios like that?
Him: Yeah, we have a test lab with 1000 machines, but your leak or issue only shows up with 10,000+ machines
Me: And you don't have a procedure in place for testing large scale issues?
Him: Yeah, we do. Assume we don't and that you are responsible for coming up with one.
etc. and it went on and on like that.
|
|
|
|
|
Hi, thanks for taking the time to respond in such depth: I find your answers fascinating.
I think the "dis-connect" in our conversation is that I am assuming you are being interviewed by your technical peers, other programmers, for a position on their team; from what you have told me, now, I believe you are describing a situation where you are being interviewed by project managers as a technical lead, and you wouldn't have gotten to that interview without a very impressive resume. And I wonder if you would have gotten to that interview without first being interviewed by the programmers you would "lead."
But, the question you describe, does not seem to me the type of questions I'd expect project managers to ask. But, my experience is based on a reality twenty or more years ago, so it should be discounted heavily.
I was asked in a phone interview at Microsoft maybe ten years ago, for a technical documentation lead position in a certain area of .NET, by the manager of the group: "tell me about object-oriented programming ?"
When I responded: "that's such vast area we need to narrow the focus down to have any meaningful discussion ... I mean are we talking history ... Xerox Parc, SmallTalk, Eiffel, Yourdon, Gang of Four ... or are we talking OOP as implemented in .NET with an emulation of multiple inheritance via Interfaces, or are we talking theory of n-Tier development as found in modern "business applications" ?"
His response was to act as if my response was totally stupid But, I sensed he was bored anyway, and was only doing the interview as a favor to his boss who a friend of mine at MS had prevailed upon to grant me the interview: so I think it was "rigged game."
We speak of two very different scenarios ?
best, Bill
"Our life is a faint tracing on the surface of mystery, like the idle, curved tunnels of leaf miners on the surface of a leaf. We must somehow take a wider view, look at the whole landscape, really see it, and describe what's going on here. Then we can at least wail the right question into the swaddling band of darkness, or, if it comes to that, choir the proper praise." Annie Dillard
|
|
|
|
|
BillWoodruff wrote: I think the "dis-connect" in our conversation is that I am assuming you are
being interviewed by your technical peers, other programmers, for a position on
their team; from what you have told me, now, I believe you are describing a
situation where you are being interviewed by project managers as a technical
lead, and you wouldn't have gotten to that interview without a very impressive
resume. And I wonder if you would have gotten to that interview without first
being interviewed by the programmers you would "lead."
The Google interview process is 15-30 min HR prescreen, then a 1hr technical phone screen by phone w/ a random Sr. Engineer @ Google who is a "certified interviewer" (whatever that means) and then they fly you in for an all day interview barage. All with "certified interviewers" Sr. Engineers. All the interviews are technical and ask you data structure / algorithm type questions only. None of this "give me the difference between delete and delete[]" garbage, but questions like I have given you. Afterwards, the 8 people who have interviewed you submit a thumbs up or thumbs down. I don't know what % of thumbs up you need to get "voted in", but I did not get it
I have had similiar experiences at Amazon, etc.
That is why I asked CP if they had good data structures for dealing with large data sets .
Obviously in a real world solution, you would store license plate #, make, model, owner, etc. in a database, but its more a "how do you think?" type question. You don't get disqualified because you don't come up with the perfect solution, often in these interviews they will lead you to what they want to hear as I have shown you with my Google interview . Its a process designed to weed out guys who are close minded / can't or won't think outside the box.
Reason why is think about it... lots of people know C++ and/or C#. I know both languages, I'm assuming you are an expert at both as well. Meaning you know the syntax by heart, a lot of the APIs, pointers, classes, etc.
There is a big difference between the guy who only knows how to code vs. the guy who maybe isn't as strong as you in the coding, but is awesome at DESIGNING an algorithm and then handing it off to you or me to implement.
Google, Amazon, etc. want the type of guys who can design & invent.
|
|
|
|
|
Once again, your patient reply, is appreciated !
Actually, I don't know C++, and I consider myself only at "journeyman" status with C#, but I was once a guru-level PostScript programmer who ended up (where else), at Adobe.
It's funny: at Adobe the programmers who could really produce, on the application side (PhotoShop, Illustrator, Premiere, etc.) were often hard-core programmers who were self-taught, and the CS graduates who were "ace" at algorithms often turned out to be "dead wood."
But, then, there were "geniuses" like Mark Hamburg who transformed PhotoShop into its "post-Knoll" incarnation: I was once talking to Mark, and the topic of what he majored in college came up: he told me he majored in math because he had already read, and understood, all of Knuth's books in high-school, and felt CS had little to offer him. No boundaries in that man's mind ! He was later awarded the Gordon Moore prize by his Silicon Valley peers for his remarkable accomplishments.
The four-person team that created the Acrobat prototype from "zilch" (I was one), were all new hires from a company Adobe acquired, and none of us had formal CS background But, please, don't blame me for what Acrobat became Our "skunkworks" project, under the direct supervision of John Warnock, was under constant attack by other groups within Adobe who considered us, I guess, as "barbarians at the gates"
But, a key difference I saw in the programmers who were so productive on the application side was that they intimately understood the quirks and features of the hardware/OS combination (Mac or Windows) they worked on. On Windows: they understood COM; on the Mac the understood things like custom window-defs, and its intricate, arcane, convoluted system of interacting system rom calls.
And, the fresh CS graduates often had no clue to inner architecture and hardware of personal PC's. Why should they: back in those days they did their homework on terminals wired up to mainframes, usually DEC.
There are people who have strengths in both algorithms and coding, as well as great depth in specific platforms: I have great respect for them !
What I don't understand is how Google ever got Jon Skeet to jump ship from C# ! Skeet, to me, is the guru's guru in .NET and C#.
cheers, Bill
"Our life is a faint tracing on the surface of mystery, like the idle, curved tunnels of leaf miners on the surface of a leaf. We must somehow take a wider view, look at the whole landscape, really see it, and describe what's going on here. Then we can at least wail the right question into the swaddling band of darkness, or, if it comes to that, choir the proper praise." Annie Dillard
|
|
|
|
|