|
Looks like a fun way to improve your coding!
Good luck with that and improvement!
Any chance I can give you some homework?
In AvalonEdit their text model use "Rope" which looks like an IList but has fast insert and delete, i.e. O(1) cost instead of O(n)
In fact their code is probably quite nice.. but after looking at it for only 2 minutes I was confused, it's all mixed up with the text model, if I remember right...
Maybe you could be interested in making standalone Rope class?!
|
|
|
|
|
sounds like maybe a segmented linked list. where each link can contain multiple elements. that way instead of reallocing an array it just has to add a new link. Traversing is still about as fast.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
You can check for yourself if there is a voodoo in there...
Reference Source[^]
"The only place where Success comes before Work is in the dictionary." Vidal Sassoon, 1928 - 2012
|
|
|
|
|
like i said, i've seen the ref source
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
You talked about Reflection... It is no Reflection, it is the source used to compile .NET...
"The only place where Success comes before Work is in the dictionary." Vidal Sassoon, 1928 - 2012
|
|
|
|
|
No, I didn't talk about reflection. I mentioned Reflector, which is a tool used to decompile .NET assemblies. Despite the name, it has nothing to do with reflection in .NET
I know exactly what the reference source is. I said I looked at it, so of course I'd know what it is.
I've seen Dictionary.cs. I've seen SortedDictionary.cs
I've also seen the decompiled contents of mscorlib and system.dll
so i know what this code looks like. BTW, the decompiled versions are more optimized than the reference source.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
honey the codewitch wrote: the decompiled versions are more optimized than the reference source
Which is a + for the compiler
I see you went through the code - so why talking of voodoo?
"The only place where Success comes before Work is in the dictionary." Vidal Sassoon, 1928 - 2012
|
|
|
|
|
because their code is performing better than expected for a red/black tree (which is what they're using in SortedDictionary) compared to my SortedBTreeDictionary
although mine destroys Microsoft's for searching after about 30,000 items.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
honey the codewitch wrote: although mine destroys Microsoft's for searching after about 30,000 items.
Exactly this is the reason for using B-trees in a database.
They're not very sensitive to size, which is something you regularly get in a database. While more than 30000 items is something you very seldom need to (or should) handle in memory.
|
|
|
|
|
Jörgen Andersson wrote: While more than 30000 items is something you very seldom need to (or should) handle in memory.
That's getting less true. What's funny is by my tests, .NET is just fine with 3 million entries spread across 3 different dictionary classes.
The heap isn't as big as you'd expect and the performance is really good for both the base Dictionary class (which is basically unsorted, and uses a hash lookup) and for my class, while not being unsurprising for the other class.
Times are changing. In memory DB is totally doable even in C#, for smaller dbs.
I was thinking of backing JSON with something like this, or implementing a full B+ with backing storage for it.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
That will be quite interesting.
|
|
|
|
|
I had 10 million entries in each of these collections last night. The B-tree rocked.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
Looking forward to an article.
|
|
|
|
|
I plan to but i want to make a little suite of sorted dictionaries for that.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
|
I read through this thread and really cannot understand what's the "problem". Where the voodoo comes in. There is very little that surprises me. Or do I mistunderstand what you are talking about?
Say you've got 4kiByte blocks, 32bit key values, then an index block holds up to 500 downwards pointers (the 96 bytes are for horizontal pointers and other sorts of management info). Root + 1 level provide pointers to a quarter million blocks. If your data records are small enough to fit 40 to the block (i.e. less than 100 bytes), this might be enough for 10 million records: Search the root index (use binary search if you prefer!), follow associade pointer to next index level. Search this root index (again binary search if you like), and there is youv'e got the pointer to the block where your data are located. The data block may have a small index with key/offset pairs to each record; this is so small that a sequential seach probably is as fast as a binary.
If blocks are at minimum filling, you may, for ten million 100 byte records, end up with two index levels below the root. If keys are not integer, but require more space, you may not be able to pack 330-500 key/pointer pairs at the index levels; this may also cause a second level to be established. But still: Three small binary searches to get to the right block isn't much work. If this is a disk structure: The root will definitely remain i cache throughout, and unless you are extremely tight on RAM, all or much of the first level below will as well. Often you use sorted structures because the records are frequently processed in sorted order. So while the second index level is not necessarily all in RAM, the relevant parts of it may be, in those cases where you do a more or less sequential processing. (If the sequential ordering means nothing to your use of the data, then you should use a hash method, not B-tree!)
You quote a single figure for both insert and removal. When you make your timings, you should make a large number of calls, quoting minimum, maximum, and average (or even more statistics-oriented figures, if that's within you field of knowledge). If you enter a million records, most of them will be super-fast; there is still room in that block for accomodating another one (key, at index level, or record, at leaf level). Every now and then, a block fills up and must be split (more often at the leaf level than at the index level, if data itself is kept in the block, not just pointers). How often this happens depends on the key size (for index blocks) and record size (for the data blocks), and the block size.
At rare occasions, the entire tree is full and the number of levels must be added. That could be quite a lot more time consuming; if the tree is to be balanced according to B-tree rules all the time, this might require some shuffeling stuff around.
Remember that B-tree rules doesn't absolutely require you to immediately float a block split up to the index level above. A search may have to skip to the next "horizontal" block, until it finds a key higher than the candidate key. Some implementations are rather lax, leaving split blocks for quite some time before updating the index. This might reduce the average add time, at the expense of (asynchronous) "cleanup" routines and somewhat higher average search times.
On the average, B-trees are quite fast for searching. But one thing that holds for all sorts of index structures: If you've got ten million new records for inserting into the tree, the right way to it is NOT one-by-one, in unsorted order. Any B-tre implementation should provide a "batch insert", sorting those ten million records by an nlogn method before inserting them into the tree, and do this in sequential key order, filling up leaf blocks from one end, one block at a time, without worrying about the higher index levels, linking in new blocks through horizontal pointers as needed. Once the bottom layer is done (without concern for higher level indexes), you go up one level to insert pointers all the new blocks, completing that level before ascending further, level by level, up to the root.
|
|
|
|
|
first of all, it's a collection class, not a DBMS. Yes, B-trees can be used for DBMS systems but if I was going to go through all the trouble of optimizing my algorithm doing batch inserts and what have you, then I'd be doing it on a B Plus tree instead
At least if my goal was to make a dbms.
My concern in my OP was not about the performance of my class(es)
It was about the performance of Microsoft's classes.
Namely, a SortedDictionary (red-black tree) beating out a B-tree after 10,000 entries.
But then i think i already explained that. Maybe i didn't. i don't remember and don't much care.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
Apparently I implemented a B+ tree for no reason because it gives you zero benefit unless you're accessing slow I/O like a disk, and it's tied to your nodes, which means implementing some kind of file driver and a file format for storing nodes at the very least.
So basically, a database.
I mean, it's cool code, but it's useless byself. On the shelf it goes.
*headdesk*
At least my regular B Tree works. That's kinda cool.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
Make it 2D, you got the fabled QuadTrees!
Although, most (in memory) implementation (including mine ) seems to be simple (2D) binary tree instead of being B+Tree
Could be useful for making graphics application (and doing in memory hit test), like the one I have in limbo for the last 2 years!
|
|
|
|
|
interesting. right now i don't know much about graphics as far as the maths go.
but i can do some basic trig and compute the distance between points, that kind of thing.
i've never much needed to do polygon hit testing though i do know that btrees or b+trees are used in them. I'm not sure how.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
Simple, when you do an hit test you don't want to hit test all the polygon in your application, since it's gonna be a computationally expensive operation.
So all your polygons are in a Quad Tree and with their bounding rectangle.
And the QuadTree has a very fast hit test implementation that would return the list of polygon worth checking in more details!
|
|
|
|
|
I get the gist at least.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
|
As for hit test, there some simple well know implementation that works well for polygon. (the winding number one being my favourite)
My grief is that I work with shape, i.e. unlike polygon which have straight lines between vertices for boundary, shape have Bezier curves, much harder
|
|
|
|
|
also i remember it being more complicated for polys that have concave bits to them.
I could be wrong. It has been years since i read the black book.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|