Click here to Skip to main content
15,867,308 members
Please Sign up or sign in to vote.
3.00/5 (2 votes)
See more:
Can somebody suggest me a container:-

1) which can search on large data in least amount of time and taking least amount of memory
2) can also search data with its content .

Till now i am storing my data in string array whose element count is 1,000,000,000


Thanks in advance
Posted
Updated 29-Oct-14 2:18am
v2
Comments
Nathan Minier 29-Oct-14 8:08am    
Searching has more to do with your algorithm and only somewhat to do with your container.

Are you searching alphabetically, or by some form of REGEX. what are the parameters, and how do you currently implement searching and sorting?
agent_kruger 29-Oct-14 8:19am    
no sir, i am just searching for some text and need container which takes least memory for searching and storing data (forgot to add that in question).
Nathan Minier 29-Oct-14 8:50am    
So that sounds like alphabetical indexing. Frankly in terms of memory, you can't beat a flat array. You could make it 2d, assuming English it would be something like 'string[Letters][Words] AlphaIndex' which would give you alphabetical indexing, like a card catalog. If you want the easy route(which will be slower/less memory efficient) you can go with 'Dictionary<char,string> AlphaIndex' which would have the benefit of implicit scalability.

I do think that search algorithm is your main concern, though, since most containers exist for convenience factor rather than for optimization. With that in mind, you have a 3 step process to follow:
1. Sort your array fully at initialization.
2. Perform a recursive Binary search on the array.
3. Return all results that match the input parameter.
Raul Iloc 31-Oct-14 2:56am    
Like I said also in my solution bellow, one solution (that I am also using in my projects) is to use stored procedure (SP) and the search will run in the database, the results will be get from the database, page by page, so you will need in your control logic just to invoke the SP with the rights parameters (similar like I did in my article source code) then to display the results for the current page.

This isn't a question we can answer: it depends far too much on your data, how you have it organised, and what you are trying to find. And remember that we can't see your screen, access your HDD, or read your mind!

If you are looking for the whole string, then that's relatively simple: you could use a hash table of some form, or a sorted list and binary chop to find the one you need.

If you are looking for a substring or substrings, then that's a whole new kettle of fish, and you probably want to start indexing words, or similar, and using those indexes to cut down on the number of comparisons you need to do.

But frankly, there are as many efficient ways to do this as there are different possibilities for exactly what you want to do!
So sit back, think about your data and your task in more detail, and try to work out exactly what you are trying to do - from that you might at least get some ideas to let you ask a more specific question. And it needs to be a lot more specific before we can help!
 
Share this answer
 
Comments
BillWoodruff 29-Oct-14 9:06am    
+5 I think that covers every nanometer of it. Nice to see the size the OP is dealing with has shrunk down from 10 terabytes !
OriginalGriff 29-Oct-14 9:36am    
Ten Terrabytes? :OMG:
That's probably about the same as the total HDD space in this house!

And he's holding it in a *string array* ? My word.
That's either a fine load of RAM in his PC, or the paging alone is going to kill him...
OriginalGriff 29-Oct-14 9:37am    
No, scratch that, I just counted.
I have approx 14TB of HDD's so it would fit. Just. :laugh:
Robert Welliever 31-Oct-14 1:58am    
Wow that's a lot of porn you must have.
agent_kruger 31-Oct-14 2:39am    
:laugh
From you question, it seems that you have 1 000 000 000 strings which is really a lot. It does not make much sense to load that much data in memory at once.

Thus you should works with a part of the data at once so only a fraction of the data would be in memory.

If you want to do a lot of search of that huge amount of data, then you need to build some indexing. Otherwise, it would take much time every time you need to do a search.

Or maybe, it is possible to use Windows indexing. I have never look on that side. Just an idea.

By the way, it is hard to tell best solution when we don't know how it will be used (single search, frequent search, fixed data or changing data...) and by who it will be used. It also depends if you do exact search, whole word search, wildcard search and if you need to handle upper/lower case or accents and the like.

Finally, as a developer, you typically have to try some approches to see how they compare and if they are adequate. By the way, do you have a problem with existing code or have you even try it.

Also, you might consider buying something designed for that depending on the intended use.
 
Share this answer
 
Comments
agent_kruger 31-Oct-14 2:43am    
sir, it is already in a container which is indexing till now and i will perform full word search and i also have to process all data together so i cannot get fractions of data again and again.
Philippe Mori 31-Oct-14 8:51am    
If you perform full word search, you might be able to create an index of all words with 3 letters or more. In your index, you might truncate words to say a maximum length of say 10 letters. Your index will the give you the position of each word and from that you might load ans search the portion of data around that.

Whether the data is on disk or in memory, you can process it all if you need. Neverthless, if you create an index and typically search for multiple words, you might be able to save a lot of processing time and memory.

If you have 1 000 000 0000 strings, then your program would need a few GB of memory if for example you have an array that contains each word...

Show us what you are using and an example of what your are doing and we might have a far more better idea how it might be optimized.
agent_kruger 31-Oct-14 9:05am    
sir, actually i have truncated string to 8 letters already and my major problem for now is that i need to loop through every element of string that is 1,000,000,000 and on every loop search for string exists in the string array and if it does then process it. So , an overview of the situation:- 1 million string array element each with max. 8 letter size and running loop 1 million time to search for dynamic string in the string array. here is what 1 million times searching is taking my performance down and so, need a container that performs quick search over heavy string array
Philippe Mori 31-Oct-14 9:31am    
Is it 1 millions (1,000,000) or 1000 millions (1,000,000,000). This makes a huge difference. For the former, you don't have to worry for memory usage that much while for the second it might be a major issue.

A dictionary might be a good solution. The dictionary would contain an entry for each distinct word and then have information related to that word. If words are not unique, the a list might be used for the value associated with each key. It all depends of what you do with matching words.

I'm not sure if you have 2 arrays say of size m and n or a single one. In either case, if the size is big, you have an algorithmn that is O(m * n) or O(n²) which is very bad for large data. Using dictionary or hast table or similar container would allow to reduce the order of at least one side so the final order might be something like O(m * log(n)) once the index is built or O(n + m) if it must be done each time.

It should make a huge difference in the speed.
agent_kruger 31-Oct-14 16:07pm    
1000 million sorry to paste 1 more set of zero's
When working with large numbers of strings and need them indexed, by far the fastest and most efficient method I've used is a Trie. I must have tried a half-dozen structures when creating an error-tolerant autocomplete, but by far the "best" was that particular tree type. I later found out that Google also implements their autocomplete using it, so I guess I'm in good company.
 
Share this answer
 
Comments
agent_kruger 31-Oct-14 2:41am    
sir, tree type? can you explain in detail.
Robert Welliever 31-Oct-14 3:18am    
A Trie is a type of tree data structure great for string indexing. For example you can fit every two letter word in the English language using just 26 * 2 nodes and 2 levels, every three letter word with 26 * 3 nodes and 3 levels, ..., every n letter word with 26 * n nodes and n levels. Anyway, I really think Wikipedia can explain it better than me: http://en.wikipedia.org/wiki/Trie.
1.There is not such control.

2.You could use for example a grid control as user interface, but in order to manage a large amount of data you must implement the pagination logic and the searching logic by yourself.

3.I have an article about this subject, it is for ASP.NET application but you can get the idea and especially the stored procedure code used for implementing pagination and reuse it in your application. Advanced ASPX GridView Pagination and Data Entities[^]
 
Share this answer
 
v2
Comments
agent_kruger 29-Oct-14 9:01am    
sir does "pagination logic" mean to divide my data in multiple pages? if it is then i cant use that logic as i have to process all data all together
Raul Iloc 29-Oct-14 9:50am    
1.As I explained in my article, pagination logic means to load and manage only a set of article (like 50) in each page.
2.Also in my SP I am using all available data, so if a filter or search parameter is used I am searching in all data, then the final results are sent to the user by using pagination. So short answer is that you can reuse my SP for your job!
agent_kruger 30-Oct-14 3:21am    
sir i need to process all data all together
Raul Iloc 30-Oct-14 3:29am    
1.Like I said before, you could use my SP customized for your needs, and the SP is searching in all data from the database ("processing all data"), but the must important thing is the pagination part (last part of the SP), the SP will return your only the requested set of results, page by page.
2.Just try to run the source code from my article and you will see that the entire set of data is used.
BillWoodruff 29-Oct-14 9:02am    
If you think the question is not okay, why post a link to an article you wrote that appears to have no relevance to the not-okay question ? Yes, the question is unclear in its present form, but it does seem to involve searching something efficiently/rapidly.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900