Unique Number Generation

Question

0.00/5 (No votes)

See more:

Hello people.
I could not throw the question properly the first time.
My question is...what is the best way to check a 16 digit number against a list of 100 million+ numbers which are in a text file for duplicates? The numbers are generated by some other program which i cannot control.
One conventional way is to load the numbers in an array and compare each number with the target number in a loop. But this process is very time consuming and memory hungry. Is there any better way to do this?
Thanks in advance.

Posted 30-Nov-10 16:44pm

Pallab_GT

Updated 3-Dec-10 2:34am

v4

Add a Solution

Comments

Dr.Walt Fair, PE 2-Dec-10 0:22am

How often do the numbers need to be checked? Are the numbers in the text file in sorted order?

The text file size just to store the numbers would be over 1.6 GB if saved in ASCII. Is that correct?

Pallab_GT 3-Dec-10 9:21am

Hi.Thanks for your reply.
The numbers are not sorted in any order. They need to be checked 2-3 times a month.
Yes you r right about the file size.

Manfred Rudolf Bihy 3-Dec-10 9:44am

If they're being checked 2-3 times a month I'd go for the merge sort approach. It may not be the fasted method but it will work with files and record numbers even larger than those you mentioned.

fjdiewornncalwe 3-Dec-10 11:20am

Please be careful with the answer you have selected as the correct one. You have selected one below that is completely incorrect to what you are trying to do? WTH!!!

6 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Manfred Rudolf Bihy · Answer 1 · 2010-12-03T03:14:00

Hi Pallab,

using file based merge sort would be the way to go for you.
Please read this blog entry:

http://splinter.com.au/blog/?p=142

This merge sort works by breaking one big file into smaller chunks.
These are conventionally sorted and written to disk. Then in a final
operation these chunks are merged into one big sorted file. This way
you can keep the memory footprint quite small even for very big files.
The larges piece of information to hold in memory at a time is the size
of one of the chunks the original was broken into.
(Pretty neat isn't it: This algorithm is from way back in the 60/70's when main memory was a very costly thing.)

Modification:

Forgot this in my original answer. After the file based merge sort is done you'll have a sorted file with maybe some duplicates. So you open this file scan through it line by line always remembering the last unique entry. If the next line read contains the same number as the last one it is not output to the final file. Lather, rinse, repeat ...
End of Modification

Cheers,

Manfred

#realJSOP · Answer 2 · 2010-12-03T03:15:00

You could load them a in groups of 10000 numbers, compare, and if not found, load the next 10000. The only things that should cause the loop to exit would be if the number was found, or if the end of the file was reached.

You didn't specify the exact format of the file, so I'm assuming that each number is on its own line.

Another thing you could do is find the largest number that exists in the big file, and keep it in a separate file. Then, you could first check to see if the new number is larger than the one in your single-number file, and if it is, you're gold, and you can add it to the big file. If not, then you can run through the big file to see if it's already in there.

BTW, sorting the contents of the file probably won't help because you can't assume that the other application will be adding them in any kind of order, and you'll end up constantly sorting the file before conducting your search. Not very efficient.

fjdiewornncalwe · Answer 3 · 2010-12-03T02:44:00

Solution 4

For starters, I'd want to get the guy who thought it was good idea to store 100+ million numbers in a text file in a corner and kick him a few times.

After I got out of jail, I would probably transfer the numbers into a simple db so that I could take advantage of the databases search algorithms.

Posted 3-Dec-10 2:44am

fjdiewornncalwe

Comments

Pallab_GT 3-Dec-10 9:14am

Thanks for your reply Marcus. I do agree with you on kicking the concerned person.
Using database is not an option.
Thanks again.

fjdiewornncalwe 3-Dec-10 10:17am

Can you do anything with the flat file, or do you have to use it as is. ie. Could you pre-process the file into some sort of sub-file sorted range thingy which would allow you to make searches less intensive, or is the flat file constantly changing? ie... Look at Manfred's post above. That's what I'm talking about here.

Hiren solanki · Answer 4 · 2010-12-03T03:07:00

Solution 5

Well , My logic to accoplish the same will be.

1. Either use LINQ or T-SQL for finding distinct ones.
2. Loop through Millions of record you have and check
  1. If it is in distict then let it be as it is
  2. If it isn't in distinct then keep very first and delete remaining.

This was my solution at that time i needed to find and delete duplicates from 50,000 records.

Try it.

Please vote and Accept Answer if it Helped.

Posted 3-Dec-10 3:07am

Hiren solanki

v2

Comments

Pallab_GT 3-Dec-10 9:15am

Thanks for your reply. But using a database is not an option.

Hiren solanki 3-Dec-10 9:18am

Then probably load all the records in datatable and apply LINQ, If it's hard for datatable then use of making partially datatable for records and move on the same way.

Manfred Rudolf Bihy 3-Dec-10 9:42am

I seriously doubt that doing this with LINQ but without a DBMS would work with 100 million+ 16 digit numbers.

@nuraGGupta@ · Answer 5 · 2010-12-01T16:59:00

Solution 3

Use hashing algorithm for this. Calculate hash for each file you have. It will give you a 16 byte string. Whenever the two hashed strings match, it means that the files are duplicate. Google for hashing algorithm and you may get lots of responses.

Posted 1-Dec-10 16:59pm

@nuraGGupta@

Comments

fjdiewornncalwe 3-Dec-10 8:46am

By the sound of the question, all of the numbers are in a massive, huge, crazy-a** text file. If each one is in a separate file, then your method would be very effective, but I'm not sure if you save a lot of cycles with hashing tiny files as opposed to opening and checking the contents.

ToniChen · Answer 6 · 2010-11-30T19:30:00

Solution 1

i think u may use some logic to divide them by some prefix, such like date+time, then the probability of duplicates will be decreased.

Posted 30-Nov-10 19:30pm

ToniChen

Unique Number Generation

6 solutions

Solution 6

Solution 7

Solution 4

Solution 5

Solution 3

Solution 1

Add your solution here

Preview 0