|
You can considerably speed up the process: If you have mostly big files you can use a two step hash comparison. In the first step you take only say 1 - 5 KB from the beginning of the file to build your hash (I'd recommend SHA2[^]). You only perform a hash calculation for the complete file if the hashes for two files match.
Best,
Manfred
"I had the right to remain silent, but I didn't have the ability!"
Ron White, Comedian
|
|
|
|
|
Thanks for your advice.
However i wonder why SHA2 should be a better choice which is slower than MD5 or MD4?
|
|
|
|
|
It's been developed to be more robust, i.e. the probability to have collisions is reduced.
Just look it up, there are plenty of good explanations on the internet on what the actual improvements over its predecessors are.
Cheers!
"I had the right to remain silent, but I didn't have the ability!"
Ron White, Comedian
|
|
|
|
|
|
i found a similar solution when i had to recover 4.5TB of files mostly between .5 - 15GB where around 1-2% of them were early terminated during a copy process, but left with correct filesize filled up with zeroes.
it adds to your suggestion to scan all big files with big fixed and some random jumps up to their ends.
|
|
|
|
|
It is almost sure that the problem not your hash algorithm but I/O. All (most) hash algorithms are short and fast enough to not blame them...
If you have large number of files you have to rethink your approach.
1. use file system's FileInfo - name, size, creation, last modified and so
2. if you can't you may consider to hash only the first block (4K) of every file and go forward only for those found the same.
I'm not questioning your powers of observation; I'm merely remarking upon the paradox of asking a masked man who he is (V).
|
|
|
|
|
Thank you V. ^_^
Your advice is good. In the beginning i had not a clear understanding of this question.
Now i agree I/O should be the key point, and i wish i can find a way to resolve my problem.
Thanks!
|
|
|
|
|
Yes I/O is a key issue...
If it is in your power you may move the storage to a more capable one, like SCSI or flash, that should speed up I/O...
I'm not questioning your powers of observation; I'm merely remarking upon the paradox of asking a masked man who he is (V).
|
|
|
|
|
i wrote a little app to do this for my own use, and i used multiple tests:
for every file to compare, find:
1. size - if they aren't the same size, they aren't the same file.
2. a copy of the first 100 bytes - if the first 100 bytes don't match, they aren't the same file. this is easy and fast to compare, requires no computation, and doesn't take much to store.
that stage is very fast. it generally takes far longer to just get the list of files than it does to do all of those tests.
after that,
calculate the hash (i used SHA1) of the first few KB for each of the remaining files. compare the hashes.
after those two stages, you will have eliminated most of the non-duplicates. then, you can do a full hash of the remaining files to find any definite duplicates.
|
|
|
|
|
You may give a polish to that code and publish it here as a tip or article...
I'm not questioning your powers of observation; I'm merely remarking upon the paradox of asking a masked man who he is (V).
|
|
|
|
|
I agree one can have mupltiple type of algorithms based on suitability:
1. check for size if does not matches then those files are not same.
2. Compare 2% character from begining and then from end if those does not matches then files are not same.
3. Finally it can be once option to simply compare the remaining part of both files or use CRC, checksum, MD SHA. If MD and SHA or not already calculated and stored then those will be a bit costly.
Manoj
Never Gives up
|
|
|
|
|
store information as bytes size and simple crc code.
performe loop and initially check bytes size if equal performe crc or hash check.
Or too do hash checking only when bytes then equal storing hash for future use.
|
|
|
|
|
Hi,
I have a 2D color map (without a color bar depth key) that depicts depth in various colors. Ideally I would like to get an estimate of all relative depths from a function of the RGB values of a color.
What I do know is the colors of the highest, middle and lowest points. So I have something like:
Fn(R,G,B) = Depth
Fn(255,255,0) = 1000
Fn(125,125,0) = 500
Fn(22, 125, 20) = 0
I realise that there won't be an exact solution and some sort of linear assumption would have to be made, but can anyone suggest a good way to approach this? (assuming it is do-able!)
I guess the images I'm talking about would look similar to this one but have various non-rainbow color changes and also have no depth key:
Similar map link
Many thanks for any assistance.
|
|
|
|
|
A simple, naive approach is to collect as many (R, G, B, Depth) tuples as you can get, then do multiple linear regression with Depth as the dependent variable.
If the linear assumption doesn't give satisfactory results, you can try multiple polynomial regression, or splitting the data set into two partitions, and using regression separately on each.
|
|
|
|
|
Many thanks for the help.
I'm pretty sure that linear regression won't work because I have an image that has large Green values at both high and low depth.
If I understand it right, Polynomial regression for R, G or B may sometimes work but I think that often the depth can be an unknown function of all three colours.
I wonder if it makes sense to combine the best Polynomial regression fit for each one somehow?
|
|
|
|
|
"...I have an image that has large Green values at both high and low depth."
I suspected this, which is why I suggested dividing the space into separate partitions and solving them independently. Statisticians call this "elaboration".
|
|
|
|
|
Is it possible that converting from RGB to HLS color representation might give values mre amenable to regression? Just a thought.
|
|
|
|
|
1-How to make a sandbox that sandboxing and preventing Malicous Java script at URL address bar with such permissions like never change the title page , never change the images or flash or any other multimedia from outside in the page and never excute a code from outside ?
2-after detecting and preventing through Sandbox , how to take the piece of code which causes the problem from a big malicious script ? e.g. eval() is the responsible for making the problem so i want the program to catch for the 3rd step
3-if for example eval() is the bad function which causes the trouble then how to REPLACE it -mapping- with a safe function? -lets say only a mapping for eval() ,execScript , SetTimeOut , SetTimeInterval
innerHTML , outerHTML document.cookies, location.replace , location.assign , script.src href , src , Iframe -
I red many articles but i didnt find safe mapping for unsafe functions
What are the safe mapping for those unsafe functions or properties??
I red this it was helpful but not complete:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/eval[^]
Note:
Though I hope to make this Algorithm implemented with c# -I posted this question there- but any help to solve this queston with any language will be welcomed.
|
|
|
|
|
|
Am working on a robotic arm, and I intend using an algorithm to generate the shortest route to a given point in3D space. Its kind of like optimization .Stuff for shortest route . etc....Am to use C# and genetic .Algorithm. Please help out . .
|
|
|
|
|
You need to define classes for Route (which will contain an array of genes), and Population (which will contain an array of Routes. Then write an evaluation function for class Route that gives the "value" of the Route's gene list; the length of the route seems appropriate.
Initialize the genes of each Route in the Population to small random values. Then:
Evaluate each Route. Delete the Routes with the highest scores (e.g. keep the 10% with the shortest routes). Refill the population with combinations of genes from pairs of Routes in the shortest 10%, with random "mutations" (altering one or more genes).
Repeat until a short enough route evolves.
|
|
|
|
|
|
Hi,
I need to write a pesudocode about InvertRelation function for 2-level Binary Decision diagrams.
InvertRelation takes as input a 2L-level quasi-reduced BDD rooted at r encoding a relation R : B^L → 2^{B^L} and returns the 2L-level quasi-reduced BDD rooted at s encoding the relation R^{−1} :B^L →2^{B^L},that is, j ∈ R(i) iff i ∈ R^{−1}(j).
the input BDD r uses the variable order x′1, x1, ..., x′L, xL and that the result BDD s uses the variable order x1,x′1,...,xL,x′L. Thus r is at level L and its children are at level L′, while s is at level L′ and its children are at level L.
Any help is appreciated.
compengr
|
|
|
|
|
Funny, I was just looking for that.
I couldn't find anything though.
yes this post is useless..
|
|
|
|
|
If you want some help, try explaining what you need without cryptic terms you haven't defined.
|
|
|
|