A Novel and Efficient Approach For Near Duplicate Page Detection in Web Crawling

Question

1.00/5 (2 votes)

See more:

, +

Can anyone help me in this.
I want to do this project for my academic.
Can some one give me any idea how to do this.
Abstract:
The drastic development of the World Wide Web in the recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to the web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overheads for the search engines critically affecting their performance and quality. The detection of duplicate and near duplicate web pages has long been recognized in web crawling research community. It is an important requirement for search engines to provide users with the relevant results for their queries in the first page without duplicate and redundant results. In this paper, we have presented a novel and efficient approach for the detection of near duplicate web pages in web crawling. Detection of near duplicate web pages is carried out ahead of storing the crawled web pages in to repositories. At first, the keywords are extracted from the crawled pages and the similarity score between two pages is calculated based on the extracted keywords. The documents having similarity scores greater than a threshold value are considered as near duplicates. The detection has resulted in reduced memory for repositories and improved search engine quality.

Posted 7-Mar-11 6:48am

Kishore Jangid

Updated 7-Mar-11 9:12am

Toli Cuturicu

v2

Add a Solution

Comments

Sergey Alexandrovich Kryukov 7-Mar-11 13:13pm

Who is the author of this text? What did you do so far?
--SA

Sandeep Mewara 7-Mar-11 13:42pm

What kind of help you are expecting here?

Posting the problem statement without clarifying/showing your code simply means you want code, is so? If not, elaborate a little on what are you seeking for - it would help others to answer you.

Kishore Jangid 7-Mar-11 15:38pm

I am asking is this project worth doing

Smithers-Jones 7-Mar-11 16:08pm

First of all:
"I want to do this project for my academic." != "I am asking is this project worth doing."

What do you expect people to answer? Short answer: no. Long answer: yes. How would anybody else know, whether it's worth doing? You have to decide for yourself, depending on your skills, interests...

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

OriginalGriff · Answer 1 · 2011-03-07T08:22:00

Solution 1

If you are waiting for permission, feel free to get started. I don't mind, even though I think it is a bit dull, and about as useful in the real world as a chocolate fire-guard.

If you are waiting for volunteers to write the code for you, then that would be cheating. And you wouldn't do that, would you?

Posted 7-Mar-11 8:22am

OriginalGriff

Comments

Kishore Jangid 7-Mar-11 15:38pm

I am asking, Is this title worth doing

Gonzoox 7-Mar-11 16:35pm

Of course is worth doing, if you have the time, you will have to create an algorithm very powerful capable of detecting the differences between pages and then match all the information you have (that you'll need to keep in a huge database) with the user's query based on the relevance of the search and the information presented in the page.
Google, Bing, Yahoo and others have something like this and their algorithms are way too advanced, for a school project doing a web crawler and a simple match can help you get the grades you need, still will require a lot of time. For something more advanced you will need time and resources if you want to compete against those monsters called Google or Bing

Kishore Jangid 7-Mar-11 18:34pm

How about using the code at http://searcharoo.net below or version 3.
They have a good algorithm to use but i am confused with the various versions and files they have for a single technique.
Initially they used the SearcharooCrawler and then SearcharooSpider_alpha and then SearcharooSpider.aspx.
But didnt understand whether really they are eliminating the duplicate page or not. And it didn't worked for certain websites.
They have used the browsers cache but i am wanna and going to use a SQL Server 2005.