Click here to Skip to main content
15,899,314 members
Home / Discussions / C#
   

C#

 
QuestionHow to stop VS to load user controls into toolbox? Pin
Morven Huang29-Apr-09 16:56
Morven Huang29-Apr-09 16:56 
AnswerRe: How to stop VS to load user controls into toolbox? Pin
Luc Pattyn29-Apr-09 17:01
sitebuilderLuc Pattyn29-Apr-09 17:01 
GeneralRe: How to stop VS to load user controls into toolbox? Pin
Morven Huang29-Apr-09 17:08
Morven Huang29-Apr-09 17:08 
GeneralRe: How to stop VS to load user controls into toolbox? Pin
Luc Pattyn29-Apr-09 17:10
sitebuilderLuc Pattyn29-Apr-09 17:10 
QuestionRemoving duplicates from large text files (Performance needed) Pin
Ehsan Baghaki29-Apr-09 14:12
Ehsan Baghaki29-Apr-09 14:12 
AnswerRe: Removing duplicates from large text files (Performance needed) Pin
Luc Pattyn29-Apr-09 15:04
sitebuilderLuc Pattyn29-Apr-09 15:04 
GeneralRe: Removing duplicates from large text files (Performance needed) [modified] Pin
Ehsan Baghaki29-Apr-09 22:28
Ehsan Baghaki29-Apr-09 22:28 
GeneralRe: Removing duplicates from large text files (Performance needed) Pin
Luc Pattyn30-Apr-09 1:14
sitebuilderLuc Pattyn30-Apr-09 1:14 
OK,

1.
Generating a huge file, then search for duplicates, is a waste of time. If the code that generates such list can be modified, modifying it will always be a much better approach.
It will take less code, less effort to develop, and less CPU cylces to execute.

2.
One way to reduce the postprocessing effort (but not code) is by generating not one but many files, i.e. by applying some sorting. Say you calculate an 8-bit hash value on the URLs, and sort them accordingly in 256 different files. That is easy, and reduces your postprocessing to 256 much smaller jobs (average file size now is 2MB, and now you can do a ReadAllLines).

3.
The ultimate solution is to avoid duplicates right away.

I have written web crawlers before; obviously they need a way to avoid visiting the same URL twice, so they contain some logic for that.
Furthermore, you typically want to search breadth-first, i.e. read and process all of a page before switching to another page; that is more efficient than depth-first, since in the latter you have to either keep in memory or refetch the content of all the pages you are encountering while digging deeper.

The natural solution to these two issues is to organize the crawler around a list containing all the URLs you have encountered (and should visit). Every anchor, image source, style sheet encountered gets added, if not already in the list; when a page is done, move to the next in the list (use an index). When the index gets to the end of the list, you are done, and the list now holds all the unique URLs.


Smile | :)

Luc Pattyn [Forum Guidelines] [My Articles]

Avoiding unwanted divs (as in "articles needing approval") with the help of this FireFox add-in


GeneralRe: Removing duplicates from large text files (Performance needed) Pin
Ehsan Baghaki30-Apr-09 2:58
Ehsan Baghaki30-Apr-09 2:58 
AnswerRe: Removing duplicates from large text files (Performance needed) Pin
Doug Goulden29-Apr-09 15:17
Doug Goulden29-Apr-09 15:17 
QuestionMessage Removed Pin
29-Apr-09 11:39
professionalN_tro_P29-Apr-09 11:39 
AnswerRe: How to Load data into user settings Pin
DaveyM6929-Apr-09 12:07
professionalDaveyM6929-Apr-09 12:07 
AnswerRe: How to Load data into user settings Pin
Henry Minute29-Apr-09 13:10
Henry Minute29-Apr-09 13:10 
QuestionUm... it works when I F5 but not when I CTRL - F5...? Pin
Edmundisme29-Apr-09 11:27
Edmundisme29-Apr-09 11:27 
AnswerRe: Um... it works when I F5 but not when I CTRL - F5...? Pin
Eddy Vluggen29-Apr-09 12:45
professionalEddy Vluggen29-Apr-09 12:45 
GeneralRe: Um... it works when I F5 but not when I CTRL - F5...? Pin
Edmundisme29-Apr-09 12:47
Edmundisme29-Apr-09 12:47 
QuestionC# - File Permission Pin
malharone29-Apr-09 10:44
malharone29-Apr-09 10:44 
AnswerRe: C# - File Permission Pin
Jimmanuel29-Apr-09 11:19
Jimmanuel29-Apr-09 11:19 
AnswerRe: C# - File Permission Pin
Luc Pattyn29-Apr-09 11:33
sitebuilderLuc Pattyn29-Apr-09 11:33 
GeneralRe: C# - File Permission Pin
malharone29-Apr-09 13:17
malharone29-Apr-09 13:17 
QuestionHelp with Web Service Upload Pin
charles Frank29-Apr-09 10:16
charles Frank29-Apr-09 10:16 
QuestionHow can I use cookies stored by Internet Explorer 7(in Vista) in my WinForms application? Pin
rebulanyum29-Apr-09 10:08
rebulanyum29-Apr-09 10:08 
AnswerRe: How can I use cookies stored by Internet Explorer 7(in Vista) in my WinForms application? Pin
Henry Minute29-Apr-09 10:59
Henry Minute29-Apr-09 10:59 
AnswerRe: How can I use cookies stored by Internet Explorer 7(in Vista) in my WinForms application? [modified] Pin
rebulanyum30-Apr-09 4:36
rebulanyum30-Apr-09 4:36 
QuestionWhy pictures aren't shown in listview Pin
Aljaz11129-Apr-09 9:06
Aljaz11129-Apr-09 9:06 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.