Click here to Skip to main content
15,890,527 members
Home / Discussions / C#
   

C#

 
GeneralRe: Implementing IDisposable in Sealed class Pin
Blumen18-Aug-08 18:45
Blumen18-Aug-08 18:45 
QuestionHow to know wheter a string contains a url? Pin
Waleed Eissa16-Aug-08 23:54
Waleed Eissa16-Aug-08 23:54 
AnswerRe: How to know wheter a string contains a url? Pin
Manas Bhardwaj17-Aug-08 0:04
professionalManas Bhardwaj17-Aug-08 0:04 
GeneralRe: How to know wheter a string contains a url? Pin
Waleed Eissa17-Aug-08 1:37
Waleed Eissa17-Aug-08 1:37 
AnswerRe: How to know wheter a string contains a url? Pin
Christian Graus17-Aug-08 0:52
protectorChristian Graus17-Aug-08 0:52 
GeneralRe: How to know wheter a string contains a url? Pin
Waleed Eissa17-Aug-08 1:42
Waleed Eissa17-Aug-08 1:42 
GeneralRe: How to know wheter a string contains a url? Pin
Christian Graus17-Aug-08 2:06
protectorChristian Graus17-Aug-08 2:06 
GeneralRe: How to know wheter a string contains a url? [modified] Pin
Waleed Eissa17-Aug-08 3:34
Waleed Eissa17-Aug-08 3:34 
Ok, now I get your point, actually I don't care whether they are valid or not, as I mentioned before it's just for spam filtering so it's not important to check whether they are valid ..

Let me explain from the beginning (hopefully you have the time to read all this Smile | :) )

In my website, users should be adding a lot of posts in a short time and I want the site to be as fast and responsive as possible when they do this, so, basically I'm looking for a spam filter that will run on my machine (as opposed to spam filters that call a web service on another website, like akismet, which can be good for blogs and sites that don't receive many posts). Unfortunately I wasn't able, so far, to find such thing, this is why I'm trying to write it myself and it seems more complicated than what I thought.

Well, I thought of two approaches that I can use to detect spam:

- Using naive bayesian (there's an article here on code project that talks about that, see http://www.codeproject.com/KB/recipes/BayesianCS.aspx[^])

- Using some rules that usually apply to spam and this is what I'm trying to do. Actually naive bayesian is very effective in most cases but it's basically because of something related to my app. Read on:

Due to the nature of my website, users wouldn't normally post any text that contains links (and I don't change links that start with http:// to anchor tags). So, it's reasonable to assume that posts that contain links will most likely be spam. Spammers can spam your site for two reasons, first to get a higher page rank for some website, more accurately for some web page (which is not true in my case as I don't change links into anchor tags, and even if I was I could use rel="nofollow" as most people do) but anyway the point is that the spam contains a url, second to advertise something and in this case they have to leave a url, email or a phone number (if you can't reach the advertiser then the ad is useless, right?). Probably you're thinking that if I don't change the links into anchor tags they won't spam my site, I can assure you they are dumb enough to do this, I have seen many other websites that don't change links into anchors still they are heavily spammed (but may be not because they are dumb, it might be because it's rumored that google detects any links that start with http:// when crawling your site even if they are not in anchor tags, I have no idea though whether this is true or not). Anyway, what I'm trying to do is find whether the post contains any of these (url, email or phone number). Finding the email address or phone number is fairly easy with regex, finding the url is fairly easy if it starts with http://, but now there are two problems with this approach, first, by having a look on some spam ads I noticed that some spammers don't start their urls with http://, and second if they know that I only check for http:// they will post all urls without it. Now the real complication is to find urls that don't start with http://, because basically anything that has a 'dot' inside can be a url, so if a user doesn't leave a space between the period that ends a sentence (full stop) and the next sentence, it will be detected as a url (this is along with so much other text that can contain a dot between two strings yet it's not a url), so I thought I can use the TLDs (generic and country codes) but in this case our regex will be way too long! This will most probably affect performance, and even if you decide not to use regex (probably using a loop that checks for every TLD) this will also most probably affect performance, and to make things even worse, some completely valid text, like asp.net for example, will be detected as a url (and it's even a valid url in case you do a post Smile | :) ).

This is getting more complicated than needed, I think I'll either drop urls that only has two parts (like asp.net) or use naive bayesian

And BTW, the reason I wanted the spam filter to return a percentage is because some posts are guaranteed to be spam (the spam filter keeps a database of spam ads, when it receives a post it hashes it and compares it to the spam ads that has the same hash, SpamAssassin does this I believe), in this case the spam filter should return 100% but if the post is not guaranteed to be spam it returns a percentage less than 100% (depending on how much this post is likely to be spam). In my app, I will not save the posts that are 100% spam to the database but those that have a percentage less than 100% will be saved but won't be visible to any users except the ones who posted them until manually checked by a moderator.

Sorry for making this too long, just wanted to explain why I'm doing this ...

Have a great day ...


modified on Sunday, August 17, 2008 10:45 AM

GeneralRe: How to know wheter a string contains a url? Pin
Manas Bhardwaj18-Aug-08 6:15
professionalManas Bhardwaj18-Aug-08 6:15 
GeneralRe: How to know wheter a string contains a url? Pin
Waleed Eissa18-Aug-08 13:49
Waleed Eissa18-Aug-08 13:49 
AnswerRe: How to know wheter a string contains a url? Pin
Paul Conrad17-Aug-08 8:12
professionalPaul Conrad17-Aug-08 8:12 
GeneralRe: How to know wheter a string contains a url? Pin
Waleed Eissa17-Aug-08 17:31
Waleed Eissa17-Aug-08 17:31 
QuestionNew Extention Pin
hadad16-Aug-08 22:15
hadad16-Aug-08 22:15 
AnswerRe: New Extention Pin
Wendelius16-Aug-08 23:12
mentorWendelius16-Aug-08 23:12 
GeneralRe: New Extention Pin
hadad16-Aug-08 23:19
hadad16-Aug-08 23:19 
GeneralRe: New Extention Pin
Wendelius16-Aug-08 23:30
mentorWendelius16-Aug-08 23:30 
AnswerRe: New Extention Pin
Giorgi Dalakishvili17-Aug-08 0:26
mentorGiorgi Dalakishvili17-Aug-08 0:26 
QuestionTrim String Array Pin
Arcdigital16-Aug-08 15:04
Arcdigital16-Aug-08 15:04 
AnswerRe: Trim String Array Pin
Dr. Emmett Brown16-Aug-08 15:15
Dr. Emmett Brown16-Aug-08 15:15 
AnswerRe: Trim String Array Pin
lisan_al_ghaib17-Aug-08 0:06
lisan_al_ghaib17-Aug-08 0:06 
AnswerRe: Trim String Array Pin
PIEBALDconsult17-Aug-08 4:07
mvePIEBALDconsult17-Aug-08 4:07 
QuestionCommunicating between windows Pin
Jason Coggins16-Aug-08 13:08
Jason Coggins16-Aug-08 13:08 
AnswerRe: Communicating between windows Pin
Ken Mazaika16-Aug-08 13:19
Ken Mazaika16-Aug-08 13:19 
AnswerRe: Communicating between windows Pin
lisan_al_ghaib16-Aug-08 13:33
lisan_al_ghaib16-Aug-08 13:33 
AnswerRe: Communicating between windows Pin
Christian Graus16-Aug-08 14:05
protectorChristian Graus16-Aug-08 14:05 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.