|
|
I'm writing a new CP article scraper app and imagine my surprise to discover that the date posted isn't retrieved using any of the normal methods (HtmlAgilityPack , WebClient , or HttpWebRequest ). It seems that you're using javascript to display that info on a given article's page.
It sure would be nice if you guys finally got around to providing a web service that allows us to retrieve an article instead of having to scrape the site (and a web service that allowed us to retrieve the reputation points as well).
.45 ACP - because shooting twice is just silly ----- "Why don't you tie a kerosene-soaked rag around your ankles so the ants won't climb up and eat your candy ass..." - Dale Earnhardt, 1997 ----- "The staggering layers of obscenity in your statement make it a work of art on so many levels." - J. Jystad, 2001
|
|
|
|
|
Dates are output using plain ol' HTML.
And yes, it'd be lovely if we could get through our current task list faster so we can get this done but we are definitely up against a hard limit of only 24 hrs in the day.
We're trying.
cheers,
Chris Maunder
The Code Project | Co-founder
Microsoft C++ MVP
|
|
|
|
|
Chris Maunder wrote: Dates are output using plain ol' HTML.
If I view source on a browser, it shows up as expected. If I use any of the three methods I listed in my OP,none of the data on the right side of the screen is included in the response. It's truly bizarre.
Chris Maunder wrote: We're trying.
I know, I just thought I'd squeak the wheel a little.
.45 ACP - because shooting twice is just silly ----- "Why don't you tie a kerosene-soaked rag around your ankles so the ants won't climb up and eat your candy ass..." - Dale Earnhardt, 1997 ----- "The staggering layers of obscenity in your statement make it a work of art on so many levels." - J. Jystad, 2001
|
|
|
|
|
Chris, there is more to it than that. I can provide you with a lot of details now.
Executive Summary: UserAgent is very relevant; dates can be off by 1 day.
Details:
1.
I already noticed CP Vanity sometimes shows dates that are off by 1 day when compared with what your article summary page shows. Example:
http://www.codeproject.com/script/Articles/MemberArticles.aspx?amid=648011[^] shows the CP Vanity article with "Last Updated: 6 Apr 2010" which is correct; the article itself also says "Updated: 6 Apr 2010". So far so good.
CP Vanity itself gets the same page (script/Articles/MemberArticles.aspx?amid=648011) showing "5 Apr 2010" and that is what my app displays.
I noticed this weeks ago, never took the time to investigate thoroughly.
FWIW: CP Vanity does not set a UserAgent.
2.
John is having bigger trouble, he wants to load an article itself and says he misses a lot of content.
3.
So I now downloaded CP Vanity[^] using an HttpWebRequest (without UserAgent) and it results in a file of 85KB. When I look at the same page with FireFox, View Source, and save that as text, it is 165KB and contains a lot more information, including menus and the full header block containing "Updated: 6 Apr 2010" which is completely absent in the WebHttpRequest result.
BTW: the UserAgent I used is:
req.UserAgent="Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.19) Gecko/2010031422 Firefox/3.0.17";
4.
The CP Vanity article shows (in Vista/FireFox):
Posted: 23 Mar 2010
Updated: 6 Apr 2010
The same URL downloaded with HttpWebRequest with my own UserAgent (the one my website gets from my system) contains:
<tr><td>Posted:</td><td><b>22 Mar 2010</b></td></tr>
<tr><td>Updated:</td><td><b>5 Apr 2010</b></td></tr>
so both dates are off by 1 day.
Conclusions
1. the UserAgent is very relevant; when absent a very slimmed down page is obtained. That explains Johns main problem.
2. with a realistic UserAgent, dates can be off by 1 day. I don't know why or how. Maybe you have an idea about that.
|
|
|
|
|
Just my 2c worth... I suspect the date anomalies are related to the site's more general date/time issues. It consistently believes that I am one timezone east of where I actually am. Most times served seem to be "adjusted" for the user's tz (daylight savings effects unexplored), but not all. For example, the message to which I am replying has as a footer:
Luc Pattyn wrote: modified on Tuesday, May 4, 2010 10:28 PM
This matches up with the "posted 1 hr 38 ago" on the forum page if we assume Luc is in tz GMT-4, and it's his time in the footer. Or is it just that the server is in that tz?
I understand what a PITA all this tz stuff is - I maintain a website which logically lives "here" (New South Wales, Australian Eastern Standard/Summer Time), but the server is somewhere on the E coast of the USA. [Oh, and the local support is in Queensland, which is the same timezone as me except they don't do daylight saving.] It took a few iterations before people stopped complaining about ridiculous times, future events scheduled to happen "yesterday", etc.
Chris, I offer this as constructive, not a whinge. (Don't know what smiley is relevant)
|
|
|
|
|
yes, I am willing to believe time zones are a problem sometimes, but here we use a single computer, a single IP address, a single internet provider, but two different ways to look at the same page, one is with a real browser, the other with a code snippet using HttpWebRequest, fetching the exact same page. So I do expect to get exactly the same data in both cases.
I already discovered UserAgent plays a role; I can accept that, and now use the value that is also used by my browser, but still the dates are different. So far it remains a mystery.
BTW: I think I'm at GMT+2, (GMT+1 and in DST). Confirmed here[^]
|
|
|
|
|
OK, so the 'modified' timestamp I mentioned is server time. My point is that you and I can be in 'today' while the server and the americans are legitimately in 'yesterday'. I suspect that the "adjust timestamp to users tz" code on the CP server is in some but not all paths through the conceptual
switch (user agent)
{
case IE: ...
...
} block(s), and if my experience is anything to go by, there's a lot of them scattered all over the shop...
|
|
|
|
|
Yep - I specified the user agent to the request, and I got the entire page - btw, on all of the articles I've tried, the dates are correct.
.45 ACP - because shooting twice is just silly ----- "Why don't you tie a kerosene-soaked rag around your ankles so the ants won't climb up and eat your candy ass..." - Dale Earnhardt, 1997 ----- "The staggering layers of obscenity in your statement make it a work of art on so many levels." - J. Jystad, 2001
|
|
|
|
|
John Simmons / outlaw programmer wrote: on all of the articles I've tried, the dates are correct.
Maybe, just maybe, that is because you publish before (your) midnight, whereas I may often publish between my midnight, and Chris's midnight; but then that should not matter, it still is different when looked at using a browser and using HttpWebRequest.
you should try scraping CP Vanity[^] and see what I mean.
|
|
|
|
|
perhaps in the header of the request you can also specify your timezone. that may fix your issue.
|
|
|
|
|
you may be right, I've never seen it though. do you also have any specific information available as to how to do that?
|
|
|
|
|
|
|
Thanks for the info, the Date property is pretty new, it exists since 4.0
unfortunately, it does not convey the timezone, and setting it to DateTime.Now did not help at all (the server could derive the client's timezone by comparing request datetime with server datetime, but doing so it would assume the request arrived "immediately", which does not have to be true).
From what I've read on the web there really isn't an official way to convey the client's timezone through HTTP. The only approach discussed is sending a page containing the necessary JavaScript to fix datetimes... And it is my guess that is what CP does, although I haven't found it yet amidst loads of code.
Thanks anyway.
|
|
|
|
|
Yes, true the Date property is new since .NET 4.0.
However, you can set the Date header in the Headers collection.
It looks like it is specified in GMT. So you maybe you should use DateTime.UtcNow , instead of DateTime.Now .
Anyway, happy coding!
|
|
|
|
|
you are persistent, I must admit.
I now tried with request.Date=DateTime.UtcNow , no change.
|
|
|
|
|
|
Luc suggested the use of a user agent, and that fixed my problem... Weird that I'm just now discovering it, though...
Apologies to Chris, et al, for suggesting that they're doing something wonky on the page.
.45 ACP - because shooting twice is just silly ----- "Why don't you tie a kerosene-soaked rag around your ankles so the ants won't climb up and eat your candy ass..." - Dale Earnhardt, 1997 ----- "The staggering layers of obscenity in your statement make it a work of art on so many levels." - J. Jystad, 2001
|
|
|
|
|
John Simmons / outlaw programmer wrote: Apologies
John, you're getting a real softie here. They do modify HTML and withhold information in mysterious ways. We are in the dark as to what information would be required in the UserAgent in order to obtain all available information correctly or even at all.
|
|
|
|
|
|
Luc will laugh, but hit Ctrl+F5 to force a CSS update.
cheers,
Chris Maunder
The Code Project | Co-founder
Microsoft C++ MVP
|
|
|
|
|
you're absolutely right.
|
|
|
|
|
Good now.
|
|
|
|
|