Click here to Skip to main content
15,887,246 members
Posted
Updated 14-Dec-23 18:46pm
v3
Comments
Andre Oosthuizen 18-Nov-23 13:04pm    
Your question is not clear enough, it seems that you want to do a web scrape of a certain site but need to scrape each page as it is selected?

Please give us more information and the site that you want to scrape, might help clearing things up a bit.
Yount_0701 18-Nov-23 18:53pm    
let's see this website link https://movie.douban.com/review/best/.
It's a movie comment list, at the bottom with page links. If you click into different page link, new comment will come up. I want to scrape these comments , and I don't want to click each page link by hand , I hope js code could do this instead.
Richard Deeming 20-Nov-23 4:16am    
As mentioned in solution 1, "scraping" data from someone else's website without their permission is unethical at best.

And if you have permission from the site owners, then there's almost certainly an easier way for them to give you access to the data.
Yount_0701 20-Nov-23 9:32am    
I'm not arguing with you. It's a tech question first, and I'm not a criminal. When new href loaded , the console context switched and pre-page console output no access , no output gather , that is the fact I faced and this is a tech community as I thought. If my description not wrong , you have to admit that it's a question first. No offense , you guys have special talent of teaching lessons , lessons walk around my question but just not step into it.

1 solution

Firstly, this is not ethical without the site owners permission, this might be adding strain to his server for your benefit!

To click an element on a site you need to get it's element id and use that to automatically click, in this case the 'a' element to load the pages. The 'a' elements is contained in a div with id 'paginator', select each element inside and click it -

JavaScript
<script>
    window.addEventListener('load', function() {
        //Get all the 'a' elements inside the paginator div...
        var links = document.querySelectorAll('.paginator a');

        //Iterate through each link and trigger a click event for each...
        links.forEach(function(link) {
            link.click();
        });
    });
</script>
 
Share this answer
 
v2
Comments
Yount_0701 19-Nov-23 2:55am    
Andre Oosthuizen 19-Nov-23 3:46am    
The above was only to point you on the right direction. See my updated sample to make it easier understandable.
Yount_0701 19-Nov-23 4:59am    
First of all , I have to thank you for your attention and your direction code.
Perhaps my description is not clear , As my mention in the question , what I'm doing is first click the 'next page', then load the new page , and scrape items info, then click 'next page' and so on. My code register the window load event , but when the page clicked , the console (F12 DevTool interface) switched into a new context, I can not get any output about the new page loaded event, which I query from new page DOM document, and use console.log() to print in the console context. What I hope is that the console should print the new page item content . That's one problem , and second , I'm not sure the listener on window's load event is capable or correct to finish my job ? Sincerely , thank you.
Andre Oosthuizen 20-Nov-23 11:58am    
Only a pleasure. The supplied code will point you on the right direction, unfortunately I am not about to start a new project and end it with numerous back and forth messages, sorry, I just don't have the time for it.

If you test my code, then start building on that you will be ok within a few days, you will then also understand your own code much better.
Yount_0701 21-Nov-23 5:15am    
I'm not sure you get me, but I'm sure I don't get you or your direction (Frankly , your code is not that much different from mine in my question , I hope you do not get messed , as far as I see , depend on my apprehension . If not , it's the problem of my apprehension).
I find a way to deal with it , not perfect , but work. As for your solution , I'm not gonna to buy it.
I use ajax/xhr to request the page label href and parse the response , I'm lucky as for the response has no dynamic DOM ops generated by js, and the href has no cross-origin problem so far, simple and work.
The problem is still there , you guys walk around my question , and then I changed my idea , and walk around my problem .

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900