mysql - How to automate web scraping using php? -
let me explain situation.
i have list of 10 million page urls. want pages scraped , saved in database raw html.
as of i'm using curl scrap pages. every time access index.php
, scraps 1 page url , save in database.
now think not possible access index.php
10 million times myself using browser.
i use do while loop
. think gonna take hell of time complete task. , memory issues too.
so can point me in right direction make task painless.
i own linux vps server 1gb ram
, whm/cpanel.
ps: considered cron. have define time in cron. if run script every 1 minute using cron, can complete 1440 urls in 24 hours
. can 1 give me idea complete atleast 100,000
urls in 1 day using cron?.
what need high-speed fetching library wget
or curl
heavy lifting you. php create lists of work these programs process, , wget
in particular has option fetching , saving contents of urls list given in file.
a php script run browser must finish in reasonable amount of time or time out. not used background processes this.
you can use crontab
check new work , launch new wget
process. there no reason have fetch 1 url @ time. many in listed files.
for example, cron job kick off script this:
#!/bin/sh list in "/tmp/*.urls" wget -i $list -b end
of course, there lot of options wget
can tweaked.
if you're php app secured, write out shell scripts run cron
in background @ later time. way can specify exact destination of each file.
Comments
Post a Comment