mysql - How to automate web scraping using php? -


let me explain situation.

i have list of 10 million page urls. want pages scraped , saved in database raw html.

as of i'm using curl scrap pages. every time access index.php, scraps 1 page url , save in database.

now think not possible access index.php 10 million times myself using browser.

i use do while loop. think gonna take hell of time complete task. , memory issues too.

so can point me in right direction make task painless.

i own linux vps server 1gb ram , whm/cpanel.

ps: considered cron. have define time in cron. if run script every 1 minute using cron, can complete 1440 urls in 24 hours. can 1 give me idea complete atleast 100,000 urls in 1 day using cron?.

what need high-speed fetching library wget or curl heavy lifting you. php create lists of work these programs process, , wget in particular has option fetching , saving contents of urls list given in file.

a php script run browser must finish in reasonable amount of time or time out. not used background processes this.

you can use crontab check new work , launch new wget process. there no reason have fetch 1 url @ time. many in listed files.

for example, cron job kick off script this:

#!/bin/sh  list in "/tmp/*.urls"   wget -i $list -b end 

of course, there lot of options wget can tweaked.

if you're php app secured, write out shell scripts run cron in background @ later time. way can specify exact destination of each file.


Comments

Popular posts from this blog

linux - Does gcc have any options to add version info in ELF binary file? -

android - send complex objects as post php java -

charts - What graph/dashboard product is facebook using in Dashboard: PUE & WUE -