webman的爬山虎插件: 讓爬取工作變得更加簡單。
composer require blogdaren/webman-phpcreeper
onXXXX
回調(diào)方法內(nèi)編寫業(yè)務(wù)邏輯即可。【即進(jìn)程編號列】
顯示有異常,待有結(jié)果了再來更新下,不過對抓取業(yè)務(wù)沒有任何影響;模擬需求是抓取未來7天內(nèi)的天氣預(yù)報
1、創(chuàng)建爬蟲目錄:app/spider
2、創(chuàng)建生產(chǎn)器句柄類文件 app/spider/Myproducer.php
<?php
/**
* @script Myproducer.php
* @brief 生產(chǎn)器Handler
* @author blogdaren<blogdaren@163.com>
* @version 1.0.0
* @modify 2022-04-01
*/
namespace app\spider;
use Workerman\Timer;
class Myproducer extends \Webman\PHPCreeper\Producer
{
/**
* @brief 抓取未來7天內(nèi)的天氣預(yù)報DEMO
*
* @return mixed
*/
public function makeTask()
{
//Create One Task
$task = array(
'url' => 'http://www.weather.com.cn/weather/101010100.shtml',
'rule' => array(
'time' => ['div#7d ul.t.clearfix h1', 'text'],
'wea' => ['div#7d ul.t.clearfix p.wea', 'text'],
'tem' => ['div#7d ul.t.clearfix p.tem', 'text'],
'wind' => ['div#7d ul.t.clearfix p.win i', 'text'],
),
'context' => array(
'cache_enabled' => true,
'cache_directory' => '/tmp/DownloadCache4PHPCreeper/download/',
'allow_url_repeat' => true,
),
);
$this->newTaskMan()->createTask($task);
}
/**
* @brief onProducerStart
*
* @param object $producer
*
* @return mixed
*/
public function onProducerStart($producer)
{
//$this->makeTask();
Timer::add(2, [$this, "makeTask"], [], true);
}
/**
* @brief onProducerStop
*
* @param object $producer
*
* @return mixed
*/
public function onProducerStop($producer)
{
}
/**
* @brief onProducerReload
*
* @param object $producer
*
* @return mixed
*/
public function onProducerReload($producer)
{
}
}
3、創(chuàng)建下載器句柄類文件 app/spider/Mydownloader.php
<?php
/**
* @script Mydownloader.php
* @brief 下載器Handler
* @author blogdaren<blogdaren@163.com>
* @version 1.0.0
* @modify 2022-04-01
*/
namespace app\spider;
class Mydownloader extends \Webman\PHPCreeper\Downloader
{
/**
* @brief onDownloaderStart
*
* @param object $downloader
*
* @return mixed
*/
public function onDownloaderStart($downloader)
{
$downloader->setClientSocketAddress([
'ws://127.0.0.1:8888',
]);
}
/**
* @brief onDownloaderStop
*
* @param object $downloader
*
* @return mixed
*/
public function onDownloaderStop($downloader)
{
}
/**
* @brief onDownloaderReload
*
* @param object $downloader
*
* @return mixed
*/
public function onDownloaderReload($downloader)
{
}
/**
* @brief onDownloaderMessage
*
* @param object $downloader
* @param string $parser_reply
*
* @return mixed
*/
public function onDownloaderMessage($downloader, $parser_reply)
{
//pprint($parser_reply, __METHOD__);
}
/**
* @brief onBeforeDownload
*
* @param object $downloader
* @param array $task
*
* @return mixed
*/
public function onBeforeDownload($downloader, $task)
{
//$downloader->httpClient->setConnectTimeout(3);
//$downloader->httpClient->setTransferTimeout(10);
//$downloader->httpClient->setHeaders(array());
//$downloader->httpClient->setProxy('http://180.153.144.138:8800');
}
/**
* @brief onStartDownload
*
* @param object $downloader
* @param array $task
*
* @return mixed
*/
public function onStartDownload($downloader, $task)
{
}
/**
* @brief onAfterDownload
*
* @param object $downloader
* @param array $download_data
* @param array $task
*
* @return mixed
*/
public function onAfterDownload($downloader, $download_data, $task)
{
//pprint($downloader->getDbo('test'), __METHOD__);
}
}
4、創(chuàng)建解析器句柄類文件 app/spider/Myparser.php
<?php
/**
* @script Myparser.php
* @brief 解析器Handler
* @author blogdaren<blogdaren@163.com>
* @version 1.0.0
* @modify 2022-04-01
*/
namespace app\spider;
class Myparser extends \Webman\PHPCreeper\Parser
{
/**
* @brief onParserStart
*
* @param object $parser
*
* @return mixed
*/
public function onParserStart($parser)
{
}
/**
* @brief onParserStop
*
* @param object $parser
*
* @return mixed
*/
public function onParserStop($parser)
{
}
/**
* @brief onParserReload
*
* @param object $parser
*
* @return mixed
*/
public function onParserReload($parser)
{
}
/**
* @brief onParserMessage
*
* @param object $parser
* @param object $connection
* @param string $download_data
*
* @return mixed
*/
public function onParserMessage($parser, $connection, $download_data)
{
/*
*$rule = array(
* 'hotline' => ['div.qxfw-body > p:eq(1)', 'text'],
*);
*$data = $parser->extractor->setHtml($download_data)->setRule($rule)->extract();
*pprint($data, __METHOD__);
*/
}
/**
* @brief onParserFindUrl
*
* @param object $parser
* @param string $url
*
* @return mixed
*/
public function onParserFindUrl($parser, $url)
{
return $url;
}
/**
* @brief onParserExtractField
*
* @param object $parser
* @param string $download_data
* @param array $fields
*
* @return mixed
*/
public function onParserExtractField($parser, $download_data, $fields)
{
!empty($fields) && pprint($fields, __METHOD__);
}
}
5、修改插件的process配置文件設(shè)置對應(yīng)的Handler
<?php
use app\spider\Myproducer;
use app\spider\Mydownloader;
use app\spider\Myparser;
return [
'myproducer' => [
'handler' => Myproducer::class,
'listen' => '',
'count' => 1,
'constructor' => ['config' =>
include('spider/global.php')
],
],
'mydownloader' => [
'handler' => Mydownloader::class,
'listen' => '',
'count' => 1,
'constructor' => ['config' =>
include('spider/global.php')
],
],
'myparser' => [
'handler' => Myparser::class,
'listen' => 'websocket://0.0.0.0:8888',
'count' => 1,
'constructor' => ['config' =>
include('spider/global.php')
],
],
];
安裝不了
composer require blogdaren/webman-phpcreeper
Using version ^1.0 for blogdaren/webman-phpcreeper
./composer.json has been updated
Running composer update blogdaren/webman-phpcreeper
Loading composer repositories with package information
Updating dependencies
Your requirements could not be resolved to an installable set of packages.
Problem 1
Use the option --with-all-dependencies (-W) to allow upgrades, downgrades and removals for packages currently locked to specific versions.
You can also try re-running composer require with an explicit version constraint, e.g. "composer require blogdaren/webman-phpcreeper:*" to figure out if any version is installable, or "composer require blogdaren/webman-phpcreeper:^2
.1" if you know which you need.
Installation failed, reverting ./composer.json and ./composer.lock to their original content.
怎么分布式部署???這個如果要分布式是不是不行?
【1】分布式和分離式都是支持的,和workeman的的分布式分離式模型完全一致。
【2】文檔分享已經(jīng)久遠(yuǎn),更多的更新參看爬山虎項目以及插件的官方手冊,或者進(jìn)爬山虎技術(shù)群問詢。
2022-12-28 16:37:22.323509 | ERROR | Producer | plugin.blogdaren.webman-phpcreeper.myproducer | 01號進(jìn)程 | 生產(chǎn)任務(wù): 檢測到任務(wù)URL配置無效, 請確認(rèn)是否已經(jīng)正確設(shè)置任務(wù)URL.......
都是安裝的最新的,按照上面復(fù)制的天氣采集代碼來的
【1】關(guān)于URL無效的問題,先升級爬山虎插件和爬山虎引擎到最新版試下看;
【2】現(xiàn)在使用短API即可:$this->createTask($task);
【3】文檔分享已經(jīng)久遠(yuǎn),更多的更新參看爬山虎項目以及插件的官方手冊,或者進(jìn)爬山虎技術(shù)群問詢。
網(wǎng)頁爬到后抽取有新的URL,如何放進(jìn)去隊列里面繼續(xù)爬取,一直把所有的URL爬完?
【1】本質(zhì)上每種容器都可以透析Task對象暴露的API,所以這么用就好了:$parser->createTask($task);
【2】文檔分享已經(jīng)久遠(yuǎn),更多的更新參看爬山虎項目以及插件的官方手冊,或者進(jìn)爬山虎技術(shù)群問詢。
如果有兩個抓取網(wǎng)站,按照爬山虎文檔多任務(wù)模式配置,一直提示設(shè)置
檢測到任務(wù)URL配置無效, 請確認(rèn)是否已經(jīng)正確設(shè)置任務(wù)URL.......
$task = array (
//任務(wù)1
array(
)
//任務(wù)2
array(
)
)
所以在webman下如果要有多個抓取任務(wù)是否要按爬山虎的一個任務(wù)在一個目錄里那樣配置?如果按那個方法配置,那在plugins里process.php里應(yīng)該怎么配置多個Producer,download,parser 呢?
還有就是,抓取的頻率只能通過Timer::add()來實現(xiàn)嗎?我只在指定的時間內(nèi)需要抓取數(shù)據(jù),是否可以用crontab類似的定時任務(wù)來執(zhí)行呢?
【1】關(guān)于URL無效的問題,先升級爬山虎插件和爬山虎引擎到最新版試下看;
【2】一般不需要,另對于多個任務(wù)有相應(yīng)的API,即:$producer->createMultiTask($task);
【3】爬山虎的定時器和workerman一脈相承,既支持Timer用法,也支持Crontab用法。
【4】文檔分享已經(jīng)久遠(yuǎn),更多的更新參看爬山虎項目以及插件的官方手冊,或者進(jìn)爬山虎技術(shù)群問詢。
很不錯