免责申明:本框架是为了帮助开发人员简化开发流程、提高开发效率,请勿使用此框架做任何违法国家法律的事情,使用者所做任何事情也与本框架的作者无关。
DotnetSpider, a .NET Standard web crawling library. It is a lightweight, efficient, and fast high-level web crawling & scraping framework.
If you want to get the latest beta packages, you should add the myget feed:
<add key="myget.org" value="https://www.myget.org/F/zlzforever/api/v3/index.json" protocolVersion="3" />
Visual Studio 2017 (15.3 or later) or Jetbrains Rider
Docker
MySql
docker run --name mysql -d -p 3306:3306 --restart always -e MYSQL_ROOT_PASSWORD=1qazZAQ! mysql:5.7
Redis (option)
docker run --name redis -d -p 6379:6379 --restart always redis
SqlServer
docker run --name sqlserver -d -p 1433:1433 --restart always -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=1qazZAQ!' mcr.microsoft.com/mssql/server:2017-latest
PostgreSQL (option)
docker run --name postgres -d -p 5432:5432 --restart always -e POSTGRES_PASSWORD=1qazZAQ! postgres
MongoDb (option)
docker run --name mongo -d -p 27017:27017 --restart always mongo
RabbitMQ
docker run -d --restart always --name rabbimq -p 4369:4369 -p 5671-5672:5671-5672 -p 25672:25672 -p 15671-15672:15671-15672 \
-e RABBITMQ_DEFAULT_USER=user -e RABBITMQ_DEFAULT_PASS=password \
rabbitmq:3-management
Docker remote api for mac
docker run -d --restart always --name socat -v /var/run/docker.sock:/var/run/docker.sock -p 2376:2375 bobrik/socat TCP4-LISTEN:2375,fork,reuseaddr UNIX-CONNECT:/var/run/docker.sock
HBase
docker run -d --restart always --name hbase -p 20550:8080 -p 8085:8085 -p 9090:9090 -p 9095:9095 -p 16010:16010 dajobe/hbase
https://github.com/dotnetcore/DotnetSpider/wiki
Please see the Project DotnetSpider.Sample in the solution.
ADDITIONAL USAGE: Configurable Entity Spider
[DisplayName("博客园爬虫")] public class EntitySpider( IOptions<SpiderOptions> options, DependenceServices services, ILogger<Spider> logger) : Spider(options, services, logger) { public static async Task RunAsync() { var builder = Builder.CreateDefaultBuilder<EntitySpider>(options => { options.Speed = 1; }); builder.UseSerilog(); builder.IgnoreServerCertificateError(); await builder.Build().RunAsync(); } protected override async Task InitializeAsync(CancellationToken stoppingToken = default) { AddDataFlow<DataParser<CnblogsEntry>>(); AddDataFlow(GetDefaultStorage); await AddRequestsAsync( new Request( "https://news.cnblogs.com/n/page/1", new Dictionary<string, object> { { "网站", "博客园" } })); } [Schema("cnblogs", "news")] [EntitySelector(Expression = ".//div[@class='news_block']", Type = SelectorType.XPath)] [GlobalValueSelector(Expression = ".//a[@class='current']", Name = "类别", Type = SelectorType.XPath)] [GlobalValueSelector(Expression = "//title", Name = "Title", Type = SelectorType.XPath)] [FollowRequestSelector(Expressions = ["//div[@class='pager']"])] public class CnblogsEntry : EntityBase<CnblogsEntry> { protected override void Configure() { HasIndex(x => x.Title); HasIndex(x => new { x.WebSite, x.Guid }, true); } public int Id { get; set; } [Required] [StringLength(200)] [ValueSelector(Expression = "类别", Type = SelectorType.Environment)] public string Category { get; set; } [Required] [StringLength(200)] [ValueSelector(Expression = "网站", Type = SelectorType.Environment)] public string WebSite { get; set; } [StringLength(200)] [ValueSelector(Expression = "Title", Type = SelectorType.Environment)] [ReplaceFormatter(NewValue = "", OldValue = " - 博客园")] public string Title { get; set; } [StringLength(40)] [ValueSelector(Expression = "GUID", Type = SelectorType.Environment)] public string Guid { get; set; } [ValueSelector(Expression = ".//h2[@class='news_entry']/a")] public string News { get; set; } [ValueSelector(Expression = ".//h2[@class='news_entry']/a/@href")] public string Url { get; set; } [ValueSelector(Expression = ".//div[@class='entry_summary']")] [TrimFormatter] public string PlainText { get; set; } [ValueSelector(Expression = "DATETIME", Type = SelectorType.Environment)] public DateTime CreationTime { get; set; } } }
Coming soon
when you use redis scheduler, please update your redis config:timeout 0
tcp-keepalive 60
Package License Bert.RateLimiters Apache 2.0 MessagePack MIT Newtonsoft.Json MIT Dapper Apache 2.0 HtmlAgilityPack MIT ZCJ.HashedWheelTimer MIT murmurhash Apache 2.0 Serilog.AspNetCore Apache 2.0 Serilog.Sinks.Console Apache 2.0 Serilog.Sinks.RollingFile Apache 2.0 Serilog.Sinks.PeriodicBatching Apache 2.0 MongoDB.Driver Apache 2.0 MySqlConnector MIT AutoMapper.Extensions.Microsoft.DependencyInjection MIT Docker.DotNet MIT BuildBundlerMinifier Apache 2.0 Pomelo.EntityFrameworkCore.MySql MIT Quartz.AspNetCore Apache 2.0 Quartz.AspNetCore.MySqlConnector Apache 2.0 Npgsql PostgreSQL License RabbitMQ.Client Apache 2.0 Polly BSD 3-C
QQ Group: 477731655 Email: zlzforever@163.com
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4