SiteOne Crawler is a powerful and easy-to-use website analyzer, cloner, and converter designed for developers seeking security and performance insights, SEO specialists identifying optimization opportunities, and website owners needing reliable backups and offline versions.
Discover the SiteOne Crawler advantage:
GIF animation of the crawler in action (also available as a βΆοΈ video):
In short, the main benefits can be summarized in these points:
sitemap.xml
and sitemap.txt
files with a list of all pages on your websiteThe following features are summarized in greater detail:
robots.txt
file and will not crawl the pages that are not allowed--url
parameter, you can specify also a sitemap.xml
file (or sitemap index), which will be processed as a list of URLs. Note: gzip pre-compressed sitemaps *.xml.gz
are not supported.http://localhost:3000/
)Crawler\Analyzer
interface.Title
, Keywords
and Description
--disable-*
directives) .. for some types of websites the best result is with the --disable-javascript
option.--allowed-domain-for-external-files
(short -adf
) from which external domains it is possible to **download ** assets (JS, CSS, fonts, images, documents) including *
option for all domains.--allowed-domain-for-crawling
(short -adc
) which other domains should be included in the crawling if there are any links pointing to them. You can enable e.g. mysite.*
to export all language mutations that have a different TLD or *.mysite.tld
to export all subdomains.---disable-styles
and ---disable-fonts
and see how well you handle accessibility and semantics--single-page
to export only one page to which the URL is given (and its assets), but do not follow other pages.--single-foreign-page
to export only one page from another domain (if allowed by --allowed-domain-for-crawling
), but do not follow other pages.--replace-content
to replace content in HTML/JS/CSS with foo -> bar
or regexp in PREG format, e.g. /card[0-9]/i -> card
. Can be specified multiple times.--replace-query-string
to replace chars in query string in the filename.--max-depth
to set the maximum crawling depth (for pages, not assets). 1
means /about
or /about/
, 2
means /about/contacts
etc./
, which doesn't work with file://
mode.---
.π‘ Tip: you can push the exported markdown folder to your GitHub repository, where it will be automatically rendered as a browsable documentation. You can look at the examples of converted websites to markdown.
sitemap.xml
and sitemap.xml
for your websiteDon't hesitate and try it. You will love it as we do! β€οΈ
π€ For active contributorsYou can download ready-to-use releases from π GitHub releases for all major platforms (π§ Linux, πͺ Windows, π macOS, arm64).
Unpack the downloaded archive, and you will find the crawler
or crawler.bat
(Windows) executable binary and run crawler by ./crawler --url=https://my.domain.tld
.
Note for Windows users: use Cygwin-based release *-win-x64.zip
only if you can't use WSL (Ubuntu/Debian), what is recommended. If you really have to use the Cygwin version, set --workers=1
for higher stability.
Note for macOS users: In case that Mac refuses to start the crawler from your Download folder, move the entire folder with the Crawler via the terminal to another location, for example to the homefolder ~
.
Most easily installation is on most Linux (x64) distributions.
git clone https://github.com/janreges/siteone-crawler.git cd siteone-crawler # run crawler with basic options ./crawler --url=https://my.domain.tld
If using Windows, the best choice is to use Ubuntu or Debian in WSL.
Otherwise, you can download swoole-cli-v4.8.13-cygwin-x64.zip from Swoole releases and use precompiled Cygwin-based bin/swoole-cli.exe
.
A really functional and tested Windows command looks like this (modify path to your swoole-cli.exe
and src\crawler.php
):
c:\Tools\swoole-cli-v4.8.13-cygwin-x64\bin\swoole-cli.exe C:\Tools\siteone-crawler\src\crawler.php --url=https://www.siteone.io/
NOTICE: Cygwin does not support STDERR with rewritable lines in the console. Therefore, the output is not as beautiful as on Linux or macOS.
If using macOS with latest arm64 M1/M2 CPU, download arm64 version swoole-cli-v4.8.13-macos-arm64.tar.xz, unpack and use its precompiled swoole-cli
.
If using macOS with Intel CPU (x64), download x64 version swoole-cli-v4.8.13-macos-x64.tar.xz, unpack and use its precompiled swoole-cli
.
If using arm64 Linux, you can download swoole-cli-v4.8.13-linux-arm64.tar.xz and use its precompiled swoole-cli
.
To run the crawler, execute the crawler
executable file from the command line and provide the required arguments:
./crawler --url=https://mydomain.tld/ --device=mobile
./crawler --url=https://mydomain.tld/ \ --output=text \ --workers=2 \ --max-reqs-per-sec=10 \ --memory-limit=1024M \ --resolve='mydomain.tld:443:127.0.0.1' \ --timeout=5 \ --proxy=proxy.mydomain.tld:8080 \ --http-auth=myuser:secretPassword123 \ --user-agent="My User-Agent String" \ --extra-columns="DOM,X-Cache(10),Title(40),Keywords(50),Description(50>),Heading1=xpath://h1/text()(20>),ProductPrice=regexp:/Price:\s*\$?(\d+(?:\.\d{2})?)/i#1(10)" \ --accept-encoding="gzip, deflate" \ --url-column-size=100 \ --max-queue-length=3000 \ --max-visited-urls=10000 \ --max-url-length=5000 \ --max-non200-responses-per-basename=10 \ --include-regex="/^.*\/technologies.*/" \ --include-regex="/^.*\/fashion.*/" \ --ignore-regex="/^.*\/downloads\/.*\.pdf$/i" \ --analyzer-filter-regex="/^.*$/i" \ --remove-query-params \ --add-random-query-params \ --transform-url="live-site.com -> local-site.local" \ --transform-url="/cdn\.live-site\.com/ -> local-site.local/cdn" \ --show-scheme-and-host \ --do-not-truncate-url \ --output-html-report=tmp/myreport.html \ --html-report-options="summary,seo-opengraph,visited-urls,security,redirects" \ --output-json-file=/dir/report.json \ --output-text-file=/dir/report.txt \ --add-timestamp-to-output-file \ --add-host-to-output-file \ --offline-export-dir=tmp/mydomain.tld \ --replace-content='/<foo[^>]+>/ -> <bar>' \ --ignore-store-file-error \ --sitemap-xml-file==/dir/sitemap.xml \ --sitemap-txt-file==/dir/sitemap.txt \ --sitemap-base-priority=0.5 \ --sitemap-priority-increase=0.1 \ --markdown-export-dir=tmp/mydomain.tld.md \ --markdown-export-single-file=tmp/mydomain.tld.combined.md \ --markdown-move-content-before-h1-to-end \ --markdown-disable-images \ --markdown-disable-files \ --markdown-remove-links-and-images-from-single-file \ --markdown-exclude-selector='.exclude-me' \ --markdown-replace-content='/<foo[^>]+>/ -> <bar>' \ --markdown-replace-query-string='/[a-z]+=[^&]*(&|$)/i -> $1__$2' \ --mail-to=your.name@my-mail.tld \ --mail-to=your.friend.name@my-mail.tld \ --mail-from=crawler@ymy-mail.tld \ --mail-from-name="SiteOne Crawler" \ --mail-subject-template="Crawler Report for %domain% (%date%)" \ --mail-smtp-host=smtp.my-mail.tld \ --mail-smtp-port=25 \ --mail-smtp-user=smtp.user \ --mail-smtp-pass=secretPassword123
For a clearer list, I recommend going to the documentation: π https://crawler.siteone.io/configuration/command-line-options/
Parameter Description--url=<url>
Required. HTTP or HTTPS URL address of the website or sitemap xml to be crawled.
''
if the URL contains query parameters. --single-page
Load only one page to which the URL is given (and its assets), but do not follow other pages. --max-depth=<int>
Maximum crawling depth (for pages, not assets). Default is 0
(no limit). 1
means /about
/about/
, 2
means /about/contacts
etc. --device=<val>
Device type for choosing a predefined User-Agent. Ignored when --user-agent
is defined.
desktop
, mobile
, tablet
. Defaults is desktop
. --user-agent=<val>
Custom User-Agent header. Use quotation marks. If specified, it takes precedence over
!
at the end, the siteone-crawler/version will not be
--timeout=<int>
Request timeout in seconds. Default is 3
. --proxy=<host:port>
HTTP proxy to use in host:port
format. Host can be hostname, IPv4 or IPv6. --http-auth=<user:pass>
Basic HTTP authentication in username:password
format. Parameter Description --output=<val>
Output type. Supported values: text
, json
. Default is text
. --extra-columns=<values>
Comma delimited list of extra columns added to output table. You can specify HTTP headers
X-Cache
), predefined values (Title
, Keywords
, Description
, DOM
), or custom
Custom_column_name=method:pattern#group(length)
, where
method
is xpath
or regexp
, pattern
is the extraction pattern, an optional #group
specifies the
(length)
sets the maximum output length (append >
to disable truncation).
Heading1=xpath://h1/text()(20>)
to extract the text of the first H1 element
ProductPrice=regexp:/Price:\s*\$?(\d+(?:\.\d{2})?)/i#1(10)
--url-column-size=<num>
Basic URL column width. By default, it is calculated from the size of your terminal window. --rows-limit=<num>
Max. number of rows to display in tables with analysis results (protection against very long and slow report).
200
. --timezone=<val>
Timezone for datetimes in HTML reports and timestamps in output folders/files, e.g. Europe/Prague
.
UTC
. Available values can be found at Timezones Documentation. --do-not-truncate-url
In the text output, long URLs are truncated by default to --url-column-size
so the table does not
--show-scheme-and-host
On text output, show scheme and host also for origin domain URLs. --hide-progress-bar
Hide progress bar visible in text and JSON output for more compact view. --no-color
Disable colored output. --force-color
Force colored output regardless of support detection. --show-inline-criticals
Show criticals from the analyzer directly in the URL table. --show-inline-warnings
Show warnings from the analyzer directly in the URL table. Parameter Description --disable-all-assets
Disables crawling of all assets and files and only crawls pages in href attributes.
--disable-*
flags. --disable-javascript
Disables JavaScript downloading and removes all JavaScript code from HTML,
onclick
and other on*
handlers. --disable-styles
Disables CSS file downloading and at the same time removes all style definitions
<style>
tag or inline by style attributes. --disable-fonts
Disables font downloading and also removes all font/font-face definitions from CSS. --disable-images
Disables downloading of all images and replaces found images in HTML with placeholder image only. --disable-files
Disables downloading of any files (typically downloadable documents) to which various links point. --remove-all-anchor-listeners
On all links on the page remove any event listeners. Useful on some types of sites with modern
--workers=<int>
Maximum number of concurrent workers (threads).
3
. --max-reqs-per-sec=<val>
Max requests/s for whole crawler. Be careful not to cause a DoS attack. Default value is 10
. --memory-limit=<size>
Memory limit in units M
(Megabytes) or G
(Gigabytes). Default is 2048M
. --resolve=<host:port:ip>
Custom DNS resolution in domain:port:ip
format. Same as curl --resolve.
--resolve='mydomain.tld:443:127.0.0.1
--allowed-domain-for-external-files=<domain>
Primarily, the crawler crawls only the URL within the domain for initial URL. This allows
*
. --allowed-domain-for-crawling=<domain>
This option will allow you to crawl all content from other listed domains - typically in the case
*
including e.g. *.siteone.*
. --single-foreign-page
If crawling of other domains is allowed (using --allowed-domain-for-crawling
),
--include-regex=<regex>
Regular expression compatible with PHP preg_match() for URLs that should be included.
--include-regex='/^\/public\//'
--ignore-regex=<regex>
Regular expression compatible with PHP preg_match() for URLs that should be ignored.
--ignore-regex='/^.*\/downloads\/.*\.pdf$/i'
--regex-filtering-only-for-pages
Set if you want filtering by *-regex
rules apply only to page URLs, but static assets (JS, CSS, images,
--include-regex='/\/sub-pages\//'
, but
--analyzer-filter-regex
Regular expression compatible with PHP preg_match() applied to Analyzer class names
/(content|accessibility)/i
or /^(?:(?!best|access).)*$/i
for all
BestPracticesAnalyzer
and AccessibilityAnalyzer
. --accept-encoding=<val>
Custom Accept-Encoding
request header. Default is gzip, deflate, br
. --remove-query-params
Remove query parameters from found URLs. Useful on websites where a lot of links
--add-random-query-params
Adds several random query parameters to each URL.
--transform-url=<from->to>
Transform URLs before crawling. Use from -> to
format for simple replacement or /regex/ -> replacement
for pattern matching.
--transform-url="live-site.com -> local-site.local"
.
--ignore-robots-txt
Should robots.txt content be ignored? Useful for crawling an otherwise internal/private/unindexed site. --http-cache-dir=<dir>
Cache dir for HTTP responses. You can disable cache by --http-cache-dir='off'
.
tmp/http-client-cache
. --http-cache-compression
Enable compression for HTTP cache storage. Saves disk space, but uses more CPU. --max-queue-length=<num>
The maximum length of the waiting URL queue. Increase in case of large websites,
9000
. --max-visited-urls=<num>
The maximum number of the visited URLs. Increase in case of large websites, but expect
10000
. --max-skipped-urls=<num>
The maximum number of the skipped URLs. Increase in case of large websites, but expect
10000
. --max-url-length=<num>
The maximum supported URL length in chars. Increase in case of very long URLs, but expect
2083
. --max-non200-responses-per-basename=<num>
Protection against looping with dynamic non-200 URLs. If a basename (the last part of the URL
5
. Parameter Description --output-html-report=<file>
Save HTML report into that file. Set to empty '' to disable HTML report.
tmp/%domain%.report.%datetime%.html
. --html-report-options=<sections>
Comma-separated list of sections to include in HTML report.
summary
, seo-opengraph
, image-gallery
, video-gallery
, visited-urls
, dns-ssl
, crawler-stats
, crawler-info
, headers
, content-types
, skipped-urls
, caching
, best-practices
, accessibility
, security
, redirects
, 404-pages
, slowest-urls
, fastest-urls
, source-domains
.
--output-json-file=<file>
File path for JSON output. Set to empty '' to disable JSON file.
tmp/%domain%.output.%datetime%.json
.
--output-text-file=<file>
File path for TXT output. Set to empty '' to disable TXT file.
tmp/%domain%.output.%datetime%.txt
.
--mail-to=<email>
Recipients of HTML e-mail reports. Optional but required for mailer activation.
--mail-from=<email>
E-mail sender address. Default values is siteone-crawler@your-hostname.com
. --mail-from-name=<val>
E-mail sender name. Default values is SiteOne Crawler
. --mail-subject-template=<val>
E-mail subject template. You can use dynamic variables %domain%
, %date%
and %datetime%
.
Crawler Report for %domain% (%date%)
. --mail-smtp-host=<host>
SMTP host for sending emails. Default is localhost
. --mail-smtp-port=<port>
SMTP port for sending emails. Default is 25
. --mail-smtp-user=<user>
SMTP user, if your SMTP server requires authentication. --mail-smtp-pass=<pass>
SMTP password, if your SMTP server requires authentication. Parameter Description --upload
Enable HTML report upload to --upload-to
. --upload-to=<url>
URL of the endpoint where to send the HTML report. Default value is https://crawler.siteone.io/up
. --upload-retention=<val>
How long should the HTML report be kept in the online version?
30d
. --upload-password=<val>
Optional password, which must be entered (the user will be 'crawler')
--upload-timeout=<int>
Upload timeout in seconds. Default value is 3600
. Parameter Description --offline-export-dir=<dir>
Path to directory where to save the offline version of the website. If target directory
--offline-export-store-only-url-regex=<regex>
For debug - when filled it will activate debug mode and store only URLs
--offline-export-remove-unwanted-code=<1/0>
Remove unwanted code for offline mode? Typically, JS of the analytics, social networks,
1
. --offline-export-no-auto-redirect-html
Disables the automatic creation of redirect HTML files for each subfolder that contains
index.html
. This solves situations for URLs where sometimes the URL ends with a slash,
--replace-content=<val>
Replace content in HTML/JS/CSS with foo -> bar
or regexp in PREG format,
/card[0-9]/i -> card
. Can be specified multiple times. --replace-query-string=<val>
Instead of using a short hash instead of a query string in the filename, just replace some characters.
foo -> bar
or regexp in PREG format,
--ignore-store-file-error
Enable this option to ignore any file storing errors.
--markdown-export-dir=<dir>
Path to directory where to save the markdown version of the website.
--markdown-export-single-file=<file>
Path to a file where to save the combined markdown files into one document. Requires --markdown-export-dir
to be set. Ideal for AI tools that need to process the entire website content in one go. --markdown-move-content-before-h1-to-end
Move all content before the main H1 heading (typically the header with the menu) to the end of the markdown. --markdown-disable-images
Do not export and show images in markdown files.
--markdown-disable-files
Do not export and link files other than HTML/CSS/JS/fonts/images - eg. PDF, ZIP, etc.
--markdown-remove-links-and-images-from-single-file
Remove links and images from the combined single markdown file. Useful for AI tools that don't need these elements.
--markdown-export-single-file
to be set. --markdown-exclude-selector=<val>
Exclude some page content (DOM elements) from markdown export defined by CSS selectors like 'header', '.header', '#header', etc.
--markdown-replace-content=<val>
Replace text content with foo -> bar
or regexp in PREG format: /card[0-9]/i -> card
.
--markdown-replace-query-string=<val>
Instead of using a short hash instead of a query string in the filename, just replace some characters.
--markdown-export-store-only-url-regex=<regex>
For debug - when filled it will activate debug mode and store only URLs which match one of these
--markdown-ignore-store-file-error
Ignores any file storing errors. The export process will continue. Parameter Description --sitemap-xml-file=<file>
File path where generated XML Sitemap will be saved.
.xml
is automatically added if not specified. --sitemap-txt-file=<file>
File path where generated TXT Sitemap will be saved.
.txt
is automatically added if not specified. --sitemap-base-priority=<num>
Base priority for XML sitemap. Default values is 0.5
. --sitemap-priority-increase=<num>
Priority increase value based on slashes count in the URL. Default values is 0.1
. Parameter Description --debug
Activate debug mode. --debug-log-file=<file>
Log file where to save debug messages. When --debug
is not set and --debug-log-file
--debug-url-regex=<regex>
Regex for URL(s) to debug. When crawled URL is matched, parsing, URL replacing,
--result-storage=<val>
Result storage type for content and headers. Values: memory
or file
.
file
for large websites. Default values is memory
. --result-storage-dir=<dir>
Directory for --result-storage=file
. Default values is tmp/result-storage
. --result-storage-compression
Enable compression for results storage. Saves disk space, but uses more CPU. --http-cache-dir=<dir>
Cache dir for HTTP responses. You can disable cache by --http-cache-dir='off'
.
tmp/http-client-cache
. --http-cache-compression
Enable compression for HTTP cache storage.
--websocket-server=<host:port>
Start crawler with websocket server on given host:port, e.g. 0.0.0.0:8000
.
{"type":"urlResult","url":"...","statusCode":200,"size":4528,"execTime":0.823}
. --console-width=<int>
Enforce a fixed console width, disabling automatic detection. Parameter Description --fastest-urls-top-limit=<int>
Number of URLs in TOP fastest list. Default is 20
. --fastest-urls-max-time=<val>
Maximum response time for an URL to be considered fast. Default is 1
. SEO and OpenGraph analyzer Parameter Description --max-heading-level=<int>
Max heading level from 1 to 6 for analysis. Default is 3
. Parameter Description --slowest-urls-top-limit=<int>
Number of URLs in TOP slowest list. Default is 20
. --slowest-urls-min-time=<val>
Minimum response time threshold for slow URLs. Default is 0.01
. --slowest-urls-max-time=<val>
Maximum response time for an URL to be considered very slow.
3
.
To understand the richness of the data provided by the crawler, you can examine real output examples generated from crawling crawler.siteone.io
:
docs/OUTPUT-crawler.siteone.io.txt
docs/OUTPUT-crawler.siteone.io.json
These examples showcase the various tables and metrics generated, demonstrating the tool's capabilities in analyzing website structure, performance, SEO, security, and more.
If you have any suggestions or feature requests, please open an issue on GitHub. We'd love to hear from you!
Your contributions with realized improvements, bug fixes, and new features are welcome. Please open a pull request :-)
π€ Motivation to create this toolIf you are interested in the author's motivation for creating this tool, read it on the project website π.
Please use responsibly and ensure that you have the necessary permissions when crawling websites. Some sites may have rules against automated access detailed in their robots.txt.
The author is not responsible for any consequences caused by inappropriate use or deliberate misuse of this tool.
This work is licensed under a license.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4