Elastic Open Crawler is a lightweight, open code web crawler designed for discovering, extracting, and indexing web content directly into Elasticsearch.
This CLI-driven tool streamlines web content ingestion into Elasticsearch, enabling easy searchability through on-demand or scheduled crawls defined by configuration files.
Note
This repository contains code and documentation for the Elastic Open Web Crawler. Docker images are available for the crawler at the Elastic Docker registry.
Important
The Open Crawler is currently in beta. Beta features are subject to change and are not covered by the support SLA of generally available (GA) features. Elastic plans to promote this feature to GA in a future release.
Elasticsearch Open Crawler Operating System8.x
v0.2.0
and above Linux, OSX 9.x
v0.2.1
and above Linux, OSX
Get from zero to crawling your website into Elasticsearch in just a few steps.
First, let's test that the crawler works on your system by crawling a simple website and printing the results to your terminal. We'll create a basic config file and run the crawler against https://example.com
.
Run the following in your terminal:
cat > crawl-config.yml << EOF output_sink: console domains: - url: https://example.com EOF docker run \ -v "$(pwd)":/config \ -it docker.elastic.co/integrations/crawler:latest jruby \ bin/crawler crawl /config/crawl-config.yml
The -v "$(pwd)":/config
flag maps your current directory to the container's /config
directory, making your config file available to the crawler.
✅ Success check: You should see HTML content from example.com
printed to your console, ending with [primary] Finished a crawl. Result: success;
Before proceeding with Step 2, make sure you have a running Elasticsearch instance. See prequisites.
For this step you'll need:
For step-by-step guidance on finding endpoint URLs and creating API keys in the UI, see connection details.
If you'd prefer to create an API key in the Dev Tools Console use the following command:
Create API key via Dev Tools ConsoleRun the following in Dev Tools Console:
POST /_security/api_key { "name": "crawler-key", "role_descriptors": { "crawler-role": { "cluster": ["monitor"], "indices": [ { "names": ["web-crawl-*"], "privileges": ["write", "create_index", "monitor"] } ] } } }
Save the encoded
value from the response - this is your API key.
Tip
If you prefer not to use environment variables (or are on a system where they don't work as expected), you can skip this step and manually edit the configuration file in Step 4.
Set your connection details and target website as environment variables. Replace the values with your actual values.
export ES_HOST="https://your-deployment.es.region.aws.elastic.cloud" export ES_PORT="443" export ES_API_KEY="your_encoded_api_key_here" export TARGET_WEBSITE="https://your-website.com"
Note
Connection settings differ based on where Elasticsearch is running (e.g., cloud hosted, serverless, or localhost).
ES_HOST
: Your Elasticsearch endpoint URL
https://your-deployment.es.region.aws.elastic.cloud
http://host.docker.internal
if Elasticsearch is running locally but not in the same Docker networkhttp://elasticsearch
if Elasticsearch is running in a Docker container on the same networkES_PORT
: Elasticsearch port
443
9200
ES_API_KEY
: API key from Step 2TARGET_WEBSITE
: Website to crawl
/
) or you'll hit an error. ArgumentError: Domain "https://www.example.com/" cannot have a path
Create your crawler config file by running the following command. This will use the environment variables you set in Step 3 to populate the configuration file automatically.
cat > crawl-config.yml << EOF output_sink: elasticsearch output_index: web-crawl-test elasticsearch: host: $ES_HOST port: $ES_PORT api_key: $ES_API_KEY pipeline_enabled: false domains: - url: $TARGET_WEBSITE EOF
If you skipped Step 3 or the environment variables aren't working on your computer, create the config file and replace the placeholders manually.
Manual configurationcat > crawl-config.yml << 'EOF' output_sink: elasticsearch output_index: web-crawl-test elasticsearch: host: https://your-deployment.es.region.aws.elastic.cloud # Your ES_HOST port: 443 # Your ES_PORT (443 for cloud, 9200 for localhost) api_key: your_encoded_api_key_here # Your ES_API_KEY from Step 2 pipeline_enabled: false domains: - url: https://your-website.com # Your target website EOFStep 5: Crawl and ingest into Elasticsearch
Now you can ingest your target website content into Elasticsearch:
docker run \ -v "$(pwd)":/config \ -it docker.elastic.co/integrations/crawler:latest jruby \ bin/crawler crawl /config/crawl-config.yml
✅ Success check: You should see messages like:
Connected to ES at https://your-endpoint - version: 8.x.x
Index [web-crawl-test] was found!
Elasticsearch sink initialized
Now that the crawl is complete, you can view the indexed data in Elasticsearch:
Use the API The fastest way is to use `curl` from the command line. This reuses the environment variables you set earlier.curl -X GET "${ES_HOST}:${ES_PORT}/web-crawl-test/_search" \ -H "Authorization: ApiKey ${ES_API_KEY}" \ -H "Content-Type: application/json"
Alternatively, run the following API call in the Dev Tools Console:
GET /web-crawl-test/_searchUse Kibana/Serverless UI
web-crawl-test
indexcrawler.yml
and elasticsearch.yml
filesYou can build and run the Open Crawler locally using the provided setup instructions. Detailed setup steps, including environment requirements, are in the Developer Guide.
Want to contribute? We welcome bug reports, code contributions, and documentation improvements. Read the Contributing Guide for contribution types, PR guidelines, and coding standards.
Learn how to get help, report issues, and find community resources in the Support Guide.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4