The climate-news-db has two goals:
Pulls urls.jsonl
from S3 and crawls articles into articles/{newspaper}.jsonl
and into database:
Take urls from articles/{newspaper}.jsonl
and saves into database:
This is useful when you want to re-create the database without scraping articles.
Interactive Search for Getting URLsRequires Go + Gum
$ ./scripts/search-cli.sh
graph LR 1(urls.jsonl) -->|make crawl| 2(articles.jsonl) 2(articles.jsonl) -->|make crawl, make regen-db| 3(database)Loading
{"url": "https://www.chinadaily.com.cn/a/202302/21/WS63f4aea4a31057c47ebb004e.html", "search_time_utc": "2023-03-20T00:05:02.998560"} {"url": "https://www.chinadaily.com.cn/a/202301/19/WS63c8a4a8a31057c47ebaa8e4.html", "search_time_utc": "2023-03-20T00:05:02.998560"}
Append only storage of raw newspaper urls. Created by a daily Google search for each newspaper with the keywords climate change
and climate crisis
. This file contains many duplicates.
Deployed as a Fly.IO app:
Deployed with AWS CDK:
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4