This library combines HTTP, Gumbo and Cascadia for a more simple way to scrape data.
Based on tidyverse/rvest.
using TidierVest starwars = read_html("https://rvest.tidyverse.org/articles/starwars.html") titles = html_elements(starwars, ["section", "h2"]) |> html_text3 titles # 7-element Vector{String}: # "The Phantom Menace" # "Attack of the Clones" # "Revenge of the Sith" # ⋮ # "Return of the Jedi" # "The Force Awakens" html = read_html("https://en.wikipedia.org/w/index.php?title=The_Lego_Movie&oldid=998422565") table = html_elements(html, ".tracklist") |> html_table table # 28×4 DataFrame # Row │ No. Title Performer(s) Length # │ String String String String # ─────┼────────────────────────────────────────────────────────────────────────────────────── # 1 │ 1. "Everything Is Awesome" Tegan and Sara featuring The Lon… 2:43 # 2 │ 2. "Prologue" 2:28 # 3 │ 3. "Emmett's Morning" 2:00 # 4 │ 4. "Emmett Falls in Love" 1:11 # 5 │ 5. "Escape" 3:26 # ⋮ │ ⋮ ⋮ ⋮ ⋮ # 25 │ 25. "Everything Is Awesome" Jo Li (Joshua Bartholomew and Li… 1:26 # 26 │ 26. "Everything Is Awesome (unplugge… Shawn Patterson and Sammy Allen 1:24 # 27 │ 27. "Untitled Self Portrait" Will Arnett 1:08 # 28 │ 28. "Everything Is Awesome (instrume… 2:41 # 19 rows omitted
Read an url
Parses a string into an HTML Document type
Get the elements you want from an html
Get the text, you can also use html_text2
or html_text3
for cleaner text
Get the content of an attribute, if string not provided it would try to get you an attribute
Create a DataFrame from an HTML Table node
Return the children of an html
Create an html document with inline html
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4