org-protocol is awesome, but browsers do a pretty poor job of turning a page’s HTML content into plain-text. However, Pandoc supports converting from HTML to org-mode, so we can use it to turn HTML into Org-mode content! It can even turn HTML tables into Org tables!
Here’s an example of what you get in Emacs from capturing this page:
curl
to download URLs (if you use it in that mode).Put org-protocol-capture-html.el
in your load-path
and add to your init file:
(require 'org-protocol-capture-html)
You need a suitable org-capture
template. I recommend this one. Whatever you choose, the default selection key is w
, so if you want to use a different key, you’ll need to modify the script and the bookmarklets.
("w" "Web site" entry (file "") "* %a :website:\n\n%U %?\n\n%:initial")
Now you need to make a bookmarklet in your browser(s) of choice. You can select text in the page when you capture and it will be copied into the template, or you can just capture the page title and URL. A selection-grabbing function is used to capture the selection.
Note: The w
in the URL in these bookmarklets chooses the corresponding capture template. You can leave it out if you want to be prompted for the template, or change it to another letter for a different template key.
This bookmarklet captures what is currently selected in the browser. Or if nothing is selected, it just captures the page’s URL and title.
javascript:location.href = 'org-protocol://capture-html?template=w&url=' + encodeURIComponent(location.href) + '&title=' + encodeURIComponent(document.title || "[untitled page]") + '&body=' + encodeURIComponent(function () {var html = ""; if (typeof document.getSelection != "undefined") {var sel = document.getSelection(); if (sel.rangeCount) {var container = document.createElement("div"); for (var i = 0, len = sel.rangeCount; i < len; ++i) {container.appendChild(sel.getRangeAt(i).cloneContents());} html = container.innerHTML;}} else if (typeof document.selection != "undefined") {if (document.selection.type == "Text") {html = document.selection.createRange().htmlText;}} var relToAbs = function (href) {var a = document.createElement("a"); a.href = href; var abs = a.protocol + "//" + a.host + a.pathname + a.search + a.hash; a.remove(); return abs;}; var elementTypes = [['a', 'href'], ['img', 'src']]; var div = document.createElement('div'); div.innerHTML = html; elementTypes.map(function(elementType) {var elements = div.getElementsByTagName(elementType[0]); for (var i = 0; i < elements.length; i++) {elements[i].setAttribute(elementType[1], relToAbs(elements[i].getAttribute(elementType[1])));}}); return div.innerHTML;}());
This one uses eww
’s built-in readability-scoring function in Emacs 25.1 and up to capture the article or main content of the page.
javascript:location.href = 'org-protocol://capture-eww-readable?template=w&url=' + encodeURIComponent(location.href) + '&title=' + encodeURIComponent(document.title || "[untitled page]");
Note: When you click on one of these bookmarklets for the first time, Firefox will ask what program to use to handle the org-protocol
protocol. You can simply choose the default program that appears (org-protocol
).
If you use Pentadactyl, you can use the Firefox bookmarklets above, or you can put these commands in your .pentadactylrc
:
map -modes=n,v ch -javascript content.location.href = 'org-protocol://capture-html?template=w&url=' + encodeURIComponent(content.location.href) + '&title=' + encodeURIComponent(content.document.title || "[untitled page]") + '&body=' + encodeURIComponent(function () {var html = ""; if (typeof content.document.getSelection != "undefined") {var sel = content.document.getSelection(); if (sel.rangeCount) {var container = document.createElement("div"); for (var i = 0, len = sel.rangeCount; i < len; ++i) {container.appendChild(sel.getRangeAt(i).cloneContents());} html = container.innerHTML;}} else if (typeof document.selection != "undefined") {if (document.selection.type == "Text") {html = document.selection.createRange().htmlText;}} var relToAbs = function (href) {var a = content.document.createElement("a"); a.href = href; var abs = a.protocol + "//" + a.host + a.pathname + a.search + a.hash; a.remove(); return abs;}; var elementTypes = [['a', 'href'], ['img', 'src']]; var div = content.document.createElement('div'); div.innerHTML = html; elementTypes.map(function(elementType) {var elements = div.getElementsByTagName(elementType[0]); for (var i = 0; i < elements.length; i++) {elements[i].setAttribute(elementType[1], relToAbs(elements[i].getAttribute(elementType[1])));}}); return div.innerHTML;}()) map -modes=n,v ce -javascript location.href='org-protocol://capture-eww-readable?template=w&url='+encodeURIComponent(content.location.href)+'&title='+encodeURIComponent(content.document.title || "[untitled page]")
Note: The JavaScript objects are slightly different for running as Pentadactyl commands since it has its own chrome.
These bookmarklets work in Chrome:
javascript:location.href = 'org-protocol:///capture-html?template=w&url=' + encodeURIComponent(location.href) + '&title=' + encodeURIComponent(document.title || "[untitled page]") + '&body=' + encodeURIComponent(function () {var html = ""; if (typeof window.getSelection != "undefined") {var sel = window.getSelection(); if (sel.rangeCount) {var container = document.createElement("div"); for (var i = 0, len = sel.rangeCount; i < len; ++i) {container.appendChild(sel.getRangeAt(i).cloneContents());} html = container.innerHTML;}} else if (typeof document.selection != "undefined") {if (document.selection.type == "Text") {html = document.selection.createRange().htmlText;}} var relToAbs = function (href) {var a = document.createElement("a"); a.href = href; var abs = a.protocol + "//" + a.host + a.pathname + a.search + a.hash; a.remove(); return abs;}; var elementTypes = [['a', 'href'], ['img', 'src']]; var div = document.createElement('div'); div.innerHTML = html; elementTypes.map(function(elementType) {var elements = div.getElementsByTagName(elementType[0]); for (var i = 0; i < elements.length; i++) {elements[i].setAttribute(elementType[1], relToAbs(elements[i].getAttribute(elementType[1])));}}); return div.innerHTML;}()); javascript:location.href = 'org-protocol:///capture-eww-readable?template=w&url=' + encodeURIComponent(location.href) + '&title=' + encodeURIComponent(document.title || "[untitled page]");
Note: The first sets of slashes are tripled compared to the Firefox bookmarklets. When testing with Chrome, I found that xdg-open
was collapsing the double-slashes into single-slashes, which breaks org-protocol
. I’m not sure why that doesn’t seem to be necessary for Firefox. If you have any trouble with this, you might try removing the extra slashes.
The shell script is handy for piping any HTML (or plain-text) content to Org through the shell, or downloading and capturing any URL directly (without a browser), but it’s not required. It requires getopt
, part of the util-linux
package which should be standard on most Linux distros. On OS X you may need to install getopt
or util-linux
from MacPorts or Homebrew, etc.
You can use it like this:
org-protocol-capture-html.sh [OPTIONS] [HTML] cat html | org-protocol-capture-html.sh [OPTIONS] Send HTML to Emacs through org-protocol, passing it through Pandoc to convert HTML to Org-mode. HTML may be passed as an argument or through STDIN. If only URL is given, it will be downloaded and its contents used. Options: -h, --heading HEADING Heading -r, --readability Capture web page article with eww-readable -t, --template TEMPLATE org-capture template key (default: w) -u, --url URL URL --debug Print debug info --help I need somebody!
After installing the bookmarklets, you can select some text on a web page with your mouse, open the bookmarklet with the browser, and Emacs should pop up an Org capture buffer. You can also do it without selecting text first, if you just want to capture a link to the page.
You can also pass data through the shell script, for example:
dmesg | grep -i sata | org-protocol-capture-html.sh --heading "dmesg SATA messages" --template i org-protocol-capture-html.sh --readability --url "https://lwn.net/Articles/615220/" org-protocol-capture-html.sh -h "TODO Feed the cat!" -t i "He gets grouchy if I forget!"
org-protocol-capture-html.sh
. (#31. Thanks to Sam Pillsworth.)dom
.eww-readable
support.org-protocol
links to the new-style ones used in Org 9. Note: This requires updating existing bookmarklets to use the new-style links. See the examples in the usage instructions. Users who are unable to upgrade to Org 9 should use the previous version of this package.python-readability
support and just use eww-readable
. eww-readable
seems to work so well that it seems unnecessary to bother with external tools. Of course, this does require Emacs 25.1, so users on Emacs 24 may wish to use the previous version.org-protocol-capture-eww-readable
. For Emacs 25.1 and up, this uses eww
’s built-in readability-style function instead of calling external Python scripts.org-protocol-capture-html-demote-times
variable, which controls how many times headings in captured pages are demoted. This is handy if you use a sub-heading in your capture template, so you can make all the headings in captured pages lower than the lowest-level heading in your capture template.sleep-for
instead of sit-for
to work around any potential issues with whatever “input” may interrupt sit-for
.Hopefully this puts issue #12 to rest for good. Thanks to @jguenther for his help fixing and reporting bugs.
org-protocol
would have nothing where the title should go, and this would cause the capture to fail. Now the bookmarklets will use [untitled page]
instead of an empty string. (No Elisp code changed, only the examples in the readme.)--no-wrap
deprecation), thanks to @jguenther.cl
and use cl-incf
instead of incf
.>=
1.16, which deprecates --no-wrap
in favor of --wrap=none
.org-protocol-capture-html.sh -u http://example.com
and it will download and capture the page.org-capture
template to the readme. This will make it much easier for new users.Create the file ~/.local/share/applications/org-protocol.desktop
containing:
[Desktop Entry] Name=org-protocol Exec=emacsclient %u Type=Application Terminal=false Categories=System; MimeType=x-scheme-handler/org-protocol;
Note: Each line’s key must be capitalized exactly as displayed, or it will be an invalid .desktop
file.
Then update ~/.local/share/applications/mimeinfo.cache
by running:
kbuildsycoca4
update-desktop-database ~/.local/share/applications/
Add to your Emacs init file:
(server-start) (require 'org-protocol)
You’ll probably want to add a capture template something like this:
("w" "Web site" entry (file+olp "~/org/inbox.org" "Web") "* %c :website:\n%U %?%:initial")
Note: Using %:initial
instead of %i
seems to handle multi-line content better.
This will result in a capture like this:
* [[http://orgmode.org/worg/org-contrib/org-protocol.html][org-protocol.el – Intercept calls from emacsclient to trigger custom actions]] :website: [2015-09-29 Tue 11:09] About org-protocol.el org-protocol.el is based on code and ideas from org-annotation-helper.el and org-browser-url.el.
On some versions of Firefox, it may be necessary to add this setting. You may skip this step and come back to it if you get an error saying that Firefox doesn’t know how to handle org-protocol
links.
Open about:config
and create a new boolean
value named network.protocol-handler.expose.org-protocol
and set it to true
.
Note: If you do skip this step, and you do encounter the error, Firefox may replace all open tabs in the window with the error message, making it difficult or impossible to recover those tabs. It’s best to use a new window with a throwaway tab to test this setup until you know it’s working.
Selection-grabbing functionThis function gets the HTML from the browser’s selection. It’s from this answer on StackOverflow.
function () { var html = ""; if (typeof content.document.getSelection != "undefined") { var sel = content.document.getSelection(); if (sel.rangeCount) { var container = document.createElement("div"); for (var i = 0, len = sel.rangeCount; i < len; ++i) { container.appendChild(sel.getRangeAt(i).cloneContents()); } html = container.innerHTML; } } else if (typeof document.selection != "undefined") { if (document.selection.type == "Text") { html = document.selection.createRange().htmlText; } } var relToAbs = function (href) { var a = content.document.createElement("a"); a.href = href; var abs = a.protocol + "//" + a.host + a.pathname + a.search + a.hash; a.remove(); return abs; }; var elementTypes = [ ['a', 'href'], ['img', 'src'] ]; var div = content.document.createElement('div'); div.innerHTML = html; elementTypes.map(function(elementType) { var elements = div.getElementsByTagName(elementType[0]); for (var i = 0; i < elements.length; i++) { elements[i].setAttribute(elementType[1], relToAbs(elements[i].getAttribute(elementType[1]))); } }); return div.innerHTML; }
Here’s a one-line version of it, better for pasting into bookmarklets and such:
function () {var html = ""; if (typeof content.document.getSelection != "undefined") {var sel = content.document.getSelection(); if (sel.rangeCount) {var container = document.createElement("div"); for (var i = 0, len = sel.rangeCount; i < len; ++i) {container.appendChild(sel.getRangeAt(i).cloneContents());} html = container.innerHTML;}} else if (typeof document.selection != "undefined") {if (document.selection.type == "Text") {html = document.selection.createRange().htmlText;}} var relToAbs = function (href) {var a = content.document.createElement("a"); a.href = href; var abs = a.protocol + "//" + a.host + a.pathname + a.search + a.hash; a.remove(); return abs;}; var elementTypes = [['a', 'href'], ['img', 'src']]; var div = content.document.createElement('div'); div.innerHTML = html; elementTypes.map(function(elementType) {var elements = div.getElementsByTagName(elementType[0]); for (var i = 0; i < elements.length; i++) {elements[i].setAttribute(elementType[1], relToAbs(elements[i].getAttribute(elementType[1])));}}); return div.innerHTML;}Add link to Mac OS X article
This article would be helpful for Mac users in setting up org-protocol.
Pentadactyl has the :write
command, which can write a page’s HTML to a file, or to a command, like :write !org-protocol-capture-html.sh
. This should make it easy to implement file-based capturing, which would pass HTML through a temp file rather than as an argument, and this would work around the argument-length limit that we occasionally run into.
All that should be necessary is to:
capture-file
that receives a path to a file instead of a URL to a page.
org-protocol-capture-html.sh
to capture with files.
STDIN
, write it to a tempfile, and pass the tempfile’s path to Emacs. The tempfile should go in the directory and have the prefix so that Emacs knows it’s safe to delete that file.:write !org-protocol-capture-html --tempfile
.
:com! search-selection,ss -bang -nargs=? -complete search \ -js commands.execute((bang ? open : tabopen ) \ + args + + buffer.currentWord)
However, I don’t see how this would allow writing different content to STDIN
, only arguments. So this might not be possible without modifying Pentadactyl and/or using a separate Firefox extension. Here is the source for the :write
command, and here for the underlying JS function. And you can see here how it uses temp files to pass STDIN
to commands.
If you try to capture too long a chunk of HTML, it will fail with “argument list too long errors” from emacsclient
. To work around this will require capturing via STDIN instead of arguments. Since org-protocol is based on using URLs, this will probably require using a shell script and a new Emacs function, and perhaps another MIME protocol-handler. Even then, it might still run into problems, because the data is passed to the shell script as an argument in the protocol-handler. Working around that would probably require a non-protocol-handler-based method using a browser extension to send the HTML directly via STDIN. Might be possible with Pentadactyl instead of making an entirely new browser extension. Also, maybe the Org-mode Capture Firefox extension could be extended (…) to do this.
However, most of the time, this is not a problem.
This would be nice.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4