RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/syntax-tree/hast-util-to-nlcst below:

syntax-tree/hast-util-to-nlcst: utility to transform hast to nlcst

hast utility to transform to nlcst.

This package is a utility that takes a hast (HTML) syntax tree as input and turns it into nlcst (natural language).

This project is useful when you want to deal with ASTs and inspect the natural language inside HTML. Unfortunately, there is no way yet to apply changes to the nlcst back into hast.

The mdast utility mdast-util-to-nlcst does the same but uses a markdown tree as input.

The rehype plugin rehype-retext wraps this utility to do the same at a higher-level (easier) abstraction.

This package is ESM only. In Node.js (version 16+), install with npm:

npm install hast-util-to-nlcst

In Deno with esm.sh:

import {toNlcst} from 'https://esm.sh/hast-util-to-nlcst@4'

In browsers with esm.sh:

<script type="module">
  import {toNlcst} from 'https://esm.sh/hast-util-to-nlcst@4?bundle'
</script>

Say our document example.html contains:

<article>
  Implicit.
  <h1>Explicit: <strong>foo</strong>s-ball</h1>
  <pre><code class="language-foo">bar()</code></pre>
</article>

…and our module example.js looks as follows:

import {fromHtml} from 'hast-util-from-html'
import {toNlcst} from 'hast-util-to-nlcst'
import {ParseEnglish} from 'parse-english'
import {read} from 'to-vfile'
import {inspect} from 'unist-util-inspect'

const file = await read('example.html')
const tree = fromHtml(file)

console.log(inspect(toNlcst(tree, file, ParseEnglish)))

…now running node example.js yields (positional info removed for brevity):

RootNode[2] (1:1-6:1, 0-134)
├─0 ParagraphNode[3] (1:10-3:3, 9-24)
│   ├─0 WhiteSpaceNode "\n  " (1:10-2:3, 9-12)
│   ├─1 SentenceNode[2] (2:3-2:12, 12-21)
│   │   ├─0 WordNode[1] (2:3-2:11, 12-20)
│   │   │   └─0 TextNode "Implicit" (2:3-2:11, 12-20)
│   │   └─1 PunctuationNode "." (2:11-2:12, 20-21)
│   └─2 WhiteSpaceNode "\n  " (2:12-3:3, 21-24)
└─1 ParagraphNode[1] (3:7-3:43, 28-64)
    └─0 SentenceNode[4] (3:7-3:43, 28-64)
        ├─0 WordNode[1] (3:7-3:15, 28-36)
        │   └─0 TextNode "Explicit" (3:7-3:15, 28-36)
        ├─1 PunctuationNode ":" (3:15-3:16, 36-37)
        ├─2 WhiteSpaceNode " " (3:16-3:17, 37-38)
        └─3 WordNode[4] (3:25-3:43, 46-64)
            ├─0 TextNode "foo" (3:25-3:28, 46-49)
            ├─1 TextNode "s" (3:37-3:38, 58-59)
            ├─2 PunctuationNode "-" (3:38-3:39, 59-60)
            └─3 TextNode "ball" (3:39-3:43, 60-64)

This package exports the identifier toNlcst. There is no default export.

toNlcst(tree, file, Parser)

Turn a hast tree into an nlcst tree.

👉 Note: tree must have positional info and file must be a VFile corresponding to tree.

tree (HastNode) — hast tree to transform
file (VFile) — virtual file
Parser (ParserConstructor or ParserInstance) — parser to use.

NlcstNode.

The algorithm supports implicit and explicit paragraphs, such as:

<article>
  An implicit paragraph.
  <h1>An explicit paragraph.</h1>
</article>

Overlapping paragraphs are also supported (see the tests or the HTML spec for more info).

Some elements are ignored and their content will not be present in nlcst: <script>, <style>, <svg>, <math>, <del>.

To ignore other elements, add a data-nlcst attribute with a value of ignore:

<p>This is <span data-nlcst="ignore">hidden</span>.</p>
<p data-nlcst="ignore">Completely hidden.</p>

<code> elements are mapped to Source nodes in nlcst.

To mark other elements as source, add a data-nlcst attribute with a value of source:

<p>This is <span data-nlcst="source">marked as source</span>.</p>
<p data-nlcst="source">Completely marked.</p>

Create a new parser (TypeScript type).

type ParserConstructor = new () => ParserInstance

nlcst parser (TypeScript type).

For example, parse-dutch, parse-english, or parse-latin.

type ParserInstance = {
  parse(value?: string | null | undefined): NlcstRoot
  tokenize(value?: string | null | undefined): Array<NlcstSentenceContent>
  tokenizeParagraph(value?: string | null | undefined): NlcstParagraph
  tokenizeParagraphPlugins: Array<(node: NlcstParagraph) => undefined | void>
  tokenizeSentencePlugins: Array<(node: NlcstSentence) => undefined | void>
}

This package is fully typed with TypeScript. It exports the additional types ParserConstructor and ParserInstance.

Projects maintained by the unified collective are compatible with maintained versions of Node.js.

When we cut a new major release, we drop support for unmaintained versions of Node. This means we try to keep the current release line, hast-util-to-nlcst@^4, compatible with Node.js 16.

hast-util-to-nlcst does not change the original syntax tree so there are no openings for cross-site scripting (XSS) attacks.

mdast-util-to-nlcst — transform mdast to nlcst
hast-util-to-mdast — transform hast to mdast
hast-util-to-xast — transform hast to xast

See contributing.md in syntax-tree/.github for ways to get started. See support.md for ways to get help.

This project has a code of conduct. By interacting with this repository, organization, or community you agree to abide by its terms.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4