WhitespaceNormalizer provides methods that help to normalize the spacing of text content inside of Elements by removing various unicode spacing and directional markings.
Constant Summary collapse"\u00a0"
"\u2028"
"\u2029"
All spaces except for NBSP
"[[:space:]&&[^#{NON_BREAKING_SPACE}]]".freeze
Whitespace we want to substitute with plain spaces
" \n\f\t\v#{LINE_SEPERATOR}#{PARAGRAPH_SEPERATOR}".freeze
Any whitespace at the front of text
/\A#{BREAKING_SPACES}+/
Any whitespace at the end of text
/#{BREAKING_SPACES}+\z/
"Invisible" space character
"\u200b"
Signifies text is read left to right
"\u200e"
Signifies text is read right to left
"\u200f"
Characters we want to truncate from text
[ZERO_WIDTH_SPACE, LEFT_TO_RIGHT_MARK, RIGHT_TO_LEFT_MARK].join
Matches multiple empty lines
/[\ \n]*\n[\ \n]*/
Normalizes the spacing of a node's text to be similar to what matchers might expect.
Variant on Normalizer#normalize_spacing that targets the whitespace of visible elements only.
Normalizes the spacing of a node's text to be similar to what matchers might expect.
#normalize_visible_spacing(text) ⇒ StringVariant on Normalizer#normalize_spacing that targets the whitespace of visible elements only.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4