RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/Genivia/RE-flex below:

Genivia/RE-flex: Yet another high-performance C++ regex library and lexical analyzer generator like Flex: extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.

A high-performance C++ regex library and lexical analyzer generator with Unicode support.

Two example use cases:

A RE/flex-generated tokenizer is used by the Tiger Compiler.
The RE/flex C++ regex engines are used by ugrep.

The RE/flex lexical analyzer generator extends Flex++ with Unicode support, indent/dedent anchors, POSIX regex lazy quantifiers, word boundaries, functions for lex and syntax error reporting, lexer rule execution performance profiling, and other new features.

Only RE/flex supports backtrack-free regex lazy matching in linear time using an advanced DFA transformation algorithm (invented by Dr. Robert van Engelen.)

RE/flex is faster than Flex and much faster than regex libraries such as Boost.Regex, C++11 std::regex, PCRE2 and RE2. For example, tokenizing a 2 KB representative C source code file into 244 tokens takes only 8.7 microseconds:

Command / Function Software Time (μs) reflex --fast --noindent RE/flex 3.4.1 8.7 reflex --fast RE/flex 3.4.1 8.9 flex -+ --full Flex 2.5.35 9.8 boost::spirit::lex::lexertl::actor_lexer::iterator_type Boost.Spirit.Lex 1.82.0 10.7 reflex --full RE/flex 3.4.1 20.6 pcre2_jit_match() PCRE2 (jit) 10.42 60.8 hs_compile_multi(), hs_scan() Hyperscan 5.4.2 129 reflex -m=boost-perl Boost.Regex 1.82.0 205 RE2::Consume() RE2 (pre-compiled) 2023-09-01 218 reflex -m=boost Boost.Regex POSIX 1.82.0 392 pcre2_match() PCRE2 10.42 500 RE2::Consume() RE2 POSIX (pre-compiled) 2023-09-01 534 flex -+ Flex 2.5.35 3759 pcre2_dfa_match() PCRE2 POSIX (dfa) 10.42 4029 regcomp(), regexec() GNU C POSIX.2 regex 4932 std::cregex_iterator() C++11 std::regex 6490

Note: performance in elapsed time (lower is better) in microseconds for 1000 to 10000 benchmark runs using Mac OS X 12.6.9 with clang 12.0.0 -O2, 2.9 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3. Hyperscan disqualifies as a scanner due to its "All matches reported" semantics resulting in 1915 matches for this test, and due to its event handler requirements. Download the tests

The performance table is indicative of the impact on performance when using PCRE2 and Boost.Regex with RE/flex. PCRE2 and Boost.Regex are optional libraries integrated with RE/flex for Perl matching because of their efficiency. By default, RE/flex uses DFA-based extended regular expression matching in linear time, the fastest method (as shown in the table).

The RE/flex matcher tracks line numbers, column numbers, and indentations, whereas Lex and Flex do not (option noyylineno) and neither do the other regex matchers in the table (except PCRE2 and Boost.Regex when used with RE/flex). Tracking this information incurs some overhead. RE/flex also automatically decodes UTF-8/16/32 input and accepts std::istream, strings, and wide strings as input.

Includes many examples, such as a mini C compiler to Java bytecode, a tokenizer for C/C++ source code, a tokenizer for Python source code, a tokenizer for Java source code, Lua, JSON, XML, YAML, and more.
Compatible with Flex and Bison to eliminate a learning curve, making a transition from Flex++ to RE/flex frustration-free.
Auto-generates code that integrates seamlessly with Bison Reentrant, Bison-Bridge, Bison-Locations, Bison 3.0 C++ interface %skeleton "lalr1.cc" and Bison Complete Symbols.
Generates code and includes methods for lexical and syntax error reporting and recovery.
The generated scanner source code is structured and easy to understand.
Fully supports Unicode and Unicode properties \p{C}, including Unicode identifier matching for C++11, Java, C#, and Python source code.
Auto-detects UTF-8/16/32 input to match Unicode patterns.
Supports file encodings ISO-8859-1 through ISO-8859-15, CP 1250 through 1258, CP 437, CP 850, CP 858, KOI8, MACROMAN, EBCDIC, and custom code pages.
Generates scanners for lexical analysis on files, C++ streams, (wide) strings, and memory such as mmap files.
Indent/nodent/dedent anchors to match indentation levels to tokenize.
Lazy quantifiers for POSIX regex matching, i.e. no hacks are needed to work around greedy repetitions.
Word boundary anchors.
Freespace mode option to improve readability of lexer specifications.
%class and %init to customize the generated Lexer classes.
%include to modularize lexer specifications.
Multiple lexer classes can be combined and used in one application, e.g. by multiple threads in a thread-safe manner.
Configurable Lexer class generation to customize the interface for various parsers, including Yacc and Bison.
Generates Graphviz files to visualize FSMs with the Graphviz dot tool.
Includes an extensible hierarchy of pattern matcher engines, with a choice of regex engines, including the RE/flex regex engine, PCRE2, and Boost.Regex.
The RE/flex regex library makes C++11 std::regex, PCRE2, and Boost.Regex much easier to use for pattern matching on (wide) strings, files, and streams.
IEEE POSIX P1003.2 standard compliant like Lex and Flex (but generates C++).
Extensive documentation in the online User Guide.
Lots of other improvements over Flex++, such as yypush_buffer_state saves the scanner state (line, column, and indentation positions), not just the input buffer; no input buffer length limit (Flex has a 16KB limit); line() returns the current line (e.g. for error reporting).

Note: PCRE2 and Boost.Regex are not dependencies, they can be used as optional regex engines in addition to the RE/flex regex engine.

Use reflex/bin/reflex.exe from the command line or add a Custom Build Step in MSVC++ as follows:

select the project name in Solution Explorer then Property Pages from the Project menu (see also custom-build steps in Visual Studio);
add an extra path to the reflex/include folder in the Include Directories under VC++ Directories, which should look like $(VC_IncludePath);$(WindowsSDK_IncludePath);C:\Users\YourUserName\Documents\reflex\include (this assumes the reflex source package is in your Documents folder).
enter "C:\Users\YourUserName\Documents\reflex\bin\win32\reflex.exe" --header-file "C:\Users\YourUserName\Documents\mylexer.l" in the Command Line property under Custom Build Step (this assumes mylexer.l is in your Documents folder);
enter lex.yy.h lex.yy.cpp in the Outputs property;
specify Execute Before as PreBuildEvent.

If you are using specific reflex options such as --flex then add these in step 3.

Before compiling your program with MSVC++, drag the folders reflex/lib and reflex/unicode to the Source Files in the Solution Explorer panel of your project. Next, run reflex.exe simply by compiling your project (which may fail, but that is OK for now as long as we executed the custom build step to run reflex.exe). Drag the generated lex.yy.h (if present) and lex.yy.cpp files to the Source Files. Now you are all set!

In addition, the reflex/vs directory contains batch scripts to build projects with MS Visual Studio C++.

On macOS systems you can use homebrew to install RE/flex with brew install re-flex. Or use MacPorts to install RE/flex with sudo port install re-flex.

On NetBSD systems you can use the standard NetBSD package installer (pkgsrc): http://cdn.netbsd.org/pub/pkgsrc/current/pkgsrc/devel/RE-flex/README.html

First clone the code:

$ git clone https://github.com/Genivia/RE-flex

Then simply do a quick clean build, assuming your environment is pretty much standard:

$ ./clean.sh
$ ./build.sh

This compiles the reflex tool and installs it locally in reflex/bin. For local use of RE/flex in your project, you can add this location to your $PATH variable to enable the new reflex command:

$ export PATH=$PATH:/your_path_to_reflex/reflex/bin

Note that the libreflex.a and libreflex.so libraries are saved locally in reflex/lib. Link against the library when you use the RE/flex regex engine in your code, such as:

$ c++ <options and .o/.cpp files> -L/your_path_to_reflex/reflex/lib -lreflex

or you could statically link libreflex.a with:

$ c++ <options and .o/.cpp files> /your_path_to_reflex/reflex/lib/libreflex.a

Also note that the RE/flex header files that you will need to include in your project are locally located in include/reflex.

To install the man page, the header files in /usr/local/include/reflex, the library in /usr/local/lib and the reflex command in /usr/local/bin:

The configure script accepts configuration and installation options. To view these options, run:

Run configure and make:

To build the examples also:

$ ./configure --enable-examples && make

After this successfully completes, you can optionally run make install to install the reflex command and the libreflex library:

Unfortunately, cloning from Git does not preserve timestamps which means that you may run into "WARNING: 'aclocal-1.15' is missing on your system." To work around this problem, run:

$ autoreconf -fi
$ ./configure && make

The above builds the library with SSE/AVX optimizations applied. To disable AVX optimizations:

$ ./configure --disable-avx && make

To disable both SSE2 and AVX optimizations:

$ ./configure --disable-sse2 && make

Optional libraries to install

To use PCRE2 as a regex engine with the RE/flex library and scanner generator, install PCRE2 and link your code with -lpcre2-8.
To use Boost.Regex as a regex engine with the RE/flex library and scanner generator, install Boost and link your code with -lboost_regex or -lboost_regex-mt.
To visualize the FSM graphs generated with reflex option --graphs-file, install Graphviz dot.

Improved Vim syntax highlighting

Copy the lex.vim file to ~/.vim/syntax/ to enjoy improved syntax highlighting for both Flex and RE/flex.

There are two ways you can use this project:

as a scanner generator for C++, similar to Flex;
as a flexible regex library API for C++.

For the first use case, use the reflex tool on the command line on a lexer specification:

$ reflex --flex --bison --graphs-file lexspec.l

This generates a scanner for Bison from the lexer specification lexspec.l and saves the finite state machine (FSM) as a Graphviz .gv file that can be visualized with the Graphviz dot tool:

$ dot -Tpdf reflex.INITIAL.gv > reflex.INITIAL.pdf
$ open reflex.INITIAL.pdf

Several examples are included to get you started. See the manual for more details.

For the second use case, use the RE/flex matcher API classes to start pattern search, matching, splitting and scanning on strings, wide strings, files, and streams.

You can select matchers that are based on different regex engines:

RE/flex regex: #include <reflex/matcher.h> and use reflex::Matcher;
RE/flex fuzzy regex for approximate matching: #include <reflex/fuzzymatcher.h> and use reflex::FuzzyMatcher
PCRE2: #include <reflex/pcre2matcher.h> and use reflex::PCRE2Matcher or reflex::PCRE2UTFMatcher.
Boost.Regex: #include <reflex/boostmatcher.h> and use reflex::BoostMatcher or reflex::BoostPosixMatcher;
C++11 std::regex: #include <reflex/stdmatcher.h> and use reflex::StdMatcher or reflex::StdPosixMatcher.

Each matcher may differ in regex syntax features (see the full documentation), but they all share the same methods and iterators, such as:

matches() returns nonzero if the whole input from start to end matches the specified pattern;
find() search input and returns nonzero if a match was found, can be repeated;
scan() scan input and returns nonzero if input at current position matches, can be repeated;
split() returns nonzero for a split of the input at the next match, can be repeated;
find.begin()...find.end() a filter iterator, iterates with find();
scan.begin()...scan.end() a tokenizer iterator, iterates with scan();
split.begin()...split.end() a splitter iterator, iterates with split().

The input matched and searched may be a string, a wide string, a file, or a stream. Searching is incremental, meaning that the input is not buffered as a whole in memory, but rather buffered in parts in a sliding window of a few KB. The window size may grow to fit a pattern match. UTF-16/32 file input with a UTF BOM is automatically normalized and matched as UTF-8.

For example, using Boost.Regex (alternatively use PCRE2 reflex::PCRE2Matcher or reflex::PCRE2UTFMatcher to match Unicode UTF-8 input):

#include <reflex/boostmatcher.h> // reflex::BoostMatcher, reflex::Input, boost::regex
// use a BoostMatcher to check if the birthdate string is a valid date
if (reflex::BoostMatcher("\\d{4}-\\d{2}-\\d{2}", birthdate).matches() != 0)
  std::cout << "Valid date!" << std::endl;

With a group capture to fetch the year:

#include <reflex/boostmatcher.h> // reflex::BoostMatcher, reflex::Input, boost::regex
// use a BoostMatcher to check if the birthdate string is a valid date
reflex::BoostMatcher matcher("(\\d{4})-\\d{2}-\\d{2}", birthdate);
if (matcher.matches() != 0)
  std::cout << std::string(matcher[1].first, matcher[1].second) << " was a good year!" << std::endl;

A pattern match made by any of the regex engines to match input with matches(), or search with find(), or tokenize with scan(), or split with split() includes detailed information that can be retrieved with the following methods:

accept() returns group capture index (or zero if not captured/matched)
text() returns const char* to 0-terminated match (ends in \0)
strview()returns std::string_view text match (preserves \0s) (C++17)
str() returns std::string text match (preserves \0s)
wstr() returns std::wstring wide text match (converted from UTF-8)
chr() returns first 8-bit char of the text match (str()[0] as int)
wchr() returns first wide char of the text match (wstr()[0] as int)
pair() returns std::pair<size_t,std::string>(accept(),str())
wpair() returns std::pair<size_t,std::wstring>(accept(),wstr())
size() returns the length of the text match in bytes
wsize() returns the length of the match in number of wide characters
lines() returns the number of lines in the text match (>=1)
columns() returns the number of columns of the text match (>=0)
begin() returns const char* to non-0-terminated text match begin
end() returns const char* to non-0-terminated text match end
rest() returns const char* to 0-terminated rest of input
span() returns const char* to 0-terminated match enlarged to span the line
line() returns std::string line with the matched text as a substring
wline() returns std::wstring line with the matched text as a substring
more() tells the matcher to append the next match (when using scan())
less(n) cuts text() to n bytes and repositions the matcher
lineno() returns line number of the match, starting at line 1
columno() returns column number of the match in characters, starting at 0
lineno_end() returns ending line number of the match, starting at line 1
columno_end() returns ending column number of the match, starting at 0
bol() returns const char* to begin of matching line (not 0-terminated)
border()returns the byte offset from the start of the line of the match
first() returns input position of the first character of the match
last() returns input position + 1 of the last character of the match
at_bol() true if matcher reached the begin of a new line \n
at_bob() true if matcher is at the begin of input and no input consumed
at_end() true if matcher is at the end of input
[0] operator returns std::pair<const char*,size_t>(begin(),size())
[n] operator returns n'th capture std::pair<const char*,size_t>

Note: POSIX matchers do not generally support group capturing, e.g. BoostPosixMatcher and StdPosixMatcher do not. RE/flex is an efficient backtrack-free DFA-based POSIX engine that supports a limited form of capturing, limited to outermost gouping, such as (abc)|(def) which has two groups. This may be extended in a future release to full capturing.

To search a string for words \w+ to display with the column number of each word found:

#include <reflex/boostmatcher.h> // reflex::BoostMatcher, reflex::Input, boost::regex
// use a BoostMatcher to search for words in a sentence
reflex::BoostMatcher matcher("\\w+", "How now brown cow.");
while (matcher.find() != 0)
  std::cout << "Found " << matcher.text() << " at column " << matcher.columno() << std::endl;

The split method is roughly the inverse of the find method and returns text located between matches. For example using non-word matching \W+:

#include <reflex/boostmatcher.h> // reflex::BoostMatcher, reflex::Input, boost::regex
// use a BoostMatcher to search for words in a sentence
reflex::BoostMatcher matcher("\\W+", "How now brown cow.");
while (matcher.split())
  std::cout << "Found " << matcher.text() << std::endl;

To pattern match the content of a file, where the file may use UTF-8, 16, or 32 encodings that are automatically converted when a UTF BOM is present:

#include <reflex/boostmatcher.h> // reflex::BoostMatcher, reflex::Input, boost::regex
// use a BoostMatcher to search and display words from a FILE
FILE *fd = fopen("somefile.txt", "r");
if (fd == NULL)
  exit(EXIT_FAILURE);
reflex::BoostMatcher matcher("\\w+", fd);
while (matcher.find())
  std::cout << "Found " << matcher.text() << std::endl;
fclose(fd);

Same again, but this time with a C++ input stream:

#include <reflex/boostmatcher.h> // reflex::BoostMatcher, reflex::Input, boost::regex
// use a BoostMatcher to search and display words from a stream
std::ifstream file("somefile.txt", std::ifstream::in);
reflex::BoostMatcher matcher("\\w+", file);
while (matcher.find())
  std::cout << "Found " << matcher.text() << std::endl;
file.close();

Stuffing the search results into a container using RE/flex iterators:

#include <reflex/boostmatcher.h> // reflex::BoostMatcher, reflex::Input, boost::regex
#include <vector>         // std::vector
// use a BoostMatcher to convert words of a sentence into a string vector
reflex::BoostMatcher matcher("\\w+", "How now brown cow.");
std::vector<std::string> words(matcher.find.begin(), matcher.find.end());

Use C++11 range-based loops with RE/flex iterators:

#include <reflex/pcre2matcher.h> // reflex::PCRE2TFMatcher, reflex::Input, std::regex
// use a PCRE2UTFMatcher to search for words in a sentence
reflex::PCRE2UTFMatcher matcher("\\w+", "How now brown cow.");
for (auto& match : matcher.find)
  std::cout << "Found " << match.text() << std::endl;

Note that we cannot generally simplify this loop to the following, because the temporary matcher object is destroyed (some compilers handle this in C++23):

for (auto& match : reflex::PCRE2UTFMatcher matcher("\\w+", "How now brown cow.").find);
  std::cout << "Found " << match.text() << std::endl;

RE/flex also allows you to convert expressive regex syntax forms such as \p Unicode classes, character class set operations such as [a-z--[aeiou]], escapes such as \X, and (?x) mode modifiers, to a regex string that the underlying regex library understands and will be able to use:

std::string reflex::Matcher::convert(const std::string& regex, reflex::convert_flag_type flags)
std::string reflex::PCRE2Matcher::convert(const std::string& regex, reflex::convert_flag_type flags)
std::string reflex::PCRE2UTFMatcher::convert(const std::string& regex, reflex::convert_flag_type flags)
std::string reflex::BoostMatcher::convert(const std::string& regex, reflex::convert_flag_type flags)
std::string reflex::StdMatcher::convert(const std::string& regex, reflex::convert_flag_type flags)

For example:

#include <reflex/matcher.h> // reflex::Matcher, reflex::Input, reflex::Pattern
// use a Matcher to check if sentence is in Greek:
static const reflex::Pattern pattern(reflex::Matcher::convert("[\\p{Greek}\\p{Zs}\\pP]+", reflex::convert_flag::unicode));
if (reflex::Matcher(pattern, sentence).matches() != 0)
  std::cout << "This is Greek" << std::endl;

We use convert with optional flag reflex::convert_flag::unicode to make . (dot), \w, \s and so on match Unicode and to convert \p Unicode character classes.

Conversion is fast (it runs in linear time in the size of the regex), but it is not without some overhead. Making converted regex patterns static as shown above saves the cost of conversion to just once to support many matchings.

Please see CONTRIBUTING.

Where do I find the documentation?