RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from http://simongog.github.io/assets/data/sdsl-slides/tutorial below:

About the project

Goal: Provide an easy-to-use, highly-efficient, configurable, and extensible library of succinct data structures for researchers and practitioners.

It is/was a challenge to meet all this goals. Here is the current state:

C++ is used (great for resource-constraint programming).
Templates are used to make it configurable.
STL concepts are used to make it easy-to-use.
Space and time efficient construction using the semi-external approach.
Construction is configurable between in-memory and semi-external.
All indexes support byte and integer sequences.
Implements highlights of 40 research articles.

Development by Timo Beller, Matthias Petri, me, and others.

Great tool for prototyping
Provides a myriad of structures: from bitvectors to compressed suffix trees
C++ is the choice for resource constraint programming
Application domains: Bioinformatics, Information Retrieval,...

Installation

Get the source code

git clone https://github.com/simongog/sdsl-lite.git

Install the include and library files into SDSL_INSTALL_PATH
(default value is $HOME)
```
cd sdsl-lite
./install.sh SDSL_INSTALL_PATH
```
Lets compile* the tutorial programs
```
cd tutorial
make
```

* use clang >=3.1 or gcc >=4.7 Integer Vectors - bitcompressed & mutable

#include <iostream>
#include <sdsl/vectors.hpp>

using namespace std;
using namespace sdsl;

int main(){
    int_vector<> v = {3,2,1,0,2,1,3,4,1,1,1,3,2,3};
    v[1]=0;
    util::bit_compress(v);
    cout << v << endl;
    cout << size_in_bytes(v) << endl; 
}

Output:

3 0 1 0 2 1 3 4 1 1 1 3 2 3
17

int_vector<0>s store the length ($8$ bytes) and width ($1$ bytes) of the elements. The data itself is stored in $64$-bit words.
You can access element through operator[].
Use the assignment operator to set values in v.
The width of the elements can be determined by calling v.width(). Initially it is $64$ bits.
util::bit_compress determines the maximal $x$ value in v and sets width to bits::hi(x)+1, which is the position of the most significant set bit of $x$ plus 1.
Size is $17$: $8$ bytes for length, $1$ byte for width, and $14 \times 4$ bits fit in a $64$ bit word.

Integer Vectors - compressed & immutable

enc_vector<> = apply self-delimiting code on deltas

int_vector<> v(10*(1<<20));
for (size_t i=0; i<10; ++i)
    for (size_t j=0; j<1U<<20; ++j)
        v[i*(1<<20)+j] = j;
cout << size_in_mega_bytes(v) << endl;
util::bit_compress(v);
cout << size_in_mega_bytes(v) << endl;
enc_vector<> ev(v);
cout << size_in_mega_bytes(ev) << endl;

Output:

80
25
1.70902

enc_vector<..>s is a compressed random access vector. It calculates $\Delta$s of adjacent elements; the $\Delta$s are compressed using a self-delimiting code. Absolute samples and pointers into the bit stream are stored to provide fast random access.
You can access element through operator[].
enc_vector<..> can be configured by three parameters
- t_coder: self-delimiting code (default: coder::elias_delta, alternative: coder::fibonacci)
- t_dens: Sample density of absolute values and pointers (default: $128$)
- t_width: Width of the int_vector used to store the samples and pointers (default: 0)

Integer Vectors - compressed & immutable

vlc_vector<> = apply self-delimiting code on elements

 
// initialize vector with 10 mega zeros
int_vector<> v(10*(1<<20), 0);
v[0] = 1ULL<<63;
util::bit_compress(v);
cout << size_in_mega_bytes(v) << endl;
vlc_vector<> vv(v);
cout << size_in_mega_bytes(vv) << endl;

Output:

80
1.48442

vlc_vector<..>s is another compressed random access vector. Self-delimiting codes are used to compress each element. Sample pointers into the bit stream are stored to provide fast random access.
You can access element through operator[].
vlc_vector<..> can be configured by three parameters
- t_coder: self-delimiting code (default: coder::elias_delta, alternative: coder::fibonacci)
- t_dens: Sample density of pointers (default: $128$)
- t_width: Width of the int_vector used to store the pointers (default: 0)
In this example bit_compress does not result in compression, since the max element in v occupies $64$ bits.

Bitvectors - uncompressed & mutable

 
#include <iostream>
#include <sdsl/bit_vectors.hpp>
using namespace std;
using namespace sdsl;

int main(){
    bit_vector b = {1,1,0,1,0,0,1};
    cout << b << endl;
    b = bit_vector(80*(1<<20), 0);
    for (size_t i=0; i < b.size(); i+=100)
        b[i] = 1;
    cout << size_in_mega_bytes(b) << endl;
}

Output:

1101001
10

bit_vector is a specialization of int_vector<..>. It's an uncompressed, mutable bitvector.
In the example, a bit_vector b is constructed from a initialization list; b can be written to a stream. Then we assign a 80 megabit vector, initialized with zeros, to b and assign every 100th element a one, and output the size of the structure.

Bitvectors - compressed & immutable

Use rrr_vector<63> or sd_vector<> to represent compressible bitvectors

bit_vector b = bit_vector(80*(1<<20), 0);
for (size_t i=0; i < b.size(); i+=100)
    b[i] = 1;
cout << size_in_mega_bytes(b) << endl;
rrr_vector<63> rrrb(b);
cout << size_in_mega_bytes(rrrb) << endl;
sd_vector<> sdb(b);
cout << size_in_mega_bytes(sdb) << endl;

Output:

10
1.77071
0.987351

There are two compressed bitvectors (rrr_vector<..> and sd_vector<..>) which can be constructed by passing a bit_vector object. Compressed bitvectors are immutable after construction.
In the example, the $80$ mega bit vector is compressed to $1.7701$ mega byte by using rrr_vector<63> and $0.987351$ by using sd_vector<>.
rrr_vector<..> has three parameters:
- t_bs: block size in $[5..255]$ (default: 63, larger=more compression)
- t_rac: Random access iterator in which the popcounts of the blocks are stored (default: int_vector<>).
- t_k: Store cumulative sums of popcounts every t_k elements (default: 32).
sd_vector<..> is explained in the next slide

Intermezzo: Inspecting structures

write_structure<JSON_FORMAT> can be used with every SDSL object and writes a space-breakdown into a stream

write_structure<JSON_FORMAT>(sdb, cout);

We use d3js.org to visualize the output:

sd_vector<..> has three parameters:
- t_hi_bit_vector: Type of bitvector used for the unary decoded differences of the high part of the positions of the 1s (defaut: bit_vector).
- t_select_1: Select support on ones (default: bit_vector::select_1_type).
- t_select_0: Select support on zeros (default: bit_vector::select_0_type).

Intermezzo: Store and load structures

Any SDSL object o can be stored to or loaded from a file named file by method

store_to_file(o, file) or load_from_file(o, file)

Example:

    bit_vector<> b(10000000, 0);
    b[b.size()/2] = 1;
    sd_vector<> sdb(b);
    store_to_file(sdb, "sdb.sdsl");
    sdb = sd_vector<>();
    cout << sdb.size() << endl; // 0
    load_from_file(sdb, "sdb.sdsl");
    cout << sdb.size() << endl; // 10000000

Initialize a bit_vector with 10000000 zeros.
Set the bit at position a b.size()/2.
Initialize sdb with b
Store sdb to file "sdb.sdsl".
Assign empty vector to sdb.
Output sdb.size(); it's zero.
Load sdb from file "sdb.sdsl".
Output sdb.size(); it's 10000000 again.

Support structures: rank_support (1)

A support object holds a pointer to the supported object and adds functionality. E.g. rank_support_* structures add the rank operation to bitvectors.

bit_vector b = bit_vector(8000, 0);
for (size_t i=0; i < b.size(); i+=100)
    b[i] = 1;
rank_support_v<1> b_rank(&b); // <- pointer to b
for (size_t i=0; i<=b.size(); i+= b.size()/4) 
    cout << "(" << i << ", " << b_rank(i) << ") ";

Output:

(0, 0) (2000, 20) (4000, 40) (6000, 60) (8000, 80)

operator(i) of rank_support_v<1> returns the number of ones in the prefix $[0..i-1]$ of b.

Support structures: rank_support (2)

Each bitvector class provides default types for rank supports.

sd_vector<> sdb(b);
sd_vector<>::rank_1_type sdb_rank(&sdb);
for (size_t i=0; i<=b.size(); i+= b.size()/4) 
    cout << "(" << i << ", " << sdb_rank(i) << ") ";
cout << endl;
rrr_vector<> rrrb(b);
rrr_vector<>::rank_1_type rrrb_rank(&rrrb);
for (size_t i=0; i<=b.size(); i+= b.size()/4) 
    cout << "(" << i << ", " << rrrb_rank(i) << ") ";

Output:

(0, 0) (2000, 20) (4000, 40) (6000, 60) (8000, 80)
(0, 0) (2000, 20) (4000, 40) (6000, 60) (8000, 80)

Support structures: rank_support (3)

It is possible to generate rank structures for bit-patterns up to length 2 for bit_vector.

bit_vector b = {0,1,0,1};
rank_support_v<1> b_r1(&b);     // <- bitpattern `1` of len 1
rank_support_v<0> b_r0(&b);     // <- bitpattern `0` of len 1
rank_support_v<10,2> b_r10(&b); // <- bitpattern `10` of len 2
rank_support_v<01,2> b_r01(&b); // <- bitpattern `01` of len 2
for (size_t i=0; i<=b.size(); ++i)
    cout << i << ": "<< b_r1(i) << " " << b_r0(i)
         << " " << b_r10(i) << " " << b_r01(i) << endl;

Output:

Support structures: select_support (1)

Select structures can be used analogously.

bit_vector b = {0,1,0,1,1,1,0,0,0,1,1};
size_t zeros = rank_support_v<0>(&b)(b.size());
bit_vector::select_0_type b_sel(&b);

for (size_t i=1; i <= zeros; ++i)
   cout << b_sel(i) << " ";

Output:

0 2 6 7 8

Support structures: select_support (2)

Select on a bit_vector can also be done for bit patterns of length 2.

bit_vector b = {0,1,0,1,1,1,0,0,0,1,1};
size_t cnt10 = rank_support_v<10,2>(&b)(b.size());
select_support_mcl<10,2> b_sel10(&b);

for (size_t i=1; i <= cnt10; ++i)
   cout << b_sel10(i) << " ";

Output:

2 6

Support structures: select_support (3)

Selecting works also on compressed bitvectors (sd_vector<> and rrr_vector<>)

sd_vector<> sd_b = bit_vector{1,0,1,1,1,0,1,1,0,0,1,0,0,1};
size_t ones = sd_vector<>::rank_1_type(&sd_b)(sd_b.size()); 
sd_vector<>::select_1_type sdb_sel(&sd_b);

cout << sd_b << endl;

for (size_t i=1; i <= ones; ++i)
    cout << sdb_sel(i) << " ";

Output:

10111011001001
0 2 3 4 6 7 10 13