Git database with bitmap index Kuba Podgrski source{d} All the - PowerPoint PPT Presentation

Git database with bitmap index Kuba Podgórski

source{d} All the “crazy mental gymnastics with data”: src-d/go-mysql-server ● src-d/gitbase ● src-d/engine ● github.com/kuba-- My open source projects: pkg/xattr ● kuba--/zip ● never-lang/never ●

Context Gitbase (git database frontend) Database implementation ( go-mysql-server ) powered by vitess.io. ● Read only (no INSERTS , UPDATES , etc.). ● Query git repositories with go-git . ● Pilosa (bitmap index) Distributed index implementation. ● With roaring storage format. ● Attributes in BoltDB . ●

Frontend for git database. ● Database implementation ( go-mysql-server ) powered by vitess.io. ● Read only (no INSERTS , UPDATES , etc.). ● Query git repositories with go-git package. ● Gitbase

Schema Main tables Repositories (repository_id) ● Remotes (remote_name, ...) ● Refs (ref_name, commit_hash) ● Commits (commit_hash, …) ● Blobs (blob_hash, …) ● Tree_Entries (blob_hash, tree_entry_name, …) ● Files (file_path, blob_hash, …) ●

Schema Relation tables Commit_Blobs (blob_hash, ...) ● Commit_Trees (commit_hash, tree_hash, ...) ● Commit_Files (commit_hash,file_path, ...) ● Ref_Commits (repository_id, ref_name, ...) ●

> SELECT refs.repository_id FROM refs NATURAL JOIN commits WHERE commits.commit_author_name = 'Alan Turing' AND refs.ref_name = 'HEAD' Get all the repositories contributed on HEAD reference.

> SELECT file_path, uast_extract( uast(blob_content, language(file_path), "//uast:Identifier"), "Name" ) FROM files WHERE language(file_path) = 'Go' Extract identifier names for go files.

> CREATE INDEX email_idx ON commits USING pilosa (commit_author_email) CREATE INDEX files_commit_path_blob_idx ON commit_files USING pilosa (commit_hash, file_path, blob_hash) WITH (async = true) Create an index on a specific column(s) ...

> CREATE INDEX files_lang_idx ON files USING pilosa (language(file_path, blob_content)) ...or on one expression.

Hash - In memory hashmap / good for equality ● BTree - The most common / self balancing ● RTree - Spatial index to group nearby object ● Bitmaps - Optimized to speed up logical operations ● Indexes

Bitmap index More often used in read-only systems. ● Optimized for logical operations. ● The best for fields with only a few ● For tables with “n” columns, the total number of distinct indexes to satisfy all possible queries possible values. Expensive - can take a lot of space. ● One index per column to support all ● possible queries on a table.

> // Position of a row/column pair. func pos (rowID, columnID uint64) uint64 { return (rowID * ShardWidth) + (columnID % ShardWidth) } // Write to local storage. bitmap .Add(pos) Roaring bitmaps.

> // Write type and value. buf[0] = byte(op.typ) // opTypAdd LittleEndian .PutUint64 (buf[1:9], op.value) // Add checksum at the end. h := fnv.New32a() h.Write(buf[0:9]) LittleEndian .PutUint32 (buf[9:13], h.Sum32()) Roaring bitmaps.

Bitmap index ● Distributed index implementation (typically server-client) ● With roaring storage format ● Attributes in BoltDB . ● Pilosa

Data model Boolean matrix The purpose of the Index is to represent a data ● namespace. You cannot perform cross-index queries. Column ids are sequential, increasing integers and ● they are common to all Fields within an Index. Row ids are sequential, increasing integers ● namespaced to each Field within an Index. Fields are used to segment rows within an index, for ● example to define different functional groups. https://www.pilosa.com/docs/latest/data-model/

Gitbase with pilosa index driver

Pilosa index driver container_name : pilosa The first approach image : Pilosa as an external service ● pilosa/pilosa:v1.2.0 ports : - "10101:10101" One pilosa index per database index (db, table, id) ● One pilosa field per expression ● Mapping in BoltDB (value, row) , (column, location) ●

Pilosalib Yet another index driver Index └─ Field └─ View Extract API from the server ● └─ Fragment ├─ openCache └─ openStorage Open/Close files locally without an index Holder ●

> type Holder struct { ... // opened channel is closed once Open() completes. opened lockedChan closing chan struct{} } Holder represents a container for indexes.

> func (h * Holder ) Open() error { h.closing = make(chan struct{}) h.opened.Close() } func (h * Holder ) Close() error { close(h.closing) h.opened.ch = make(chan struct{}) } Open initializes the root data directory for the holder. Close closes all open fragments.

> func (h * Holder ) Open() error { h.closing = make(chan struct{}) h.opened.Close() // panic! } func (h * Holder ) Close() error { close(h.closing) // panic! h.opened.ch = make(chan struct{}) } Panic! Open/Close accidently being called twice.

Pilosalib Bitmaps across the One index, many fields same pilosa index One pilosa index per (db, table) ● are mergeable One pilosa field per (id, expression, partition) ● Mapping (in BoltDB) utilizes bucket sequencer ● to get next ID Values encoded by gob package ●

> // CREATE INDEX id ON(A, B) idx := newPilosaIndex(db, table) // A, B for _, ex := range Expressions() { idx.CreateField(id, ex , p) } Mergeable DB indexes - Create index.

> for colID := offset; ; colID ++ { values, location := it.Next() for i, f := range idx.fields { rowID := getRowID(f, values[i]) f.Add( rowID , colID ) } putLocation(idx, colID , location) } Mergeable DB indexes - Save data.

> // WHERE A = ‘2’ AND B = ‘4’ var row *pilosa.Row for i, ex := range Expressions() { f := idx.Field(id, ex , p) // rowID( A,‘2’): 2, rowID( B, ‘4’): 4 rowID := mapping.rowID(f, values[i]) row = row.Intersect(f.Row( rowID )) } Intersect bitmaps [0, 0, 1, 1, 0, 1, ...] AND [1, 0, 0, 1, 1, 1, ...]

> // WHERE A = ‘2’ AND B = ‘4’ var row *pilosa.Row for i, ex := range Expressions() { ... } bits := row.Columns() // [3, 5] ... mapping.getLocation(idx, bits[offset]) Get results Index(A, B) == Index(A) AND Index(B)

Interfaces

> type IndexDriver interface { ID() string LoadAll(db, table string) ([] Index , error) Create(db, table, id string, Expressions []Expressions, Config map[string]string) ( Index , error) Save(*Context, Index , PartitionIndexKeyValueIter) error Delete( Index , PartitionIter) error } IndexDriver interface.

> type Index interface { Has(p Partition, keys ...interface{}) (bool, error) Get(keys ...interface{}) ( IndexLookup , error) ... } type AscendIndex interface { AscendGreaterOrEqual(keys ...interface{}) ( IndexLookup , error) AscendLessThan(keys ...interface{}) ( IndexLookup , error) AscendRange(ge, lt []interface{}) ( IndexLookup , error) } Index interface.

> type IndexLookup interface { Values(Partition) (IndexValueIter, error) Indexes() []string } type SetOperations interface { Intersection(...IndexLookup) IndexLookup Union(...IndexLookup) IndexLookup Difference(...IndexLookup) IndexLookup } IndexLookup interface.

Mapping

> func getRowID(field string, value interface{}) id uint64 { b := CreateBucketIfNotExists(field) var key bytes.Buffer enc := gob.NewEncoder(&key) enc.Encode(value) if v := b.Get(key.Bytes()); v != nil { id = LittleEndian.Uint64(v) } Mapping values to rowID

> func getRowID(field string, value interface{}) id uint64 { ... // key doesn’t exist id, _ = b.NextSequence() val = make([]byte, 8) LittleEndian.PutUint64(val, id) b.Put(key.Bytes(), val) Mapping values to rowID

Thanks https://sourced.tech/engine https://github.com/RoaringBitmap/roaring https://github.com/src-d/gitbase https://github.com/pilosa/pilosa https://github.com/src-d/go-mysql-server

Git database with bitmap index Kuba Podgrski source{d} All the - PowerPoint PPT Presentation

Git database with bitmap index Kuba Podgrski source{d} All the crazy mental gymnastics with data: src-d/go-mysql-server src-d/gitbase src-d/engine github.com/kuba-- My open source projects: pkg/xattr kuba--/zip

Links this: //nasinf001/abajric/git-doc.git Demo: //nasinf001/abajric/git-demo.git Pro

Git 101: Git and GitHub for beginners Overview 1.Install git and create a Github account

THE REPO DOES NOT FORGET STEP 1: GIT FILTER-BRANCH git filter-branch --index-filter 'git rm -rf

CS: Pod of Delight Week 11: Git Git What is Git? Distributed version control tool Keep

HICAMP Bitmap A Space-Efficient Updatable Bitmap Index for In-Memory Databases Bo Wang,

GIT WORKSHOP GIT WORKSHOP 1 . 1 GIT WORKSHOP GIT WORKSHOP Manuela Salvucci

Git and Github A developers best friend What is Git? 2 What is Git? Git is a Version

Git 101 Kristen Kwong Kristen Kwong, 2020 Git 101 Kristen Kwong Slides:

You will learn what git is . You will learn how you can use git . You will learn how to learn more

Using Git Matthieu Moy Matthieu.Moy@imag.fr 2016 Matthieu Moy (Matthieu.Moy@imag.fr) Git 2016

Tools git: Theory git: Use Git and (other) Tools for Cooperation git: Tools Project

Git David Parker CSCI 5828 - Presentation Outline What is Git? Other Useful Related

GIT RECAP Check status since last commit: $ git status Stage changes/add new files: $ git add

Git tools Sylvain Bouveret, Grgory Mouni, Matthieu Moy 2017 [first].[last]@imag.fr

Bitmap (Raster) Images CO2016 Multimedia and Computer Graphics Roy Crole: Bitmap Images (CO2016,

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

' $ Bitmap Index Design and Ev aluation Chee-Y ong Chan Univ ersit y of

Mobi obiCeal: Tow owards Secure and nd Practical Plausibly Deni niable Encryption n on Mobi

FVD: A High-Performance Virtual Machine Image Format for Cloud Chunqiang (CQ) Tang IBM T.J.

BF-based chunk availability compression for PPSP-02 Lingli

Shooting Stars in the Sky An Online Algorithm for Skyline Queries Donald Kossmann Frank Ramsak

15-721 ADVANCED DATABASE SYSTEMS Lecture #13 Checkpoint Protocols Andy Pavlo / / Carnegie

Storage: File System Implementation Prof. Patrick G. Bridges 1 University of New Mexico The Way

Solving Everyday Data Problems with FoundationDB Ryan Worl (ryantworl@gmail.com) Consultant