ToroDB Stampede: Open-source BI connector for MongoDB Álvaro Hernández Tortosa <aht@8kdata.com> Percona live 2017
Who I am CEO, 8Kdata.com • What we do @8Kdata: ALVARO HERNANDEZ ✓ Creators of ToroDB.com, NoSQL & SQL database ✓ Database R&D, product development ✓ Training and consulting in PostgreSQL ✓ PostgreSQL Support Twitter: @ahachete Linkedin: Founder, President Spanish Postgres User Group http://es.linkedin.com/in/alvarohernandeztortosa/ postgrespana.es ~ 750 members
MongoDB Connector for BI
MongoDB BI Connector •https://docs.mongodb.com/bi- connector/master/ •Developed by MongoDB themselves •Effectively allows to query mongo in a SQL-ish way But...
MongoDB BI Connector •How do they do it? You don’t know it. •Harder to detect bugs •They’re part of bigger solutions which may or may not fit you •You need Enterprise version, you are paying for the whole solution
MongoDB does *not* have consistent reads https://blog.meteor.com/mongodb-queries-dont-always-return-all-matching-documents-654b6594a827#.fplxodagr
ToroDB Stampede
ToroDB: from MongoDB to PostgreSQL
How do we generate the schema? ★ There’s no need to generate it, it’s implicit! { name: “Alice”, address: { street: “Wonderland”, number: 42 }, hobbies: [ “reading”, “talking to rabbits” ] } implicit schema
How do we generate the schema? •ToroDB analyzes every incoming document and separates metadata (schema) from data (tuples) •With that schema info, ToroDB creates 1+ tables per MongoDB collection •ToroDB also creates a RDBMS catalog schema per MongoDB database
How do we generate the schema? SELECT * FROM torodb.people; database: torodb did name_s _id_x collection: people { 1 Alice \x01d4a63dc6944f37a61cf3d6 name: “Alice”, _id: (implicit val), SELECT * FROM torodb.people_address; address: { street: did name_s seq number_d street_s “Wonderland”, number: 42 1 1 42 Wonderland }, hobbies: [ SELECT * FROM torodb.people_hobbies; “reading”, “talking to did rid seq value_s rabbits” 1 1 0 “reading” ] } 1 2 1 “talking to rabbits”
How to react to dynamic changes in the schema? [ { name: “Alice” }, { name: “Harry”, surname: “Potter” } ] ★ What happens if a new document in the collection brings new fields?
How to react to dynamic changes in the schema? did _id_x name_s surname_s 1 \x01d4a63dc6944f37a61cf3d6 Alice 2 \x21f4f563dc6944f37a61cf436 Harry Potter ★ ToroDB automatically performs: ✓ ALTER TABLE ... ADD COLUMN surname_s text
How to react to dynamic changes in the schema? [ { name: “Harry”, surname: “Potter” } { name: “Henry”, surname: 8 }, ] ★ What happens if a new document in the collection brings type conflicts?
How to react to dynamic changes in the schema? did _id_x name_s surname_s surname_i 1 \x01d4a63dc6944f37a61cf3d6 Harry Potter 2 \x21f4f563dc6944f37a61cf436 Henry 8 ★ Type conflicts are resolved by suffixing column names with the type of the data: "s" for text, "i" for integer and so forth
ToroDB Architecture This is your SQL replica! ToroDB ToroDB works as a secondary node on a MongoDB replica set
ToroDB Architecture • ToroDB works as a secondary node on a replica set • This node has two main modes: ✓ First, it imports all MongoDB data into PostgreSQL ✓ Then it stays tuned to changes in MongoDB Oplog and applies them in PostgreSQL
ToroDB Architecture Protocols MongoDB*, Couchbase, Cassandra… Abstract Doc Tables and Rows { Here is where ‘the magic’ happens transforming Docs into Data2Relational Backends PostgreSQL*, MySQL, SQL Server, Oracle… * currently available ToroDB offers the interface of any NoSQL database, but persisting data in tables and rows within a SQL database.
Measuring I/O with iotop required to answer a query Github Archive: top 10 actors (1,4GB dataset) 500 MongoDB storageSize: 536.37 MB MongoDB size: 1410,35 MB 375 Exactly 100% of the storageSize! db.githubarchive.aggregate([ Disk Read (MB) { $group: { _id: '$actor.login', events: { $sum: 1 } 536,36 250 } }, { $sort: { events: -1 }}, { $limit: 10 } ]) 125 SELECT count(*), login FROM actor GROUP BY login 87,93 6,5 ORDER BY 1 DESC LIMIT 10; 0 MongoDB PostgreSQL PostgreSQL + cstore
Building the benchmarks • Data from Github archive • Timing three different environments that contain the same data: ✓ MongoDB ✓ MongoDB with 3 shards ✓ PostgreSQL (obtained by migrating MongoDB environment via ToroDB) •We run 6 queries (PostgreSQL queries being the relational equivalent to MongoDB queries) ✓ Available at https://gist.github.com/ahachete
Github 500GB PostgreSQL MongoDB MongoDB (3 Shards) 880 A 1.851 621 83 B 4.333 1.184 57x 239 C 4.253 1.191 254 D 14.504 3.691 1.650 E 4.732 1.418 402 F 4.126 1.021 0 4.000 8.000 12.000 16.000 Seconds (Less is better)
Github 100GB PostgreSQL MongoDB MongoDB (3 Shards) 34 A 361 89 13 B 750 That’s 267x 194 faster! 38 C 747 163 9 D 2.405 383 240 E 768 193 53 F 664 117 0 750 1.500 2.250 3.000 Seconds (Less is better)
Github 100GB no idx PostgreSQL MongoDB MongoDB (3 Shards) 31 A 715 199 13 B 694 209 Important 37 because 94x C 668 many BI 193 queries 7 don’t use D 658 indexes 158 241 E 793 211 53 F 667 164 0 200 400 600 800 Seconds (Less is better)
What's on my MongoDB database?
Stampede infers and clearly shows you your data
Or use SQL tools to show your schema
Installation Prerequisites ToroDB Stampede MongoDB connector for BI • PostgreSQL • MongoDB Enterprise 3.2+ • Java SE 8 • OpenSSL
Installation MongoDB connector for BI wget https://fastdl.mongodb.org/linux/mongodb-linux-x86_64- ubuntu1604-3.4.3.tgz tar xjf mongodb-linux-x86_64-ubuntu1604-3.4.3.tgz sudo install -m755 bin/mongo* /usr/local/bin/ ToroDB Stampede wget https://www.torodb.com/download/torodb-stampede-1.0.0- beta1.tar.bz2 tar xjf torodb-stampede-1.0.0-beta1.tar.bz2
Configuration MongoDB connector for BI ... It’s something like this schema: • You have to create an schema for - db: database each database (or collection). tables: - table: name_table collection: name_collection pipeline: [] mongodrdl --host example.com columns: -d dbname [-c collname] - Name: _id MongoType: bson.ObjectId -o schema.drdl SqlName: _id SqlType: varchar • You can edit manually… - Name: column_name MongoType: type SqlName: sqlname SqlType: sqltype
Configuration ToroDB Stampede • Requires a new user ( ‘torodb' ) and a new database ( ’torod' ) in PostgreSQL. • Access configuration to PostgreSQL detailed in .toropass : host:port:torod:torodb:<password> • You can also check the configuration used by ToroDB Stampede using the -l parameter: torodb-stampede -l
Start! MongoDB connector for BI mongosqld --schema schema.drdl --mongo-uri <your.mongohost.com> • You need one schema per database • It enables a MySQL connection ToroDB Stampede $TOROHOME/bin/torodb-stampede • And it starts the automagically replication
Problem resolution Insert Documents with new fields MongoDB connector for BI ToroDB Stampede • You will surely have to edit • Creates new columns and manually the mongodrdl tables if it’s necessary schema: ➡ mongodrdl infers the schema by sampling
Problem resolution Type Conflicts MongoDB connector for BI ToroDB Stampede • You should execute mongodrdl • The column names contain again, but surely you don’t have a postfix, indicating the the new type, so: data type ➡ You have to edit the schema ➡ If one path contains and add a new field with two different data types different sql name. then two different columns are created.
Problem resolution Array / Document exploding MongoDB Connector for BI database: stampede collection: restaurants stampede.restaurants { _id name address.street address.number _id: (implicit val), 01d4a63 “Glorious Food” “Wonderland” 42 name: “Glorious Food” dc6944f address: { street:”Wonderland”, number: 42, coord: [ -73.9832324, 40.6563633] stampede.restaurants_address_coord } _id address.coord address.coord_idx } 01d4a63 -73.9832324 0 dc6944f 01d4a63 40.6563633 1 dc6944f
Problem resolution Array / Document exploding ToroDB Stampede stampede.restaurants database: stampede did _id_x name_s address_e collection: restaurants 0 \x01d4a “Glorious Food” f { _id: (implicit val), name: “Glorious Food” address: { stampede.restaurants_address street:”Wonderland”, did rid seq street_s number_d coord_e number: 42, 0 0 “Wonderland” 42 t coord: [ -73.9832324, 40.6563633] } stampede.restaurants_address_coord } did rid pid seq v_d 0 0 0 0 -73,9832324 0 1 0 1 40.6563633
Use Case with… PostgreSQL 100.000 rows M 1.008 ms o n g o 1 D 0 B 0 B . 0 I 0 0 8 r o 2 w . 4 s 1 2 m ( s 8 0 x ! ! ) SELECT * FROM twitter.message INNER JOIN twitter.message_user ON SELECT * FROM `twitter.message` twitter.message.did = LIMIT 100000 twitter.message_user.did LIMIT 100000
Recommend
More recommend