POLARDB for MyRocks Extending shared storage to MyRocks Zhang, Yuan Alibaba Cloud Apr, 2018
About me • Yuan Zhang • database engineer • Work at Ailbaba for 5 years • Focus on MySQL & MyRocks • email : zhangyuan.zy@alibaba-inc.com MORE THAN JUST CLOUD
Agenda • Background • Basic Architecture • Implementation details • Performance Improment • Future plan MORE THAN JUST CLOUD
Background Why POLARDB for MyRocks MyRocks + Polarstore Benifits from MyRocks • Greate space efficiency, better compression • Greate write efficiency, lower write amplification • Fast data loading • Compatiable with MySQL Benifits from share-storage(polarstore) • Promising data consistency • Ability to scale read node immediately without full copy of data MORE THAN JUST CLOUD
Basic Architecture Primary • Accept Read/Write workload Replica • Only Accept Read workload • Share sst/wal with primary MORE THAN JUST CLOUD
Let’s Begin prepare for rocksdb wal replication • Base on AIiSQL5.7 • Port MyRocks from Facebook • Only support RocksDB and MyISAM engine • Convert system tables to RocksDB MORE THAN JUST CLOUD
Convert system tables to RocksDB Prepare for RocksDB WAL replication • Convert system tables to RocksDB • Except mysql.slow_log, mysql.general_log, they store in local disk, primary and replica have their owen mysql.slow_log, mysql.general_log tables. MORE THAN JUST CLOUD
Rocksdb WAL/Manifest replication Architecture MORE THAN JUST CLOUD
Rocksdb WAL/Manifest replication Asynchronous replication WAL Replication • Replay PUT/DELETE/MERGE Manifest Replicaion • Replay flush & compaction WAL and Manifest Coordination • Only apply VEdit while Applied lsn > VEdit lsn MORE THAN JUST CLOUD
Rocksdb WAL/Manifest replication Control Primary WAL and SST files deletion WAL deletion - original wal deletion will lead Replica lost wal • Lm : min_log_number on Primary • Ln : min_log_number on all Replicas • new_min_log_number = min( Lm , Ln ) • When WAL’s number < new_min_log_number , then this WAL can be deleted SST deletion - original SST deleteion will lead Replica cannot find SST and crash • min_version_number : the minimal version number replica is using • SST can be deleted only when It will’t be used by Primary and all Replicas MORE THAN JUST CLOUD
DDL&Cache replication Architecture MORE THAN JUST CLOUD
DDL Replication Remove frm,par files Frm,par files • Table metadata information • If Master and replica share frm,par files, DDL replication must be synchronous Remove frm,par files • Store these contents in RocksDB • Replica can read multi version of table schema • DDL replication is asynchronous MORE THAN JUST CLOUD
DDL Replication Remove frm,par files DDL replication is asynchronous • Multiple Table schema version in rocksdb • Row data also have different verisions MORE THAN JUST CLOUD
DDL Replication We have MDL lock to protect DDL operation in Primary. This lock also need in Replica’s DDL. Primary • Log MDL lock start and end. Replica • Replay MDL lock start A. lock MDL • Replay MDL lock end A. update table cache in myrocks B. unlock MDL MORE THAN JUST CLOUD
Cache Replication ACL, Procedure, Query cache Replicaition Primary • Log cache change in RocksDB WAL ACL, Procedure Replica • Replay this change from WAL and invaild this cache MORE THAN JUST CLOUD
Index Statistics Replication Persistent • Part index statistics information persist in each SST • Total index statistics store in INDEX_STATISTICS Memory • Rdb_dey_def::m_stats Update • Analyze table • Flush memtable • Compact Replica listen PUT operation in INDEX_STATISTICS and reload statistic info to memory. MORE THAN JUST CLOUD
New Log Format log change for replication Log Types • DDL(START, END) • Cache change, ACL/Proc Log format • PUT/DELETE Log store location • __system__ column family MORE THAN JUST CLOUD
New Log Format New type in data dictionary // Data dictionary types enum DATA_DICT_TYPE { enum POLAR_LOG_TYPE { DDL_ENTRY_INDEX_START_NUMBER = 1, INDEX_INFO = 2, TABLE_DDL = 1, CF_DEFINITION = 3, CACHE_CHANGE = 2, BINLOG_INFO_INDEX_NUMBER = 4, …… DDL_DROP_INDEX_ONGOING = 5, INDEX_STATISTICS = 6, END_POLAR_ROCK_TYPE = 255 MAX_INDEX_ID = 7, }; DDL_CREATE_INDEX_ONGOING = 8, POLAR_LOG = 100, // for polar replication END_DICT_INDEX_ID = 255 }; MORE THAN JUST CLOUD
New Log Format New type in data dictionary DDL_START • type: PUT • key: POLAR_LOG+TABLE_DDL+dbname.tablename • value: NULL DDL_END • type: DELETE • key: POLAR_LOG+TABLE_DDL+dbname.tablename • value: NULL CACHE_CHANGE • type: PUT • key: POLAR_LOG+CACHE_CHANGE+ACL/Proc • value: NULL MORE THAN JUST CLOUD
New Log Format Problems DDL_START and DDL_END must be a pair. Problem 1: Primary Crash DDL_START • type: PUT • Primary crash after DDL_START , Primary will • key: POLAR_LOG+TABLE_DDL+dbname.tablename resent DDL_START when restart, and the previous • value: NULL DDL_END will lost. DDL_END • type: DELETE • Replica replay DDL_START and hold MDL lock, It • key: POLAR_LOG+TABLE_DDL+dbname.tablename will not unlock with DDL_END • value: NULL MORE THAN JUST CLOUD
New Log Format Problems DDL_START and DDL_END must be a pair. Problem 1: Primary Crash • Primary crash after DDL_START , Primary will resent DDL_START when restart, and the previous DDL_END will lost. • Replica replay DDL_START and hold MDL lock, It will not unlock with DDL_END Solution • Primary Scan RocksDB to find record TABLE_DDL when restart, if found, Primary should resent DDL_END , and Replica will unlock the old lock MORE THAN JUST CLOUD
New Log Format Problems DDL_START and DDL_END must be a pair. Problem 2: Replica Crash • Replica carsh after DDL_START , Replica will continue to replay DDL_END when restart • But the lock with DDL_START will not exist after restart, Replica replay DDL_END to unlock a MDL lock which is not exist MORE THAN JUST CLOUD
New Log Format Problems DDL_START and DDL_END must be a pair. Problem 2: Replica Crash • Replica carsh after DDL_START , Replica will continue to replay DDL_END when restart • But the lock with DDL_START will not exist after restart, Replica replay DDL_END to unlock a MDL lock which is not exist Solution • Replica Scan RocksDB to find record TABLE_DDL when restart, if found, Replica should replay DDL_START to lock MORE THAN JUST CLOUD
MVCC MVCC based on RocksDB snapshot Keep a consistent snapshot in Replica • Replica can’t get the record after Primary compact Control compact in Primary • Compact in Primary should consider about Replica ’s snapshot • Only delete record when sequnce >= Sn , Sn is the laste seqence in Replica • Primary ’s snapshot list merge with replica ’ s snapshot list. MORE THAN JUST CLOUD
MVCC MVCC based on RocksDB snapshot Keep a consistent snapshot in Replica MORE THAN JUST CLOUD
Performance Improment Optimize write performance • Async-commit • Optimize auto_increment • MORE THAN JUST CLOUD
Performance Improment Async-commit Original pipeline write MORE THAN JUST CLOUD
Performance Improment Async-commit Async-commit MORE THAN JUST CLOUD
Performance Improment Optimize write performance Optimize auto_increment • write need check unique • Do Get first then write • Get is expensive Actually, most auto_increment check uniqueness is not necessary. Espacially, when all the auto_incment column is automatically generated. MORE THAN JUST CLOUD
Performance Improment Optimize write performance Optimize auto_increment • max_specify_pk: user sepcified max auto_increment value • if pk > max_specify_pk, skip unique check • if pk <= max_specify_pk nead unique check max_specify_pk update when user use sepcified auto_increment value MORE THAN JUST CLOUD
Future Feature • Online DDL • Multiple-Master Performance • Compaction optimize MORE THAN JUST CLOUD
Q&A MORE THAN JUST CLOUD
Recommend
More recommend