repositories and content addressable storage
play

Repositories and content addressable storage A data repository needs - PowerPoint PPT Presentation

Repositories and content addressable storage A data repository needs to (among other things) Make sure data remains safe and uncorrupted Make sure data remains available If data is changed, previous version should be kept


  1. Repositories and content addressable storage A data repository needs to (among other things) ● Make sure data remains safe and uncorrupted ● Make sure data remains available ● If data is changed, previous version should be kept Solutions available, but.. ● Links to data break -- how to make sure that once a link is created it never breaks? ○ Who keeps track of what is where? ● What if two files have different names but the same content (duplication)? ● Dealing with unexpected events Many solutions used centralized systems ● Single point of failure, single entity in control ● What about doing all the above at scale? Big data etc.

  2. Repositories and content addressable storage Possible solution: distributed and content addressed storage Distributed = a resource is controlled by many. No single place, person, server, entity, has full control Location addressed = things can be found based on a known location ● C:\Photos\vacation.jpg ● The identifier changes, even though the content doesn’t. C:\Pictures\Vacation\waterslide.jpg Content addressed = things can be found based on their content ● Create a digital fingerprint of vacation.jpg based on its content. ● The fingerprint stays the same no matter where it physically resides

  3. Repositories and content addressable storage Content addressable storage ● Fingerprint (hash) stays the same always = uniquely identify, de-duplicate Distributed content addressable storage ● Decentralizes the table that keeps track of where the raw data associated with the fingerprints physically reside ○ Uses many participants each having equal responsibility ● No single point of failure - eg., no single entity controls the lookup table ● Links stay can around forever as long as the network exists. ● Can use the resources of participants to have safe copies of the data, use their bandwidth to speed up transfers

  4. IPFS IPFS is a content addressed distributed storage protocol ● A single file system that is spread out on many computers (nodes)

  5. IPFS and repositories Generally: Some interesting properties: IPFS is a protocol rather than a service Can build services on top of it (client, server) The nodes form a distributed file system based Can access IPFS content via standard HTTP on P2P technology (e.g., DHT for lookups) using gateways (see figure) or FUSE. Versioning, de-duplication is fundamentally part Objects can be “pinned” so they aren’t garbage of it collected and always stay local Files are broken down into blocks. Possibility to create a private IPFS network (via modification of the bootstrap list) ● Each block has a hash. ● Blocks are linked it a tree-like structure. Easy, quick “IPFS is actually more similar to a single bittorrent swarm exchanging git objects.”

  6. IPFS Gateway

  7. What IPFS isn’t A cloud storage service, backup protocol. A blockchain-based system ● Can’t upload stuff and disconnect Blockchain = immutable, publicly available & verifiable record of transactions Files must remain available by “pinning” them. Can work with blockchain ● Unpinned files get deleted after some ● Incentives for providing node time resources ● Who will pin files in addition to the ● “Mining” a cryptocurrency for reward owner? ● Storing data in a blockchain is ○ other interested parties? inefficient. ○ Combine to store transactions in blockchain, data IPFS.

Recommend


More recommend