Repositories and content addressable storage A data repository needs to (among other things) ● Make sure data remains safe and uncorrupted ● Make sure data remains available ● If data is changed, previous version should be kept Solutions available, but.. ● Links to data break -- how to make sure that once a link is created it never breaks? ○ Who keeps track of what is where? ● What if two files have different names but the same content (duplication)? ● Dealing with unexpected events Many solutions used centralized systems ● Single point of failure, single entity in control ● What about doing all the above at scale? Big data etc.
Repositories and content addressable storage Possible solution: distributed and content addressed storage Distributed = a resource is controlled by many. No single place, person, server, entity, has full control Location addressed = things can be found based on a known location ● C:\Photos\vacation.jpg ● The identifier changes, even though the content doesn’t. C:\Pictures\Vacation\waterslide.jpg Content addressed = things can be found based on their content ● Create a digital fingerprint of vacation.jpg based on its content. ● The fingerprint stays the same no matter where it physically resides
Repositories and content addressable storage Content addressable storage ● Fingerprint (hash) stays the same always = uniquely identify, de-duplicate Distributed content addressable storage ● Decentralizes the table that keeps track of where the raw data associated with the fingerprints physically reside ○ Uses many participants each having equal responsibility ● No single point of failure - eg., no single entity controls the lookup table ● Links stay can around forever as long as the network exists. ● Can use the resources of participants to have safe copies of the data, use their bandwidth to speed up transfers
IPFS IPFS is a content addressed distributed storage protocol ● A single file system that is spread out on many computers (nodes)
IPFS and repositories Generally: Some interesting properties: IPFS is a protocol rather than a service Can build services on top of it (client, server) The nodes form a distributed file system based Can access IPFS content via standard HTTP on P2P technology (e.g., DHT for lookups) using gateways (see figure) or FUSE. Versioning, de-duplication is fundamentally part Objects can be “pinned” so they aren’t garbage of it collected and always stay local Files are broken down into blocks. Possibility to create a private IPFS network (via modification of the bootstrap list) ● Each block has a hash. ● Blocks are linked it a tree-like structure. Easy, quick “IPFS is actually more similar to a single bittorrent swarm exchanging git objects.”
IPFS Gateway
What IPFS isn’t A cloud storage service, backup protocol. A blockchain-based system ● Can’t upload stuff and disconnect Blockchain = immutable, publicly available & verifiable record of transactions Files must remain available by “pinning” them. Can work with blockchain ● Unpinned files get deleted after some ● Incentives for providing node time resources ● Who will pin files in addition to the ● “Mining” a cryptocurrency for reward owner? ● Storing data in a blockchain is ○ other interested parties? inefficient. ○ Combine to store transactions in blockchain, data IPFS.
Recommend
More recommend