[PPT] - Peer-to-peer systems and Data location overlay networks Churn PowerPoint Presentation

SLIDE 1

1

Complex Adaptive Systems

C.d.L. Informatica – Università di Bologna

Peer-to-peer systems and

verlay networks

Fabio Picconi Dipartimento di Scienze dell’Informazione 2

Introduction to P2P systems
Common topologies
Data location
Churn
Newscast algorithm
Security

Outline

3

Peer-to-peer vs. client-server

client-server

Server well connected to the

“center” of the Internet

Servers carries out critical tasks
Clients only talk to server
Only nodes located on the

“periphery” of the Internet

Tasks distributed across all nodes
Clients talk to other clients

peer-to-peer

4

Example – Video sharing

Client-server: YouTube

client-server

Advantages

Client can disconnect after upload
Uploader needs little bandwidth
Other users can find the file easily

(just use search on server webpage)

Disadvantages

Server may not accept file or

remove it later (according to content policy)

Whole system depends on the server

(what if shut down like Napster?)

Server storage and bandwidth

are expensive!

uploader downloader downloader downloader downloader downloader

SLIDE 2

5

Example – Video sharing

Peer-to-peer: BitTorrent

peer-to-peer

Advantages

Does not depend on a central server
Bandwidth shared across nodes

(downloaders also act as uploaders)

High scalability, low cost

Disadvantages

Seeder must remain on-line to

guarantee file availability

Content is more difficult to find

(downloaders must find .torrent file)

Freeloaders cheat in order to

download without uploading

seeder downloader downloader downloader downloader downloader

6

Comparison: P2P vs. client-server

Client-server

Asymmetric: client and servers

carry out different tasks

Global knowledge: servers have

a global view of the network

Centralization: communications

and management are centralized

Single point of failure: a server

failure brings down the system

Limited scalability: servers

easily overloaded

Expensive: server storage and

bandwidth capacity is not cheap

Peer-to-peer

Symmetric: each node carries out

the same tasks

Local knowledge: nodes only

know a small set of other nodes

Decentralization: nodes must self-
rganize in a decentralized way
Robustness: several nodes may fail

with little or no impact

High scalability: high aggregate

capacity, load distribution

Low-cost: storage and bandwidth

are contributed by users

7

Characterizing peer-to-peer systems

The main characteristics of P2P systems are:

decentralization (i.e., no central server)
self-organization (e.g., adding new nodes and removing disconnected ones)
symmetric communications (e.g., peers act as clients and servers)
scalability (thanks to high aggregate capacity and load distribution)
shared ownership (i.e., storage and bandwidth are contributed by peers)
overlay construction and routing (i.e., nodes form a logical network on

top of the underlying IP network)

a message from one peer to another is sent through the underlying IP network

8

P2P environment

P2P systems are deployed in a challenging environment:

High latency and low bandwidth between nodes
a high hop count will result in a high end-to-end latency
transferring large files may take a long time
Churn
nodes may disconnect temporarily
new nodes are constantly joining the system, while others leave the
verlay permanently
Security
P2P clients run on machines under full control of their users
data sent to other nodes may be erased, corrupted, disclosed, etc.
malicious users may try to bring down the system (e.g., routing attack)
Selfishness
users may run hacked P2P clients in order to avoid contributing resources

SLIDE 3

9

Problems

Some of the problems that a P2P systems designer must face:

Overlay construction and maintenance
maintain a given overlay topology (e.g., random, two-level, ring, etc.)
Data location
locate a given data object among a large number of nodes
Data dissemination
propagate data in an efficient and robust manner
Per-node state
keep the amount of state per node small
Tolerance to churn
maintain system invariants (e.g., topology, data location, data availability)

despite node arrivals and departures

10

Topology

Some common topologies:

Flat unstructured: a node can connect to any other node
only constraint: maximum degree dmax
fast join procedure
usually very tolerant to churn
good for data dissemination, bad for location
Two-level unstructured: nodes connect to a supernode
supernodes form a small overlay
used for indexing and forwarding
large state and high load on supernodes
Flat structured: constraints based on node ids
allows for efficient data location
constraints require long join and leave procedures
less robust in high-churn environments

11

Data location - Flooding

Problem: find the set of nodes S that store a copy of object O Solutions:

(1) Flooding: send a search message to all nodes [first Gnutella protocol]

A search message contains either keywords or an object id

Advantages:

simplicity
no topology constraints

Disadvantages:

high network overhead (huge traffic generated by each search request)
flooding stopped by TTL (which produces search horizon)
only applicable to small number of nodes

12

Data location - Flooding

(1) Flooding (cont.) Flooding in a flat unstructured network:

search horizon for TTL = 2

bj

Objects that lie outside of the horizon are not found

search

SLIDE 4

13

Data location - Superpeers

(2) Two-level overlay: use superpeers to index the locations of an object [eMule, Gnutella 2, BitTorrent]

Each node connects to a superpeer and advertises the list of objects it stores
Search requests are sent to the superpeer, which forwards them to other

superpeers Advantages:

highly scalable

Disadvantages:

superpeers must be realiable, powerful and well connected to the

Internet (expensive)

superpeers must maintain large state
the system relies on a small number of superpeers

14

Data location - Superpeers

(2) Two-level overlay (cont.)

A two-level overlay is a partially centralized system
In some systems superpeers do not connect to each other (e.g., BitTorrent)

response request

bj

15

Data location - KBR

(3) Structured networks: use a routing algorithm that implements Key-Based Routing [Overnet, Kad, BitTorrent trackerless] Key-Based Routing (also known as Distributed Hash Tables, or DHTs) works as follows:

each node is given a unique node identifier, or nodeid
given a key k, the node whose nodeid is numerically closest to k

among all nodes in the network is known as the root of key k

given a routing key k, a KBR algorithm can route a message to the

root of k in a small number of hops, usually O(log N)

the location of an object with id objectid is tracked by the root of

k = objectid

thus, one can find the location of an object by routing a message to the

root of k = objectid and querying the root for the location of the object

16

Data location - KBR (cont.)

Key-Based Routing [Pastry]

Source node id: 04F2 k = object id: 8955 Hop # Hop id Shared prefix length 04F2 1 85E0 1 2 8909 2 3 8957 3 4 8954 3 (root of k)

04F2 3A79 5230 8909 8954 8957 AC78 C52A E25A route(k=8955,msg)

verlay address space

[0000,FFFF]

620F 85E0

bj8955

stored on nodes 620F,C52A 8821

bj

8955

bj

8955

Object 8955 is tracked by node 8954, which knows of two copies stored at nodes 620F and C52A

SLIDE 5

17

Data location - KBR (cont.)

Routing table for node 4F28 [Pastry]

Routing table

02A3 409A 4F04 19BA 413C 4F1B 4F21 2F34 4288 N/A E129 4E01 F0A4 N/A 4FF5 … … … ...

Leaf Set

4F04 4F1B 4F21

4F30

4F55 4FF5

Node id: 4F28

used to find next hop with longer shared prefix used to find the nodeid closest to a key that is close to the local nodeid

In this example the routing table size is 4 x 15 = 60 entries, for a

maximum network size of N = 65536 nodes.

The average route length in this case is 4 hops.

18

Data location - KBR (cont.)

(3) Structured networks (cont.) Advantages:

completely decentralized (no need for superpeers)
routing algorithm achieves low hop count for large network sizes

Disadvantages:

each object must be tracked by a different node
objects are tracked by unreliable nodes (i.e., which may disconnect)
keyword-based searches are more difficult to implement than

with superpeers (because objects are located by their objectid)

the overlay must be structured according to a given topology

in order to achieve a low hop count

routing tables must be updated every time a node joins or leaves the
verlay

19

Data location - Loosely structured overlays

(4) Loosely structured networks: use hints on the location of objects [Freenet]

Nodes locate objects by sending search requests containing the object id
Requests are propagated using a technique similar to flooding
Objects with similar identifiers are grouped on the same nodes

AE5J A B C D E F 5B20 request for AE5J response AF02

20

Data location - Loosely structured overlays

(4) Loosely structured networks (cont.)

A search response leaves routing hints on the path back to the source
Hints are used when propagating future requests for similar object ids

AE5J A B C D E F Hints AE5J: D 5B20: E 5B20 Hints AE5J: F Hints AE5J: B AF02 request for AF02

SLIDE 6

21

Data location - Loosely structured overlays

(4) Loosely structured networks (cont.) Advantages:

no topology constraints, flat architecture
searches are more efficient than with plain flooding

Disadvantages:

does not support keyword-based searches
search requests have a TTL
contrary to structured overlays, loosely structured overlay do not

guarantee a low number of hops, nor that the object will be found

22

The location schemes described previously can be classified according to:

degree of structure
decentralization

Freenet

loosely

structured Kad Gnutella decentralized

eMule

Gnutella 2 BitTorrent partially centralized structured unstructured

Data location - Summary

23

Churn (node arrivals and departures) can have several effects on a P2P system:

data objects may be become unavailable if all replicas disconnect
routing tables may become inconsistent

(e.g., entries may point to nodes which have disconnected)

the overlay may become partitioned if several nodes suddenly disconnect:

Effects of churn

24

Node arrivals and departures must not disrupt the normal behavior of the system

system invariants must be maintained
connected overlay (i.e., no partitions), low average path length
data objects accessible from anywhere in the network
two types of churn tolerance:
dynamic repair: ability to react to changes in the overlay to maintain

system invariants (e.g., heal partitions)

static resilience: ability to continue operating correctly before adaptation
ccurs (e.g., route messages through alternate paths)

Churn tolerance

SLIDE 7

25

An simple way to prevent partitions is to increase the node degree

Churn – Preventing partitions

Ring partitions can be avoided by keeping a list of successor nodes

26

Percentage of failed paths for various numbers of successor nodes

Churn – Static resilience (Ring topology)

1 successor 16 successors 48 successors

More successors reduce the chances of “opening” the ring

27

Percentage of failed paths for various overlay topologies

Churn – Static resilience (various topologies)

Some topologies provide more alternate paths between nodes than others

28

Dynamically update routing information to adapt to overlay changes
Two types of repair algorithms:
reactive: start maintenance procedure immediately after detection
periodic: execute maintenance procedure periodically

A reactive algorithm can bring down the system instead of repairing it:

Each node disconnection triggers a maintenance procedure on a set of nodes
Over a given churn rate, the maintenance traffic congests the network
The network congestion leads to nodes being considered as disconnected
This triggers even more maintenance procedures (positive feedback),

eventually bringing down the network This effect may be avoided using a periodic maintenance algorithm

Churn – Dynamic repair (Ring topology)

SLIDE 8

29

Churn is an important issue in P2P overlays:

Data may become unavailable, and routing information outdated
Static resilience
depends on the topology (i.e., the number of alternate paths)
increases with the average node degree
Dynamic repair protocols must be carefully designed
reactive protocols are usually faster
periodic protocols can handle higher churn rates

Churn – Summary

30

Newscast

Peer-to-peer protocol that creates and maintains an unstructured overlay
Highly resilient to churn
Can be used to propagate information
Extremely simple design based on information gossip:
Each node only knows about a small set of other knows (view)
Each node periodically picks a random node from this set
Both nodes exchange their views and update them
The random view exchange makes the algorithm very robust to failures and

changes in the overlay

31

Newscast - View exchange

Each node maintains a view v containing c entries, where

entry = {node address, timestamp}

Each node executes the following code every T seconds:
1. select random entry r from local view
2. send local view plus an entry with the local address to node r
3. retrieve view of node r and merge it with local view
4. keep the c entries with the most recent timestamps
The view of a node changes on each round

32

Newscast - View exchange (cont.)

Example view exchange initiated by node A

B,10 D,5 E,3 X,8 S,12 W,2 view of node A (c = 6) D A E X B W S

SLIDE 9

33

Newscast - View exchange (cont.)

A,15 B,10 D,5 E,3 X,8 S,12 W,2 E,15 J,14 C,9 D,8 H,10 L,14 Z,2 view of node E A E

1. Select random node

B,10 D,5 E,3 X,8 S,12 W,2 view of node A selected node

2. Exchange views (plus local entry) with selected node

view of node A

34

Newscast - View exchange (cont.)

E,15 J,14 L,14 S,12 H,10 B,10 C,9 X,8 D,8 D,5 E,3 W,2 Z,2

3. Merge views, and order

by timestamp

merge result

4. Keep c most recent entries

E,15 J,14 L,14 S,12 H,10 B,10 new view

f node A

(c = 6)

35

Newscast - Average path length

Newscast overlays have a low average path length, i.e., O(log N)

(average number of hops between any two nodes)

36

Newscast - Static resilience

The overlay shows high static resilience

SLIDE 10

37

Newscast - Summary

Simple peer-to-peer algorithm
Nodes use only local information
Periodic peer-wise data exchanges
Emergent properties:
low average path length
resistance to high churn

38

Security

Security in peer-to-peer systems is hard to enforce:

Users have full control on their computers
Modified clients may not follow the standard protocol
Communications may be eavesdropped
Data may be corrupted
Private data stored on remote computers may be disclosed

39

Security - Weak identities

The user may leave the system and rejoin it with a new identity (i.e, user id)
If an attack is detected, the attacker can reenter the system with a new id
An attacker may create a large number of false identities (Sybil attack)

S2 A S3 S4 S1 S6 S5

Example of Sybil Attack:

Nodes S1 to S6 are actually 6

instances of the P2P client running on the same machine

The attacker can intercept all

traffic coming from or going to node A

40

Security - Strong identities

The user cannot change its identity
Solution: use a centralized, trusted Certification Authority (CA)
Each new user must obtain an identity certificate:

certificate = { user id, IP address, user’s public key, signatureCA }

The certificate is digitally signed by the CA, whose public key is

known by all users

A certificate cannot be forged (would require the CA’s private key)
To prove his identity, a user signs a message with his private key,

and attaches the corresponding certificate signed by the CA

Strong identities prevent Sybil Attacks
If an attacker is caught, it cannot easily rejoin the system

SLIDE 11

41

Security – Weak vs. strong identities

Strong identities requires a centralized CA
new nodes must contact the CA before joining the network:
the CA response may be slow (a few days)
if the CA is unavailable, new nodes cannot join the system
the security of the system depends on the CA:
the CA must correctly verify the identity of the requester
the CA’s private key must be kept secret
Many P2P systems use weak identities
IP address already gives some identity information
Some systems ensure anonymity (e.g., FreeNet)