HIERARCHICAL CLUSTERING OF MESSAGE FLOWS IN A MULTICAST DATA DISSEMINATION SYSTEM Yoav Tock, Nir Naaman, Avi Harpaz, Gidon Gershinsky IBM Haifa Research Laboratory Mount Carmel, Haifa 31905, Israel { tock,naaman,harpaz,gidon } @il.ibm.com ABSTRACT interest. We assume that the publish-subscribe part of the A large-scale data dissemination application is character- system is confined to a multicast-enabled enterprize LAN. ized by a large number of information flows and infor- The challenge is to deliver the messages generated by mation consumers. Consumers are interested in different, the flows to the interested consumers in an efficient manner. yet overlapping, subsets of the flows. Multicast is used to In a sparse yet correlated subscription pattern [1], such as deliver subsets of the flows to subsets of the consumers. the one we assume, flooding is very inefficient as the con- Since multicast groups are a limited resource, each con- sumers will be burdened heavily with an enormous amount sumer must filter out a large number of unneeded flows. of unwanted incoming traffic. Unicast, on the other hand, We alleviate the end-node filtering load by using hierarchi- is perfect for the consumer, but many messages will travel cal clustering of flows to transport-layer sessions, and clus- multiple times on the common parts of the network, wast- tering of sessions to network-layer multicast groups. This ing network resources and heavily loading the transmitter. scheme allows for hierarchical filtering of flows at the re- It was suggested by [2] to solve the message distri- ceivers. We formulate a cost function that models and em- bution problem by assigning a multicast group per flow. phasizes the filtering process, and propose algorithms for However, multicast groups are a limited network resource, the solution of the hierarchical mapping problem. Perfor- because routers must save and maintain state information mance evaluation indicates a significant reduction of end- for every multicast group used. Moreover, certain end-node node filtering cost compared to a non-hierarchic approach. systems pose limitations on the number of multicast groups one can join. Thus, using a multicast group per flow is im- practical for large scale systems. An alternative is to map KEY WORDS the large number of flows to a fixed number of multicast Multicasting algorithms, multicast mapping, data dissemi- groups, and to assign each receiver with a set of multicast nation, receiver interest, hierarchical clustering, optimiza- groups so as to satisfy its flow subscriptions. The problem tion algorithms. is to find these pair of mappings so as to minimize some cost function that quantifies system performance. This had been termed the “ channelization ” problem and shown to 1 Introduction be NP-hard by [3]. Several authors have tried to tackle this problem by clustering flows into multicast groups [4, 5, 1]. Consider a large-scale data dissemination application that A solution to the channelization problem, accord- is characterized by a large number of information flows (in ing to the cost function proposed by [3], aims to strike the hundreds of thousands), and a large number of informa- a balance between the total bandwidth consumed and the tion consumers (in the thousands). Each information flow amount of unwanted information received by consumers. generates messages which must be delivered to interested Thus, in general, even the optimal solution still leaves the consumers. Information consumers display interest hetero- consumers with the need to filter the incoming stream of geneity, that is, consumers are interested in different, yet messages. It has been found [6] that in a high through- overlapping, subsets of the information flows. Naturally, put messaging application over a fast enterprize network, an individual information flow may be required by many it is often the computing power of end-nodes that limits consumers. performance. The fact that the number of flows is orders Such a setup is typical of a large financial trading of magnitude larger than the practical number of multicast office, for example, where the flows can be stock quotes, groups ( ∼ 10 5 vs. ∼ 10 2 ), together with the large number commodity prices, etc., and the consumers can be traders, of receivers ( ∼ 10 3 ), aggravates this problem. analysts and so on. Each trader or analyst is interested in a different portfolio — thus displaying interest heterogeneity Our main goal is to further reduce the filtering load across the data consumers. A simplified model of such a imposed upon the receivers. To that end, we introduce a system is shown in Fig. 1. The publisher divides the data hierarchical clustering scheme that allows for hierarchical feed into a large number of topics (a synonym to informa- filtering of flows at the receivers. We propose to cluster the tion flows), and each consumer subscribes to his topics of flows into transport-layer multicast sessions, and cluster the 466-160 320
this work we restrict each session to be mapped to a sin- Subscribers gle multicast group. This restriction is in accordance with WAN the specifications of most reliable multicast protocols (e.g., PGM). The flow-to-session mapping matrix, X = ( x kl ) , k ∈ F , l ∈ S , is defined Publisher � 1 flow F k is mapped to session S l Data Vendor x kl = Enterprise LAN 0 otherwise Let the total rate of session S l be θ l messages per second, Figure 1. A simplified model of a financial data dissemina- and θ = [ θ 1 , · · · , θ L ] . That is, θ = λ · X . The session- tion system. to-group mapping matrix, Y = ( Y lm ) , l ∈ S , m ∈ G , is defined � 1 session S l mapped to group G m y lm = 0 otherwise sessions into network-layer multicast groups. We formu- late a cost function that models and emphasizes the hierar- A user interested in an information flow must listen to the chical filtering process, and incorporate it in algorithms for appropriate multicast group, pull out the relevant session, the solution of the hierarchical mapping problem. A statis- and extract the desired information flow from the session. tical model for consumer interest and message rate, based See for example Fig. 2, where user U 3 , interested in flow on real-life data from the financial domain, is presented. F 3 , might be given the reverse path /G m /S l /F 3 (note that The statistical model is used to evaluate the performance of there is more than one reverse path from U 3 to F 3 ). the proposed scheme. The subscription matrix, Z = ( z nl ) , n ∈ U , l ∈ S , Finally, let us remark that alleviating the filtering bur- specifies to which sessions must each user subscribe den off the receivers has also been the goal of content-based � 1 messaging. However, central filtering had been deemed user U n subscribes to session S l slow, and broker-assist solutions (e.g. [7], [2]) introduce z nl = 0 otherwise delays in the data path, which makes them inapplicable in certain scenarios. On the other hand, multicast is now The group listening matrix, P = ( p nm ) , n ∈ U, m ∈ G , widely available and is enabled by default in most LANs. specifies to which multicast groups must each user join We thus see an advantage in utilizing multicast capabilities � 1 for high throughput messaging. user U n joins group G m p nm = 0 otherwise 2 Problem Description and Model Since each session is mapped to a single multicast group, P = u ( Z · Y ) , where B = u ( A ) is the point-wise step Let F denote the set of information flows, | F | = K . Flow operator, b i,j = 1 for a i,j > 0 , and 0 otherwise. F k produces a sequence of messages with rate λ k messages A legal set of mappings must comply to several re- per second, k ∈ F , and λ = [ λ 1 , · · · , λ K ] . quirements. Specifically: Let U denote a set of users (consumers), | U | = N . (i) No false exclusion — all the flows a user is interested Each user U n contributes a binary “interest vector” of in are mapped to one or more sessions to which the user length K , where a 1’ in the k th position denotes his interest subscribes. That is, in flow F k . The rows of the “interest matrix” W = ( w nk ) , k ∈ F , n ∈ U , are the users’ interest vectors: � z nl · x kl − w nk ≥ 0 . ∀ n ∈ U, k ∈ F. � 1 l ∈ S user U n is interested in flow F k w nk = 0 otherwise (ii) No dummy sessions or groups — no empty, unmapped, or un-listened sessions and groups. That is, Each flow is mapped into a session (also referred to as a “stream”), which is a globally unique transport layer � � x kl > 0 ∀ l ∈ S, and y lm > 0 ∀ m ∈ G. entity, for example, a transport session in Pragmatic Gen- k ∈ F l ∈ S eral Multicast (PGM) [8, 9]. Each session is mapped to a multicast group (see Fig. 2). Let S denote the set of ses- This is a parsimony requirement. sions (streams), | S | = L ; and G denote the set of multicast The optimal set of mappings must also minimize a groups, | G | = M . certain cost function. Thus, the problem can be phrased in the following way. Given F, λ, S, G, U, W , find a set In the general case, a session might be mapped to of mappings X, Y, Z , that complies with the constraints, more than one multicast group. However, in order to avoid and minimizes the cost function C ( X, Y, Z ) . In this paper the complications associated with stream duplication, in 321
Recommend
More recommend