an introduction reach1to1 - 1 / 25
what we do big data solutions for capturing, storing, searching and analyzing structured and unstructured data from multiple sources reach1to1 - 2 / 25
big data technology benefits ● cluster of low cost commodity servers ● capable of handling unlimited growth in data size ● distributed parallel distributed distributed processing models computing computing ● no loss in performance with increasing data size ● no licensing costs - primarily open source reach1to1 - 3 / 25
big data technologies open source technologies that are developed and being used by companies like Google, Facebook, Twittter and LinkedIn reach1to1 - 4 / 25
case studies ● patent document repository Large international chemical manufacturer requires a high performance ● document repository capable of handling large volume of patent documents with advanced search capabilities ● log file analysis Large telecom provider requires to analyze log files generated from ● automated customer support calls and call center logs without manual data collation ● customer activity analysis Fast growing low-cost airline requires to analyze customer activity to enable ● promotional fares to increase market share reach1to1 - 5 / 25
why reach1to1? ● combined experience of over 20 years in NoSQL database technologies ● expertise in entire product development life cycle ● handled range of enterprise applications using NoSQL databases including sales monitoring and analytics ● customer order tracking ● accounts receivable tracking ● customer support tracking ● reach1to1 - 6 / 25
patent document repository - a case study outline reach1to1 - 7 / 25
data requirements document folders families documents are organized documents are also f3 into multiple folders, that pf3 grouped into into patent f2 determine access rights pf2 families, that determine and represent logical f1 pf1 relationships that are collections based on priority codes documents assigned to each document d1 d1 d1 comments c1 users review documents c2 and add comments that c3 represent their views on the researched topic reach1to1 - 8 / 25
functional requirements batch operations user operations search crud d1 comment d1 repository reach1to1 - 9 / 25
batch operations d1 documents are added or replaced in the d1 repository in batches consisting of up to thousands of documents the critical performance metrics for batch operations are throughput and access delay batch throughput is the rate of processing of documents repository access delay is the time it takes from the start of the batch till documents are available for user operations reach1to1 - 10 / 25
user operations create / retrieve / update / delete search add / – documents based on access rights update comment comment – crud operations on comments – comments can be private or public search – facility for advanced full text search features – facility for faceted search for drilling down into search results repository – search results need contain highlights for matching terms – search based on concordance reach1to1 - 11 / 25
architecture client application client API repository application server synchronization storage & retrieval indexing relationships object oriented advanced full text search document families database reach1to1 - 12 / 25
persistence Hbase used for persistence object oriented provides random, real-time database read/write access capable of hosting very large data can use clusters of servers multi-value and hierarchical parameters mapped to column families and columns links between documents and related objects stored as linked object ids reach1to1 - 13 / 25
data model folders f3 f2 f1 documen ents d1d1d1 paten ents com omments c1 p3 c2 p2 c3 p1 reach1to1 - 14 / 25
indexing Solr provides powerful full-text advanced full text search, hit highlighting, faceted search, dynamic clustering highly scalable, distributed search and index replication documents, comments and patents are indexed in a 1+n+m denormalized index structure field collapsing is used to group multiple search results pivoted faceting is used to provide accurate facet results due to duplicate entries reach1to1 - 15 / 25
indexing model 1 n m 1+n+m 1 folder 1 document 2 patents 3 comments 6 index entries + => folder+document properties + + => folder+document + patent + + => properties + + => folder+document + + => + comment properties + + => reach1to1 - 16 / 25
relationships Neo4j for mapping a graph of graph traversal documents based on their tags a high performance graph database with transaction support documents, tags and families are created as vertices edges between document and tag vertices family is a fully connected sub-graph reach1to1 - 17 / 25
grouping into families document vertex tags family family family reach1to1 - 18 / 25
client API oodebe is a synchronization engine repository that is based on node.js application server provides a consistent client api that encapsulates combined synchronous operations across multiple big data repository components includes a scripting engine includes advanced sequencing patterns - serial, parallel, waterfall, concurrent queues etc. provides for multiple concurrent operations with provision for logical object-level locks reach1to1 - 19 / 25
batch operations user operations search d1 crud d1 comment client API synchronization server scripts start add/update add/update add/update retrieve search batch folder document comment document query batch delete delete delete retrieve status folder document comment comment web services object oriented full text search index graph index database reach1to1 - 20 / 25
performance benchmarks 0.3 secs 0.25 secs add/update add/update folder document 0.25 secs 0.3 secs retrieve add/update document comment 0.24 secs 0.3 secs retrieve delete comment comment not implemented 1.3 secs delete search folder query not measured delete document note: timings are average across a pre-defined set of operations reach1to1 - 21 / 25
scalability data size data complexity processing speed hadoop scales to thousands of commodity computers using all cores and spindles simultaneously proven data size scalability – e.g. Facebook has 21 pb data in a single hadoop cluster solr has built-in capabilities for replication that allows it to scale up for very high query volumes without loss of performance – e.g. solr has production instances of over 200+ mn items neo4j enterprise version includes high availability clustering and can traverse up to 1-2 mn hops per second reach1to1 - 22 / 25
scalability data complexity data size processing speed hbase column families and columns provide a flexible way to manage sparse data structures using object links allows additional objects to be linked to documents neo4j can be used to handle more hierarchical data structures that require traversals solr schema can be extended easily for adding new, though re-indexing is required after a change additional index servers can be added to manage new types of queries and synchronized by oodebe synchronization scripts reach1to1 - 23 / 25
scalability processing speed data complexity data size node.js allows clusters of worker processes with facility to monitor and automatically manage them batch throughput can be optimized by using concurrent queues and multiple worker processes custom client applications can be developed that manage complex processes faster, and invoked through synchronization scripts solr batch updates and caching can be used to speed up updates and queries respectively reach1to1 - 24 / 25
thank you info@reach1to1.com +91-98201-94408 reach1to1 - 25 / 25
Recommend
More recommend