The Database Systems and Information Management Group at Technische Universit¨ at Berlin 1 Introduction 2.1 Stratosphere Our flagship project is a Collaborative Research The Database Systems and Information Management Unit funded by the Deutsche Forschungsgemeinschaft Group, in German known by the acronym DIMA, (DFG) in which the Technische Universit¨ at Berlin, is part of the Department of Software Engineering the Humboldt Universit¨ at zu Berlin, and the Hasso- and Theoretical Computer Science at the TU Berlin. Plattner-Institut in Potsdam are jointly research- It is led by Prof. Dr. Volker Markl and consists ing ” Information Management on the Cloud ”. of 3 postdocs, 8 research associates and 19 student Stratosphere aims at considerably advancing the assistants. state-of-art in data processing on parallel, adaptive architectures. Stratosphere (named after the layer of the atmosphere above the clouds) explores the power of massively parallel computing for complex informa- tion management applications. Building on the ex- 2 Research Areas pertise of the participating researchers, we aim to develop a novel, database-inspired approach to ana- The Database Systems and Information Manage- lyze, aggregate, and query very large collections of ment Research Group (DIMA) under the direction either textual or (semi-)structured data on a virtual- of Volker Markl conducts research in the areas of in- ized, massively parallel cluster architecture. formation modeling, business intelligence, query pro- Stratosphere conducts research in the areas of mas- cessing, query optimization, impact of new hardware sively parallel data processing engines, a program- architectures on information management, and appli- ming model for parallel data programming, robust cations. While having a strong focus on system build- optimization of declarative data flow programs, con- ing and validating research in practical scenarios and tinuous re-optimization and adaptation of the execu- use-cases, the group aims at exploring and provid- tion, data cleansing, and text mining. The unit will ing fundamental and theoretically sound solutions to validate its work through a benchmark of the over- current major research challenges. The group inter- all system performance and by demonstrators in the acts closely with researchers at prestigious national areas of climate research, the biosciences and linked and international academic institutions and carries open data. The goal of Stratosphere is to jointly re- out joint research projects with leading IT compa- search and build a large-scale data processor based nies, including Hewlett Packard, IBM, and SAP, as on concepts of robust and adaptive execution. We well as innovative small and medium enterprises. are researching a programming model that extends a In the following paragraphs, we present our main functional map/reduce programming model with ad- research projects. ditional second order functions. As execution plat- i
2.3 GoOLAP.info form we use the Nephele system, a massively parallel data flow engine which is also researched and devel- Today, the Web is one of the world’s largest oped in the project. We are examining real-world databases. However, due to its textual nature, ag- use-cases in the area of climate research, informa- gregating and analyzing textual data from the Web tion extraction and integration of unstructured data analogue to a data warehouse is a hard problem. For in the life-sciences, as well as linked open data and instance, users may start from huge amounts of tex- social network graph data. tual data and drill down into tiny sets of specific fac- tual data, may manipulate or share atomic facts, and may repeat this process in an iterative fashion. In 2.2 MIA the GoOLAP – The Web as Data Warehouse project we investigate fundamental problems in the The German language web consists of more than six process: What are common analysis operations of billion web sites and is second in size only to the En- “end users” on natural language Web text? What glish language web. This vast amount of data could is the typical iterative process for generating, verify- potentially be used for a large number of applications, ing and sharing factual information from plain Web such as market- and trend analysis, opinion and data text? Can we integrate both, the “cloud”, a clus- mining for Business Intelligence or applications in the ter of massively parallel working machines, and the domain of language processing technologies. The goal “crowd”, end users of GoOLAP.info, for solving hard of MIA – A Marketplace for Trusted Informa- problems, such as training 10.000s of fact extractors, tion and Analysis – is to create a marketplace-like for verifying billions of atomic facts or for generating infrastructure in which this data is stored, refined analytical reports from the Web? and made available in such a way that it enables the The current prototype GoOLAP.info contains al- trade with refined and agglomerated data and value- ready factual information from the Web for about added services. In order to achieve this, we draw several million objects. The keyword-based query in- upon the results of our substantial research in the terface focuses on simple query intentions, such as, areas of Cloud Computing and Information Manage- “display everything about Airbus” or complex aggre- ment. gation intentions, such as “List and compare mergers, The marketplace provides the German-language acquisitions, competitors and products of airplane web and its history as a data pool for analysis and technology vendors”. value-added services. The focus of its initial version are use cases in the domains of media, market re- search and consulting. These use cases have special 2.4 ROBUST requirements of data privacy and security that will be observed. Gradually, the platform will be expanded Online communities play a central role in vital busi- for additional use cases and services as well as inter- ness functions such as corporate expertise manage- nationalization. ments, marketing, product support and customer re- The proposed infrastructure enables new business lationship management. Communities on the web models with information as a tradable good, which easily grow to millions of users and thus need a scal- build on algorithmic methods that extract informa- able infrastructure capable of handling millions of tion from semi-structured and unstructured data. By discussion threads containing billions of posts. The using the platform to collaboratively analyze and re- EU integrated project ROBUST - Risk and Op- fine the data of the German-language web, businesses portunity Management of huge-scale BUSi- significantly reduce expenses while at the same time ness communiTies develops methods and models jointly creating the basis for a data economy. This to monitor and understand the behavior and require- will enable even small and medium sized businesses ments of users and groups in these communities. A to access and compete in this market. massively parallel cloud infrastructure will handle ii
Recommend
More recommend