Large me2015

DeepDive: A Data Management System for Machine Learning Workloads

Ce Zhang

Recorded 10 March 2016 in Lausanne, Vaud, Switzerland

Event: IC Colloquia - EPFL IC School Colloquia


Many pressing questions in science are macroscopic: they require scientists to consult information expressed in a wide range of resources, many of which are not organized in a structured relational form. Knowledge base construction (KBC) is the process of
populating a knowledge base, i.e., a relational database storing factual information, from unstructured inputs. KBC holds the promise of facilitating a range of macroscopic sciences by making information accessible to scientists. One key challenge in building a high-quality
KBC system is that developers must often deal with data that are both diverse in type and large in size. Further complicating the scenario is that these data need to be manipulated by both relational operations and state-of-the-art machine-learning techniques.

My research focuses on building a data management system for machine learning workloads with the goal to help this complex process of building KBC systems. The system I build is called DeepDive, whose ultimate goal is to allow scientists to build a KBC system, and machine learning systems in general,
by declaratively specifying domain knowledge without worrying about any algorithmic, performance, or scalability issues. DeepDive has been used by users without machine learning expertise in a number of domains from paleobiology to genomics to anti-human trafficking. In this talk, I will describe
the DeepDive framework, its applications, and underlying techniques we developed to speed up a range of machine learning workloads by up to two orders of magnitude.

Watched 5086 times.