Tastes Great! Less Filling! High Performance and Accurate Training Data Collection for Self-Driving Database Management Systems

Authors: Matthew Butrovich, Wan Shen Lim, Lin Ma, John Rollinson, William Zhang, Yu Xia, Andrew Pavlo
Institute: Carnegie Mellon University, Army Cyber Institute, Massachusetts Institute of Technology
Published at SIGMOD'22
Paper Link: https://dl.acm.org/doi/10.1145/3514221.3517845

Background

A self-driving DBMS usually contains a behavior modeling module which predicts the cost of a database action on a given workload.

The module needs a set of training data to train, so the system needs a method to collect these data.

Current training data collection scheme:

Needs an online method with low overhead

Needs a method that collects internal features (CPU time, # of concurrent workers, info of GC etc.)
- External features are not accurate
Needs a method that collects metrics in kernel-space.
- Collecting metrics in user-space is expensive due to the overhead of system calls and I/O