Workshop on Filter Data Structures (SPAA 2023)

Filters, such as Bloom [1970], quotient [2009, 2012], cuckoo [2014], and XOR [2020] filters, maintain an approximate representation of a set or a multiset. Filters are widely used in networks, storage systems, databases, machine learning, computational biology, and other areas.

The primary goal of the workshop is to bring together the researchers who are at the forefront of filter data structure research and make the community (both theory and applications) aware of the recent developments. This will further help uncover the open research questions and increase interaction among researchers to tackle those questions.

Some topics that will be covered include recent advancements in the theory and practice of filters such as static filters, infinitely-resizeable filters, maplets, and adaptive filters. Furthermore, the workshop will include talks from the experts who use filters to build modern large-scale data analysis applications such as Filters+Bioinformatics and Filters+ML.

The workshop will take place on Friday, June 16th from 2–6pm. Please see the SPAA site for more details.

Agenda:

Overview of recent advances in filters
Application of filters in computational biology
Application of filters in machine learning

Organized by: Prashant Pandey and Rob Johnson

Schedule and Speaker List

2:00 – 2:25 — Welcome to Filters, Michael Bender

Talk video
2:25 – 2:50 — Binary Fuse Filters: Fast and Tiny Immutable Filters, Daniel Lemire

Talk video
2:50 – 3:15 — InfiniFilter: Expanding Filters to Infinity and Beyond, Niv Dayan

Talk video
3:15 – 3:40 — SplinterDB and Maplets, Alex Conway

Talk video
3:40 – 4:10 — Break
4:10 – 4:35 — Theory and Practice of Adaptive Filters, Sam McCauley

Talk video
4:35 – 5:00 — Application of Filters in Computatioal Biology, Rob Patro

Talk video
5:00 – 5:25 — High-Performance Filters for GPUs, Prashant Pandey

Abstracts

Title: Welcome to Filters (Michael Bender)

Abstract: In this talk, I’ll present the basic definitions for Filters and Maplets, including the different set of operations they support and their evaluation criteria. I’ll cover the state of the art and some important historical developments.

Title: Binary Fuse Filters: Fast and Tiny Immutable Filters (Daniel Lemire)

Abstract: Conventional Bloom filters provide fast approximate set membership while using little memory. Engineers use them to avoid expensive disk and network accesses. We recently introduced the binary fuse filters that are faster and smaller at query time while saving at least 30% in memory usage compared to the Bloom filters. The result is an immutable filter, and the construction is slightly slower (e.g., by 50%).

We review some performance issues related to our binary fuse filters, but also to probabilistic filters in general: e.g., how does the query time performance scale with respect to the number of random accesses ? For network transmission, the filters are often compressed: how well do different filters compress ?

Title: InfiniFilter: Expanding Filters to Infinity and Beyond (Niv Dayan)

Abstract: Filter data structures have been used ubiquitously since the 1970s to answer approximate set-membership queries in various areas of computer science including architecture, networks, operating systems, and databases. Such filters need to be allocated with a given capacity in advance to provide a guarantee over the false positive rate. In many applications, however, the data size is not known in advance, requiring filters to dynamically expand. This paper shows that existing methods for expanding filters exhibit at least one of the following flaws: (1) they entail an expensive scan over the whole data set, (2) they require a lavish memory footprint, (3) their query, delete and/or insertion performance plummets, (4) their false positive rate skyrockets, and/or (5)~they cannot expand indefinitely.

We introduce InfiniFilter, a new method for expanding filters that addresses these shortcomings. InfiniFilter is a hash table that stores a fingerprint for each entry. It doubles in size when it reaches capacity, and it sacrifices one bit from each fingerprint to map it to the expanded hash table. The core novelty is a new and flexible hash slot format that sets longer fingerprints to newer entries. This keeps the average fingerprint length long and thus the false positive rate stable. At the same time, InfiniFilter provides stable insertion/query/delete performance as it is comprised of a unified hash table. We implement InfiniFilter on top of Quotient Filter, and we demonstrate theoretically and empirically that it offers superior cost properties compared to existing methods: it better scales performance, the false positive rate, and the memory footprint, all at the same time.

Title: SplinterDB and Maplets: Improving the Tradeoffs in Key-Value Store Compaction Policy (Alex Conway)

Abstract: A maplet is a fundamental data structure, which extends the notion of a filter to include values. It turns out that many applications which rely on filters are better served using maplets. In this talk, I’ll show how maplets can be used to improve the tradeoffs in LSM-based key-value stores, such as SplinterDB, an enterprise-grade key-value store deployed in VMware products. Maplets allow SplinterDB to compact data lazily, which improves write performance, while compacting maplets aggressively, which improves read performance.

Title: Theory and Practice of Adaptive Filters (Sam McCauley)

Abstract: Classic filter analysis gives the error performance on a single query. If there is a false positive error on one element, then repeated queries to that element will always result in a false positive. This leads to several issues. First, this means that an adversary can repeat false positive queries, leading to arbitrarily poor filter performance; this is a problem from a security standpoint. Second, the single-query analysis leaves performance on the table due to repeated queries. Many practical datasets have elements that are queried many times; reducing the false positive rate on these repeated queries can lead to improved performance.

In this talk I will discuss several recent data structures that close this gap, fixing false positives so that subsequent queries to these fixed elements are guaranteed to be answered correctly. We will look at upper and lower bounds from a theoretical standpoint, as well as discussing opportunities to make these theoretical ideas implementable in practice.

Title: Breaking Bloom: The success of the Bloom filter in Bioinformatics, and how to move beyond it (Rob Patro)

Abstract: Despite the substantial and rapidly advancing development of filter data structures — providing both fundamentally new capabilities and improved practical performance — the Bloom filter still remains the most prevalent AMQ for many applications in Bioinformatics. Why is this the case? In this talk, I will highlight some of the most common uses of the AMQ ADT in Bioinformatics and Computational Biology applications. I will describe how the Bloom filter has been used, and how it has, in several cases, been specialized for the type of sequencing data that arises in high-throughput genomics. I will attempt to identify ongoing challenges where filter data structures may prove instrumental, describe the properties that successful approaches to these problems might have, and argue that characteristics beyond those provided by the Bloom filter may be useful, or even essential.

Title: High-Performance Filters for GPUs (Prashant Pandey)

Abstract: Filters approximately store a set of items while trading off accuracy for space-efficiency and can address the limited memory on accelerators, such as GPUs. However, there is a lack of high-performance and feature-rich GPU filters as most advancements in filter research has focused on CPUs.

In this talk, we explore the design space of filters with a goal to develop massively parallel, high performance, and feature rich filters for GPUs. We evaluate various filter designs in terms of performance, usability, and supported features and identify two filter designs that offer the right trade off in terms of performance, features, and usability. We present two new GPU-based filters, the TCF and GQF, that can be employed in various high performance data analytics applications. The TCF is a set membership filter and supports faster inserts and queries, whereas the GQF supports counting which comes at an additional performance cost. Both the GQF and TCF provide point and bulk insertion API and are designed to exploit the massive parallelism in the GPU without sacrificing usability and necessary features. The TCF and GQF are up to 4.4× and 1.4× faster than the previous GPU filters in our benchmarks and at the same time overcome the fundamental constraints in performance and usability in current GPU filters.

Code of Conduct

The workshop follows the ACM Policy on Harassment and Discrimination.