External
YouTube
Great explanatory video on bloom filters.
A recent coding challenge I was doing opened up a world of possibilities to me, as it introduced me to a data structure that I wasn't aware of before. These ideas have given me the thirst to explore some more exogenous data structures. In this post I want to look at two. A Sparse Matrix and a Bloom Filter.
When a function produces a lot of null values certain data structures become a hindrance and certainly bloated. This increasingly becomes a problem when you start scaling to very large matrices which will bring about a massive waste of memory usage.
For example you have a 5x5 matrix:
\[ \begin{pmatrix} 0 & 0 & 0 & 0 & 2 \\ 0 & 0 & 0 & 2 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ 0 & 2 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \end{pmatrix} \]
Most of the elements are zeroed out and and it doesn't make sense to store the null values multiple times. If you remove the uneeded values you will lose the ordinal structure of the data, so you need another storage solution. The more obvious way would be to store the data in a dictionary and treat keys that are not in the dictionary as null values instead, but there is a more effective method processing the data into a compressed structure.
class SparseMatrix:
def __init__(self, data: list[list[int]]) -> None:
raw_values = defaultdict(list)
self.values = self.format_values(data, raw_values, 0)
self.rows = [0]
for value in list(raw_values.values()):
self.rows.append(len(value) + self.rows[-1])
self.cols = list(chain(*raw_values.values()))
@staticmethod
def format_values(
raw_values: list,
dimensions: defaultdict[int, list],
index: int,
) -> list:
values = []
for i, value in enumerate(raw_values):
if isinstance(value, list):
values += SparseMatrix.format_values(value, dimensions, i)
elif value:
dimensions[index].append(i)
values.append(value)
return values
This code snippet takes the dictionary as an intermediate step and creates an efficient mapping between columns and rows which resemble the original matrix without losing detail.
For example if you take the aforementioned example and use the SparseMatrix
class you get this output from the respective attributes:
Values: [2, 2, 2, 1]
Col index: [4, 3, 1, 1]
Row index: [0, 1, 2, 3, 4]
Where each column index maps to a value and the row index maps to a index in the column index array. Looping through this structure is simple, as you easily can expand the values through a loop.
Of course this implementation is not optimal and should really be written in a mature format such as Numpy array where these kinds of operations are optimized. A lot of python packages already implement this such as the SciPy library and Sparse package.
What if full memory accuracy isn't an issue and all you want is to check if something is possibly contained. On a shallow level you would think this kind of issues doesn't come up, but it does once you start looking at scale and finer details of day to day operations.
The whole idea of a bloom filter works on overlapping hash functions where the default state of each bit in the m sized array is set to 0. When you add an element you hash it multiple times and set the bits at those locations 1. Verifying if an element is in a set is as simple as checking if any bits are 0 when hashing another element you want to check. Of course this is a simplified overview of the structure, so here is a video explains the implementation really well and is where I originally learnt about bloom filters.
Interesting use cases for this are dealing with caching one hit wonders where its a good way of finding if an element should be cached not. The implementation of queries in various databases are also optimised with this. Like Google Bigtable or PostgreSQL as lot of rows might have values zeroed out which make querying for results that much more efficient.
Similarly there are established packages such as Pybloomfiltermap or Rbloom that implement and maintain this structure in more detail.
Comparing the sparse matrix and bloom filter structure show a lot of similarities between them, as both are dealing with inefficient data and compressing it in such a way that it doesn't break the final intended functionality.