## BloomFilter This class provides an implementation of Bloom Filter, a space and time-efficient approach to represent a set that supports insertions, deletions, and membership queries. The Bloom Filter was first introduced in the paper "Space/Time Trade-offs in Hash Coding with Allowable Errors" by Burton H. Bloom. To import the class, use the following: ```python from streamsketchlib.bloom_filter import BloomFilter ``` ### Overview The data structure allows three operations: - insert an element into the set - delete an element from the set - check if an element is in the set If there are at most `n` elements in the set and `delta` is the false positive rate (i.e., the probability that an element is not in the set but is falsely reported as being in the set by the data structure), then the data structure uses roughly `O~(n * log(1/delta))` bits of memory for any reasonable `n` that is at most the number of particles in the observable universe (excluding overheads). The update time is `O(1/eps^2 log (1/delta))`. If each element is an image of size `200kb`, then if we need to store `m = 10^6` images, then the total space is `2 * 10^8 Kb = 200Gb`. So if we allow `1%` false positive rate, then the space that Bloom Filter uses is `10^6 * ln(log(1/0.01))/ln(2)` bits which is approximate `0.2754 Mb`. This is a big saving. Each operation takes roughly `O(log(1/delta))` time. ### Initialization To initialize an instance of this class, we can specify the following parameters: - `delta`: controls the false positive rate. The default value is `0.01`. - `n`: the maximum number of elements to be inserted into the filter. - `seed`: the seed for randomness. The default value is `42`. ```python delta = 0.1 n = 1000 B = BloomFilter(n = n, delta = delta, seed = 50) ``` ### insert To insert an element into the set, the element must be a byte-like object. The simplest approach is to convert an object to a string. For example, ```python delta = 0.0001 n = 100 B = BloomFilter(n = n, delta = delta) for i in range(n): B.insert(str(i)) ``` ### delete To delete an element from the set, use the delete function. However, note that the overall correctness is only guaranteed if the element exists in the set. For example, ```python delta = 0.05 n = 1000 B = BloomFilter(n = 2*n, delta = delta) C = BloomFilter(n = 2*n, delta = delta) false_positives = 0 for i in range(n): B.insert(str(i)) for i in range(int(n/2), int((3/2)*n)): C.insert(str(i)) B.merge(C) for j in range(2*n): if j >= (3/2)*n and B.membership(str(j)) == True: false_positives += 1 print(false_positives) >>> 12 ``` ### membership Check if an element is in the set, use the membership query function. However, there is a probability `delta` of false positives, meaning the function may incorrectly return `True` for an element that is not in the set. For example, ```python delta = 0.0001 n = 10000 false_positives = 0 B = BloomFilter(n = n, delta = delta) for i in range(n): B.insert(str(i)) for j in range(2*n): if j >= n and B.membership(str(j)) == True: false_positives += 1 print(false_positives) >>> 1 ``` ### merge To merge with another Bloom filter with the same seed, use the merge function. The resulting filter will provide the answer to the union of two sets. For example, ```python delta = 0.05 n = 1000 B = BloomFilter(n = 2*n, delta = delta) C = BloomFilter(n = 2*n, delta = delta) false_positives = 0 for i in range(n): B.insert(str(i)) for i in range(int(n/2), int((3/2)*n)): C.insert(str(i)) B.merge(C) for j in range(2*n): if j >= (3/2)*n and B.membership(str(j)) == True: false_positives += 1 print(false_positives) >>> 12 ``` ### + operator Merge two Bloom filters A and B with the same seeds. The resulting filter will provide the answer to the union of two sets. In other words, A = A + B is the same as A.merge(B). ```python delta = 0.05 n = 1000 B = BloomFilter(n = 2*n, delta = delta) C = BloomFilter(n = 2*n, delta = delta) false_positives = 0 for i in range(n): B.insert(str(i)) for i in range(int(n/2), int((3/2)*n)): C.insert(str(i)) B = B + C for j in range(2*n): if j >= (3/2)*n and B.membership(str(j)) == True: false_positives += 1 print(false_positives) >>> 12 ``` ### from_existing Create a new Bloom Filter with similar parameters so that they can be merged later. ```python delta = 0.05 n = 1000 B = BloomFilter(n = 2*n, delta = delta) C = BloomFilter.from_existing(B) false_positives = 0 for i in range(n): B.insert(str(i)) for i in range(int(n/2), int((3/2)*n)): C.insert(str(i)) B.merge(C) for j in range(2*n): if j >= (3/2)*n and B.membership(str(j)) == True: false_positives += 1 print(false_positives) >>> 12 ```