## Distinct Count This class provides a space and time efficient data structure (called a sketch) to estimate the number of distinct element up to a factor `(1 ± epsilon)` with probability at least `1-delta` in a data stream. To import the class, use the following: ```python from streamsketchlib.distinct_count import DistinctCount ``` ### Overview The update time is `O(log 1/eps)`. It supports the following operations: - insert a token to the stream. - return the estimate for number of distinct elements the stream encounters so far up to a factor `1±eps` with probability at least `1-delta`. - combine the sketches of two streams so that we can estimate the number of distinct elements of the combined stream. ### Initialization To initialize an instance of this class, we can specify the following parameters: - `delta`: controls the failure probability. The default value is `0.01`. - `epsilon`: controls the estimate's quality. The default value is `0.01`. - `seed`: the seed for randomness. The default value is `42`. - `algorithm`: -`BJKST_1` (default) : This is an implementation of the first algorithm in the paper "Counting Distinct Elements in a Data Stream" by Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar & Luca Trevisan. The data structure uses roughly `O~(1/eps^2 * log(1/delta))` memory (excluding overheads). This is currently the only implementation. For example, ```python stream = DistinctCount(delta=0.01, epsilon=0.05, seed=42) stream2 = DistinctCount() ``` ### insert Insert a new token into the stream. The token must be byte-like objects. The easiest way to achieve this is to convert a token to string. For example, ```python stream = DistinctCount(delta=0.01, epsilon=0.05, seed=42) stream.insert("apple") stream.insert("orange") stream.insert("apple") ``` ### estimator Return the estimate of the number of distinct elements that have appeared in the stream so far up to a factor (1 ± epsilon) with probability at least 1-delta. For example, ```python stream = DistinctCount(delta=0.01, epsilon=0.05, seed=42) for i in range(100, 200): stream.insert(str(i)) for i in range(150, 250): stream.insert(str(i)) print(stream.estimator()) >>> 150 ``` ### merge Merge with another sketch with the same seed. The resulted sketch will provide answer to the combined stream. For example, ```python stream = DistinctCount(delta=0.01, epsilon=0.05, seed=42) stream2 = DistinctCount(delta=0.01, epsilon=0.05, seed=42) for i in range(100, 200): stream.insert(str(i)) for i in range(150, 250): stream2.insert(str(i)) stream.merge(stream2) print(stream.estimator()) >>> 150 ``` ### + Operator If A is a sketch of some stream and B is a sketch of another stream, then A + B returns the merged sketch that provides the answer to the combined stream. In other words, A = A+B is the same as A.merge(B). For example, ```python stream = DistinctCount(delta=0.01, epsilon=0.05, seed=42) stream2 = DistinctCount(delta=0.01, epsilon=0.05, seed=42) for i in range(100, 200): stream.insert(str(i)) for i in range(150, 250): stream2.insert(str(i)) stream = stream + stream2 print(stream.estimator()) >>> 150 ``` ### from_existing Create a new sketch with similar parameters so that they can be merged later. For example, ```python stream = DistinctCount(delta=0.01, epsilon=0.05, seed=42) stream2 = DistinctCount.from_existing(stream) for i in range(100, 200): stream.insert(str(i)) for i in range(150, 250): stream2.insert(str(i)) stream.merge(stream2) print(stream.estimator()) >>> 150 ```