Insert-Only Model
Paper: Streaming Algorithms for Graph k-matching with optimal or near optimal update time
What is a Streaming Algorithm:
A graph stream $\mathcal{S}$ for an underlying graph $G$ is a sequence of elements, each of the form $(e, op)$, where $op$ is an update to an edge $e$ – either an insertion or deletion (including edge weight if the graph is weighted).
Graph Stream Looks Like
\[S = (e_1, op_1), (e_2, op_2), \dots\]A streaming algorithm processes this sequence one element at a time using limited resources:
- Space (memory): The algorithm cannot store the full graph – it must maintain a compact summary.
- Update time: The time to process each incoming edge must be fast, since data streams can arrive at very high rates. If the update processing rate does not catch the update arrival rate, the whole system may fail.
The performance of a streaming algorithm is measured by both its space complexity and its update time per element.
Two Models
Notation reference: Streaming Model Notation
| Insert-Only | Dynamic |
|---|---|
| Every operation is an edge insertion | Operations can be insertions or deletions |
| Simpler – only grows | Harder – graph can shrink |
What is graph k-matching:
Notation reference: Graph Notation
A matching is a set of edges with no shared vertices.
Consider the graph below with vertices {1, 2, 3, 4}:
The graph:
1 ---- 2
| |
| |
3 ---- 4
A matching is a set of edges where no vertex appears more than once:
Valid matching M1: Valid matching M2:
1 ==== 2 1 2
|
|
3 4 3 4
Matched: {1,2} Matched: {1,3}
wt(M1) = wt([1,2]) wt(M2) = wt([1,3])
What is optimal or near optimal:
There is a known lower bound of $\Omega(k^2)$ on space and $\Omega(1)$ on update time for streaming algorithms solving Graph $k$-Matching. Algorithms are evaluated against these bounds:
- Optimal: Matches the lower bound exactly (e.g., $O(k^2)$ space, $O(1)$ update time).
- Near-optimal: Optimal up to a polylogarithmic factor (e.g., $O(k^2 \cdot \operatorname{polylog}(n))$ space or $O(\operatorname{polylog}(n))$ update time).
Previous work
Known Lower Bounds
Before any algorithms: the minimum resources any correct randomized streaming algorithm must use.
- Space: $\Omega(k^2)$ – proven even in the simpler insert-only model on unweighted graphs
- Update time: $\Omega(1)$ – trivially, you must at least read each edge
These bounds are the target. Algorithms are judged by how close they get.
2014 – Fafianie & Kratsch (MFCS 2014)
Problem: Graph $k$-Matching, unweighted graphs, insert-only model
Result:
- Space: $O(k^2)$ – matches the lower bound
- Update time: $O(\log k)$ – good, but not yet $O(1)$
- Deterministic (always correct, no randomness)
Method: Kernelization streaming – maintain a small kernel that preserves the matching structure.
Limitation: Unweighted graphs only. No dynamic deletions.
2015 – Chitnis et al. (SODA 2015 + SPAA 2015)
Problem: Graph $k$-Matching, unweighted graphs, dynamic model (insertions and deletions)
Result:
- Space: $O(k^2 \cdot \operatorname{polylog}(n))$
- Update time: $O(k^2 \cdot \operatorname{polylog}(n))$ – slow
Notable: This was the first dynamic streaming algorithm for $k$-matching. Dynamic streams are much harder – deletions can undo structure the algorithm has built. The SPAA 2015 paper refined the SODA 2015 result but kept the same asymptotic bounds.
Key assumption – “Size-$k$ Constraint”: The algorithm assumes the maximum matching size in the stream never exceeds $k$ at any point. This is a strong restriction – it does not apply to arbitrary graph streams.
2016 – Chitnis et al. (SODA 2016)
This paper made several advances, each addressing a different scenario:
Case 1: Dynamic model with Size-$k$ Constraint (unweighted)
- Space: $O(k^2 \cdot \operatorname{polylog}(n))$
- Update time: $O(\operatorname{polylog}(n))$ – major improvement over the 2015 $O(k^2 \cdot \operatorname{polylog}(n))$
- Method: sketch-based approach
Case 2: Dynamic model without Size-$k$ Constraint (unweighted)
- Space: $O(k^2 \cdot \operatorname{polylog}(n))$
- Update time: $O(k^2 \cdot \operatorname{polylog}(n))$ – still slow for general streams
Case 3: Weighted graphs (dynamic model)
- Space and update time multiply by $W$ (number of distinct edge weights) compared to unweighted case
- Fast update time $O(\operatorname{polylog}(n))$ only when Size-$k$ Constraint holds
- Without Size-$k$ Constraint: update time $O(k^2 \cdot \operatorname{polylog}(n))$
Limitation: Fast update time only achievable under the Size-$k$ Constraint – a strong assumption not always satisfied in practice.
Summary Timeline
| Year | Authors | Model | Graph | Space | Update Time | Constraint |
|---|---|---|---|---|---|---|
| 2014 | Fafianie & Kratsch | Insert-only | Unweighted | $O(k^2)$ | $O(\log k)$ | None |
| 2015 | Chitnis et al. | Dynamic | Unweighted | $O(k^2 \cdot \operatorname{polylog}(n))$ | $O(k^2 \cdot \operatorname{polylog}(n))$ | Size-$k$ |
| 2016 | Chitnis et al. | Dynamic | Unweighted | $O(k^2 \cdot \operatorname{polylog}(n))$ | $O(\operatorname{polylog}(n))$ | Size-$k$ |
| 2016 | Chitnis et al. | Dynamic | Unweighted | $O(k^2 \cdot \operatorname{polylog}(n))$ | $O(k^2 \cdot \operatorname{polylog}(n))$ | None |
| 2016 | Chitnis et al. | Dynamic | Weighted | $O(Wk^2 \cdot \operatorname{polylog}(n))$ | $O(k^2 \cdot \operatorname{polylog}(n))$ | None |
Contributions
What This Paper Achieves
| Model | Space | Update Time | Status |
|---|---|---|---|
| Insert-only | $O(k^2)$ | $O(1)$ | Optimal on both |
| Dynamic | $O(W k^2 \cdot \operatorname{polylog}(n))$ | $O(\operatorname{polylog}(n))$ | Near-optimal on both |
Where $W$ = number of distinct edge weights.
Why Update Time Matters
Previous work (Fafianie & Kratsch) achieved $O(k^2)$ space but only $O(\log k)$ update time. The prior dynamic algorithm had $O(k^2 \cdot \operatorname{polylog}(n))$ update time – far too slow for high-rate streams. This paper closes that gap.
Significance: Improving runtime where previous algorithms were too slow for fast data streams.
Problems
-
P-Matching: Find any k-matching in $G$ or report no k-matching exists.
-
P-wMatching: Find a maximum weighted k-matching in $G$ or report no k-matching exists.
Note: P stands for parameterized – complexity is measured in terms of $n$ and $k$.
Hash Functions
Notation reference: Hash Function Notation
- General form: $h : U \to [r]^-$
- Meaning: Maps elements to “buckets” numbered $0$ to $r-1$
Perfect Hash Functions (w.r.t. S)
Notation reference: Hash Function Notation
- No collisions within $S$ for all $x, y \in S$ with $x \neq y$: $h(x) \neq h(y)$
Universal Hash Families
Notation reference: Hash Function Notation
For any two distinct elements $x \neq y$:
\[\Pr_{h \leftarrow H}[h(x) = h(y)] \le \frac{1}{r}\]- $r$ = number of buckets mapped to
- h is chosen u.a.r from H
Constructed Hash Function (Proposition 2.3)
\[h_{a,b,r}(x) = ((a x + b) \bmod p) \bmod r\]Notation reference: Hash Function Notation
Parameters
- $a$: multiplier, $1 \leq a \leq p-1$
- $b$: offset, $0 \leq b \leq p-1$
- $p$: a prime $\geq \lvert U\rvert$
- $r$: number of buckets
Space Note
- Only need to store 4 values bounded by $O(\log n)$, because it takes $O(\log n)$ bits to store an integer of size $n$
- Parameters are chosen based on $n$, so the bound in terms of $n$ is valid
Proposition 2.4 – Probability of Collision w.r.t. S
- We have $r$ elements
- Use $r^2$ buckets
- Pick a random universal hash function
Why (Skip unless asked):
-
Distinct pairs in $S$: \(\binom{r}{2} = \frac{r(r-1)}{2}\)
-
Since $H$ is universal: \(\Pr[h(x) = h(y)] \le \frac{1}{r^2}\) (mapping into $[r^2]$ buckets)
-
Probability of at least one collision: \(\Pr[\text{at least one collision}] \le \binom{r}{2} \cdot \frac{1}{r^2} = \frac{r(r-1)}{2r^2} = \frac{r-1}{2r} < \frac{1}{2}\)
-
Therefore:
Note: We use $r^2$ buckets to reduce collision probability.
Applying to this paper
\[f : V \to [4k^2]^-\]Notation reference: Hash Function Notation
A hash function that maps each vertex from V to a bucket from $0$ to $4k^2 - 1$
-
“M is Nice” if the hash function is perfect for the $2k$ vertices of matching $M$ (i.e., no two vertices of $M$ are in the same bucket $V_i$). This applies to any matching, not just the optimal one.
-
Reminder: $4k^2$: refers back to Proposition 2.4 We need a k matching thus have 2k vertices to map with $f$ thus need $r^2$ buckets $= 4k^2$ buckets.
Edge Ordering
Notation reference: Edge Ordering and Subgraph Notation
-
Edge ordering definition: \(\beta(e) = (wt(e), u, v) \quad \text{where } u < v\)
- This defines a unique ordering for all edges:
- Highest weight first
- Break ties using ordered vertices
- This defines a unique ordering for all edges:
Reduced Compact Subgraph
Notation reference: Edge Ordering and Subgraph Notation
1) Create $C_f(H)$ (Compact Subgraph)
Steps in English
- Look at all edges between bucket pairs $(V_i, V_j)$ in $H$
- Buckets are defined by the hash function
- Keep only the heaviest edge between each bucket pair
- Use maximum $\beta$-value ordering
- Discard all other edges between the same bucket pair
The matching can only have at most one one edge per bucket pair thus the heaviest edge is the best candidate.
Formal definition:
\[\mathcal{C}_f(H) = \bigcup_{i \neq j} \left\{ \arg\max_{e \in H[V_i, V_j]} \beta(e) \right\}\]where $H[V_i, V_j]$ denotes the set of edges in $H$ with one endpoint in $V_i$ and one in $V_j$.
Equivalently:
\[\mathcal{C}_f(H) = \left\{ e \in H \;\middle|\; e \text{ has endpoints in distinct buckets } V_i, V_j \text{ and } \beta(e) = \max_{e' \in H[V_i,V_j]} \beta(e') \right\}\]Size bound
- Theoretical maximum (if $H$ had an edge between every bucket pair): $\binom{4k^2}{2} = O(k^4)$ Intuitively: There are $4k^2$ buckets, At most one edge kept per bucket pair: {(V0,V1), (V0,V2), …}
- Actual bound in the streaming context: $\lvert C_f(H)\rvert \leq \lvert H\rvert = O(k^2)$, since $C_f(H) \subseteq H$ and $H$ only has $O(k^2)$ edges to begin with
Note:
Space is always $O(k^2)$ – we never reach $O(k^4)$.
Since $C_f(H) \subseteq H$, the compact subgraph can only contain edges already in $H$. It cannot create new edges. Therefore $\lvert C_f(H)\rvert \leq \lvert H\rvert = O(k^2)$ in the streaming context.
The $O(k^4)$ figure is a loose combinatorial upper bound – the number of possible bucket pairs if $H$ had an edge between every pair of the $4k^2$ buckets. Since $H$ itself only has $O(k^2)$ edges, only $O(k^2)$ bucket pairs are ever populated.
Both mechanisms keep $G_s^f$ at $O(k^2)$:
- Batching bounds $H = G_{\hat{s}}^f \cup G_{\hat{s}+1,s}$ to at most $8k^2$ edges, preventing unbounded growth.
- Step 3 of $R_f$ – the global filter – reduces $H$ from $\leq 8k^2$ down to $\leq 4k^2$, maintaining the size guarantee across recursive steps.
2) Local Filter
For each bucket $V_i$:
- Count edges incident to $V_i$ in $C_f(H)$
- An edge is deleted if it is not among the 2k heaviest at both of its endpoint buckets $V_i$ and $V_j$
- This gives $R’_f(H)$
Formal definition:
Let $\Gamma(V_i)$ denote the edges of $\mathcal{C}_f(H)$ incident to bucket $V_i$, ordered by $\beta$-value. Then:
\[\mathcal{R}'_f(H) = \left\{ [u,v] \in \mathcal{C}_f(H) \;\middle|\; u \in V_i,\; v \in V_j,\; i \neq j,\; [u,v] \in \operatorname{top}_{2k}(\Gamma(V_i)) \text{ and } [u,v] \in \operatorname{top}_{2k}(\Gamma(V_j)) \right\}\]where $\operatorname{top}_{2k}(\Gamma(V_i))$ denotes the $2k$ heaviest edges (by $\beta$-value) in $\mathcal{C}_f(H)$ incident to $V_i$.
3) Global Filter
- Keep only the $4k^2$ heaviest edges overall from $R’_f(H)$
- This gives $R_f(H)$
Formal definition:
\[\mathcal{R}_f(H) = \operatorname{top}_{4k^2}\!\left(\mathcal{R}'_f(H)\right)\]where $\operatorname{top}_{4k^2}(\mathcal{R}’_f(H))$ denotes the $4k^2$ heaviest edges (by $\beta$-value) in $\mathcal{R}’_f(H)$.
Lemma 3.1
Notation reference: Edge Ordering and Subgraph Notation
-
A maximum weighted nice $k$-matching in $C_f(H)$ is preserved exactly in $R_f(H)$
-
Justifies why working with the tiny $R_f(H)$ is sufficient
Formal definition:
Let $\mathcal{M}_k^f(H)$ denote the set of all nice $k$-matchings in $H$ w.r.t. $f$. Then:
Part 1 – Existence is preserved:
\[\mathcal{M}_k^f(\mathcal{C}_f(H)) \neq \emptyset \iff \mathcal{M}_k^f(\mathcal{R}_f(H)) \neq \emptyset\]Part 2 – Optimal weight is preserved:
\[\max_{M \,\in\, \mathcal{M}_k^f(\mathcal{C}_f(H))} wt(M) = \max_{M \,\in\, \mathcal{M}_k^f(\mathcal{R}_f(H))} wt(M)\]Both parts together mean: $\mathcal{R}_f(H)$ is sufficient – no nice $k$-matching information is lost by the reduction.
Lemma 3.2
-
$R_f(H)$ can be constructed in $O(\lvert H\rvert + k^2)$ time and space
-
This is what makes the algorithm fast enough to interleave
Recursive Design Setup
Notation reference: Insert-Only Algorithm Notation
Stream Subgraphs
-
$G_s$ = subgraph of the first $s$ edges in the stream ${e_1, e_2, \dots, e_s}$
-
$G_{r,s}$ = subgraph of edges from $r$ to $s$ ${e_r, e_{r+1}, \dots, e_s}$
-
$\hat{s}$ = largest multiple of $4k^2$ that is less than $s$
Recursive Definition of $G_s^f$
Notation reference: Insert-Only Algorithm Notation
Base case:
\[G_0^f = \varnothing\]Recursive step:
\[G_s^f = R_f\left(G_{\hat{s}}^f \cup G_{\hat{s}+1, s}\right)\]Meaning
- Take previous summary $G_{\hat{s}}^f$
- Take new batch $G_{\hat{s}+1, s}$
- Union them together
- At most $8k^2$ edges total
- Apply $R_f$ to reduce back down
- Result has $\leq 4k^2$ edges
Lemma 3.3 Gives
- $G_s^f$ has at most $4k^2$ edges
Key property:
\[G_s^f = R_f(G_s)\]Meaning:
- Computing $R_f$ in batches gives the same result as computing $R_f$ on all edges at once
- Guarantees no loss of information
Why This Matters
- The recursive design makes streaming computation possible
- Otherwise we would need to store all edges to compute $R_f(G)$
Insert Only Model Algorithm
Notation reference: Insert-Only Algorithm Notation
Input: $\mathcal{S} = { e_1, e_2, e_3, \ldots, e_s, \ldots }$ $\to$ running index
Output: Maximum weighted $k$-matching in $G$ or report no $k$-matching
Preprocessing
1: Let $\mathcal{H}_\epsilon$ be a set of $\lceil\log(1/\epsilon)\rceil$ hash functions picked u.a.r. from a universal family of hashing functions mapping $V(G) \to [4k^2]^-$.
Why: Back to Proposition 2.4: $\Pr[f_i \text{ is perfect w.r.t. } V(M)] \geq 1/2$
Since we want probability to be even better, we choose not one but $\lceil\log(1/\epsilon)\rceil$ hash functions.
Each $f_i$ is picked independently and u.a.r from the universal family.
$\Pr[\text{all } f_1, f_2, \ldots, f_t \text{ fail}] \leq (1/2)^t = 1/2^t$
Since $t = \lceil\log(1/\epsilon)\rceil$
Thus $\Pr[\text{all fail}] \leq 1 / 2^{\log(1/\epsilon)} = \epsilon$
$\Pr[\text{one succeeds}] \geq 1 - \epsilon$
Runtime Analysis / Space
- $O(\log(1/\epsilon))$ functions
- $O(\log n)$ bits stored per function
- Still under $O(k^2)$ space
Time: Since $\lceil\log(1/\epsilon)\rceil$ is constant $\to$ no change
Theorem 3.5
Proof of: $\Pr[\text{one succeeds}] \geq 1 - \epsilon$
Initialization
2: For each hash function $f$ in $\mathcal{H}_\epsilon$: $G_0^f = \emptyset$
Why: We are initializing the reduced compact subgraph for each hash function to the empty set, since no edges have been seen.
3: $G_{1,s} = { e_1, e_2, \ldots, e_s }$ where either $s = 4k^2$ or $s < 4k^2$ and $e_s$ is the last edge in stream $\mathcal{S}$.
Why: Initializing the first batch up to 4k^2 edges.
- Reading first $s \leq 4k^2$ edges takes $O(1)$ update time
- Storing edges takes $O(k^2)$ space
Updating
4: While the edge stream $\mathcal{S}$ is not ended:
Interleave the executions of Steps 4.1 and 4.2 so that the time between reading two consecutive edges in $\mathcal{S}$ into $G_{s+1,s’}$ is bounded by $O(1)$.
Step 4.1
For each hash function f in $\mathcal{H}\epsilon$: $G_s^f = R_f( G{\hat{s}}^f \cup G_{\hat{s}+1, s} )$
Refer to recursive design
Step 4.2
$G_{s+1,s’} = { e_{s+1}, e_{s+2}, \ldots, e_{s’} }$
where either $s’ = s + 4k^2$, or $s’ < s + 4k^2$ and $e_{s’}$ is the last edge in stream $\mathcal{S}$.
Step 4.3
$s = s’$
Why: We compute the reduced compact subgraph recursively in $O(k^2)$ space. We read the next batch into a buffer interleaved with 4.1. Then advance stream position s to s’.
Query
5: For each hash function in $\mathcal{H}_\epsilon$:
Step 5.1
$G_f = R_f( G_{\hat{s}}^f \cup G_{\hat{s}+1, s} )$
Step 5.2
Construct the maximum weighted matching.
Why: Final reduction step is performed creating $G_f$. Since the stream ended, check the buffer for any remaining edges to reduce into $G_f$. Then construct the maximum weighted matching for $G_f$.
Final Step
6: If a $k$-matching is constructed in Step 5: Return largest weight $k$-matching. Else return: “no $k$-matching exists in $G$”.