Algorithms

Reference Lemma 3.2

Lemma 3.2

*Notation reference: Hash Function Notation Edge Ordering and Subgraph Notation*

Lemma 3.2 (Section 3): Let $f$ be a hash function mapping the vertex set of $G$ to $[4k^2]^-$. There is an algorithm that, for any subgraph $H$ of $G$, constructs $\mathcal{R}_f(H)$ in time and space both bounded by $O(\lvert H\rvert + k^2)$.

Proof strategy: Four sequential passes, each linear in $\lvert H\rvert + k^2$.


Step 1 – Scan edges and partition by hash value

  • Iterate over every edge $[u, v]$ in $H$
  • Compute $f(u)$ and $f(v)$
  • If $f(u) = f(v)$, discard edge

Why discard: $f(u) = f(v)$ means both endpoints are in the same bucket $V_i$. By definition of $\mathcal{C}_f(H)$, only cross-bucket edges are kept. An edge entirely within $V_i$ cannot be in any nice $k$-matching, so it is safe to discard immediately.

  • Otherwise place edge into lists:
    • $E_{f(u)}$
    • $E_{f(v)}$

Notes:

  • Scanning edges: $O(\lvert H\rvert)$
  • Initializing the $4k^2$ bucket lists: $O(k^2)$

Step 2 – Build $\mathcal{C}_f(H)$ by Radix Sort

Goal: For each bucket $V_i$ and other bucket $V_j$, find the heaviest edge between them.

  • Each $E_i$ contains edges going to many different buckets mixed together
  • Sort by key $(f(u), f(v))$
  • Since one endpoint bucket is fixed per list, sorting is by the other bucket’s index -> groups edges by which bucket they connect to

Example: Sort these edges by bucket pair:

Edge A: (f(u), f(v)) = (3, 1)
Edge B: (f(u), f(v)) = (1, 2)
Edge C: (f(u), f(v)) = (3, 0)
Edge D: (f(u), f(v)) = (1, 2)
Edge E: (f(u), f(v)) = (2, 1)

Pass 1 – sort by f(v):

C: (3, 0)    <- f(v)=0
A: (3, 1)    <- f(v)=1
E: (2, 1)    <- f(v)=1
B: (1, 2)    <- f(v)=2
D: (1, 2)    <- f(v)=2

Pass 2 – stable sort by f(u):

B: (1, 2)    <- f(u)=1
D: (1, 2)    <- f(u)=1  (B before D preserved from Pass 1)
E: (2, 1)    <- f(u)=2
C: (3, 0)    <- f(u)=3
A: (3, 1)    <- f(u)=3  (C before A preserved from Pass 1)

Radix Sort Cost Note

Radix sort gives $O(\lvert H\rvert + k^2)$ (as a global sort).

Per-list view:

  • $\lvert E_i\rvert$ to scan edges
  • $4k^2$ to initialize the count array

Sum: $O!\left(\sum_i (\lvert E_i\rvert + 4k^2)\right) = O!\left(\sum_i \lvert E_i\rvert\right) + O(4k^2 \cdot 4k^2) = O(\lvert H\rvert) + O(k^4)$

Using one global radix sort instead gives: $O(\lvert H\rvert + k^2)$


After sorting

  • For each pair $(V_i, V_j)$:
    • Keep the heaviest edge, Toss the rest

Result: $\mathcal{C}_f(H)$

Cost: $O(\lvert H\rvert + k^2)$

Step 3 – Build $\mathcal{R}’_f(H)$ from $\mathcal{C}_f(H)$ by trimming $2k$ heaviest per bucket

3.1 – Size bound

$\lvert E_i\rvert \leq 4k^2 - 1$ since $E_i$ contains edges of $\mathcal{C}_f(H)$ incident to $V_i$.


3.2 – Select top edges per bucket

For each $E_i$:

  • Use linear-time selection (median-of-medians), cost $O(\lvert E_i\rvert)$
  • Find threshold $\beta$-value for top $2k$ edges
  • Edges below threshold are not among the top $2k$

Total:

$O!\left(\sum_i \lvert E_i\rvert\right)$ = $O(\lvert\mathcal{C}_f(H)\rvert)$ = $O(\lvert H\rvert)$

Why $O(\lvert H\rvert)$: Each edge appears in exactly 2 lists, so $\sum_i \lvert E_i\rvert = 2\lvert\mathcal{C}_f(H)\rvert$. Since $\mathcal{C}_f(H) \subseteq H$, we have $\lvert\mathcal{C}_f(H)\rvert \leq \lvert H\rvert$, giving $O(\lvert H\rvert)$ rather than the combinatorial worst-case $O(k^4)$. In the streaming context $\lvert H\rvert = O(k^2)$.


3.3 – Filter edges

  • Scan all edges in $\mathcal{C}_f(H)$, where $\lvert\mathcal{C}_f(H)\rvert \leq O(\lvert H\rvert)$
  • In the streaming context $\lvert H\rvert = 8k^2$ since $H \subseteq G$ from the stream $S$
  • Delete edges below threshold

Step 4 – Build $\mathcal{R}_f(H)$

Keep only the top $4k^2$ heaviest edges from $\mathcal{R}’_f(H)$.

4.1 – Selection

  • Apply linear-time selection on entire edge set $\mathcal{R}’_f(H)$
  • Find threshold $\beta$-value for top $4k^2$

Cost: $O(\lvert\mathcal{R}’_f(H)\rvert)$


4.2 – Delete below-threshold edges

  • Remove every edge whose $\beta$-value is less than threshold

Containment Chain

$\mathcal{R}_f(H) \subseteq \mathcal{R}’_f(H) \subseteq \mathcal{C}_f(H) \subseteq H$


Streaming Context Space

$O(k^2)$


Final Runtime

  • Linear-time selection steps: $O(\lvert H\rvert)$
  • Bucket initialization: $O(k^2)$

Total Runtime:

$O(\lvert H\rvert + k^2)$