Detectors & Algorithms

中文

This page explains what each --report section does, what kinds of duplicates it is good at, and the main implementation idea / limitations.

Terminology: “clone types” are often categorized as Type-1/2/3/4. They are a way to describe similarity levels (not a strict binary standard).

0) Scanning & normalization (common prerequisites)

All detectors share the same upfront work:

Collect candidate file paths (respects .gitignore by default and skips common dirs like node_modules/)
Read file contents and skip:
- files larger than maxFileSize
- binary files containing \\0
- runtime anomalies like NotFound / PermissionDenied (counted in scanStats)

See Scan Options for details.

1) `fileDuplicates`: duplicate files (whitespace-insensitive)

Goal

Find files that are identical after removing ASCII whitespace.

Core idea

remove ASCII whitespace (space/newline/tab, etc.)
hash normalized bytes and group
compare a sample within a group to reduce hash-collision risk

Good for / not good for

good: copy-pasted files with only formatting/indentation changes (a subset of Type-1)
not good: identifier renames or small insertions/deletions (closer to Type-2/3)

2) `codeSpanDuplicates`: suspected duplicate code spans (character-level)

Corresponds to the CLI --code-spans (also included in report mode).

Goal

Quickly find suspected duplicate code snippets and report line ranges for manual review.

Normalization

Character-level normalization:

remove newlines
drop all symbols + whitespace
keep only [A-Za-z0-9_]

This is robust to formatting/punctuation differences, but still sensitive to identifier renames.

Matching idea (conceptual)

fingerprint + window selection (winnowing) over the normalized character stream
extend candidate matches (maximal match)
de-duplicate, group, and output occurrences with line ranges

Good for / limitations

good: quickly locate large copy/paste spans across files/repos
limitations:
- not an AST/token clone detector; false positives/negatives are possible
- language-agnostic normalization loses semantic information

3) `lineSpanDuplicates`: line-normalized duplicate spans

Goal

Detect multi-line duplicated spans while being insensitive to indentation/punctuation.

Normalization (per line)

keep only [A-Za-z0-9_] per line
hash the sequence as a “line token”

Then detect duplicated windows over the line-token sequence, and filter using “sum of normalized char lengths >= minMatchLen”.

Good for / limitations

good: repeated blocks with per-line formatting/punctuation differences
limitations: line-based, so reordering/insertion/deletion across lines impacts results more

4) `tokenSpanDuplicates`: token-level duplicate spans

Goal

Closer to CPD-style detection: whitespace-insensitive and more robust for some Type-2 changes (like identifier renames).

Tokenization (simplified)

A lightweight tokenizer (heuristic, not a full parser):

keywords (if/for/return/let/class/...) → fixed tokens
identifiers → IDENT
numbers → NUM
strings → STR (and records the start line for multi-line strings)
punctuation kept as-is ({}, (), ;, ...)

Then it applies a similar fingerprint/window strategy to find duplicated token spans.

Good for / limitations

good: structurally similar logic with renamed variables (Type-2-ish)
limitations:
- heuristic tokenizer: multi-language friendly but not “syntax-correct”
- block/AST-ish detectors are weaker for languages without {} (e.g. Python)

5) `blockDuplicates`: `{}` block-level duplicates

Goal

Detect fully duplicated brace blocks (at token level).

Core idea

tokenize
build block nodes using { / } (token/line ranges, nesting, children)
hash token slices per block and group

6) `astSubtreeDuplicates`: `{}`-structure “AST subtree” approximate duplicates

Goal

More robust at the block-structure level: if a block’s structure + contents match, consider it a duplicate.

Core idea (simplified)

Build a representation for each block:

child blocks are replaced by a marker + child hash
tokens outside child blocks are kept

This forms a tree based on {} nesting and fingerprints subtrees bottom-up.

Limitation

This is not a real language AST; it’s an approximation based on brace structure.

7) `similarBlocksMinhash` / `similarBlocksSimhash`: similar block pairs

Goal

Find highly similar (but not identical) block pairs (heuristics towards Type-3).

Input

derived from {} blocks
uses only shallow depths (depth is limited to control scale)
shingles over block token stream (default 5-grams)

MinHash (`similarBlocksMinhash`)

build MinHash signatures
generate candidate pairs via LSH (banding/bucketing)
filter by score >= similarityThreshold (and crossRepoOnly when enabled)

SimHash (`similarBlocksSimhash`)

build 64-bit SimHash per block
generate candidates via banding/bucketing
filter by hamming_distance <= simhashMaxDistance (and crossRepoOnly when enabled)
output includes distance

Good for / limitations

good: hints for small edits/insertions/deletions in otherwise similar blocks
limitations: approximate methods can produce false positives; verify via preview and line ranges

Detectors & Algorithms ​

0) Scanning & normalization (common prerequisites) ​

1) fileDuplicates: duplicate files (whitespace-insensitive) ​

Goal ​

Core idea ​

Good for / not good for ​

2) codeSpanDuplicates: suspected duplicate code spans (character-level) ​

Goal ​

Normalization ​

Matching idea (conceptual) ​

Good for / limitations ​

3) lineSpanDuplicates: line-normalized duplicate spans ​

Goal ​

Normalization (per line) ​

Good for / limitations ​

4) tokenSpanDuplicates: token-level duplicate spans ​

Goal ​

Tokenization (simplified) ​

Good for / limitations ​

5) blockDuplicates: {} block-level duplicates ​

Goal ​

Core idea ​

6) astSubtreeDuplicates: {}-structure “AST subtree” approximate duplicates ​

Goal ​

Core idea (simplified) ​

Limitation ​

7) similarBlocksMinhash / similarBlocksSimhash: similar block pairs ​

Goal ​

Input ​

MinHash (similarBlocksMinhash) ​

SimHash (similarBlocksSimhash) ​

Good for / limitations ​

8) How to choose detectors? ​

Detectors & Algorithms

0) Scanning & normalization (common prerequisites)

1) `fileDuplicates`: duplicate files (whitespace-insensitive)

Goal

Core idea

Good for / not good for

2) `codeSpanDuplicates`: suspected duplicate code spans (character-level)

Goal

Normalization

Matching idea (conceptual)

Good for / limitations

3) `lineSpanDuplicates`: line-normalized duplicate spans

Goal

Normalization (per line)

Good for / limitations

4) `tokenSpanDuplicates`: token-level duplicate spans

Goal

Tokenization (simplified)

Good for / limitations

5) `blockDuplicates`: `{}` block-level duplicates

Goal

Core idea

6) `astSubtreeDuplicates`: `{}`-structure “AST subtree” approximate duplicates

Goal

Core idea (simplified)

Limitation

7) `similarBlocksMinhash` / `similarBlocksSimhash`: similar block pairs

Goal

Input

MinHash (`similarBlocksMinhash`)

SimHash (`similarBlocksSimhash`)

Good for / limitations

8) How to choose detectors?