Performance & Scaling
The default goal of dup-code-check is to scan repos quickly in local dev and CI, and output duplicates/similarity results with actionable file/line locations.
Scanning: I/O and file collection
File collection strategy
By default it respects .gitignore and, when possible, uses Git to speed up file enumeration:
respectGitignore=truefollowSymlinks=false- the root is a Git repo and
gitis available
If Git-based collection is not available, it falls back to a walker-based traversal.
Controlling I/O cost
Use these options to control scan cost:
maxFileSize: skip huge files (default 10 MiB)maxFiles: file-count budgetmaxTotalBytes: total-bytes budgetignoreDirs: skip dependency/build directories (defaults include common ones)
Detection: rough complexity intuition
Detectors vary significantly in cost:
fileDuplicates: low cost (linear scan + grouping)codeSpanDuplicates/tokenSpanDuplicates: medium cost (fingerprints/windows/candidate matching)similarBlocks*: higher cost (candidate generation + similarity computation), but depth is limited and thresholds filter aggressively
If you want “highest signal first”, start with duplicate files, then move to --report only when needed.
Large repo tips (rules of thumb)
- keep scan roots tight (only the directories you care about)
- add explicit
--ignore-dirfor dependencies/build outputs - in CI, set budgets (
--max-total-bytesor--max-files) and use--strictto surface incomplete scans - if it’s too slow:
- first try disabling
--report - then raise thresholds (
--min-token-len/--min-match-len)
- first try disabling