Output & Report

中文

dup-code-check supports both text output and JSON output. Text is for humans; JSON is for post-processing and CI integration.

1) Duplicate files (default mode)

Text

You’ll see:

duplicate groups: <N>
for each group:
- hash=<...> normalized_len=<...> files=<...>
- - [repoLabel] path

JSON (`--json`)

JSON output is an array, each element:

interface DuplicateGroup {
  hash: string;          // 16 hex chars (FNV-1a 64)
  normalizedLen: number; // byte length after ASCII whitespace removal
  files: { repoId: number; repoLabel: string; path: string }[];
}

2) Suspected duplicate code spans (`--code-spans`)

Text

duplicate code span groups: <N>
per group:
- hash=<...> normalized_len=<...> occurrences=<...>
- preview=<...>
- - [repoLabel] path:startLine-endLine

JSON (`--json`)

JSON output is an array, each element:

interface DuplicateSpanGroup {
  hash: string;
  normalizedLen: number;
  preview: string;
  occurrences: {
    repoId: number;
    repoLabel: string;
    path: string;
    startLine: number;
    endLine: number;
  }[];
}

3) Scan stats (`--stats`)

JSON mode

With --json --stats:

default / --code-spans: { groups, scanStats }
--report: { report, scanStats }

scanStats fields include:

candidateFiles, scannedFiles, scannedBytes
gitFastPathFallbacks: non-zero when the scan attempted the Git fast path and had to fall back to the filesystem walker
skippedNotFound, skippedPermissionDenied, skippedTooLarge, skippedBinary, skippedOutsideRoot, skippedRelativizeFailed, skippedWalkErrors
skippedOutsideRoot: paths outside roots or unsafe paths (e.g. symlink targets outside roots, or unsafe paths emitted by the Git fast path)
skippedBudgetMaxFiles: non-zero when the scan stopped early due to the maxFiles budget
skippedBudgetMaxTotalBytes: skipped due to maxTotalBytes (reading would exceed the total bytes budget)
skippedBudgetMaxNormalizedChars: non-zero when the scan stopped early due to the maxNormalizedChars budget
skippedBudgetMaxTokens: non-zero when the scan stopped early due to the maxTokens budget (report mode)
skippedBucketTruncated: detector guardrail; fingerprint buckets were truncated to cap worst-case cost (results may miss some matches)

Text mode

In text mode, --stats prints stats to stderr while keeping results on stdout:

bash

dup-code-check --stats . >result.txt 2>stats.txt

4) Strict mode (`--strict`)

--strict is intended for CI and answers “was the scan complete?”:

exits 1 on PermissionDenied, outside_root, relativize_failed, traversal errors, bucket truncation, or budget limits (maxFiles / maxTotalBytes / maxNormalizedChars / maxTokens)
does not fail on NotFound, TooLarge, or Binary

When --json is enabled and --stats is not, --strict still prints stats to stderr on failure (so you can see why).

5) Report mode (`--report`)

Text output contains multiple sections (in this order):

file duplicates
code span duplicates
line span duplicates
token span duplicates
block duplicates
AST subtree duplicates
similar blocks (minhash)
similar blocks (simhash)

JSON output:

interface DuplicationReport {
  fileDuplicates: DuplicateGroup[];
  codeSpanDuplicates: DuplicateSpanGroup[];
  lineSpanDuplicates: DuplicateSpanGroup[];
  tokenSpanDuplicates: DuplicateSpanGroup[];
  blockDuplicates: DuplicateSpanGroup[];
  astSubtreeDuplicates: DuplicateSpanGroup[];
  similarBlocksMinhash: SimilarityPair[];
  similarBlocksSimhash: SimilarityPair[];
}

For the meaning/implementation ideas of each section, see Detectors & Algorithms.

Output & Report ​

1) Duplicate files (default mode) ​

Text ​

JSON (--json) ​

2) Suspected duplicate code spans (--code-spans) ​

Text ​

JSON (--json) ​

3) Scan stats (--stats) ​

JSON mode ​

Text mode ​

4) Strict mode (--strict) ​

5) Report mode (--report) ​

Output & Report

1) Duplicate files (default mode)

Text

JSON (`--json`)

2) Suspected duplicate code spans (`--code-spans`)

Text

JSON (`--json`)

3) Scan stats (`--stats`)

JSON mode

Text mode

4) Strict mode (`--strict`)

5) Report mode (`--report`)