Source Loader
Overview
The SourceLoader class traverses source files in a repository, normalizes them
(removes comments, collapses whitespace), and uses Bloom filters to efficiently
detect matches against known patches. It is a critical component of the
PatchTrack pipeline for identifying where patches may have been applied in
source code.
This page explains the workflow, Bloom filter approach, key methods, data structures, and matching logic. For the complete API reference and docstrings, see the mkdocstrings section below.
Purpose
- Load and traverse source files from the filesystem (single files or trees)
- Normalize source code by removing language-specific comments
- Build Bloom filters for fast probabilistic n-gram matching
- Detect patches in source files using hash-based queries
- Record matches and provide detailed results
Key Concepts
Bloom Filters
A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. In SourceLoader:
- A bit vector of size
BLOOMFILTER_SIZE(2,097,152 bits ≈ 256 KB) stores presence information - Each n-gram is hashed three ways (FNV-1a, DJB2, SDBM) to set 3 bits
- A query checks if all 3 hash bits are set; if not, the element is definitely not in the set
- Trade-off: False positives possible, false negatives impossible
Normalization
Source code is normalized by:
- Removing language-specific comments (using
helpers.remove_comments()) - Collapsing whitespace (all whitespace replaced with empty string)
- Converting to lowercase
- Splitting into tokens
This aggressive normalization ensures consistent matching regardless of formatting or comments.
Matching Strategy
- Source code is split into sliding n-grams
- Each source n-gram is hashed and stored in the Bloom filter
- For each patch, all patch hashes are queried against the Bloom filter
- Matches are recorded with patch ID and position information
Batch Processing
To avoid memory overflow with very large source files:
- Bloom filter is rebuilt periodically (when
num_ngram_processedexceedsBLOOMFILTER_SIZE / MIN_MN_RATIO) - Old patch hashes are checked against the current filter before reset
- This creates "batches" of hashes within the file
Workflow
SourceLoader.traverse(source_path, patch, file_ext)
│
├─── Store patch_list and count from patch object
│
├─── For each source file:
│
├─── _process(source_path, file_ext):
│ ├─ Read file
│ ├─ Call _normalize()
│ └─ Call _query_bloomfilter()
│
├─── _query_bloomfilter():
│ ├─ Split source into tokens
│ │
│ ├─ For each patch:
│ │ │
│ │ ├─ Build Bloom filter from source n-grams
│ │ ├─ For each batch:
│ │ │ ├─ _check_bloom_match() [against old hashes]
│ │ │ └─ Reset Bloom filter
│ │ │
│ │ └─ _check_patch_hashes() [final check]
│ │
│ └─ Record matches in _match_dict
│
└─── Return total match count
Key Methods
| Method | Returns | Purpose |
|---|---|---|
traverse(source_path, patch, file_ext) |
int |
Load and analyze source files; query against patches. Return total matches. |
_process(source_path, magic_ext) |
None |
Read a source file and normalize it; call _query_bloomfilter(). |
_normalize(source, file_ext) |
str |
Remove comments, collapse whitespace, lowercase. Return normalized string. |
_query_bloomfilter(source_norm_lines, magic_ext) |
None |
Build Bloom filter from source n-grams and query against all patches. Handles batch processing and resets. |
_check_bloom_match(patch_id) |
None |
Query old patch hashes against current Bloom filter; record matches. |
_check_patch_hashes(patch_id) |
None |
Check all hashes from current patch against Bloom filter; populate _match_dict and _results. |
items() |
List |
Get source file list. |
length() |
int |
Get count of source files processed. |
match_items() |
Dict |
Get match dictionary mapping patch ID → matches. |
results() |
Dict |
Get results dictionary with per-hash match status. |
source_hashes() |
List[Tuple] |
Get source n-gram hashes [(ngram_str, [hash1, hash2, hash3]), ...]. |
Data Structures
Match Dictionary
_match_dict: Dict[int, Dict[int, Dict[int, bool]]] maps:
Where:
- patch_id: Identifier from patch list
- seq_num: Sequence number of n-gram within patch (increments every 3 hashes)
- hash_value: One of the three hash values (FNV-1a, DJB2, SDBM)
- is_match_bool: Whether this hash was found in Bloom filter
Results Dictionary
_results: Dict[int, Dict[str, Any]] maps hash values to match results:
Source Hashes List
_source_hashes: List[Tuple[str, List[int]]] stores n-grams and their hashes:
Usage Example
Basic workflow
from analyzer.patchLoader import PatchLoader
from analyzer.sourceLoader import SourceLoader
# Load patches first
loader = PatchLoader()
loader.traverse('data/buggy/', patch_type='buggy', file_ext=2)
# Now query source against patches
source_loader = SourceLoader()
match_count = source_loader.traverse(
source_path='data/src/',
patch=loader,
file_ext=2 # Same file type index
)
print(f"Found {match_count} possible matches")
# Inspect results
matches = source_loader.match_items()
for patch_id, match_info in matches.items():
print(f"Patch {patch_id}: {match_info}")
Checking specific hashes
source_loader = SourceLoader()
source_loader.traverse('data/src/', loader, file_ext=2)
results = source_loader.results()
for hash_val, result in results.items():
if result['Match']:
print(f"Hash {hash_val} found in source")
# Reverse lookup
source_ngrams = source_loader.source_hashes()
for ngram, hashes in source_ngrams:
print(f"N-gram: {ngram} -> Hashes: {hashes}")
Bloom Filter Tuning
Memory and False Positive Trade-offs
BLOOMFILTER_SIZE(default 2,097,152): Larger size = fewer false positives but more memory used.MIN_MN_RATIO(default 32): Controls batch boundary. Smaller ratio = more frequent resets and checks, reducing false positives but increasing computation.
Batch Size Calculation
When num_ngram_processed exceeds BLOOMFILTER_SIZE / MIN_MN_RATIO:
For typical file with 10 KB of normalized tokens, this means ~100+ batches.
Adjust MIN_MN_RATIO or BLOOMFILTER_SIZE to tune performance vs. accuracy.
Best Practices
- Use consistent
file_extindices across bothPatchLoaderandSourceLoaderto ensure language-aware normalization is uniform. - Verify source files are readable (handle encoding errors gracefully).
- Consider pre-filtering source files by size (skip very large files if memory-constrained).
- Store
match_items()andresults()for post-processing and analysis. - Use
source_hashes()for debugging mismatches or reverse n-gram lookup.
Notes on Bloom Filter Implementation
The code uses bitarray.bitarray from the bitarray package for efficient
bit manipulation. The Bloom filter is reset between batches to manage memory
usage on large files. This means the final _check_patch_hashes() checks
only the last batch; earlier batches are checked via _check_bloom_match()
during the resets.
Integration with PatchLoader and Classifier
- Input:
PatchLoaderprovides normalized patches and their hash lists - Output: Match information passed to
classifierfor further analysis - See:
docs/reference/classifier.mdfor how matches are scored and classified as PA/PN/NE
API Reference
Source file loader for patch analysis.
The SourceLoader class traverses source files, normalizes them (removes comments, collapses whitespace), and uses Bloom filters (Replaced with temporary array of hash tables) to efficiently query for matches against patches.
analyzer.sourceLoader.SourceLoader
Loads and analyzes source files using Bloom filter-based patch matching.
This class normalizes source files (removing comments and excess whitespace), builds Bloom filters, and detects matches against known patches.
Source code in analyzer/sourceLoader.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | |
analyzer.sourceLoader.SourceLoader.__init__()
Initialize the SourceLoader with empty data structures.
Source code in analyzer/sourceLoader.py
analyzer.sourceLoader.SourceLoader.traverse(source_path, patch, file_ext)
Traverse source files and query against patches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_path
|
str
|
Path to a file or directory to analyze. |
required |
patch
|
Any
|
Patch object with items() and length() methods. |
required |
file_ext
|
int
|
File extension type index (from FileExt class). |
required |
Returns:
| Type | Description |
|---|---|
int
|
The number of matches found. |
Source code in analyzer/sourceLoader.py
analyzer.sourceLoader.SourceLoader.items()
analyzer.sourceLoader.SourceLoader.length()
analyzer.sourceLoader.SourceLoader.match_items()
analyzer.sourceLoader.SourceLoader.results()
See Also
- Main Module — How source loading fits in the pipeline
- Patch Loader — How patches are prepared
- Classifier — How matches are classified
- Constant —
BLOOMFILTER_SIZEandMIN_MN_RATIO - Helpers — Comment removal for different languages