Classifier Module
Overview
The Classifier module provides core patch classification functions used by PatchTrack's pipeline. It determines whether a ChatGPT code snippet matches patches in a GitHub pull request by comparing hunks, calculating similarities, and aggregating results into classification decisions.
Purpose
This module:
- Processes patches and source code with tokenization
- Matches code hunks using hash-based comparison
- Calculates similarity ratios between snippets
- Classifies individual patches as PA, PN, or NE
- Preserves original logic with improved readability
Classification Algorithm
ChatGPT Code + PR Patch
↓
Tokenize Both
↓
Create Hash Tables
↓
Find Matching Hunks
↓
Calculate Similarity Ratio
↓
Classify Based on Matches
↓
PA / PN / NE
Key Concepts
Hunks
A "hunk" is a continuous block of added code in the patch. The classifier matches hunks from ChatGPT code against PR patches.
Hash-Based Matching
Uses sha256 hashes to compare:
- Source hashes: Hash values of each line in ChatGPT code
- Patch hunks: Added code blocks in PR patches
- Match: When hashes align between source and patch
Similarity Ratio
Calculated using n-gram comparison:
- Range: 0.0 (no match) to 1.0 (perfect match)
- Formula: Matching n-grams / Total n-grams
- Usage: Determines confidence in classification
Classification Decision
Patch Applied (PA)
✅ Conditions:
- ChatGPT code appears in one or more PR hunks
- High similarity ratio to patch content
- Hashes match between source and patch
Patch Not Applied (PN)
❌ Conditions:
- ChatGPT code does NOT appear in any PR hunk
- No matching hashes found
- Similarity ratio is low
Not Existing (NE)
⚠️ Conditions:
- Required file doesn't exist in PR
- ChatGPT code path cannot be processed
Key Functions
Core Processing
process_patch(): Load and traverse patch file with source codeget_ext(): Extract file extension from filename
Matching Functions
find_hunk_matches(): Find hash-based matches between hunksfind_hunk_matches_w_important_hash(): Enhanced matching with priority hashescalculate_match_percentage(): Calculate proportion of matched items
Classification Functions
classify_hunk(): Classify a single hunkclassify_patch(): Aggregate hunk classifications to patch levelcal_similarity_ratio(): Calculate n-gram based similarity
Usage Example
from analyzer import classifier, common
# Set n-gram size (global setting)
common.ngram_size = 4
# Process a patch and source
patch_loader, source_loader = classifier.process_patch(
patch_path='data/patches/pr-123/github/patch-1.patch',
dst_path='data/patches/pr-123/chatgpt/code.py',
type_patch='patch',
file_ext=5
)
# Extract components
added_code = patch_loader.added()
match_items = source_loader.match_items()
source_hashes = source_loader.source_hashes()
# Find matches with important hashes
hunk_matches = classifier.find_hunk_matches_w_important_hash(
match_items=match_items,
_type='PA',
important_hashes=added_code,
source_hashes=source_hashes
)
# Calculate similarity
similarity = classifier.cal_similarity_ratio(source_hashes, added_code)
print(f"Similarity Ratio: {similarity:.2%}")
# Classify hunks
hunk_classes = []
for hunk_id in hunk_matches:
hunk_class = classifier.classify_hunk('', hunk_matches[hunk_id]['class'])
hunk_classes.append(hunk_class)
# Final patch classification
final_class = classifier.classify_patch(hunk_classes)
print(f"Patch Classification: {final_class}")
Data Structures
Patch Loader Output
{
'hunks': [
{'added': [...], 'removed': [...], 'context': [...]},
...
],
'hashes': {hash_value: {'count': n, 'lines': [...]}, ...}
}
Source Loader Output
{
'tokens': [...],
'hashes': {hash_value: position, ...},
'match_items': {hash: {'Match': True/False}, ...}
}
Match Results
Performance Considerations
- Hash-based: O(n) complexity for matching
- N-gram size: Trade-off between accuracy and speed
- Small n (1-2): Faster, fewer matches
- Large n (4+): Slower, more accurate
- File size: Larger files take longer to process
API Reference
Patch classifier helpers.
This module contains helper functions used by the patch classification pipeline. The refactor preserves original logic but improves readability by adding type hints, docstrings, and removing commented-out code.
analyzer.classifier.process_patch(patch_path, dst_path, type_patch, file_ext)
Process a patch and its corresponding source traversal.
This wraps PatchLoader.traverse and SourceLoader.traverse, preserving
the original try/except logging behavior.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
patch_path
|
str
|
Path to the patch file. |
required |
dst_path
|
str
|
Path to the destination/source files. |
required |
type_patch
|
str
|
Type of patch (e.g., buggy/fixed). |
required |
file_ext
|
str
|
File extension being processed. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[Any, Any]
|
Tuple of (patch_loader_instance, source_loader_instance). |
Source code in analyzer/classifier.py
analyzer.classifier.get_ext(filename)
Return the file extension for filename.
If no extension is present this returns an empty string.
analyzer.classifier.calculate_match_percentage(results, hashes)
Calculate percentage of matched items in results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
Dict[Any, Dict[str, Any]]
|
Mapping of items to a dict containing a boolean under key |
required |
hashes
|
Dict[Any, Any]
|
Mapping used to collect matched or unmatched items (kept for parity). |
required |
Returns:
| Type | Description |
|---|---|
float
|
Percentage (0-100) of matched items. Returns 0 if there are no items. |
Source code in analyzer/classifier.py
analyzer.classifier.find_hunk_matches(match_items, _type, important_hashes, source_hashes)
Find matches between hunks using hashed values.
Preserves original matching logic and return structure.
Source code in analyzer/classifier.py
analyzer.classifier.classify_hunk(class_patch, class_buggy)
Classify a single hunk based on patch and buggy classifications.
Source code in analyzer/classifier.py
analyzer.classifier.classify_patch(hunk_classifications)
Determine patch-level classification from hunk classifications.
Source code in analyzer/classifier.py
analyzer.classifier.find_hunk_matches_w_important_hash(match_items, _type, important_hashes, source_hashes)
Find hunk matches using important hashes feature.
Preserves original behavior and return structure.
Source code in analyzer/classifier.py
analyzer.classifier.cal_similarity_ratio(source_hashes, added_lines_hashes)
Calculate similarity ratio between source hashes and added lines hashes.
Source code in analyzer/classifier.py
See Also
- Patch Loader - Parses PR patches
- Source Loader - Parses ChatGPT code
- Aggregator - Aggregates classifications
- Common - Configuration (n-gram size, etc.)