Classifier Module

Overview

The Classifier module provides core patch classification functions used by PatchTrack's pipeline. It determines whether a ChatGPT code snippet matches patches in a GitHub pull request by comparing hunks, calculating similarities, and aggregating results into classification decisions.

Purpose

This module:

Processes patches and source code with tokenization
Matches code hunks using hash-based comparison
Calculates similarity ratios between snippets
Classifies individual patches as PA, PN, or NE
Preserves original logic with improved readability

Classification Algorithm

ChatGPT Code + PR Patch
         ↓
   Tokenize Both
         ↓
   Create Hash Tables
         ↓
   Find Matching Hunks
         ↓
   Calculate Similarity Ratio
         ↓
   Classify Based on Matches
         ↓
    PA / PN / NE

Key Concepts

Hunks

A "hunk" is a continuous block of added code in the patch. The classifier matches hunks from ChatGPT code against PR patches.

Hash-Based Matching

Uses sha256 hashes to compare:

Source hashes: Hash values of each line in ChatGPT code
Patch hunks: Added code blocks in PR patches
Match: When hashes align between source and patch

Similarity Ratio

Calculated using n-gram comparison:

Range: 0.0 (no match) to 1.0 (perfect match)
Formula: Matching n-grams / Total n-grams
Usage: Determines confidence in classification

Classification Decision

Patch Applied (PA)

✅ Conditions:

ChatGPT code appears in one or more PR hunks
High similarity ratio to patch content
Hashes match between source and patch

Patch Not Applied (PN)

❌ Conditions:

ChatGPT code does NOT appear in any PR hunk
No matching hashes found
Similarity ratio is low

Not Existing (NE)

⚠️ Conditions:

Required file doesn't exist in PR
ChatGPT code path cannot be processed

Key Functions

Core Processing

process_patch(): Load and traverse patch file with source code
get_ext(): Extract file extension from filename

Matching Functions

find_hunk_matches(): Find hash-based matches between hunks
find_hunk_matches_w_important_hash(): Enhanced matching with priority hashes
calculate_match_percentage(): Calculate proportion of matched items

Classification Functions

classify_hunk(): Classify a single hunk
classify_patch(): Aggregate hunk classifications to patch level
cal_similarity_ratio(): Calculate n-gram based similarity

Usage Example

from analyzer import classifier, common

# Set n-gram size (global setting)
common.ngram_size = 4

# Process a patch and source
patch_loader, source_loader = classifier.process_patch(
    patch_path='data/patches/pr-123/github/patch-1.patch',
    dst_path='data/patches/pr-123/chatgpt/code.py',
    type_patch='patch',
    file_ext=5
)

# Extract components
added_code = patch_loader.added()
match_items = source_loader.match_items()
source_hashes = source_loader.source_hashes()

# Find matches with important hashes
hunk_matches = classifier.find_hunk_matches_w_important_hash(
    match_items=match_items,
    _type='PA',
    important_hashes=added_code,
    source_hashes=source_hashes
)

# Calculate similarity
similarity = classifier.cal_similarity_ratio(source_hashes, added_code)
print(f"Similarity Ratio: {similarity:.2%}")

# Classify hunks
hunk_classes = []
for hunk_id in hunk_matches:
    hunk_class = classifier.classify_hunk('', hunk_matches[hunk_id]['class'])
    hunk_classes.append(hunk_class)

# Final patch classification
final_class = classifier.classify_patch(hunk_classes)
print(f"Patch Classification: {final_class}")

Data Structures

Patch Loader Output

{
    'hunks': [
        {'added': [...], 'removed': [...], 'context': [...]},
        ...
    ],
    'hashes': {hash_value: {'count': n, 'lines': [...]}, ...}
}

Source Loader Output

{
    'tokens': [...],
    'hashes': {hash_value: position, ...},
    'match_items': {hash: {'Match': True/False}, ...}
}

Match Results

{
    'hunk_0': {
        'class': 'PA',
        'matches': 25,
        'total': 30,
        'percentage': 83.33
    }
}

Performance Considerations

Hash-based: O(n) complexity for matching
N-gram size: Trade-off between accuracy and speed
Small n (1-2): Faster, fewer matches
Large n (4+): Slower, more accurate
File size: Larger files take longer to process

API Reference

Patch classifier helpers.

This module contains helper functions used by the patch classification pipeline. The refactor preserves original logic but improves readability by adding type hints, docstrings, and removing commented-out code.

`analyzer.classifier.process_patch(patch_path, dst_path, type_patch, file_ext)`

Process a patch and its corresponding source traversal.

This wraps PatchLoader.traverse and SourceLoader.traverse, preserving the original try/except logging behavior.

Parameters:

Name	Type	Description	Default
`patch_path`	`str`	Path to the patch file.	required
`dst_path`	`str`	Path to the destination/source files.	required
`type_patch`	`str`	Type of patch (e.g., buggy/fixed).	required
`file_ext`	`str`	File extension being processed.	required

Returns:

Type	Description
`Tuple[Any, Any]`	Tuple of (patch_loader_instance, source_loader_instance).

Source code in analyzer/classifier.py

def process_patch(patch_path: str, dst_path: str, type_patch: str, file_ext: str) -> Tuple[Any, Any]:
    """Process a patch and its corresponding source traversal.

    This wraps `PatchLoader.traverse` and `SourceLoader.traverse`, preserving
    the original try/except logging behavior.

    Args:
        patch_path: Path to the patch file.
        dst_path: Path to the destination/source files.
        type_patch: Type of patch (e.g., buggy/fixed).
        file_ext: File extension being processed.

    Returns:
        Tuple of (patch_loader_instance, source_loader_instance).
    """
    common.ngram_size = constant.NGRAM_SIZE

    patch = patch_loader.PatchLoader()
    try:
        _ = patch.traverse(patch_path, type_patch, file_ext)
    except Exception as e:
        print("Error traversing patch:....", e)

    source = source_loader.SourceLoader()
    try:
        _ = source.traverse(dst_path, patch, file_ext)
    except Exception as e:
        print("Error traversing source (variant)....", e)

    return patch, source

`analyzer.classifier.get_ext(filename)`

Return the file extension for filename.

If no extension is present this returns an empty string.

Source code in analyzer/classifier.py

def get_ext(filename: str) -> str:
    """Return the file extension for `filename`.

    If no extension is present this returns an empty string.
    """
    parts = filename.rsplit('.', 1)
    return parts[-1] if len(parts) == 2 else ''

`analyzer.classifier.calculate_match_percentage(results, hashes)`

Calculate percentage of matched items in results.

Parameters:

Name	Type	Description	Default
`results`	`Dict[Any, Dict[str, Any]]`	Mapping of items to a dict containing a boolean under key `'Match'`.	required
`hashes`	`Dict[Any, Any]`	Mapping used to collect matched or unmatched items (kept for parity).	required

Returns:

Type	Description
`float`	Percentage (0-100) of matched items. Returns 0 if there are no items.

Source code in analyzer/classifier.py

def calculate_match_percentage(results: Dict[Any, Dict[str, Any]], hashes: Dict[Any, Any]) -> float:
    """Calculate percentage of matched items in `results`.

    Args:
        results: Mapping of items to a dict containing a boolean under key `'Match'`.
        hashes: Mapping used to collect matched or unmatched items (kept for parity).

    Returns:
        Percentage (0-100) of matched items. Returns 0 if there are no items.
    """
    total = 0
    matched = 0

    for h in results:
        total += 1
        if results[h].get('Match'):
            matched += 1
    return (matched / total) * 100 if total != 0 else 0.0

`analyzer.classifier.find_hunk_matches(match_items, _type, important_hashes, source_hashes)`

Find matches between hunks using hashed values.

Preserves original matching logic and return structure.

Source code in analyzer/classifier.py

def find_hunk_matches(match_items: Dict[Any, Any], _type: str, important_hashes: List[Any], source_hashes: List[Tuple[Any, Any]]) -> Dict[Any, Any]:
    """Find matches between hunks using hashed values.

    Preserves original matching logic and return structure.
    """
    seq_matches: Dict[Any, Any] = {}

    for patch_nr in match_items:
        seq_matches[patch_nr] = {'sequences': {}, 'class': ''}
        for patch_seq in match_items[patch_nr]:
            seq_matches[patch_nr]['sequences'][patch_seq] = {
                'count': 0,
                'hash_list': list(match_items[patch_nr][patch_seq].keys())
            }

            for k in match_items[patch_nr][patch_seq]:
                if match_items[patch_nr][patch_seq][k]:
                    seq_matches[patch_nr]['sequences'][patch_seq]['count'] += 1

    match_bool = True

    for seq_nr in seq_matches:
        for seq in seq_matches[seq_nr]['sequences']:
            if seq_matches[seq_nr]['sequences'][seq]['count'] < 2:
                match_bool = False
                break

        _class = ''
        if _type == 'MO':
            _class = _type if match_bool else 'MC'
        elif _type == 'PA':
            _class = _type if match_bool else 'MC'

        seq_matches[seq_nr]['class'] = _class

    return seq_matches

`analyzer.classifier.classify_hunk(class_patch, class_buggy)`

Classify a single hunk based on patch and buggy classifications.

Source code in analyzer/classifier.py

def classify_hunk(class_patch: str, class_buggy: str) -> str:
    """Classify a single hunk based on patch and buggy classifications.
    """
    final_class = ''
    if class_buggy == 'MC' and class_patch == 'PA':
        final_class = 'PA'
    if class_buggy == 'PA' and class_patch == 'MC':
        final_class = 'PA'
    if class_buggy == 'MC' and class_patch == 'MC':
        final_class = 'PN'
    if class_patch == '' and class_buggy != '':
        final_class = class_buggy
    if class_patch != '' and class_buggy == '':
        final_class = class_patch
    if class_patch == '' and class_buggy == '':
        final_class = 'PN'
    return final_class

`analyzer.classifier.classify_patch(hunk_classifications)`

Determine patch-level classification from hunk classifications.

Source code in analyzer/classifier.py

def classify_patch(hunk_classifications: List[str]) -> str:
    """Determine patch-level classification from hunk classifications.
    """
    na_total = 0
    ed_total = 0

    final_class = ''
    for i in range(len(hunk_classifications)):
        if hunk_classifications[i] == 'PA':
            ed_total += 1
        elif hunk_classifications[i] == 'PN':
            na_total += 1

    final_class = 'PN' if ed_total == 0 else 'PA'
    return final_class

`analyzer.classifier.find_hunk_matches_w_important_hash(match_items, _type, important_hashes, source_hashes)`

Find hunk matches using important hashes feature.

Preserves original behavior and return structure.

Source code in analyzer/classifier.py

def find_hunk_matches_w_important_hash(match_items: Dict[Any, Any], _type: str, important_hashes: List[Any], source_hashes: List[Tuple[Any, Any]]) -> Dict[Any, Any]:
    """Find hunk matches using important hashes feature.

    Preserves original behavior and return structure.
    """
    seq_matches: Dict[Any, Any] = {}
    test: List[Any] = []
    for lines in important_hashes:
        for line in lines:
            for each in line:
                for ngram, hash_list in source_hashes:
                    if each in ngram:
                        test.append(hash_list)

    important_hash_match = 0
    for patch_nr in match_items:
        match_bool = False
        seq_matches[patch_nr] = {'sequences': {}, 'class': ''}
        for patch_seq in match_items[patch_nr]:
            seq_matches[patch_nr]['sequences'][patch_seq] = {
                'count': 0,
                'hash_list': list(match_items[patch_nr][patch_seq].keys())
            }

            if seq_matches[patch_nr]['sequences'][patch_seq]['hash_list'] in test:
                seq_matches[patch_nr]['sequences'][patch_seq]['important'] = True
                important_hash_match += 1
                match_bool = True
            else:
                seq_matches[patch_nr]['sequences'][patch_seq]['important'] = False

            for k in match_items[patch_nr][patch_seq]:
                if match_items[patch_nr][patch_seq][k]:
                    seq_matches[patch_nr]['sequences'][patch_seq]['count'] += 1

        seq_matches[patch_nr]['class'] = _type if match_bool else 'MC'

    return seq_matches

`analyzer.classifier.cal_similarity_ratio(source_hashes, added_lines_hashes)`

Calculate similarity ratio between source hashes and added lines hashes.

Source code in analyzer/classifier.py

def cal_similarity_ratio(source_hashes: List[Tuple[Any, Any]], added_lines_hashes: List[List[List[Any]]]) -> float:
    """Calculate similarity ratio between source hashes and added lines hashes.
    """
    count_matches: List[Any] = []

    for lines in added_lines_hashes:
        for line in lines:
            for each in line:
                for ngram, hash_list in source_hashes:
                    if each == ngram:
                        count_matches.append(ngram)

    s_hashes: List[Any] = [ngram for ngram, _ in source_hashes]

    try:
        unique_matches = list(set(count_matches))
        unique_source_hashes = list(set(s_hashes))
        per = (len(unique_matches) / len(unique_source_hashes)) * 100
        return per
    except Exception:
        return 0.0

Classifier Module

Overview

Purpose

Classification Algorithm

Key Concepts

Hunks

Hash-Based Matching

Similarity Ratio

Classification Decision

Patch Applied (PA)

Patch Not Applied (PN)

Not Existing (NE)

Key Functions

Core Processing

Matching Functions

Classification Functions

Usage Example

Data Structures

Patch Loader Output

Source Loader Output

Match Results

Performance Considerations

API Reference

analyzer.classifier.process_patch(patch_path, dst_path, type_patch, file_ext)

analyzer.classifier.get_ext(filename)

analyzer.classifier.calculate_match_percentage(results, hashes)

analyzer.classifier.find_hunk_matches(match_items, _type, important_hashes, source_hashes)

analyzer.classifier.classify_hunk(class_patch, class_buggy)

analyzer.classifier.classify_patch(hunk_classifications)

analyzer.classifier.find_hunk_matches_w_important_hash(match_items, _type, important_hashes, source_hashes)

analyzer.classifier.cal_similarity_ratio(source_hashes, added_lines_hashes)

See Also

`analyzer.classifier.process_patch(patch_path, dst_path, type_patch, file_ext)`

`analyzer.classifier.get_ext(filename)`

`analyzer.classifier.calculate_match_percentage(results, hashes)`

`analyzer.classifier.find_hunk_matches(match_items, _type, important_hashes, source_hashes)`

`analyzer.classifier.classify_hunk(class_patch, class_buggy)`

`analyzer.classifier.classify_patch(hunk_classifications)`

`analyzer.classifier.find_hunk_matches_w_important_hash(match_items, _type, important_hashes, source_hashes)`

`analyzer.classifier.cal_similarity_ratio(source_hashes, added_lines_hashes)`