Skip to content

Classifier Module

Overview

The Classifier module provides core patch classification functions used by PatchTrack's pipeline. It determines whether a ChatGPT code snippet matches patches in a GitHub pull request by comparing hunks, calculating similarities, and aggregating results into classification decisions.

Purpose

This module:

  • Processes patches and source code with tokenization
  • Matches code hunks using hash-based comparison
  • Calculates similarity ratios between snippets
  • Classifies individual patches as PA, PN, or NE
  • Preserves original logic with improved readability

Classification Algorithm

ChatGPT Code + PR Patch
   Tokenize Both
   Create Hash Tables
   Find Matching Hunks
   Calculate Similarity Ratio
   Classify Based on Matches
    PA / PN / NE

Key Concepts

Hunks

A "hunk" is a continuous block of added code in the patch. The classifier matches hunks from ChatGPT code against PR patches.

Hash-Based Matching

Uses sha256 hashes to compare:

  • Source hashes: Hash values of each line in ChatGPT code
  • Patch hunks: Added code blocks in PR patches
  • Match: When hashes align between source and patch

Similarity Ratio

Calculated using n-gram comparison:

  • Range: 0.0 (no match) to 1.0 (perfect match)
  • Formula: Matching n-grams / Total n-grams
  • Usage: Determines confidence in classification

Classification Decision

Patch Applied (PA)

✅ Conditions:

  • ChatGPT code appears in one or more PR hunks
  • High similarity ratio to patch content
  • Hashes match between source and patch

Patch Not Applied (PN)

❌ Conditions:

  • ChatGPT code does NOT appear in any PR hunk
  • No matching hashes found
  • Similarity ratio is low

Not Existing (NE)

⚠️ Conditions:

  • Required file doesn't exist in PR
  • ChatGPT code path cannot be processed

Key Functions

Core Processing

  • process_patch(): Load and traverse patch file with source code
  • get_ext(): Extract file extension from filename

Matching Functions

  • find_hunk_matches(): Find hash-based matches between hunks
  • find_hunk_matches_w_important_hash(): Enhanced matching with priority hashes
  • calculate_match_percentage(): Calculate proportion of matched items

Classification Functions

  • classify_hunk(): Classify a single hunk
  • classify_patch(): Aggregate hunk classifications to patch level
  • cal_similarity_ratio(): Calculate n-gram based similarity

Usage Example

from analyzer import classifier, common

# Set n-gram size (global setting)
common.ngram_size = 4

# Process a patch and source
patch_loader, source_loader = classifier.process_patch(
    patch_path='data/patches/pr-123/github/patch-1.patch',
    dst_path='data/patches/pr-123/chatgpt/code.py',
    type_patch='patch',
    file_ext=5
)

# Extract components
added_code = patch_loader.added()
match_items = source_loader.match_items()
source_hashes = source_loader.source_hashes()

# Find matches with important hashes
hunk_matches = classifier.find_hunk_matches_w_important_hash(
    match_items=match_items,
    _type='PA',
    important_hashes=added_code,
    source_hashes=source_hashes
)

# Calculate similarity
similarity = classifier.cal_similarity_ratio(source_hashes, added_code)
print(f"Similarity Ratio: {similarity:.2%}")

# Classify hunks
hunk_classes = []
for hunk_id in hunk_matches:
    hunk_class = classifier.classify_hunk('', hunk_matches[hunk_id]['class'])
    hunk_classes.append(hunk_class)

# Final patch classification
final_class = classifier.classify_patch(hunk_classes)
print(f"Patch Classification: {final_class}")

Data Structures

Patch Loader Output

{
    'hunks': [
        {'added': [...], 'removed': [...], 'context': [...]},
        ...
    ],
    'hashes': {hash_value: {'count': n, 'lines': [...]}, ...}
}

Source Loader Output

{
    'tokens': [...],
    'hashes': {hash_value: position, ...},
    'match_items': {hash: {'Match': True/False}, ...}
}

Match Results

{
    'hunk_0': {
        'class': 'PA',
        'matches': 25,
        'total': 30,
        'percentage': 83.33
    }
}

Performance Considerations

  • Hash-based: O(n) complexity for matching
  • N-gram size: Trade-off between accuracy and speed
  • Small n (1-2): Faster, fewer matches
  • Large n (4+): Slower, more accurate
  • File size: Larger files take longer to process

API Reference

Patch classifier helpers.

This module contains helper functions used by the patch classification pipeline. The refactor preserves original logic but improves readability by adding type hints, docstrings, and removing commented-out code.

analyzer.classifier.process_patch(patch_path, dst_path, type_patch, file_ext)

Process a patch and its corresponding source traversal.

This wraps PatchLoader.traverse and SourceLoader.traverse, preserving the original try/except logging behavior.

Parameters:

Name Type Description Default
patch_path str

Path to the patch file.

required
dst_path str

Path to the destination/source files.

required
type_patch str

Type of patch (e.g., buggy/fixed).

required
file_ext str

File extension being processed.

required

Returns:

Type Description
Tuple[Any, Any]

Tuple of (patch_loader_instance, source_loader_instance).

Source code in analyzer/classifier.py
def process_patch(patch_path: str, dst_path: str, type_patch: str, file_ext: str) -> Tuple[Any, Any]:
    """Process a patch and its corresponding source traversal.

    This wraps `PatchLoader.traverse` and `SourceLoader.traverse`, preserving
    the original try/except logging behavior.

    Args:
        patch_path: Path to the patch file.
        dst_path: Path to the destination/source files.
        type_patch: Type of patch (e.g., buggy/fixed).
        file_ext: File extension being processed.

    Returns:
        Tuple of (patch_loader_instance, source_loader_instance).
    """
    common.ngram_size = constant.NGRAM_SIZE

    patch = patch_loader.PatchLoader()
    try:
        _ = patch.traverse(patch_path, type_patch, file_ext)
    except Exception as e:
        print("Error traversing patch:....", e)

    source = source_loader.SourceLoader()
    try:
        _ = source.traverse(dst_path, patch, file_ext)
    except Exception as e:
        print("Error traversing source (variant)....", e)

    return patch, source

analyzer.classifier.get_ext(filename)

Return the file extension for filename.

If no extension is present this returns an empty string.

Source code in analyzer/classifier.py
def get_ext(filename: str) -> str:
    """Return the file extension for `filename`.

    If no extension is present this returns an empty string.
    """
    parts = filename.rsplit('.', 1)
    return parts[-1] if len(parts) == 2 else ''

analyzer.classifier.calculate_match_percentage(results, hashes)

Calculate percentage of matched items in results.

Parameters:

Name Type Description Default
results Dict[Any, Dict[str, Any]]

Mapping of items to a dict containing a boolean under key 'Match'.

required
hashes Dict[Any, Any]

Mapping used to collect matched or unmatched items (kept for parity).

required

Returns:

Type Description
float

Percentage (0-100) of matched items. Returns 0 if there are no items.

Source code in analyzer/classifier.py
def calculate_match_percentage(results: Dict[Any, Dict[str, Any]], hashes: Dict[Any, Any]) -> float:
    """Calculate percentage of matched items in `results`.

    Args:
        results: Mapping of items to a dict containing a boolean under key `'Match'`.
        hashes: Mapping used to collect matched or unmatched items (kept for parity).

    Returns:
        Percentage (0-100) of matched items. Returns 0 if there are no items.
    """
    total = 0
    matched = 0

    for h in results:
        total += 1
        if results[h].get('Match'):
            matched += 1
    return (matched / total) * 100 if total != 0 else 0.0

analyzer.classifier.find_hunk_matches(match_items, _type, important_hashes, source_hashes)

Find matches between hunks using hashed values.

Preserves original matching logic and return structure.

Source code in analyzer/classifier.py
def find_hunk_matches(match_items: Dict[Any, Any], _type: str, important_hashes: List[Any], source_hashes: List[Tuple[Any, Any]]) -> Dict[Any, Any]:
    """Find matches between hunks using hashed values.

    Preserves original matching logic and return structure.
    """
    seq_matches: Dict[Any, Any] = {}

    for patch_nr in match_items:
        seq_matches[patch_nr] = {'sequences': {}, 'class': ''}
        for patch_seq in match_items[patch_nr]:
            seq_matches[patch_nr]['sequences'][patch_seq] = {
                'count': 0,
                'hash_list': list(match_items[patch_nr][patch_seq].keys())
            }

            for k in match_items[patch_nr][patch_seq]:
                if match_items[patch_nr][patch_seq][k]:
                    seq_matches[patch_nr]['sequences'][patch_seq]['count'] += 1

    match_bool = True

    for seq_nr in seq_matches:
        for seq in seq_matches[seq_nr]['sequences']:
            if seq_matches[seq_nr]['sequences'][seq]['count'] < 2:
                match_bool = False
                break

        _class = ''
        if _type == 'MO':
            _class = _type if match_bool else 'MC'
        elif _type == 'PA':
            _class = _type if match_bool else 'MC'

        seq_matches[seq_nr]['class'] = _class

    return seq_matches

analyzer.classifier.classify_hunk(class_patch, class_buggy)

Classify a single hunk based on patch and buggy classifications.

Source code in analyzer/classifier.py
def classify_hunk(class_patch: str, class_buggy: str) -> str:
    """Classify a single hunk based on patch and buggy classifications.
    """
    final_class = ''
    if class_buggy == 'MC' and class_patch == 'PA':
        final_class = 'PA'
    if class_buggy == 'PA' and class_patch == 'MC':
        final_class = 'PA'
    if class_buggy == 'MC' and class_patch == 'MC':
        final_class = 'PN'
    if class_patch == '' and class_buggy != '':
        final_class = class_buggy
    if class_patch != '' and class_buggy == '':
        final_class = class_patch
    if class_patch == '' and class_buggy == '':
        final_class = 'PN'
    return final_class

analyzer.classifier.classify_patch(hunk_classifications)

Determine patch-level classification from hunk classifications.

Source code in analyzer/classifier.py
def classify_patch(hunk_classifications: List[str]) -> str:
    """Determine patch-level classification from hunk classifications.
    """
    na_total = 0
    ed_total = 0

    final_class = ''
    for i in range(len(hunk_classifications)):
        if hunk_classifications[i] == 'PA':
            ed_total += 1
        elif hunk_classifications[i] == 'PN':
            na_total += 1

    final_class = 'PN' if ed_total == 0 else 'PA'
    return final_class

analyzer.classifier.find_hunk_matches_w_important_hash(match_items, _type, important_hashes, source_hashes)

Find hunk matches using important hashes feature.

Preserves original behavior and return structure.

Source code in analyzer/classifier.py
def find_hunk_matches_w_important_hash(match_items: Dict[Any, Any], _type: str, important_hashes: List[Any], source_hashes: List[Tuple[Any, Any]]) -> Dict[Any, Any]:
    """Find hunk matches using important hashes feature.

    Preserves original behavior and return structure.
    """
    seq_matches: Dict[Any, Any] = {}
    test: List[Any] = []
    for lines in important_hashes:
        for line in lines:
            for each in line:
                for ngram, hash_list in source_hashes:
                    if each in ngram:
                        test.append(hash_list)

    important_hash_match = 0
    for patch_nr in match_items:
        match_bool = False
        seq_matches[patch_nr] = {'sequences': {}, 'class': ''}
        for patch_seq in match_items[patch_nr]:
            seq_matches[patch_nr]['sequences'][patch_seq] = {
                'count': 0,
                'hash_list': list(match_items[patch_nr][patch_seq].keys())
            }

            if seq_matches[patch_nr]['sequences'][patch_seq]['hash_list'] in test:
                seq_matches[patch_nr]['sequences'][patch_seq]['important'] = True
                important_hash_match += 1
                match_bool = True
            else:
                seq_matches[patch_nr]['sequences'][patch_seq]['important'] = False

            for k in match_items[patch_nr][patch_seq]:
                if match_items[patch_nr][patch_seq][k]:
                    seq_matches[patch_nr]['sequences'][patch_seq]['count'] += 1

        seq_matches[patch_nr]['class'] = _type if match_bool else 'MC'

    return seq_matches

analyzer.classifier.cal_similarity_ratio(source_hashes, added_lines_hashes)

Calculate similarity ratio between source hashes and added lines hashes.

Source code in analyzer/classifier.py
def cal_similarity_ratio(source_hashes: List[Tuple[Any, Any]], added_lines_hashes: List[List[List[Any]]]) -> float:
    """Calculate similarity ratio between source hashes and added lines hashes.
    """
    count_matches: List[Any] = []

    for lines in added_lines_hashes:
        for line in lines:
            for each in line:
                for ngram, hash_list in source_hashes:
                    if each == ngram:
                        count_matches.append(ngram)

    s_hashes: List[Any] = [ngram for ngram, _ in source_hashes]

    try:
        unique_matches = list(set(count_matches))
        unique_source_hashes = list(set(s_hashes))
        per = (len(unique_matches) / len(unique_source_hashes)) * 100
        return per
    except Exception:
        return 0.0

See Also