SuperC-less tokenizer fails to separate strings with a large amount of special characters correctly #281

lolrepeatlol · 2024-10-25T03:57:23Z

Summary

kmax's built-in, SuperC-less tokenizer can incorrectly mark the end of strings, particularly in cases where strings have a significant number of special characters.
This happens because the current tokenizer, get_tokens(), is somewhat simplistic in how it interprets some inputs, and ultimately produces outputs that are too granular for analyze_c_tokens to correctly tag.
One particular issue, seen here with string detection, can compound on itself. In one instance viewed below, preprocessor directives are ultimately considered "c" tokens rather than "preprocessor" tokens because they're believed to be within a string.
In this instance, kmax's failure to account for preprocessor directives becomes a problem when the conditional at the top of the stack isn't popped, as the #endif directive isn't properly accounted for.
- This failure leads to an AssertionError when attempting to find another conditional to match to.

Steps to reproduce

Steps followed

To get a repaired configuration file for a commit range, I followed the steps here:

Clone the Linux kernel with git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git and enter the directory.
Create a diff from a range of patches using git diff {commit1}..{commit2} > patchset.diff. I initially encountered this issue with the range 52afb15e9d9a021ab6eec923a087ec9f518cb713 to 0253d718a070ba109046299847fe8f3cf7568c3c.
Check out the source code at the latest of the two commits and create a kernel configuration file with a command like make defconfig.
Run klocalizer with klocalizer --repair .config -a x86_64 --include-mutex patchset.diff --verbose.

What I expected to happen

I expected klocalizer to repair the kernel configuration file.

What actually happened

klocalizer runs into an AssertionError:

DEBUG: Doing syntax analysis on "drivers/gpu/drm/nouveau/nvif/object.c" to get constrained line ranges.
DEBUG: Syntax analysis for "drivers/gpu/drm/nouveau/nvif/object.c" found 22 unconstrained lines, 0 lines are remaining for presence condition analysis.
DEBUG: Doing syntax analysis on "drivers/gpu/drm/nouveau/nvkm/engine/disp/r535.c" to get constrained line ranges.
DEBUG: Syntax analysis for "drivers/gpu/drm/nouveau/nvkm/engine/disp/r535.c" found 2 unconstrained lines, 0 lines are remaining for presence condition analysis.
DEBUG: Doing syntax analysis on "drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c" to get constrained line ranges.
DEBUG: Syntax analysis for "drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c" found 405 unconstrained lines, 0 lines are remaining for presence condition analysis.
DEBUG: Doing syntax analysis on "drivers/gpu/drm/omapdrm/omap_dmm_tiler.c" to get constrained line ranges.
Traceback (most recent call last):
  File "/home/alexei/IDEProjects/PyCharmProjects/kmax/venv/bin/klocalizer", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/home/alexei/IDEProjects/PyCharmProjects/kmax/kmax/klocalizer", line 1740, in <module>
    klocalizerCLI()
  File "/home/alexei/IDEProjects/PyCharmProjects/kmax/kmax/klocalizer", line 815, in klocalizerCLI
    root_cb = SyntaxAnalysis.get_conditional_blocks_of_file(srcfile_fullpath)
  File "/home/alexei/IDEProjects/PyCharmProjects/kmax/kmax/superc.py", line 807, in get_conditional_blocks_of_file
    cb = SyntaxAnalysis.get_conditional_blocks(content, line_count)
  File "/home/alexei/IDEProjects/PyCharmProjects/kmax/kmax/superc.py", line 782, in get_conditional_blocks
    assert len(stack) == 1
AssertionError
(venv) alexei@turing:~/LinuxKernels/kmax_stress_testing/linux_rand500$

Additional information

While debugging the code, I made use of debug print statements in analyze_c_tokens and focused on when in_quotes would change. The code is seen below:

    def analyze_c_tokens(tokens_w_line_nums):
        """
        TODO: document
        determines what kind of code each token is (c, comment, or preprocessor).
        returns a map between line number and a list of (token mapped to type of code)
        """
        analyzed_tokens = {}

        in_quotes = False
        in_single_line_comment = False
        in_preprocessor = False
        prev_line_num = 0
        in_comment = False
        continued_preprocessor = False

        for token, line_num in tokens_w_line_nums:
            if len(token) < 1:
                continue

            print(f"\nDEBUG processing tkn '{token}', line {line_num}")
            print(f"  in_quotes current state: {in_quotes}")

            if line_num == prev_line_num:
                pass
            else:
                print(f"\nDEBUG state @ line {line_num}:")
                print(f"  prev preprocessor state: {in_preprocessor}")
                print(f"  contd preprocessor: {continued_preprocessor}")

                if not continued_preprocessor:
                    in_preprocessor = False
                    print(f"  resetting preprocessor state to {in_preprocessor}")
                if in_single_line_comment:
                    in_comment = False
                    in_single_line_comment = False
                analyzed_tokens[line_num] = []

            if token[0] == '#':
                print(f"DEBUG # check @ line {line_num}:")
                print(f"  in_comment: {in_comment}")
                print(f"  in_quotes: {in_quotes}")
                print(f"  token[0] == '#': {token[0] == '#'}")

            # preprocessor check
            if (not in_comment) and (not in_quotes) and token[0] == '#':
                print(f"\nDEBUG found # @ line {line_num}:")
                print(f"  prev state: {in_preprocessor}")
                in_preprocessor = True
                print(f"  new state: {in_preprocessor}")

            print(f"  before quote check, in_quotes: {in_quotes}")

            if (not in_preprocessor) and (not in_comment) and ("\"" in token):
                print(f"DEBUG quote found in token '{token}' @ line {line_num}")
                print(f"  current in_quotes: {in_quotes}")
                in_quotes = not in_quotes
                print(f"  new in_quotes: {in_quotes}")

            print(f"  after quote check, in_quotes: {in_quotes}")

            if (not in_quotes) and (not in_comment) and ("//" in token):
                in_single_line_comment = True
                in_comment = True

            if (not in_quotes) and (not in_comment) and ("/*" in token):
                in_comment = True

            # add the token with code type
            current_type = "preprocessor" if in_preprocessor else ("comment" if in_comment else "c")
            print(f"  token: {token}, type: {current_type}")

            if in_comment:
                analyzed_tokens[line_num].append({token: "comment"})
            elif in_preprocessor:
                # handle case where no space between directive and parenthesis
                found_directive = False  # track if directive found
                for directive in ['if', 'ifdef', 'ifndef', 'elif', 'else', 'endif']:
                    if token.startswith(directive + '('):
                        directive_token = directive
                        remaining_token = token[len(directive):]  # capture remaining part by slicing at length of directive
                        analyzed_tokens[line_num].append({directive_token: "preprocessor"})  # add directive
                        if remaining_token:
                            analyzed_tokens[line_num].append({remaining_token: "preprocessor"})  # add remaining token
                            found_directive = True
                        break

                # if no directive found: just add token as whole
                if not found_directive:
                    analyzed_tokens[line_num].append({token: "preprocessor"})
            else:
                analyzed_tokens[line_num].append({token: "c"})

            if in_comment and ("*/" in token):
                in_comment = False

            if token == '\\':
                continued_preprocessor = True
                print(f" continuation found! setting continued_preprocessor: {continued_preprocessor}")
            else:
                continued_preprocessor = False

            print(f"DEBUG loop end. in_quotes: {in_quotes}")
            prev_line_num = line_num
        return analyzed_tokens

For context, I had inserted other debug print statements elsewhere, especially in get_conditional_blocks(), as I had initially believed the preprocessor conditional wasn't being closed. However, this was not the case, so I investigated sister functions like analyze_c_tokens(). The actual problem stems from a string filled with many special characters within drivers/gpu/drm/omapdrm/omap_dmm_tiler.c:

...
/*
 * debugfs support
 */

#ifdef CONFIG_DEBUG_FS //line 991

static const char *alphabet = "abcdefghijklmnopqrstuvwxyz"
				"ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
static const char *special = ".,:;'\"`~!^-+"; //this is the problem

static void fill_map(char **map, int xdiv, int ydiv, struct tcm_area *a,
							char c, bool ovw)
...
	}

error:
	kfree(map);
	kfree(global_map);

	return 0;
}
#endif  //associated endif at line 1159

The debug output is available below. In particular, note how the *alphabet strings are tokenized fully, while *special is separated into different pieces, causing issues within analyze_c_tokens(). Finally, note how the endif preprocessor conditional is believed to be in_quotes and C code.

DEBUG processing tkn 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', line 994  # the tokenizer properly accounts for a FULL string
  in_quotes current state: True
  before quote check, in_quotes: True
  after quote check, in_quotes: True
  token: ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789, type: c
DEBUG loop end. in_quotes: True

DEBUG processing tkn '"', line 994
  in_quotes current state: True
  before quote check, in_quotes: True
DEBUG quote found in token '"' @ line 994
  current in_quotes: True
  new in_quotes: False
  after quote check, in_quotes: False
  token: ", type: c
DEBUG loop end. in_quotes: False

... [continued] ...

DEBUG processing tkn '=', line 995
  in_quotes current state: False
  before quote check, in_quotes: False
  after quote check, in_quotes: False
  token: =, type: c
DEBUG loop end. in_quotes: False

DEBUG processing tkn '"', line 995
  in_quotes current state: False
  before quote check, in_quotes: False
DEBUG quote found in token '"' @ line 995
  current in_quotes: False
  new in_quotes: True
  after quote check, in_quotes: True
  token: ", type: c
DEBUG loop end. in_quotes: True

DEBUG processing tkn '.,:;', line 995  # one part of the string
  in_quotes current state: True
  before quote check, in_quotes: True
  after quote check, in_quotes: True
  token: .,:;, type: c
DEBUG loop end. in_quotes: True

DEBUG processing tkn ''', line 995
  in_quotes current state: True
  before quote check, in_quotes: True
  after quote check, in_quotes: True
  token: ', type: c
DEBUG loop end. in_quotes: True

DEBUG processing tkn '\', line 995
  in_quotes current state: True
  before quote check, in_quotes: True
  after quote check, in_quotes: True
  token: \, type: c
 continuation found! setting continued_preprocessor: True  # another issue, likely caused by this bug?
DEBUG loop end. in_quotes: True

DEBUG processing tkn '"', line 995
  in_quotes current state: True
  before quote check, in_quotes: True
DEBUG quote found in token '"' @ line 995
  current in_quotes: True
  new in_quotes: False
  after quote check, in_quotes: False
  token: ", type: c
DEBUG loop end. in_quotes: False

DEBUG processing tkn '`~!^-+', line 995  # another part of the string
  in_quotes current state: False
  before quote check, in_quotes: False
  after quote check, in_quotes: False
  token: `~!^-+, type: c
DEBUG loop end. in_quotes: False

DEBUG processing tkn '"', line 995
  in_quotes current state: False
  before quote check, in_quotes: False
DEBUG quote found in token '"' @ line 995
  current in_quotes: False
  new in_quotes: True
  after quote check, in_quotes: True
  token: ", type: c
DEBUG loop end. in_quotes: True

DEBUG processing tkn ';', line 995
  in_quotes current state: True
  before quote check, in_quotes: True
  after quote check, in_quotes: True
  token: ;, type: c
DEBUG loop end. in_quotes: True   # NOTICE: still believed to be "in_quotes"

... [continued] ...

DEBUG processing tkn '#', line 1159
  in_quotes current state: True

DEBUG state @ line 1159:
  prev preprocessor state: False
  contd preprocessor: False
  resetting preprocessor state to False
DEBUG # check @ line 1159:
  in_comment: False
  in_quotes: True
  token[0] == '#': True
  before quote check, in_quotes: True
  after quote check, in_quotes: True
  token: #, type: c
DEBUG loop end. in_quotes: True

DEBUG processing tkn 'endif', line 1159
  in_quotes current state: True
  before quote check, in_quotes: True  # should NOT be "in_quotes"
  after quote check, in_quotes: True
  token: endif, type: c  # should NOT be "C code"
DEBUG loop end. in_quotes: True

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SuperC-less tokenizer fails to separate strings with a large amount of special characters correctly #281

SuperC-less tokenizer fails to separate strings with a large amount of special characters correctly #281

lolrepeatlol commented Oct 25, 2024

SuperC-less tokenizer fails to separate strings with a large amount of special characters correctly #281

SuperC-less tokenizer fails to separate strings with a large amount of special characters correctly #281

Comments

lolrepeatlol commented Oct 25, 2024

Summary

Steps to reproduce

Steps followed

What I expected to happen

What actually happened

Additional information