Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JPlag is unable to handle 'ć' character #1612

Closed
uuqjz opened this issue Feb 26, 2024 · 9 comments · Fixed by #1613
Closed

JPlag is unable to handle 'ć' character #1612

uuqjz opened this issue Feb 26, 2024 · 9 comments · Fixed by #1613
Labels
duplicate This has been discussed somewhere else enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change

Comments

@uuqjz
Copy link
Contributor

uuqjz commented Feb 26, 2024

When trying to check a Spanish cpp dataset almost all submissions are discarded for containing the char 'ć'.
I am using 5.1.0 from the dev branch.
The dataset is linked below:
z5z5.zip

@uuqjz uuqjz added enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change labels Feb 26, 2024
@TwoOfTwelve
Copy link
Contributor

This error is not caused by JPlag directly. This is the same issue as #1427. There is already an ANTLR issue related to this: antlr/grammars-v4#3952.

It seems that the character only occurs in comments, so as a workaround you could write a script that deletes all comments and run JPlag after it.

@tsaglam tsaglam added the duplicate This has been discussed somewhere else label Feb 26, 2024
@tsaglam
Copy link
Member

tsaglam commented Feb 26, 2024

If you want to keep comments, another workaround might be to remove all non-ASCII characters from the comments via a script.

@uuqjz
Copy link
Contributor Author

uuqjz commented Feb 26, 2024

After removing every non ascii char, I get this error:
line 1:3 token recognition error at: ''
The line in question is
/�B�2017/2018: Zadaa 5, Zadatak 4
It seems like these are start of heading control codes which have to be removed too

@uuqjz
Copy link
Contributor Author

uuqjz commented Feb 26, 2024

Even after this preprocessing those files cannot be parsed:
failed to parse 'student9307.cpp'Cannot invoke "de.jplag.cpp.grammar.CPP14Parser$DeclaratorContext.pointerDeclarator()" because the return value of "de.jplag.cpp.grammar.CPP14Parser$ParameterDeclarationContext.declarator()" is null

@TwoOfTwelve
Copy link
Contributor

TwoOfTwelve commented Feb 26, 2024

After removing every non ascii char, I get this error: line 1:3 token recognition error at: '' The line in question is /�B�2017/2018: Zadaa 5, Zadatak 4 It seems like these are start of heading control codes which have to be removed too

This is probably caused by an encoding issue. I think your files might be encoded in UTF-16 or something, but there is very little for the heuristic to actually go by. We might want to include an encoding flag in the future.

I will look into the second error later, but it also seems to be caused by ANTLR.

@TwoOfTwelve
Copy link
Contributor

I look a little more into the first line of the files and I don't know how that ever came to be. It certainly does not look like valid cpp code.

The null pointer issue should be fixed in #1613

@uuqjz
Copy link
Contributor Author

uuqjz commented Feb 26, 2024

@TwoOfTwelve
Copy link
Contributor

With the fix and the first line removed JPlag runs on my machine.

@uuqjz
Copy link
Contributor Author

uuqjz commented Feb 26, 2024

Awesome, the fix works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This has been discussed somewhere else enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants