-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat request: multiple programming languages #1546
Comments
Good point, this relates to cross-language plagiarism detection. While there has been some research in that area, there are (to my knowledge) no usable tools for that. In future, we may want to introduce that by creating a shared token type set for common concepts between languages. Thus, language modules may reuse these token types thus allowing for cross-language support. |
Hello, this has been done in this fork: https://github.com/euberdeveloper/JPlag/tree/feature/multilanguage-plagiarism-detection A pull request will follow up in the future |
We have our own ideas for that, but we are happy to look at yours. Keep in mind, that these might be major changes that need to consider other upcoming changes, API considerations, and not break existing features (e.g. token sequence normalization or match merging). |
I think what I've done is more like a proof of concept.
To speed up the process, I made the single language front ends use first their default language-specific tokens to get specific tokens and then I made a converter to convert those tokens to general ones. Don't do it, the results are not good and many issues could be fixed by obtaining language-agnostic tokens directly by parsing the source code from scratch. I will implement this improvement soon. |
Another improvement I want to do is making the language-agnostic tokens dynamic. Each language will override/implement some methods such as "supportsClasses" or "has variable declarations". For example C would return false to the first method and true for the second one. Python would return true to the first one and false to the second one. Java true to both. Then, the langiage-agnostic tokenizers for Rach language would receive the full set of languages for this run as an additional parameter. Based on what those language support, it will change behaviour, for example if Java Python and C are provided, the java tokenizer will discard Class tokens. If only Java and Python are provided as possible languages for this run, the Java tokenizer will emit class tokens. |
I have some work in progress with this |
Note, that we have our own plans here that might be conflicting with yours. But we are always happy to look at your ideas for inspiration. |
As of now it seems that JPlag supports multiple programming languages, but only in a homogeneous way.
This means that I can compare two different submissions both in Java, both in Python but not one in Java and one in Python.
It could seem that it doesn't make sense, but it could actually be a type of obfuscation, translating a program from a language to another one.
Maybe Java and python are not the perfect example, but if we take into account languages such as Java and Kotlin or Scala, that all work with the JVM, this issue becomes more relevant
The text was updated successfully, but these errors were encountered: