-
Notifications
You must be signed in to change notification settings - Fork 0
Draft How to Add the Capability to Work with Another Programming Language in JPlag
As of Version 3.0.0, JPlag supports 7 types of input, including Java, C++, C#, Python, R, and natural language style text. If your favorite programming language is still missing, learn here how to create a frontend for it.
- an implementation of the relevant interfaces Language and Token
- a Parser for your programming language
- an Adapter to adapt the proprietary Tokens of the Parser to jPlag Tokens
- an abstraction strategy to decide which set of tokens is relevant to the structure of the file
In the JPlag module, create a new submodule for your language. Set the aggregator JPlag POM as the parent POM for your frontend.
<parent>
<groupId>de.jplag</groupId>
<artifactId>aggregator</artifactId>
<version>${revision}</version>
</parent>
Any new submodule must be registered in the aggregator POM. For all Java classes needed (for a supposed programming language 'MyLang'), we suggest they be created in the package de.jplag.mylang
.
<modules>
<module>jplag.frontend.java</module>
<module>jplag.frontend.csharp-6</module>
...
<module>jplag.frontend.mylang</module>
</module>
Furthermore, all frontends rely on the interfaces supplied by the frontend-utils
submodule of JPlag.
<dependencies>
<dependency>
<groupId>de.jplag</groupId>
<artifactId>frontend-utils</artifactId>
</dependency>
</dependencies>
Add your new module as a dependency to the aggregator
and JPlag
modules.
<dependency>
<groupId>${project.groupId}</groupId>
<artifactId>mylang</artifactId>
<version>${project.version}</version>
</dependency>
For the JPlag CLI to recognize your language as an option, add it to LanguageOption
, providing its name and the class path of the corresponding Language
class.
public enum LanguageOption {
JAVA("java", "de.jplag.java.Language"),
PYTHON_3("python3", "de.jplag.python3.Language"),
// ...
MYLANGUAGE("mylang", "de.jplag.mylang.Language");
// ...
}
It will be added to the various help output of the CLI automatically.
The Language interface represents programming languages that JPlag can work with. It encorporates a set of properties specific to the language, such as the standard file extensions, and a set of configurations that modify the behavior of JPlag when processing submissions.
The abstract class Token represents a syntactic1) unit of code that appears in a submission, e.g. an assignment, a variable definition, or the beginning or end of a method. A Token
object stores its position in the source code, and most crucially, its type (an integer). Depending on the Language
, the specification of what counts as a Token varies greatly, therefore, you need to supply:
- a custom map of types of token types to their corresponding number (
CLASS_BEGIN
= 2,IMPORT
= 3, ...)2), - a concrete
Token
class.
An easy way to realise the map is to have the Token
class implement a special TokenConstants
interface that contains an integer constant for each type of Token
. This way, the constants are easily accessible in your Token
class while keeping their definitions separate.
-->
![MyLangToken Class Diagram](https://lh6.googleusercontent.com/_lmH5gaz4gG7yUziUXlp70KYzSGN3O6tniVFManRmJbhlHUOyVF15WRs1xpg4Austqg13dFkQCLaaL6SNSz6=w1872-h1048)
As may be noticable from the example `Token` types above, the token types need to account for the nested structure of the code themselves, because the `TokenList` data structure cannot save any kind of depth information that a tree could. Structural elements like the bodies of classes, methods and control flow statements must be represented as a `BEGIN` token and an `END` token. For "leaf" elements of which you do not want to save the inner structure (e.g. assignment expressions), a single token suffices.
<sup><a name="footnote-1">1)</a></sup> Read _syntactic_ as opposed to _lexical_ units, which is what the term _token_ usually refers to. As JPlag generally operates on a rather coarse-grained structural abstraction of the code, most lexical tokens like operators or even variable names are not reflected in the `TokenList`, see [4. Abstraction Strategy](#4-abstraction-strategy).
<sup><a name="footnote-2">2)</a></sup> Tokens 0 and 1 are universal for all types of languages, so language-specific tokens start at 2.
### 2. Parser ###
To generate a `TokenList` from submissions, the source code needs to be (lexed and) parsed. Parser implementations in Java for virtually any programming language are available online. We present two approaches that are already used in JPlag frontends for parsing.
Both parsers produce a representation of the Abstract Syntax Tree (AST) of the source code, using their own implementation of `Token`s.
#### A Hard-coded Parser: the JavaCompiler API ####
A [Java Compiler API](https://docs.oracle.com/en/java/javase/17/docs/api/java.compiler/module-summary.html) is included in the standard Java API.
A JavaCompiler object can be obtained using `ToolProvider.getSystemJavaCompiler()`.
In the following code excerpt, the Java Compiler object `javac` is used to create ASTs from the submissions. (Code heavily shortened for clarity - this code will not compile. See `JavacAdapter.java` for the complete code.)
```java
CompilationTask task = javac.getTask(javaFiles);
for (final CompilationUnitTree ast : executeCompilationTask(task)) {
ast.accept(new TokenGeneratingTreeScanner(parser), null);
parser.add(TokenConstants.FILE_END);
}
return processErrors(parser.getErrorConsumer(), listener);
ANTLR - ANother Tool for Language Recognition is a library that generates parsers from grammar definitions (.g4
files). The generated parsers, in turn, are Java classes that can be used in JPlag frontends to parse submissions on runtime.
MyLexer.java --generated-from--> Grammar
MyBaseListener.java --generated-from--> Grammar
-->
![Grammar Diagram](https://lh5.googleusercontent.com/crYBriHO-XnKdklGhOns4Gvh82scWZJ394b8mPmr66Y_xyRAWahEyPSte7Itdp7RSS7GJILv9cFDp4wDTL_O=w2500-h1412)
The ANTLR project has collected ANTLR grammars for many languages for free, see [here](https://github.com/antlr/grammars-v4). To use ANTLR grammars in a JPlag frontend, put them into `jplag.frontend.mylang/src/main/antlr4/de/jplag/mylang/grammar`. Grammars that are imported by other grammars need to be put in `jplag.frontend.mylang/src/main/antlr4/imports`.
The ANTLR Maven Plugin needs to be included into your frontend POM file as a dependency.
<h5 a><strong><code>${project-dir}/jplag.frontend.mylang/pom.xml</code></strong></h5>
<dependency>
<groupId>org.antlr</groupId>
<artifactId>antlr4-runtime</artifactId>
</dependency>
To generate all needed classes from the ANTLR `.g4` files on build, include the antlr4 goal in the build goal:
<h5 a><strong><code>${project-dir}/jplag.frontend.mylang/pom.xml</code></strong></h5>
<build>
<plugins>
<plugin>
<groupId>org.antlr</groupId>
<artifactId>antlr4-maven-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>antlr4</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
### 3. Parser Adapter ###
Regardless of which sort of `Parser` is used, the resulting AST will use the Token class proprietary to the Parser API. For the analyses performed by JPlag, its very own `Token` interface must be used. To linearize the AST and transfer the custom Tokens to JPlag Tokens, traverse the AST with a 'visitor'/'walker' and have a custom 'observer'/'listener' add a Token to the TokenList at the event of visiting a relevant parser Token. See the `Listener` classes of the C# frontend.
### 4. Abstraction Strategy ###
As mentioned before, not all types of syntactic element of a given programming language might be relevant for the analyses performed by JPlag. The level of granularity determines the precision and recall that your frontend will get out of submissions.
As a guideline, some of the design decisions of existing frontends regarding Token selection are outlined below.
#### Java Frontend Tokens ####
The JPlag Java frontend takes into account tokens corresponding to
- nesting syntactic structures: class, method, if-else, try, switch, ...
- control flow: break, continue, return, throw, assert
- method invocations
- writing to memory: assignments, declarations, object constructors, array constructors
The following types of elements are omitted:
- arithmetic operations: addition, negation, or, instanceof
- reading from memory: field access, variable references
- type information
- identifiers
- numerical and String literals
**Note:** Depending on the design of the grammar, the resulting parser's ability to create a 'complete'[<sup>3)</sup>](#footnote-3) abstraction may vary. For the sake of conciseness, some syntactic elements may not have a unique rule defined for them in the grammar, but they may only implicitly be defined as one of many cases of a more general syntactic element, or a sequence of other elements. For example, an interface declaration would be indistinguishable from a class or enum declaration if they share a common rule. Similarly, a method invocation might be represented as an expression followed by an argument list, so that a method invocation would not be a syntactic category of the grammar at all.
If you are using the ANTLR4 framework, you may be able to account for these kinds of disregarded syntactic elements using one of the following strategies:
* If the parent of the token identifies it uniquely (e.g. a `block` inside an `ifStatement` must be an `ifBlock`), you can access it in the `enter` method of the parent element via `ParserRuleContext::getChild(int)` and create a token from it ("top-down").
* If the token is represented not by a rule but by a terminal (e.g. a keyword of the language), you can handle it inside the `visitTerminal` method ("bottom-up").
* If the token comes up in multiple contexts that you need to distinguish, you may consider a stack machine of custom states, as the C# and Go frontends do.
* Sometimes it might be much easier just to modify the grammar a bit in order to 'save' the context in which an element appeared. See the Kotlin frontend README for an example. Be aware of the conditions set by the software license.
<sup><a name="footnote-3">3)</a></sup> i.e. containing all elements of the code that one might deem relevant in this context
[...]
Note: Each `Token` object holds its position in the source file and its length. None of these information is relevant to the analysis, but only for testing and debugging. The only value used in current analyses to determine and compare the sequence of tokens in the source file is the order of the `TokenList`. It is therefore crucial that the order in which `Token`s are appended to the `TokenList` be consistent.
## Test your frontend
Your main design decision when creating a new frontend is which tokens to include, and how many matching tokens in a row you find considerable for the analysis. A good way to check the coverage of code by your set of tokens is to use the `TokenPrinterTest` like so:
* Create a new test method for your language. In it, call `printSubmissions` in a similar way as for other languages already included.
* Put some code files in your language into the directory `jplag/src/test/resources/PRINTER/`.
* Run the test inside your IDE to print the code annotated with the tokens.
This way, you might discover a lack of coverage for specific syntactic structures, or incorrect annotations.
Here are some more checks that you can implement as a JUnit test:
* Each TokenConstant may come up in source code.
* Each TokenConstant has a valid String representation in `Token::type2string`.
* Each line of well-formatted code is represented in the generated `TokenList` – this is not an absolute requirement depending on the language.
* The sequence of `BEGIN` and `END` tokens are correctly nested (Dyck language)
Still, as to which degree your abstraction is suitable for detecting plagarisms, there is only limited insight that these checks can offer.