Welcome to the IRIS Labs Recruitment Task! This year at IRIS Labs, we are focussing on Automating Code Documentation, with the help of LLMs and other NLP methodologies!
Maintaining up-to-date and accurate documentation is important for the success and sustainability of projects. However, the manual creation and upkeep of documentation are often time-consuming and prone to errors. This project proposes the development and implementation of an automated documentation generation system, to streamline and enhance the documentation process, ensuring consistency, accuracy, and efficiency.
This repository contains the code for a project we had undertaken, with a similar idea. Please go through the codebase thoroughly.
For the purpose of the recruitment task, we are listing a few issues below you can tackle. You can choose any issue(s), and work on them. Before going through any of the issues, it is suggested you go through the README.md first.
This will help you understand the issues below better.
Issue | Level |
---|---|
Dendrogram Feature Addition | Easy |
Documentation Customization | Easy |
Additional Code Features | Easy |
Knowledge Graphs Addition | Medium |
Investigating Clustering | Medium |
Investigating Embeddings Mechanisms | Hard |
Investigating Context Specific Mechanisms | Hard |
Based on the current clustering algorithm used, contributors can add this functionality. You can look into dendrograms here. This is a very easy issue to fix.
Customization should adhere to the needs of the user. Contributors can look into prompt engineering as the starting point, and expand from there. An example of customization should be a specific format dictated by the user.
We have added :
- Code Refactoring
- Automatic Test Generation as of now.
Various other features regarding code analytics can be added. (For example, number of functions, classes etc, which can be easily done during prompting)
Contributors can look into incorporating knowledge graphs based on the clustered or non-clustered files to provide further insight in the structure of the code. Retrieval of code can also be a subsequent feature after this.
Agglomerative Clustering was chosen to take an advantage of any hierarchical relation between the code clusters, but other more efficient or more meaningful clustering algorithms can be looked into.
We are currently using CodeBERT, which has its own limitations. Contributors can look into exploring different embedding models, perhaps even look into training their own from a multimodal approach (NL/PL).
We are currently using a "window" algorithm to maintain context, which is not the most efficient technique. Contributors can look into memory networks, various attention mechanisms, or Reinforcement Learning with Memory.
In your forked repositories, create a submission.md
file and detail the following in the same order :
-
Your understanding of the codebase. (very brief, just covering a general view. This is given in our
README.md
file, but we want to know your understanding) -
Write about all the processes involved from a Machine Learning standpoint using the template given below. We have used several novel techniques, so it would be great if you can identify those and explain them. (HINT : One of these techniques is with the embedding algorithm here)
You can fill in the following pointers in your file for this point :
-
Codebase Traversal :
-
Code Embeddings :
-
Handling Large Code Files :
-
Maintaining Context with Agglomerating Clustering :
-
Efficient Documentation Generation :
- The tasks you have tackled. Give a description of what you have done, why you chose to use some specific method, what other methods exist and why you didn't use them.
- Write about the limitations of the current application, and what core changes you can make. (Core changes as in if you want to change the architecture of how this entire process goes about)
We are not focussed on the number of tasks you can manage. These tasks are just a small guide to what can be fixed. We are more interested in the clarity of the concepts pertaining to Natural Language Processing and Machine Learning in general. Even if you cannot solve any tasks, if you can detail points 1 and 2 in the Submission Instructions really well, you will have a strong submission. Solving fewer tasks which are done well will always be better than solving more tasks which are done haphazardly. The goal is to test out how well you can translate ML and NLP theory into application.