Skip to content

Commit

Permalink
feat: add c, cpp, csharp, yaml, md, rst support (#7)
Browse files Browse the repository at this point in the history
Closes #5
  • Loading branch information
fynnfluegge authored Oct 7, 2023
1 parent 36e1025 commit e0705db
Show file tree
Hide file tree
Showing 17 changed files with 653 additions and 554 deletions.
39 changes: 29 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,54 +19,68 @@ Built with [langchain](https://github.com/langchain-ai/langchain), [treesitter](

</kbd>


</div>

## ✨ Features

- 🔎 &nbsp;Semantic code search
- 💬 &nbsp;GPT-like chat with your codebase
- 💻 &nbsp;100% local embeddings and llms
- sentence-transformers, instructor-embeddings, llama.cpp, Ollama
- sentence-transformers, instructor-embeddings, llama.cpp, Ollama
- 🌐 &nbsp;OpenAI and Azure OpenAI support

> [!NOTE]
> There will be better results if the code is well documented. You might consider [doc-comments-ai](https://github.com/fynnfluegge/doc-comments.ai) for code documentation generation.
## 🚀 Usage

Start semantic search:

```
codeqai search
```

Start chat dialog:

```
codeqai chat
```

## 📋 Requirements
## 📋 Requirements

- Python >= 3.9

## 🔧 Installation

```
pipx install codeqai
```

At first usage it is asked to install faiss-cpu or faiss-gpu. Faiss-gpu is recommended if the hardware supports CUDA 7.5+.
If local embeddings and llms are used it will be further asked to install sentence-transformers, instructor or llama.cpp later.

## ⚙️ Configuration

At first usage or by running

```
codeqai configure
```

the configuration process is initiated, where the embeddings and llms can be chosen.

## 🌐 Remote models

If remote models are preferred instead of local, some environment variables needs to be specified in advance.

### OpenAI

```bash
export OPENAI_API_KEY = "your OpenAI api key"
```

### Azure OpenAI

```bash
export OPENAI_API_TYPE = "azure"
export OPENAI_API_BASE = "https://<your-endpoint.openai.azure.com/"
Expand All @@ -75,35 +89,40 @@ export OPENAI_API_VERSION = "2023-05-15"
```

## 💡 How it works

The entire git repo is parsed with treesitter to extract all methods with documentations and saved to a local FAISS vector database with either sentence-transformers, instructor-embeddings or OpenAI's text-embedding-ada-002.
The vector database is saved to a file on your system and will be loaded later again after further usage.
Afterwards it is possible to do semantic search on the codebase based on the embeddings model.
To chat with the codebase locally llama.cpp or Ollama is used by specifying the desired model.
Using llama.cpp the specified model needs to be available on the system in advance.
Using llama.cpp the specified model needs to be available on the system in advance.
Using Ollama the Ollama container with the desired model needs to be running locally in advance on port 11434.
Also OpenAI or Azure-OpenAI can be used for remote chat models.
Also OpenAI or Azure-OpenAI can be used for remote chat models.

## 📚 Supported Languages

- [x] Python
- [x] Typescript
- [x] Javascript
- [x] Java
- [x] Rust
- [x] Kotlin
- [x] Go
- [ ] C++
- [ ] C
- [ ] Lua
- [ ] Scala
- [x] C++
- [x] C
- [x] C#

## FAQ

### Where do I get models for llama.cpp?

Install the `huggingface-cli` and download your desired model from the model hub.
For example

```
huggingface-cli download TheBloke/CodeLlama-13B-Python-GGUF codellama-13b-python.Q5_K_M.gguf
```

will download the `codellama-13b-python.Q5_K_M` model. After the download has finished the absolute path of the model `.gguf` file is printed to the console.

> [!IMPORTANT]
> `llama.cpp` compatible models must be in the `.gguf` format.
5 changes: 5 additions & 0 deletions codeqai/repo.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ def load_files():
".angular",
"cdk.out",
".aws-sam",
".terraform",
]
WHITELIST_FILES = [
".js",
Expand Down Expand Up @@ -71,4 +72,8 @@ def load_files():
".pm",
".lua",
".sql",
".yaml",
".yml",
".rst",
".md",
]
6 changes: 5 additions & 1 deletion codeqai/treesitter/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
from codeqai.treesitter.treesitter import Treesitter, TreesitterMethodNode
from codeqai.treesitter.treesitter import (Treesitter,
TreesitterMethodNode)
from codeqai.treesitter.treesitter_c import TreesitterC
from codeqai.treesitter.treesitter_cpp import TreesitterCpp
from codeqai.treesitter.treesitter_cs import TreesitterCsharp
from codeqai.treesitter.treesitter_go import TreesitterGo
from codeqai.treesitter.treesitter_java import TreesitterJava
from codeqai.treesitter.treesitter_js import TreesitterJavascript
Expand Down
74 changes: 41 additions & 33 deletions codeqai/treesitter/treesitter.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from abc import ABC, abstractmethod
from abc import ABC

import tree_sitter
from tree_sitter_languages import get_language, get_parser
Expand All @@ -21,48 +21,56 @@ def __init__(


class Treesitter(ABC):
def __init__(self, language: Language):
def __init__(
self,
language: Language,
method_declaration_identifier: str,
name_identifier: str,
doc_comment_identifier: str,
):
self.parser = get_parser(language.value)
self.language = get_language(language.value)
self.method_declaration_identifier = method_declaration_identifier
self.method_name_identifier = name_identifier
self.doc_comment_identifier = doc_comment_identifier

@staticmethod
def create_treesitter(language: Language) -> "Treesitter":
return TreesitterRegistry.create_treesitter(language)

@abstractmethod
def parse(self, file_bytes: bytes) -> list[TreesitterMethodNode]:
self.tree = self.parser.parse(file_bytes)
pass

def parse_methods(self, methods: list[tuple[tree_sitter.Node, str]]):
result = []
methods.reverse()
while methods:
if methods and methods[-1][1] == "doc_comment":
doc_comment = methods.pop()
self.process_method(methods, doc_comment, result)
else:
self.process_method(methods, None, result)

methods = self._query_all_methods(self.tree.root_node)
for method in methods:
method_name = self._query_method_name(method["method"])
doc_comment = method["doc_comment"]
result.append(
TreesitterMethodNode(method_name, doc_comment, method["method"])
)
return result

def process_method(self, methods, doc_comment, result):
if methods and methods[-1][1] == "method":
method = methods.pop()
if methods and methods[-1][1] == "method_name":
method_name = methods.pop()
result.append(
TreesitterMethodNode(
method_name[0].text.decode(),
doc_comment[0].text.decode() if doc_comment else None,
method[0],
)
)
def _query_all_methods(
self,
node: tree_sitter.Node,
):
methods = []
if node.type == self.method_declaration_identifier:
doc_comment_node = None
if (
node.prev_named_sibling
and node.prev_named_sibling.type == self.doc_comment_identifier
):
doc_comment_node = node.prev_named_sibling.text.decode()
methods.append({"method": node, "doc_comment": doc_comment_node})
else:
for child in node.children:
methods.extend(self._query_all_methods(child))
return methods

@abstractmethod
def _query_all_methods(self):
"""
This function returns a treesitter query for method names
based on the language
"""
pass
def _query_method_name(self, node: tree_sitter.Node):
if node.type == self.method_declaration_identifier:
for child in node.children:
if child.type == self.method_name_identifier:
return child.text.decode()
return None
26 changes: 26 additions & 0 deletions codeqai/treesitter/treesitter_c.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
import tree_sitter

from codeqai.constants import Language
from codeqai.treesitter.treesitter import Treesitter
from codeqai.treesitter.treesitter_registry import TreesitterRegistry


class TreesitterC(Treesitter):
def __init__(self):
super().__init__(Language.C, "function_definition", "identifier", "comment")

def _query_method_name(self, node: tree_sitter.Node):
if node.type == self.method_declaration_identifier:
for child in node.children:
# if method returns pointer, skip pointer declarator
if child.type == "pointer_declarator":
child = child.children[1]
if child.type == "function_declarator":
for child in child.children:
if child.type == self.method_name_identifier:
return child.text.decode()
return None


# Register the TreesitterJava class in the registry
TreesitterRegistry.register_treesitter(Language.C, TreesitterC)
26 changes: 26 additions & 0 deletions codeqai/treesitter/treesitter_cpp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
import tree_sitter

from codeqai.constants import Language
from codeqai.treesitter.treesitter import Treesitter
from codeqai.treesitter.treesitter_registry import TreesitterRegistry


class TreesitterCpp(Treesitter):
def __init__(self):
super().__init__(Language.CPP, "function_definition", "identifier", "comment")

def _query_method_name(self, node: tree_sitter.Node):
if node.type == self.method_declaration_identifier:
for child in node.children:
# if method returns pointer, skip pointer declarator
if child.type == "pointer_declarator":
child = child.children[1]
if child.type == "function_declarator":
for child in child.children:
if child.type == self.method_name_identifier:
return child.text.decode()
return None


# Register the TreesitterJava class in the registry
TreesitterRegistry.register_treesitter(Language.CPP, TreesitterCpp)
62 changes: 62 additions & 0 deletions codeqai/treesitter/treesitter_cs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
import tree_sitter

from codeqai.constants import Language
from codeqai.treesitter.treesitter import Treesitter
from codeqai.treesitter.treesitter_registry import TreesitterRegistry


class TreesitterCsharp(Treesitter):
def __init__(self):
super().__init__(
Language.C_SHARP, "method_declaration", "identifier", "comment"
)

def _query_method_name(self, node: tree_sitter.Node):
first_match = None
if node.type == self.method_declaration_identifier:
for child in node.children:
# if the return type is an object type, then the method name
# is the second match
if child.type == self.method_name_identifier and not first_match:
first_match = child.text.decode()
elif child.type == self.method_name_identifier and first_match:
return child.text.decode()
return first_match

def _query_all_methods(self, node: tree_sitter.Node):
methods = []
if node.type == self.method_declaration_identifier:
doc_comment_nodes = []
if (
node.prev_named_sibling
and node.prev_named_sibling.type == self.doc_comment_identifier
):
current_doc_comment_node = node.prev_named_sibling
while (
current_doc_comment_node
and current_doc_comment_node.type == self.doc_comment_identifier
):
doc_comment_nodes.append(current_doc_comment_node.text.decode())
if current_doc_comment_node.prev_named_sibling:
current_doc_comment_node = (
current_doc_comment_node.prev_named_sibling
)
else:
current_doc_comment_node = None

doc_comment_str = ""
doc_comment_nodes.reverse()
for doc_comment_node in doc_comment_nodes:
doc_comment_str += doc_comment_node + "\n"
if doc_comment_str.strip() != "":
methods.append({"method": node, "doc_comment": doc_comment_str.strip()})
else:
methods.append({"method": node, "doc_comment": None})
else:
for child in node.children:
methods.extend(self._query_all_methods(child))
return methods


# Register the TreesitterJava class in the registry
TreesitterRegistry.register_treesitter(Language.C_SHARP, TreesitterCsharp)
21 changes: 2 additions & 19 deletions codeqai/treesitter/treesitter_go.py
Original file line number Diff line number Diff line change
@@ -1,28 +1,11 @@
import tree_sitter

from codeqai.constants import Language
from codeqai.treesitter.treesitter import (Treesitter,
TreesitterMethodNode)
from codeqai.treesitter.treesitter import Treesitter
from codeqai.treesitter.treesitter_registry import TreesitterRegistry


class TreesitterGo(Treesitter):
def __init__(self):
super().__init__(Language.GO)

def parse(self, file_bytes: bytes) -> list[TreesitterMethodNode]:
super().parse(file_bytes)
methods = self._query_all_methods(self.tree.root_node)
return self.parse_methods(methods)

def _query_all_methods(self, node: tree_sitter.Node):
query_code = """
(comment) @doc_comment
(function_declaration
name: (identifier) @method_name) @method
"""
query = self.language.query(query_code)
return query.captures(node)
super().__init__(Language.GO, "function_declaration", "identifier", "comment")


# Register the TreesitterJava class in the registry
Expand Down
Loading

0 comments on commit e0705db

Please sign in to comment.