- id unique SHA256 hash of the sample
- hexsha unique git hash of (source) file
- repo the owner/repo
- path the full path to the original file
- license repo license
- language the programming language
- parameters list of parameters and its type (type can be None)
- identifier the function or method name
- return_type the type returned by the function
- original_string original version of function/class node
- original_docstring the raw docstring before tokenization or parsing
- docstring the top-level comment or docstring (docstring version without param’s doc, return, exception, etc)
- docstring_tokens tokenized version of
docstring
- code the part of the
original_string
that is code - code_tokens tokenized version of
code
- short_docstring: short, brief summarization (first line of the docstring)
- short_docstring_tokens: tokenized version of short_docstring
- comment
- List of comment (line) inside the function/class
- docstring_params
- params
List of param's docstring (which actually is paramerter of the function). Each item in the list is a dictionary, sample example:
- identifier (str): "a"
- docstring (str): "this is a comment"
- docstring_tokens (List): ['this', 'is', 'a', 'comment']
- default (bool or None): null
- is_optional (bool or None): null
- [Optional field] type (str): 'int'
- outlier_params
The params which don’t list in the function declaration (e.g.
def cal_sum(a, b):
, if a paramc
is describe in docstring, then it is called outlier params). The syntax is similar with params - returns
List of returns. Example:
- type (str): "int"
- docstring (str): "sum of 2 value"
- docstring_tokens (List): ['sum', 'of', '2', 'value']
- raises
List of raise/throw. Example:
- type (str): "ValueError"
- docstring (str): "raise if
ValueError
if a or b is not digit" - docstring_tokens (List): ['raise', 'if', '
', 'ValueError', '
', 'if', 'a', 'or', 'b', 'is', 'not', 'digit']
- others
List of other type of docstring params (e.g
version
,author
, etc). Example:- identifier (str): "author"
- docstring (str): "Dung Manh Nguyen"
- docstring_tokens (List): ['Dung', 'Manh', 'Nguyen']
- params
List of param's docstring (which actually is paramerter of the function). Each item in the list is a dictionary, sample example:
See the example below:
def cal_sum(a: int, b: int) -> int:
"""
This is demo function
Args:
a (int): this is a comment
b (int): this is another comment
c (int): this is a comment, but `c` is not `cal_sum`'s paramerter
Returns:
int: sum of 2 value
Raise:
ValueError: raise if `ValueError` if a or b is not digit
"""
assert str(a).isdigit() == True, ValueError()
assert str(b).isdigit() == True, ValueError()
# return sum of `a` and `b`
return a + b
Extract results:
{
"repo": "",
"path": "",
"language": "Python",
"license": "",
"identifier": "plotpoints",
"parameters": [
{"param":"a",
"type": "int"},
{"param":"b",
"type": "int"}
],
"return_type": "int",
"original_string": "def cal_sum(a: int, b: int) -> int:\n \"\"\"\n This is demo function\n\n Args:\n a (int): this is a comment\n b (int): this is another comment\n c (int): this is a comment, but `c` is not `cal_sum`'s paramerter\n\n Returns:\n int: sum of 2 value\n\n Raise:\n ValueError: raise if `ValueError` if a or b is not digit\n \"\"\"\n assert str(a).isdigit() == True, ValueError()\n assert str(b).isdigit() == True, ValueError()\n # return sum of `a` and `b`\n return a + b",
"code": "def cal_sum(a: int, b: int) -> int:\n assert str(a).isdigit() == True, ValueError()\n assert str(b).isdigit() == True, ValueError()\n return a + b",
"code_tokens": [...],
"original_docstring": "This is demo function\n\n Args:\n a (int): this is a comment\n b (int): this is another comment\n c (int): this is a comment, but `c` is not `cal_sum`'s paramerter\n\n Returns:\n int: sum of 2 value\n\n Raise:\n ValueError: raise if `ValueError` if a or b is not digit",
"docstring": "This is demo function",
"docstring_tokens": [...],
"short_docstring": "This is demo function",
"short_docstring_tokens": [...]
"comment": [
"# return sum of `a` and `b`",
],
"docstring_params": {
"returns": [
{
"docstring": "sum of 2 value",
"docstring_tokens": ["sum", "of", "2", "value"],
"type": "int"
}
],
"raises": [
{
"docstring": "raise if `ValueError` if a or b is not digit",
"docstring_tokens": ["raise", "if", "`", "ValueError", "`", "if", "a", "or", "b", "is", "not", "digit"],
"type": "int"
}
],
"params": [
{
"identifier": "a",
"docstring": "this is another comment",
"type": "int",
"docstring_tokens": ["this", "is", "another", "comment"]
},
{
"identifier": "b",
"docstring": "this is a comment",
"type": "int",
"docstring_tokens": ["this", "is", "a", "comment"]
},
],
"outlier_params": [
{
"identifier": "c",
"docstring": "this is a comment, but `c` is not `cal_sum`'s paramerter",
"type": "int",
"docstring_tokens": ["this", "is", "a", "comment", ",", "but", "`", "c", "`", "'", "s", "parameter"]
}
],
"others": []
}
}
- repo the owner/repo
- path full path to the original file
- language the programming language
- license repo license
- parent_name method/class parent node name
- code the part of
original_string
that is code - code_tokens tokenized version of code
- prev_context the (code) block above the comment
- next_context the (code) block below the comment
- original_comment the original comment before cleaning
- start_point (position of start line, position of start character)
- end_point (position of last line, position of last character)
- comment the cleaned comment
- comment_tokens tokenized version of comment
See the example below:
def fix_init_kwarg(self, sender, args, kwargs, **signal_kwargs):
# Anything passed in as self.name is assumed to come from a serializer and
# will be treated as a json string.
if self.name in kwargs:
value = kwargs.pop(self.name)
# Hack to handle the xml serializer's handling of "null"
if value is None:
value = 'null'
kwargs[self.attname] = value
After extracting, we result:
{
"repo": "ithinksw/philo",
"path": "philo/models/fields/__init__.py",
"language": "Python",
"code": "def fix_init_kwarg(self, sender, args, kwargs, **signal_kwargs):\n\t\t# Anything passed in as self.name is assumed to come from a serializer and\n\t\t# will be treated as a json string.\n\t\tif self.name in kwargs:\n\t\t\tvalue = kwargs.pop(self.name)\n\t\t\t\n\t\t\t# Hack to handle the xml serializer's handling of \"null\"\n\t\t\tif value is None:\n\t\t\t\tvalue = 'null'\n\t\t\t\n\t\t\tkwargs[self.attname] = value",
"prev_context": null,
"next_context": {
"code": "if self.name in kwargs:\n\t\t\tvalue = kwargs.pop(self.name)\n\t\t\t\n\t\t\t# Hack to handle the xml serializer's handling of \"null\"\n\t\t\tif value is None:\n\t\t\t\tvalue = 'null'\n\t\t\t\n\t\t\tkwargs[self.attname] = value",
"start_point": [3, 2],
"end_point": [10, 31]
},
"original_comment": "# Anything passed in as self.name is assumed to come from a serializer and\n# will be treated as a json string.",
"start_point": [1, 2],
"end_point": [2, 2],
"comment": " Anything passed in as self.name is assumed to come from a serializer and \n will be treated as a json string.",
"comment_tokens": [
"Anything",
"passed",
"in",
"as",
"self",
".",
"name",
"is",
"assumed",
"to",
"come",
"from",
"a",
"serializer",
"and",
"will",
"be",
"treated",
"as",
"a",
"json",
"string",
"."
]
}