Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft flow PR #127

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions veniq/dataset_collection/dataflow/collectEM.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
from collections import defaultdict
from typing import List, Dict

import d6tcollect
import d6tflow
# from veniq.dataset_collection.augmentation import InvocationType
from joblib import Parallel, delayed

from veniq.ast_framework import AST
from veniq.ast_framework import ASTNodeType, ASTNode
from veniq.dataset_collection.augmentation import collect_info_about_functions_without_params
from veniq.dataset_collection.dataflow.preprocess import TaskAggregatorJavaFiles
from veniq.utils.ast_builder import build_ast

d6tcollect.submit = False


@d6tflow.requires({'csv': TaskAggregatorJavaFiles})
class TaskFindEM(d6tflow.tasks.TaskCache):
dir_to_search = d6tflow.Parameter()
dir_to_save = d6tflow.Parameter()
system_cores_qty = d6tflow.IntParameter()

def _find_EMs(self, row):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u rename row to a bit more informative? what's in the row?

result_dict = {}
try:
ast = AST.build_from_javalang(build_ast(row['original_filename']))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long method. this whole routine about extracting method declrations could be put into a separate method

classes_declaration = [
ast.get_subtree(node)
for node in ast.get_root().types
if node.node_type == ASTNodeType.CLASS_DECLARATION
]
method_declarations: Dict[str, List[ASTNode]] = defaultdict(list)
for class_ast in classes_declaration:
class_declaration = class_ast.get_root()
collect_info_about_functions_without_params(class_declaration, method_declarations)

methods_list = list(class_declaration.methods) + list(class_declaration.constructors)
for method_node in methods_list:
target_node = ast.get_subtree(method_node)
for method_invoked in target_node.get_proxy_nodes(
ASTNodeType.METHOD_INVOCATION):
extracted_m_decl = method_declarations.get(method_invoked.member, [])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is extracted_m_decl ? what is its type?

if len(extracted_m_decl) == 1:
result_dict[method_invoked.line] = [target_node, method_invoked, extracted_m_decl]
# print({'em_list': result_dict, 'ast': ast})
if result_dict:
print(f' ZHOPA {result_dict}')
return [{'em_list': result_dict, 'ast': ast}]
else:
return {}
except Exception:
pass

return {}

def run(self):
csv = self.inputLoad()['csv']
rows = [x for _, x in csv.iterrows()]

with Parallel(n_jobs=2, require='sharedmem') as parallel:
Copy link
Member

@KatGarmash KatGarmash Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why n_jobs=2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just a template, we can fix in future

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u use self.system_cores_qty, like u did in the preprocess task?

results = parallel((delayed(self._find_EMs)(a) for a in rows))
self.save({"data": [x for x in results if x]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the type of x?

Copy link
Member Author

@lyriccoder lyriccoder Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the result of _find_EMs function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u type-annotate the _find_EMs function then?

64 changes: 64 additions & 0 deletions veniq/dataset_collection/dataflow/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
import os
from argparse import ArgumentParser

import d6tcollect
import d6tflow

from veniq.dataset_collection.dataflow.collectEM import TaskFindEM

d6tcollect.submit = False

if __name__ == '__main__':
system_cores_qty = os.cpu_count() or 1
parser = ArgumentParser()
parser.add_argument(
"-d",
"--dir",
required=True,
help="File path to JAVA source code for methods augmentations"
)
parser.add_argument(
"-o", "--output",
help="Path for file with output results",
default='augmented_data'
)
parser.add_argument(
"--jobs",
"-j",
type=int,
default=system_cores_qty - 1,
help="Number of processes to spawn. "
"By default one less than number of cores. "
"Be careful to raise it above, machine may stop responding while creating dataset.",
)
parser.add_argument(
"-z", "--zip",
action='store_true',
help="To zip input and output files."
)
parser.add_argument(
"-s", "--small_dataset_size",
lyriccoder marked this conversation as resolved.
Show resolved Hide resolved
help="Number of files in small dataset",
default=100,
type=int,
)

args = parser.parse_args()
d6tflow.preview(
TaskFindEM(
dir_to_search=args.dir,
dir_to_save=args.output,
system_cores_qty=args.jobs))
d6tflow.run(
TaskFindEM(
dir_to_search=args.dir,
dir_to_save=args.output,
system_cores_qty=args.jobs
))
data = TaskFindEM(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task TaskFindEM doesn't look like the final step. or is it?
where do u write the final dataset to disk?

dir_to_search=args.dir,
dir_to_save=args.output,
system_cores_qty=args.jobs
).outputLoad(cached=False)

print(data)
114 changes: 114 additions & 0 deletions veniq/dataset_collection/dataflow/preprocess.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
import hashlib
import re
from pathlib import Path

import d6tcollect
import d6tflow
import pandas as pd
# from veniq.dataset_collection.augmentation import InvocationType
from pebble import ProcessPool
from tqdm import tqdm

from veniq.utils.encoding_detector import read_text_with_autodetected_encoding

d6tcollect.submit = False


class TaskAggregatorJavaFiles(d6tflow.tasks.TaskCSVPandas):
dir_to_search = d6tflow.Parameter()
dir_to_save = d6tflow.Parameter()
system_cores_qty = d6tflow.IntParameter()

columns = [
'project_id',
'original_filename',
'class_name',
'invocation_line_string',
'invocation_line_number_in_original_file',
'target_method',
'target_method_start_line',
'extract_method',
'extract_method_start_line',
'extract_method_end_line',
'output_filename',
'is_valid_ast',
'insertion_start',
'insertion_end',
'ncss_target',
'ncss_extracted',
'do_nothing',
'ONE_LINE_FUNCTION',
'NO_IGNORED_CASES'
] # + [x for x in InvocationType.list_types()]

def _remove_comments(self, string: str):
pattern = r"(\".*?\"|\'.*?\')|(/\*.*?\*/|//[^\r\n]*$)"
# first group captures quoted strings (double or single)
# second group captures comments (//single-line or /* multi-line */)
regex = re.compile(pattern, re.MULTILINE | re.DOTALL)

def _replacer(match):
# if the 2nd group (capturing comments) is not None,
# it means we have captured a non-quoted (real) comment string.
if match.group(2) is not None:
# so we will return empty to remove the comment
return ""
else: # otherwise, we will return the 1st group
return match.group(1) # captured quoted-string

return regex.sub(_replacer, string)

def _preprocess(self, file):
original_text = read_text_with_autodetected_encoding(str(file))
# remove comments
text_without_comments = self._remove_comments(original_text)
# remove whitespaces
text = "\n".join([ll.rstrip() for ll in text_without_comments.splitlines() if ll.strip()])

return text

def _save_text_to_new_file(self, input_dir: Path, text: str, filename: Path) -> Path:
# need to avoid situation when filenames are the same
hash_path = hashlib.sha256(str(filename.parent).encode('utf-8')).hexdigest()
dst_filename = input_dir / f'{filename.stem}_{hash_path}.java'
if not dst_filename.parent.exists():
dst_filename.parent.mkdir(parents=True)
if not dst_filename.exists():
with open(dst_filename, 'w', encoding='utf-8') as w:
w.write(text)

return dst_filename

def run(self):
test_files = set(Path(self.dir_to_search).glob('**/*Test*.java'))
not_test_files = set(Path(self.dir_to_search).glob('**/*.java'))
files_without_tests = list(not_test_files.difference(test_files))
if not files_without_tests:
raise Exception("No java files were found")

full_dataset_folder = Path(self.dir_to_save) / 'full_dataset'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long method. the part with directory handling could be extracted, for example

if not full_dataset_folder.exists():
full_dataset_folder.mkdir(parents=True)
self.output_dir = full_dataset_folder / 'output_files'
if not self.output_dir.exists():
self.output_dir.mkdir(parents=True)
self.input_dir = full_dataset_folder / 'input_files'
if not self.input_dir.exists():
self.input_dir.mkdir(parents=True)
df = pd.DataFrame(columns=['original_filename'])
with ProcessPool(self.system_cores_qty) as executor:
future = executor.map(self._preprocess, files_without_tests, timeout=200, )
result = future.result()
for filename in tqdm(files_without_tests):
try:
text = next(result)
if text:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

under what condition can a text be None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When result is empty, empty string, when the file consists only of comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when can a result be empty, beside the case when soure code is empty ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we pass empty file, i believe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u leave a comment about it, maybe?

df = df.append(
{'original_filename': self._save_text_to_new_file(self.input_dir, text,
filename).absolute()},
ignore_index=True
)
except Exception as e:
print(str(e))

self.save(data=df)