-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft flow PR #127
base: master
Are you sure you want to change the base?
Draft flow PR #127
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please remove all the commented code?
if not files_without_tests: | ||
raise Exception("No java files were found") | ||
|
||
full_dataset_folder = Path(self.dir_to_save) / 'full_dataset' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long method. the part with directory handling could be extracted, for example
for filename in tqdm(files_without_tests): | ||
try: | ||
text = next(result) | ||
if text: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
under what condition can a text be None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When result is empty, empty string, when the file consists only of comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when can a result be empty, beside the case when soure code is empty ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we pass empty file, i believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u leave a comment about it, maybe?
csv = self.inputLoad()['csv'] | ||
rows = [x for _, x in csv.iterrows()] | ||
|
||
with Parallel(n_jobs=2, require='sharedmem') as parallel: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why n_jobs=2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's just a template, we can fix in future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u use self.system_cores_qty, like u did in the preprocess task?
|
||
with Parallel(n_jobs=2, require='sharedmem') as parallel: | ||
results = parallel((delayed(self._find_EMs)(a) for a in rows)) | ||
self.save({"data": [x for x in results if x]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the type of x?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the result of _find_EMs
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u type-annotate the _find_EMs function then?
def _find_EMs(self, row): | ||
result_dict = {} | ||
try: | ||
ast = AST.build_from_javalang(build_ast(row['original_filename'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long method. this whole routine about extracting method declrations could be put into a separate method
target_node = ast.get_subtree(method_node) | ||
for method_invoked in target_node.get_proxy_nodes( | ||
ASTNodeType.METHOD_INVOCATION): | ||
extracted_m_decl = method_declarations.get(method_invoked.member, []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is extracted_m_decl ? what is its type?
dir_to_save = d6tflow.Parameter() | ||
system_cores_qty = d6tflow.IntParameter() | ||
|
||
def _find_EMs(self, row): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u rename row to a bit more informative? what's in the row?
dir_to_save=args.output, | ||
system_cores_qty=args.jobs | ||
)) | ||
data = TaskFindEM( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
task TaskFindEM doesn't look like the final step. or is it?
where do u write the final dataset to disk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see inline comments
d6tflow can't run in parallel child tasks
it can save results in a task only 1 time (child task can be only run synchronously). So if the task saves smth, it is completed. We have to collect all results from child tasks and then pass large collection to the next task. And do it every time from task to task
Since we have to pass large collection we will keep in memory files and it's metadata.
We can keep only metadata of all files, and then open each files in a separate thread and then find EM, Target method and it's invocation. But it is very ugly and take too much time
I have to use
joblib
sinceProcessPool
cannot serialize some AST objectsYou can run program as
python3 main.py --dir /mnt/d/temp/dataset/01/ --output output