New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft flow PR #127

Draft

lyriccoder wants to merge 8 commits into master from dt6flow

Member

lyriccoder commented Feb 4, 2021 •

edited

Loading

d6tflow can't run in parallel child tasks
it can save results in a task only 1 time (child task can be only run synchronously). So if the task saves smth, it is completed. We have to collect all results from child tasks and then pass large collection to the next task. And do it every time from task to task
Since we have to pass large collection we will keep in memory files and it's metadata.
We can keep only metadata of all files, and then open each files in a separate thread and then find EM, Target method and it's invocation. But it is very ugly and take too much time
I have to use joblib since ProcessPool cannot serialize some AST objects

You can run program as python3 main.py --dir /mnt/d/temp/dataset/01/ --output output

lyriccoder added 4 commits

February 3, 2021 13:40


          Parallel failed version

573c7cf


          Non-parallel version

dfee4c6


          Global parallel

b513ef2


          Save files and preprocessed

eb729b4

lyriccoder marked this pull request as draft

February 4, 2021 14:44

KatGarmash requested changes

View reviewed changes

Member

KatGarmash left a comment

can you please remove all the commented code?


          Paralllel version but inside task

535d502

KatGarmash reviewed

View reviewed changes

veniq/dataset_collection/dataflow/main.py Show resolved Hide resolved

KatGarmash reviewed

View reviewed changes

veniq/dataset_collection/dataflow/main.py Outdated Show resolved Hide resolved

lyriccoder added 2 commits

February 4, 2021 18:40


          Now joblib is used

72109cf

Fix

4d6eb45

KatGarmash reviewed

View reviewed changes

veniq/dataset_collection/dataflow/preprocess.py

+                      if not files_without_tests:
+                          raise Exception("No java files were found")
+                      full_dataset_folder = Path(self.dir_to_save) / 'full_dataset'

Member

KatGarmash Feb 4, 2021

long method. the part with directory handling could be extracted, for example

KatGarmash reviewed

View reviewed changes

veniq/dataset_collection/dataflow/preprocess.py

+                          for filename in tqdm(files_without_tests):
+                              try:
+                                  text = next(result)
+                                  if text:

Member

KatGarmash Feb 4, 2021

under what condition can a text be None?

Member Author

lyriccoder Feb 5, 2021

When result is empty, empty string, when the file consists only of comments

Member

KatGarmash Feb 5, 2021

when can a result be empty, beside the case when soure code is empty ?

Member Author

lyriccoder Feb 5, 2021

when we pass empty file, i believe.

Member

KatGarmash Feb 5, 2021

can u leave a comment about it, maybe?

KatGarmash reviewed

View reviewed changes

veniq/dataset_collection/dataflow/collectEM.py

+                      csv = self.inputLoad()['csv']
+                      rows = [x for _, x in csv.iterrows()]
+                      with Parallel(n_jobs=2, require='sharedmem') as parallel:

Member

KatGarmash Feb 5, 2021 •

edited

Loading

why n_jobs=2?

Member Author

lyriccoder Feb 5, 2021

it's just a template, we can fix in future

Member

KatGarmash Feb 5, 2021

can u use self.system_cores_qty, like u did in the preprocess task?

KatGarmash reviewed

View reviewed changes

veniq/dataset_collection/dataflow/collectEM.py

+                      with Parallel(n_jobs=2, require='sharedmem') as parallel:
+                          results = parallel((delayed(self._find_EMs)(a) for a in rows))
+                      self.save({"data": [x for x in results if x]})

Member

KatGarmash Feb 5, 2021

what is the type of x?

Member Author

lyriccoder Feb 5, 2021 •

edited

Loading

the result of _find_EMs function

Member

KatGarmash Feb 5, 2021

can u type-annotate the _find_EMs function then?

KatGarmash reviewed

View reviewed changes

veniq/dataset_collection/dataflow/collectEM.py

+                  def _find_EMs(self, row):
+                      result_dict = {}
+                      try:
+                          ast = AST.build_from_javalang(build_ast(row['original_filename']))

Member

KatGarmash Feb 5, 2021

long method. this whole routine about extracting method declrations could be put into a separate method

KatGarmash reviewed

View reviewed changes

veniq/dataset_collection/dataflow/collectEM.py

+                                  target_node = ast.get_subtree(method_node)
+                                  for method_invoked in target_node.get_proxy_nodes(
+                                          ASTNodeType.METHOD_INVOCATION):
+                                      extracted_m_decl = method_declarations.get(method_invoked.member, [])

Member

KatGarmash Feb 5, 2021

what is extracted_m_decl ? what is its type?

KatGarmash reviewed

View reviewed changes

veniq/dataset_collection/dataflow/collectEM.py

+                  dir_to_save = d6tflow.Parameter()
+                  system_cores_qty = d6tflow.IntParameter()
+                  def _find_EMs(self, row):

Member

KatGarmash Feb 5, 2021

can u rename row to a bit more informative? what's in the row?

KatGarmash reviewed

View reviewed changes

veniq/dataset_collection/dataflow/main.py

+                          dir_to_save=args.output,
+                          system_cores_qty=args.jobs
+                      ))
+                  data = TaskFindEM(

Member

KatGarmash Feb 5, 2021

task TaskFindEM doesn't look like the final step. or is it?
where do u write the final dataset to disk?

KatGarmash requested changes

View reviewed changes

Member

KatGarmash left a comment

see inline comments

lyriccoder requested a review from acheshkov

February 8, 2021 07:19


          Update collectEM.py

8f9d89d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet