Skip to content

ASC Q4 2020 Meeting

Josh Hursey edited this page Feb 18, 2021 · 19 revisions

PMIx Standard Administrative Steering Committee (ASC) 4Q 2020 Meeting

Quick Links

Agenda (timeline in the chart below)

1Q 2021 [Jan 1 - March 31) - Virtual
 - 1 day: Thurs., Feb. 11
 - 2 day: Tues., Feb. 16 & Thurs., Feb. 18

2Q 2021 (April 1 - June 30) - Virtual
 - 1 day: Thurs., May 6
 - 2 day: Tues., May 11 & Thurs., May 13

3Q 2021 (July 1 - Sept. 30) - Virtual
 - 1 day: Thurs., July 22
 - 2 day: Tues., July 20 & Thurs., July 22

4Q 2021 (Oct. 1 - Dec. 31) - Face2Face if possible
 - 1 day: Thurs., Oct. 21
 - 2 day: Tues., Oct. 26 & Thurs., Oct. 28

Agenda Timeline

We will try to keep to this timeline as best as we can. However, discussion items may take longer/shorter than anticipated and as a result, the agenda may need to be adjusted during the meeting.

We will start roll call promptly at 11:15 am. After this point, the co-chairs may decide, during the meeting, to adust the timeline based on the discussion.

All times in US Central. (Last Update: Sept. 29, 2020)

Start End Topic
11:00 am 11:05 Gathering
* Slides
11:05 am 11:15 Decide on 2021 quarterly meetings
11:15 am 11:25 Roll Call, Vote on New ASC Members, Call for New ASC Members
- Roll call
- Call for New ASC Members
11:25 am 11:30 Election of "odd year" Co-Chair and Secretary positions
Voting Link - Co-Chair
Voting Link - Secretary
11:30 am 12:00 Governance PR Reading and First Vote
* Clarification of timeline for quarterly meetings items
- https://github.com/pmix/governance/pull/17
* Introduce Revision Exception Vote
- https://github.com/pmix/governance/pull/18
Voting Link
12:00 pm 12:30 Standard PR Reading
- Old PR: https://github.com/pmix/pmix-standard/pull/235
- New PR: https://github.com/dsolt/pmix-standard/pull/11
- PDF in this PR comment
12:30 pm 1:15 Break
1:15 pm 1:20 PMIx 3.2 Release Update
1:20 pm 1:30 PMIx 4.0 Release Update
1:30 pm 1:30 (Announce Voting Results)
1:30 pm 2:15 Standard PR Reading
* Storage Working Group: Query support
- https://github.com/pmix/pmix-standard/issues/277
- https://github.com/pmix/pmix-standard/pull/280
2:15 pm 3:00 Plenary: Integration of Uses Cases
* https://github.com/pmix/pmix-wg-slices/pull/1
3:00 pm 3:30 Working Group Updates
- Client Separation / Implementation Agnostic Document Working Group
- Slicing/Grouping of functionality Working Group
- Dynamic Workflows Working Group
- Storage Working Group
* Open Call for New Working Groups
3:30 pm 5:00 Additional Discussion Items

Attendees

  • Kathryn Mohror (LLNL)
  • Josh Hursey (IBM)
  • Howard Pritchard (LANL)
  • Michael Karo (Altair)
  • Stephen Herbein (LLNL)
  • David Solt (IBM)
  • Aurelien Bouteiller (UTK)
  • Ralph Castain (Intel)
  • Thomas Naughton (ORNL)
  • Shane Snyder (ANL)
  • Ken Raffenetti (ANL)
  • Bengisu Elis (TUM)
  • Artem Polyakov (NVIDIA)
  • Jai Dayal (Intel)

Notes

  • ASC 2021 Quarterly Meetings - 1 day or 2
    • Kathryn: With 2 days, if there is any little feedback on a PR in the first day, it could be integrated and re-read in the second day.
    • Michael Karo: If we do have a face-to-face, it should only be on a single day
      • Josh: we could split the meeting across two days but have the first half day in the evening, and a second half day in the morning. You could fly in the first day and fly out the second day.
      • Kathryn: optimally, you could get away with only a single night in the hotel and only spend two days on travel versus three days and two nights in a hotel if we cram the meeting onto a single day
      • Three verbal "votes" in favor of the motion
    • Josh: what about moving the meeting back an hour to start at 8am PT/10am CT, which would make it easier for european attendees.
      • Kathryn: this would put it at early dinner time for Europe
      • Suggestion is to plan for 10am-1pm CT
  • Roll call for attendance/vote
    • Review voting eligibility
    • Note - Mellanox needs to be updated to Nvidia.
    • (Kathryn has roll / paste in notes)
  • Call for new ASC Members
    • none
  • Election of officers (1 Co-chair and 1 Secretary position - both "odd year")
  • Governance PRs up for a vote:
  • PMIx Standard PRs up for a vote:
    • None
  • PMIx Standard PRs up for a Reading
    • pmix/pmix-standard/pull/235
    • pmix/pmix-standard/pull/235/commits/a1370b884f55e62b32b1e2ff1a11e82a4c40cbdd
    • TODO: link to slides
    • Slide 9: term: namespace
      • Ralph: the namespace value must be unique, so it must be assigned by the RM.
      • Dave: we thought there was a way for the client to provide the namespace via an attribute
      • Ralph: there is an attribute you can pass to Info to get the namespace, but no way to set it.
      • Josh: does the second sentence add anything such that removing it causes a problem?
      • Ralph: tools have always been a tricky case. A tool can not provide a namespace and be given one by the RM. Or the tool can provide one, and if it doesn’t exist or conflicts, it is an error.
    • Slide 12: term: slot
      • Ralph: The intent of "slot" is the number of processes you are allowed to run on a node/resource. A “slot” equals a “process”, e.g., 10 slots, can run 10 processes.
      • Dave: Problems with matching of slots to other resources, e.g., GPU
      • Ralph: The slot number is only for the number of processes for a given node, regardless of the number of resources (num CPUs, GPUs, etc.).
      • Kathryn: So do we want to rephrase as, "slot" indicates the number of processes that can run on a node?
      • Dave: How is that helpful for Comm_spawn for knowing available resources?
      • Ralph: Still able to run N processes on that node irrespective of the number of resources
      • MichaelK: A slot is sort of an intended use of the node. The oversubscribe case can be a different interpretation of this mapping.
      • Ralph: Slot is indicating the number of processes on a given node, e.g., entries in a hostfile to indicate number of processes for that node.
      • MichaelK: You can ask for a given number of "slots" (oversubscribe) and can exceed what might be the resource matched setting, e.g., exceed number of processes to CPUs.
      • Ralph: No correlation of slot to core count.
      • Stephen: Slot is upper limit on the number of processes that can be created on a resource. Should there be a new term to avoid confusion about the use of "slot". The intent here is to get the “cap” for number of processes on a given node.
      • Ralph: The term "slot" has been used for a while.
      • Stephen: if do not make a request in terms of slots, then that might cause chance for inconsistencies across RMs.
      • Ralph: Need some way to communicate the "cap". Inside PMIx doc, the term “slot” means cap on processes for a node. In PMIx, it sets “universe size”, which is the total number of (TJN: ?slots/processes?) you can spawn.
      • Ralph: Slot has nothing to do with the resource utilization.
      • Stephen: Maybe could clarify a common example of slot maps to number of cores in many RMs (e.g., Flux, PBS).
      • Ralph: Apprehensive to tie slots to number of cores/resources. Try to avoid drawing relationship between slot and resource.
      • Artem: For me, process is also a type of resource (from OS perspective). A process itself is the resource.
      • Dave: Maybe hold off on this for later discussion/slide.
      • Kathryn: Trying to clarify the disconnect between resource/slot.
      • Ralph: Intent being that a slot is a specific tie to a resource. The number of processes that you can run, irrespective of the resources, that cap is the max number of processes you can execute.
    • Slide 14: "Other"
      • Ralph: cannot add a _ to the strings/attributes. The existing code does not check for _ to preserve backwards compatibility. The behavior changes depending on if a attribute is standardized or not.
      • Dave: we are only restricting what standardized attributes can be labeled as
      • Ralph: we ascribe behaviors to reserved attributes/constants differently than to non-reserved attributes/constants.
      • Dave: so if someone passes in "PMI2…" then the library will consider it a reserved/standardized keyword
      • Passing in "PMIX" is reserved/standardized in existing library code, working group will take this back for review.
    • Slide 16: "Process"
      • Ralph: Not really important to make that distinction ("OS process")
      • Aurelien: In MPI they are avoiding this now by renaming everything as "MPI process" to avoid ambiguity. It could be a process, thread, etc.
      • Ralph: The intent is to just have a process that is represented by a rank (identifier)
      • Aurelien: Possibly avoid ambiguity w/ "PMIx process" for definitions
      • Ralph: Cautions about taking too much time over definition precision
      • Dave: Generally agree for generalizing, possibly using examples to clarify
    • Josh: Sounds like good feedback to WG, and action will be to take it back for updates and possibly bring for reading at next meeting
  • PMIx 3.2 release update
    • Release timeline - Oct. 2020
    • Waiting on finalization of OpenPMIx implementation
    • Changes:
      • PMIx_Allocation_request func signature, PMIX_ALLOC_ID to PMIX_ALLOC_REQ_ID, add new PMIX_ALLOC_ID, update PMIx_generate_regex/ppn and add PMIX_REGEX datatype
  • PMIx 4.0 release update
    • Release timeline - Oct. 2020
    • Finished revisions to Fabrics interfaces
    • Done now, see updates in branch now
    • Ralph: Possibly some changes to Python bindings area
    • Josh: Should we cut another rc?
    • Josh: How about rolling rc today and put 2 week timeout
    • Ralph: Possibly Python bindings updates
    • Josh: Also acknowledgements edits
    • Thomas: Asked a few to take look at Python bindings
    • Josh/All: Thanks for getting this done.
    • Ralph: Yes, and wanted to get done by Oct. 1
  • PMIx Standard PRs up for a Reading
    • Storage Working Group: Query support
    • pmix/pmix-standard/pull/280
    • TODO: link to slides
    • Reading through PDF draft (Ch.15)
    • Discussion:
      • Stephen: For locality, is it only possible to return a single value?
      • Shane: Not sure how common it would be to split return info, e.g., Lustre and BurstBuffer. Not sure about use cases crossing storage.
      • Katheryn: Maybe cases that mix node and network
      • Stephen: Thinking of something like UnifyFS where using software to aggregate across backing store types. Maybe return both node-local and network. Not sure what you would return with UnitfyFS?
      • Ralph: How about using a bitmask and then can "logical or" together
      • Aurelien: Locality question on clarification
      • Shane: Intent to reflect latency, e.g., low-latency (node-local), etc.
      • Stephen: At a given mount point, can others on another node access this location?
      • Shane: Maybe want to have attribute to reflect the scope of the data sharing would be useful, and use bitmask to allow for multiple locality types.
      • Stephen: Use of "optimal" in attribute might be too strong. Maybe use something like “suggested”.
      • Shane: Yes, that seems good. This text desc likely just carry over from statfs descr this was based upon
      • Stephen: For access types, the attributes, "PMIX_STORAGE_ACCESS_TYPE" - the use here is to pass in qualifiers for a specific spot
      • Shane: Yes, would use the storage access qualifier to restrict the scope for what you are querying on
      • Stephen: Things marked float, is that 32 bit?
      • Shane: Assumed was float but could be double? Assumed 32 bit float. Likely should double check to convince that’s sufficient.
    • Reviewing comments on PR #280
      • Kathryn: Regarding item related to PMIX_STORAGE_TYPE and returning the OS kernel fs type, e.g., "gpfs", “lustre”, etc., how would user-space filesystems show up?
      • Shane: That would likely come to the PMIx server to manage that.
      • Aurelien: This would also apply to FUSE fs’s, b/c they would all return "fuse".
      • Shane: If not a well defined POSIX method, then have use a server/client ad hoc approach
      • Stephen: Seem to think the change from enum to value-string is fine for current PR (small enough changeset).
      • Shane: Tend to agree that keeps things to smallish size change and can start pulling things out (reduce). So call this as a "string-type" and avoid the enum.
      • Ralph: If use generic string-type, then need some well defined set of strings to know what to look for to have meaning, e.g., "lustre" is value that you know to go looking for. Try to avoid the “magic words” and instead just use some constants that people can use without having to go hunting.
      • Shane: Makes sense, need some constant attributes.
      • Katheryn: What about "unknown" from enum?
      • Shane: Get rid of that constant, but keep the well defined items.
      • Aurelien: Need to have clear set of names and ways to define them, to avoid having ugly or problematic names. Rules for names helpful.
      • Aurelien: Maybe use reverse URI, e.g., "comm.x.y"
      • Shane: What would one of those look like for Lustre?
      • Aurelien: Hmmm, maybe not useful
      • Shane: Maybe best thing is to standarize on kernel filesystems and leave room for others to customize as needed.
      • Shane: Suggestions for integrating these changes for this PR?
      • Josh: Sounds like this should be going under the provisional process (faster). You can do a reading, have a vote w/in same meeting and then a 2 week window, assume pass vote and comment-window. Then goes into the release. And can then have another provisional vote later for more pieces. And this would allow for getting provisional pieces. And then later (e.g., Q1) to re-read to the spec.
      • At moment that would look like it could be in the v4.1 timeframe. Need to talk w/ release managers.
  • Plenary discussion items
    • Slices Working Group: Integration of Uses Cases (30-45 min)
    • Plan: Interested in getting feedback on layout/format, e.g., in spec or separate doc. Also going to give review of LaTeX bits for doing this.
    • https://github.com/pmix/pmix-wg-slices/pull/1
      • Reviewing PDF on PR (Ch.3) "Use-Cases"
    • General feedback was this looks great! Comments suggested it may be best to keep as a single doc, probably as an appendix.
  • Working Group Presentations
    • Client Separation / Implementation Agnostic Document Working Group
      • Many of the changes have been pulled into V4
      • Made some changes to put, get, publish and how those are presented, but many have been pulled in/addressed by Ralph in V4. Will be restarting on those based on V4 changes.
    • Slicing/Grouping of functionality Working Group
      • No updates besides plenary
    • Dynamic Workflows Working Group
    • Storage Working Group
      • No updates besides reading
  • Open call for new Working Groups
    • No comments from community
  • Discussion items
    • Chapter Chairs
      • Kathryn: other standards bodies sometimes have people in charge of a chapter. Any changes to that chapter are aided and reviewed by them.
      • Stephen: could leverage Github’s Code Owners
  • Monthly meeting next week is canceled
Clone this wiki locally