Meeting 2019 03 04

Face-to-Face meeting co-located with MPI Forum meeting

        Josh Hursey (IBM)
	Jeff Squyres (Cisco)
	Dan Holmes (EPCC)
	Noah Evans (Sandia)
	Martin Schulz (TUM/LRZ)
	Ken Raffenetti (ANL)
	Brice Goglin (Inria)
	Julien Jaeger (CEA)
	Wesley Bland (Intel)
	Julien Adam (Paratools)
	Suren Byna (LBNL)
	Stephen Herbein (LLNL)
	Shane Snyder (ANL)
	Geoffroy Vallee (ORNL)
	Aurelien Bouteiller (UTK)
	Sourav Chakraborty (OSU)
	Artem Polyakov (Mellanox)
	Kathryn Mohror (LLNL)
	Barry Rountree (LLNL)
	Tom Scogland (LLNL)
	Howard Pritchard
	Guillaume Papauré(Atos)
	Jithin Jose (Microsoft)
	Michael Chuvelev (Intel)
	Thomas Naughton
	Shinji Sumimoto (Fujitsu)

9:00 AM – 10:00 AM: Introduction of participants (familiarity with PMI family of interfaces, and goals for attending this meeting)

10:00 AM – 10:30 AM: Brief overviews of current system runtime interfaces

PMI-1 / PMI-2
- - PMIx

10:30 AM: Order lunch online from Jimmy Johns, everyone pays for their own lunch (credit card)

10:30 AM – 10:45 AM: Break

10:45 AM – 12:00 PM: Discussion of uses cases for a system runtime interface

Intel MPI use case
- MPI wireup, backward compatibility
- Don’t standardize the server interface, but have ABI compatibility for the clients to swap out vendor-specific versions of PMI(x) library
- PMIx
  - Standard contains structure definitions but should be more abstract
  - Values are defined as structures - may lead to inefficiencies and limit optimizations.
  - Good generalization of keys - instead of having a variety of API’s use predefined keys to a single general interface
  - Like having different, custom strings for keys that are vendor specific (and associated data type). No need to have these in the standard, standard API is flexible enough to handle this.
MPI Wireup
MPI Dynamics
- MPI Spawn
MPI Sessions
- Flexible MPI_Init/Finalize and grouping of participating processes
- Depends on PMIx Groups functionality
  - https://github.com/pmix/pmix-standard/pull/139
Programming model interoperability
- Coordination of Open MPI and OpenMP (inter-runtime coordination)
  - Placement, mapping, binding of threads/processes
- Is PMI the right place for programming model interoperability? Why can’t we do it at the OS layer?
  - The OS does not provide the interfaces we need
  - What is the information being exchanged for MPI and OpenMP
    - Mapping and binding (early in application run, may not be applicable to other use cases)
  - Many times the information needed is runtime level and not OS level. The OS does not know anything about them
  - PMI is for communication between libraries and notifications to libraries
    - Both libraries can start up in the same job process and know they co-exist. How they collaborate is up to them
    - Event notification to enable progress, possibly too slow for some use cases but functional
Fujitsu’s Use case
- User Job Startup, Finalize, Spawn
- User Job Information:
  - Topology and shape: ex. 3D torus 4x8x4, 2D mesh 8x32, 1D ring 256x1, Rank Order Information.
  - Hardware related: Tofu HW RDMA Channels, Barrier etc..
  - User Defined Rank maps
  - Routing information with fault nodes
- Job Account and Log Information: Usr-System Time, Performance FLOPS, CPI, File Usage etc..
Resilient library support in OpenMP (inter-library coordination)
- Based on the quality of service (performance, energy)
Distributed computing
- runtime key/value store
- Wireup (out-of-band)
Fault tolerance
- Quality of Service approach
  - Store resilience requirements into K/V store
  - Event notification
- Checkpoint/restart
  - Co-locate daemons with processes in the job
  - Wireup between services
  - Notification of when things fail
  - Information about storage devices in the system (multi-level storage model) - need metadata about storage options
- If we don’t have fault tolerance it won’t be used outside of HPC
- Failure detection, failure propagation are needed by fault tolerance models in general
- In PMI processes can query to find out who died, no automatic notification like in PMIx
Workflows
- Currently, use the file system to communicate between jobs/tasks
- Need a standard interface for job status and control, notification of job life cycle, cancel jobs
Communicate in ‘spawn’ network topology or other system information, binding/ordering information requirements. Power requirements.
Dask use case
- Long-lived tasks that can be an MPI process for a while, then leave identity as MPI task, do work, then reform MPI ‘group’ for next MPI activity.
Radical Pilot
- Run a job in a job - job steps. Uses PRRTE right now
Containers
- Singularity use case
  - Interface the host RM and the application inside the container
- CharlieCloud use case
  - “Host launch” vs “container launch”
    - mpirun outside the container vs inside the container
  - Some challenges with PMIx Ref. Library working in both models
    - Need to be careful about how the library was compiled and versioning
- PMI1 wire protocol works across container boundary
  - Success at both LLNL and Intel using this model with Singularity
- Need to maintain thinking about how to manage this boundary
  - Wire protocol as part of the standard
  - Need versioning as part of the protocol as part of the specification
Tools support
- Performance tools
- Debuggers
- IO Forwarding
Notifications - events
Discovery of resources
- Network topology
- Query nodes for information about capabilities
- Storage options available in the system
- Discovery of peers on the system
- - Query the system for RAS events in a generic manner
- A way to send RAS events to the SMS or log
Interacting with dynamic interaction with the scheduler
- Asking scheduler what capabilities are available
- Feedback of dynamic allocation
- Support for interactions with schedulers like k8s, meos
Monitoring of services
- Event model to inform growing/shrinking of resources
- Set a threshold for load (for example) that provides a callback which you can use to spawn additional services
Language support
- Beyond a C-only interface
- How to support interfaces in other languages (e.g., Rust, Go)
Support for cross-job (in scheduler sense) communication through PMI
- Publish + Connect/Accept
- Need to discover what resources are available
  - General purpose services running on a system
  - Services that come/go over time
We should not solve the security problem, but use the system provided services
- May need to define hooks to allow for these services to be plugged in
Synchronization models
- Flexible fences in PMIx (fences don’t need to be global)
- Conduit to provide key to server,
- Allgather, scatter, doesn’t need to be so strict, softer allgather
- Need collectives so that we don’t have to do explicit key gets, reduce network traffic, reduce N-N queries
- Maybe a “range get” would be useful
Accept/Connect Spawn
- Do they belong in MPI or outside of it?
- Connect/Accept /join/spawn, current MPI implementations are subpar
  - Outside HPC the world is much more dynamic. They might use MPI if we fixed these problems
  - Open MPI drops into PMI to implement these features, is that the right ordering?
  - Yes, so applications can use the abstractions in MPI or whatever programming model
Fault tolerance
- Need a model that doesn’t kill your job if something fails
- Need the reliability of TCP
Questions:
- Where do draw the line between PMI’s responsibility and something else’s responsibility
- What is feature creep and what is needed?
It should be really easy to incrementally implement piece by piece (currently PMIx is all or nothing wrt ingesting header and structs)
- Should be easy to say “Flux supports version X with this support” instead of function by function
- For procurements need to be able to specify what functionality and version wanted supported in resource manager
- Keys required, what is the minimal set required? What should be standardized? Needs to be clearly stated. Need to figure out what exactly is needed.

12:00 PM – 1:00 PM: Working lunch, continuing discussions from morning

1:00 PM – 2:00 PM: Discussion of how to move forward

Cite: XKCD https://xkcd.com/927/

How to move forward so we don’t end up with 15 competing standards?
Starting from scratch could set us back many years. How do we prevent that?
How to make it a more authoritative standard?
Do we start from PMI1, 2, X? Do we form working groups?
PMIx process: what is it?
- Anybody can put forth proposals. No specific management of that
- No membership requirements, to evolve with speed
- API has changed over time but try not to break it, deprecation process
- Proposals discussed on weekly calls, after two (?) calls and discussion, people say yes/no, put out an errata
- Experimental work in standard as well as “regular” api
If PMIx is a place to start, we would have to bring in more people and define what that means, would change the process and the form
Would starting from PMIx alienate some people?
Two sides of the spectrum: MPI and PMIx (slow vs fast)
Would the PMIx community accept a slowdown from wider community participation?
- Possibly, might give PMIx implementors more focus
- Need to combine stability with forward progress, moving fast is not always a problem
- But a two-week process is probably too fast… Need time for reaching out to community and obtaining pushback. You could be on vacation!
- Need to balance trade off of prototyping implementation quickly and providing stability in the standard
What is the core set of PMIx that is needed by Open MPI
- 1/6? Is the guess, or maybe ⅓
- How do we articulate the scope of PMIx needed for particular functionality?
Current rules are based on group involved in the process
- Believe they are open to changing the rules if more people join
Need grouping of functionality and keys (red, yellow, green light)
Could we identify the core set of functionality and then add in more instead of starting with the whole PMIx standard?
- Eg. standardizing best practices
Versus starting with the whole shebang. But if we start with the whole thing
Resource manager support (aside from the reference implementation) is challenging/broken
- Tried to participate in the past but it didn’t work
- How to make this work now?
- The community is probably more open now
- The challenge seems to be the reference implementation and the document is focused on that?
How to refine the discussion?
- Use the PMIx document as a base, chapter 1
- What subset do we start with?
- How many reference implementations do we need for something to be accepted?
What should the name be?
- PMIx, PMI4?
Need tight coordination with the reference implementations
- Needs to be heavy overlap between implementors and standards designers always so they don’t get out of sync with each other
What should the next set of meetings be? Telecons?
The goal is to have a redacted PMIx document that we can start a discussion on?
- What is the core set of functionality we are interested in?
Meeting once every couple of months is not enough, people don’t have travel budgets
What about virtual meeting style?
- Face to face meetings is where most large, contentious changes happen
- Travel budgets are hard… maybe just a couple meetings per year
Starting with PMIx in full may deter outside community (non HPC) from joining

2:00 PM: Adjourn, start of MPI Forum

Action items:

Release a Doodle poll to setup a weekly PMIx standardization meeting
- Target 2 weeks from now for the first meeting
PMIx Standardization Meeting
- Works in conjunction with PMIx implementation meeting (maybe the same meeting)
  - Same community focusing on implementation/exploration with a dedicated companion effort for formalizing ‘solid’ components into a standard document and a pipeline to bring forward new interfaces from the implementation/experimentation side for adoption into the standard.
  - The implementations can move fast and experiment
  - The standard moves a bit slower behind the implementations to codify syntax and semantics into a document.
- Goals:
  - Separate PMIx implementation details from the PMIx standard document
  - Consider slicing the standard document into a Client API, Server API, Tool API, and wire protocol (or ABI?) sections.
    - Allow for an implementation to only provide, for example, Client API
  - Consider slicing the standard document into groupings of “core interfaces” and “inferences focused on X” - which may overlap.
    - Makes it easier to identify supported subsets of the standard that are supported by the RM and/or required by the Client.
  - Consider a “required” vs “good to have” vs “optional” set of attributes
- Standardization process notes
  - Longer process for feedback
  - Require more active participation
  - Maintain open participation and governance
  - Face-to-face meeting once (maybe twice?) a year and weekly teleconf meetings
    - Face-to-face to facilitate progress with people in the room focused on one thing
    - Weekly teleconf to faciliate communication and continual development
    - Standard progress should not be delayed until the face-to-face
  - Implementation must be provided in at least one open reference implementation.
    - Consider requiring at least sign off (not necessarily a full implementation) from another PMIx implementation before incorporation into the standard.
- Consider spinning off working groups to investigate PMIx extensions as needed
Need to improve outreach to bring more people into the discussion
- Need to be careful of scaring away groups that might want to participate in such a standardization effort, but may have foundational questions about the PMix model.
Start from the PMIx document and refine from there instead of starting from scratch
For now, stay independent of any ‘other’ standardization body (e.g., MPI Forum) and work with the current expanded community interested in PMI(x)s

Home

Working Groups

Events

Helpful How To...

ASC Quarterly Meetings

ASC Monthly Meetings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meeting 2019 03 04

Clone this wiki locally