diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..b8e2da0 --- /dev/null +++ b/LICENSE @@ -0,0 +1,11 @@ +Copyright 2024 (c) Edward Z. Yang + +Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: + +1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. + +2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. + +3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. diff --git a/README.md b/README.md new file mode 100644 index 0000000..303267a --- /dev/null +++ b/README.md @@ -0,0 +1,13 @@ +# tlparse: Parse structured PT2 logs +`tlparse` parses structured torch trace logs and outputs HTML files analyzing data. + +Quick start: +Run PT2 with the TORCH_TRACE environment variable set: +``` +TORCH_TRACE=/tmp/my_traced_log example.py +``` + +Feed input into tlparse: +``` +tlparse /tmp/my_traced_log -o tl_out/ +``` diff --git a/src/main.rs b/src/main.rs index 4d124b4..c000484 100644 --- a/src/main.rs +++ b/src/main.rs @@ -1,15 +1,11 @@ use anyhow::anyhow; use base16ct; use clap::Parser; -use core::hash::BuildHasherDefault; -use fxhash::{FxHashMap, FxHasher}; -use html_escape::encode_text; -use indexmap::IndexMap; +use fxhash::FxHashMap; use md5::{Digest, Md5}; use std::ffi::{OsStr, OsString}; use regex::Regex; -use std::fmt::{self, Display, Formatter}; use std::fs; use std::fs::File; use std::io::{self, BufRead}; @@ -18,112 +14,18 @@ use std::path::PathBuf; use tinytemplate::TinyTemplate; use indicatif::{MultiProgress, ProgressBar, ProgressStyle}; -use once_cell::sync::Lazy; -use serde::{Deserialize, Serialize}; -use std::sync::Mutex; use std::time::Instant; -pub type FxIndexMap = IndexMap>; - -static INTERN_TABLE: Lazy>> = - Lazy::new(|| Mutex::new(FxHashMap::default())); - -static CSS: &str = r#" -"#; - -static TEMPLATE_DYNAMO_GUARDS: &str = r#" - - -

Guards

-
    -{{ for guard in guards }} -
  • {guard.code}
  • -{{ endfor }} -
- - -"#; - -static TEMPLATE_INDEX: &str = r#" - - - -
-

Stack trie

-

-The stack trie is a way of getting a quick orientation on where all the -compilations in a model take place, esp., if you are compiling a codebase you are unfamiliar with. -It is a tree of stack frames, for all stacks that triggered PT2 compilation. If only a single -stack is in the tree, you will simply see a plain list of frames (most recent call last). With -multiple stacks, at every point where two stacks diverge from having a common prefix, we increase -the indentation of the list and have a separate sub-list per sub-tree. -

-{stack_trie_html | format_unescaped} -
-
-

IR dumps

-

-The IR dumps collected dumped intermediate products from various points of the PT2 -compilation process. The products are organized by compile id, and then sorted in chronological -order. -

-

-A compile id uniquely identifies are particular compilation inside a PT2 -program. It is traditionally written as [x/y], where the frame id x -identifies the particular Python frame which we are compiling, and frame compile -id y identifies how many times we've recompiled this same frame. For example, -[0/0] refers to the very first frame compiled by PT2; [0/1] refers to the -first recompilation of this frame, while [1/0] refers to a different frame, within -distinct code cache, which we are compiling next (perhaps because of a graph break). Although -Dynamo treats distinct frames as completely unrelated, a frame compilation could overlap with another -frame; for example, if you graph break in an inlined function, Dynamo will typically try to compile -the nested frame again on an inner frame. You can identify the hierarchical relationship between -frames by looking at the stack trie above. -

-

-In some situations, the compile id will have an extra signifier [x/y_z], where z is the -attempt for this particular (re)compilation. Certain conditions will cause Dynamo to -restart analysis, when Dynamo discovers that it needs to undo a decision it previously made. The most -common cause of recompilation is a graph break in an inlined function call, which forces to restart -and avoid inlining the function in the first place. -

-

-Here is a high level description of PT2's compilation phases, and the intermediate products each -phase generates: -

-
    -
  1. Optional: If compiled autograd is enabled, and we are processing a backward call, compiled autograd will trace the autograd graph from the autograd engine, and produce an FX graph compiled_autograd_graph that will be Dynamo traced. Otherwise, Dynamo will directly trace user's bytecode.
  2. -
  3. Dynamo symbolically evaluates the Python bytecode of a program, producing dynamo_output_graph
  4. -
  5. Optional: If optimize_ddp is enabled, the DDPOptimizer will split the Dynamo output graph to improve pipelining communications. Each split subgraph is optimize_ddp_split_child_submod, and the high level graph that plumbs the graphs together is optimize_ddp_split_graph. If there are multiple splits, each subsequent build product will be produced multiple times, one for each split.
  6. -
  7. AOTAutograd traces the (possibly split) Dynamo output graph, producing a aot_joint_graph if backwards is enabled. It then partitions the graph into aot_forward_graph and aot_backward_graph. If training is not needed, there may only be an aot_forward_graph.
  8. -
  9. Inductor will apply some post grad FX passes, producing inductor_post_grad_graph
  10. -
  11. Inductor will perform code generation, producing the final inductor_output_code which will be executed at runtime. This output is a valid Python program and can be directly run.
  12. -
-

-Build products below: -

-
    -{{ for compile_directory in directory }} -
  • {compile_directory.0} -
      - {{ for path in compile_directory.1 }} -
    • {path}
    • - {{ endfor }} -
    -
  • -{{ endfor }} -
-
- - -"#; + +use crate::types::*; +use crate::templates::*; +pub mod templates; +pub mod types; #[derive(Parser)] #[command(author, version, about, long_about = None)] #[command(propagate_version = true)] -struct Cli { +pub struct Cli { path: PathBuf, /// Output directory, defaults to `tl_out` #[arg(short, default_value = "tl_out")] @@ -140,199 +42,6 @@ struct Cli { no_browser: bool, } -#[derive(Default)] -struct StackTrieNode { - terminal: Vec, - // Ordered map so that when we print we roughly print in chronological order - children: FxIndexMap, -} - -impl StackTrieNode { - fn insert(&mut self, mut stack: StackSummary, compile_id: String) { - let mut cur = self; - for frame in stack.drain(..) { - cur = cur.children.entry(frame).or_default(); - } - cur.terminal.push(compile_id); - } - - fn fmt_inner(&self, f: &mut Formatter, indent: usize) -> fmt::Result { - for (frame, node) in self.children.iter() { - let star = node.terminal.join(""); - if self.children.len() > 1 { - // If the node has multiple children, increase the indent and print a hyphen - writeln!( - f, - "{:indent$}- {star}{}", - "", - frame, - indent = indent, - star = star - )?; - node.fmt_inner(f, indent + 2)?; - } else { - // If the node has only one child, don't increase the indent and don't print a hyphen - writeln!( - f, - "{:indent$} {star}{}", - "", - frame, - indent = indent, - star = star - )?; - node.fmt_inner(f, indent)?; - } - } - Ok(()) - } -} - -impl Display for StackTrieNode { - fn fmt(&self, f: &mut Formatter) -> fmt::Result { - write!(f, "
")?;
-        self.fmt_inner(f, 0)?;
-        write!(f, "
")?; - Ok(()) - } -} - -#[derive(Eq, PartialEq, Hash, Deserialize, Serialize, Debug, Clone)] -struct CompileId { - frame_id: u32, - frame_compile_id: u32, - attempt: u32, -} - -impl fmt::Display for CompileId { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - write!(f, "[{}/{}", self.frame_id, self.frame_compile_id)?; - if self.attempt != 0 { - write!(f, "_{}", self.attempt)?; - } - write!(f, "]") - } -} - -#[derive(Default, Debug)] -struct Stats { - ok: u64, - other_rank: u64, - fail_glog: u64, - fail_json: u64, - fail_payload_md5: u64, - fail_dynamo_guards_json: u64, -} - -#[derive(Debug, Hash, Eq, PartialEq, Deserialize, Serialize)] -struct FrameSummary { - filename: u32, - line: i32, - name: String, -} - -fn simplify_filename<'a>(filename: &'a str) -> &'a str { - let parts: Vec<&'a str> = filename.split("#link-tree/").collect(); - if parts.len() > 1 { - return parts[1]; - } - // TODO: generalize this - let parts: Vec<&'a str> = filename - .split("1e322330-seed-nspid4026531836_cgpid26364902-ns-4026531840/") - .collect(); - if parts.len() > 1 { - return parts[1]; - } - filename -} - -impl fmt::Display for FrameSummary { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - let intern_table = INTERN_TABLE.lock().unwrap(); - let filename = intern_table - .get(&self.filename) - .map_or("(unknown)", |s| s.as_str()); - write!( - f, - "{}:{} in {}", - encode_text(simplify_filename(filename)), - self.line, - encode_text(&self.name) - ) - } -} - -type StackSummary = Vec; - -#[derive(Debug, Deserialize)] -struct OptimizeDdpSplitChildMetadata { - name: String, -} - -#[derive(Debug, Deserialize)] -#[serde(untagged)] -enum SymInt { - Int(i64), - Symbol(String), -} - -#[derive(Debug, Deserialize)] -struct EmptyMetadata {} - -#[derive(Debug, Deserialize)] -struct DynamoOutputGraphMetadata { - _sizes: Option>>, -} - -#[derive(Debug, Deserialize)] -struct DynamoStartMetadata { - stack: Option, -} - -#[derive(Debug, Deserialize)] -struct InductorOutputCodeMetadata { - filename: Option, -} - -#[derive(Debug, Deserialize)] -struct Envelope { - rank: Option, - #[serde(flatten)] - compile_id: Option, - #[serde(default)] - has_payload: Option, - // externally tagged union, one field per log type we recognize - dynamo_start: Option, - str: Option<(String, u32)>, - dynamo_output_graph: Option, - optimize_ddp_split_graph: Option, - optimize_ddp_split_child: Option, - compiled_autograd_graph: Option, - dynamo_guards: Option, - aot_forward_graph: Option, - aot_backward_graph: Option, - aot_joint_graph: Option, - inductor_post_grad_graph: Option, - inductor_output_code: Option, -} - -#[derive(Debug, Deserialize, Serialize)] -struct DynamoGuard { - code: String, - stack: Option, - user_stack: Option, -} - -#[derive(Debug, Serialize)] -struct DynamoGuardsContext { - guards: Vec, -} - -#[derive(Debug, Serialize)] -struct IndexContext { - css: &'static str, - directory: Vec<(String, Vec)>, - stack_trie_html: String, -} fn main() -> anyhow::Result<()> { let cli = Cli::parse(); diff --git a/src/templates.rs b/src/templates.rs new file mode 100644 index 0000000..6296ded --- /dev/null +++ b/src/templates.rs @@ -0,0 +1,91 @@ +pub static CSS: &str = r#" +"#; + +pub static TEMPLATE_DYNAMO_GUARDS: &str = r#" + + +

Guards

+
    +{{ for guard in guards }} +
  • {guard.code}
  • +{{ endfor }} +
+ + +"#; + +pub static TEMPLATE_INDEX: &str = r#" + + + +
+

Stack trie

+

+The stack trie is a way of getting a quick orientation on where all the +compilations in a model take place, esp., if you are compiling a codebase you are unfamiliar with. +It is a tree of stack frames, for all stacks that triggered PT2 compilation. If only a single +stack is in the tree, you will simply see a plain list of frames (most recent call last). With +multiple stacks, at every point where two stacks diverge from having a common prefix, we increase +the indentation of the list and have a separate sub-list per sub-tree. +

+{stack_trie_html | format_unescaped} +
+
+

IR dumps

+

+The IR dumps collected dumped intermediate products from various points of the PT2 +compilation process. The products are organized by compile id, and then sorted in chronological +order. +

+

+A compile id uniquely identifies are particular compilation inside a PT2 +program. It is traditionally written as [x/y], where the frame id x +identifies the particular Python frame which we are compiling, and frame compile +id y identifies how many times we've recompiled this same frame. For example, +[0/0] refers to the very first frame compiled by PT2; [0/1] refers to the +first recompilation of this frame, while [1/0] refers to a different frame, within +distinct code cache, which we are compiling next (perhaps because of a graph break). Although +Dynamo treats distinct frames as completely unrelated, a frame compilation could overlap with another +frame; for example, if you graph break in an inlined function, Dynamo will typically try to compile +the nested frame again on an inner frame. You can identify the hierarchical relationship between +frames by looking at the stack trie above. +

+

+In some situations, the compile id will have an extra signifier [x/y_z], where z is the +attempt for this particular (re)compilation. Certain conditions will cause Dynamo to +restart analysis, when Dynamo discovers that it needs to undo a decision it previously made. The most +common cause of recompilation is a graph break in an inlined function call, which forces to restart +and avoid inlining the function in the first place. +

+

+Here is a high level description of PT2's compilation phases, and the intermediate products each +phase generates: +

+
    +
  1. Optional: If compiled autograd is enabled, and we are processing a backward call, compiled autograd will trace the autograd graph from the autograd engine, and produce an FX graph compiled_autograd_graph that will be Dynamo traced. Otherwise, Dynamo will directly trace user's bytecode.
  2. +
  3. Dynamo symbolically evaluates the Python bytecode of a program, producing dynamo_output_graph
  4. +
  5. Optional: If optimize_ddp is enabled, the DDPOptimizer will split the Dynamo output graph to improve pipelining communications. Each split subgraph is optimize_ddp_split_child_submod, and the high level graph that plumbs the graphs together is optimize_ddp_split_graph. If there are multiple splits, each subsequent build product will be produced multiple times, one for each split.
  6. +
  7. AOTAutograd traces the (possibly split) Dynamo output graph, producing a aot_joint_graph if backwards is enabled. It then partitions the graph into aot_forward_graph and aot_backward_graph. If training is not needed, there may only be an aot_forward_graph.
  8. +
  9. Inductor will apply some post grad FX passes, producing inductor_post_grad_graph
  10. +
  11. Inductor will perform code generation, producing the final inductor_output_code which will be executed at runtime. This output is a valid Python program and can be directly run.
  12. +
+

+Build products below: +

+
    +{{ for compile_directory in directory }} +
  • {compile_directory.0} +
      + {{ for path in compile_directory.1 }} +
    • {path}
    • + {{ endfor }} +
    +
  • +{{ endfor }} +
+
+ + +"#; diff --git a/src/types.rs b/src/types.rs new file mode 100644 index 0000000..eedd218 --- /dev/null +++ b/src/types.rs @@ -0,0 +1,211 @@ +use core::hash::BuildHasherDefault; +use fxhash::{FxHashMap, FxHasher}; +use html_escape::encode_text; +use indexmap::IndexMap; + +use std::fmt::{self, Display,Formatter}; +use std::path::PathBuf; + +use once_cell::sync::Lazy; +use serde::{Deserialize, Serialize}; +use std::sync::Mutex; + +pub type FxIndexMap = IndexMap>; + +pub static INTERN_TABLE: Lazy>> = + Lazy::new(|| Mutex::new(FxHashMap::default())); + + +#[derive(Default)] +pub struct StackTrieNode { + terminal: Vec, + // Ordered map so that when we print we roughly print in chronological order + children: FxIndexMap, +} + +impl StackTrieNode { + pub fn insert(&mut self, mut stack: StackSummary, compile_id: String) { + let mut cur = self; + for frame in stack.drain(..) { + cur = cur.children.entry(frame).or_default(); + } + cur.terminal.push(compile_id); + } + + pub fn fmt_inner(&self, f: &mut Formatter, indent: usize) -> fmt::Result { + for (frame, node) in self.children.iter() { + let star = node.terminal.join(""); + if self.children.len() > 1 { + // If the node has multiple children, increase the indent and print a hyphen + writeln!( + f, + "{:indent$}- {star}{}", + "", + frame, + indent = indent, + star = star + )?; + node.fmt_inner(f, indent + 2)?; + } else { + // If the node has only one child, don't increase the indent and don't print a hyphen + writeln!( + f, + "{:indent$} {star}{}", + "", + frame, + indent = indent, + star = star + )?; + node.fmt_inner(f, indent)?; + } + } + Ok(()) + } +} + +impl Display for StackTrieNode { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!(f, "
")?;
+        self.fmt_inner(f, 0)?;
+        write!(f, "
")?; + Ok(()) + } +} + +#[derive(Eq, PartialEq, Hash, Deserialize, Serialize, Debug, Clone)] +pub struct CompileId { + pub frame_id: u32, + pub frame_compile_id: u32, + pub attempt: u32, +} + +impl fmt::Display for CompileId { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + write!(f, "[{}/{}", self.frame_id, self.frame_compile_id)?; + if self.attempt != 0 { + write!(f, "_{}", self.attempt)?; + } + write!(f, "]") + } +} + +#[derive(Default, Debug)] +pub struct Stats { + pub ok: u64, + pub other_rank: u64, + pub fail_glog: u64, + pub fail_json: u64, + pub fail_payload_md5: u64, + pub fail_dynamo_guards_json: u64, +} + +#[derive(Debug, Hash, Eq, PartialEq, Deserialize, Serialize)] +pub struct FrameSummary { + filename: u32, + line: i32, + name: String, +} + +fn simplify_filename<'a>(filename: &'a str) -> &'a str { + let parts: Vec<&'a str> = filename.split("#link-tree/").collect(); + if parts.len() > 1 { + return parts[1]; + } + // TODO: generalize this + let parts: Vec<&'a str> = filename + .split("1e322330-seed-nspid4026531836_cgpid26364902-ns-4026531840/") + .collect(); + if parts.len() > 1 { + return parts[1]; + } + filename +} + +impl fmt::Display for FrameSummary { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + let intern_table = INTERN_TABLE.lock().unwrap(); + let filename = intern_table + .get(&self.filename) + .map_or("(unknown)", |s| s.as_str()); + write!( + f, + "{}:{} in {}", + encode_text(simplify_filename(filename)), + self.line, + encode_text(&self.name) + ) + } +} + +pub type StackSummary = Vec; + +#[derive(Debug, Deserialize)] +pub struct OptimizeDdpSplitChildMetadata { + pub name: String, +} + +#[derive(Debug, Deserialize)] +#[serde(untagged)] +pub enum SymInt { + Int(i64), + Symbol(String), +} + +#[derive(Debug, Deserialize)] +pub struct EmptyMetadata {} + +#[derive(Debug, Deserialize)] +pub struct DynamoOutputGraphMetadata { + _sizes: Option>>, +} + +#[derive(Debug, Deserialize)] +pub struct DynamoStartMetadata { + pub stack: Option, +} + +#[derive(Debug, Deserialize)] +pub struct InductorOutputCodeMetadata { + pub filename: Option, +} + +#[derive(Debug, Deserialize)] +pub struct Envelope { + pub rank: Option, + #[serde(flatten)] + pub compile_id: Option, + #[serde(default)] + pub has_payload: Option, + // externally tagged union, one field per log type we recognize + pub dynamo_start: Option, + pub str: Option<(String, u32)>, + pub dynamo_output_graph: Option, + pub optimize_ddp_split_graph: Option, + pub optimize_ddp_split_child: Option, + pub compiled_autograd_graph: Option, + pub dynamo_guards: Option, + pub aot_forward_graph: Option, + pub aot_backward_graph: Option, + pub aot_joint_graph: Option, + pub inductor_post_grad_graph: Option, + pub inductor_output_code: Option, +} + +#[derive(Debug, Deserialize, Serialize)] +pub struct DynamoGuard { + pub code: String, + pub stack: Option, + pub user_stack: Option, +} + +#[derive(Debug, Serialize)] +pub struct DynamoGuardsContext { + pub guards: Vec, +} + +#[derive(Debug, Serialize)] +pub struct IndexContext { + pub css: &'static str, + pub directory: Vec<(String, Vec)>, + pub stack_trie_html: String, +}