Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor bookkeeping #105

Merged
merged 9 commits into from
Jul 11, 2023
Merged

Refactor bookkeeping #105

merged 9 commits into from
Jul 11, 2023

Conversation

loreloc
Copy link
Member

@loreloc loreloc commented Jul 9, 2023

  • Refactor bookkeeping to work with folded layers having an arbitrarily number of inputs (which I call arity). In this way, any layer is treated in the same way, i.e., we have a nice layer-agnostic loop.
        for layer, (should_pad, in_layer_ids, fold_idx) in zip(self.inner_layers, self.bookkeeping):
            # (fold_1 + ... + fold_n, units, batch_size)
            if len(in_layer_ids) == 1:
                (in_layer_id,) = in_layer_ids
                inputs = layer_outputs[in_layer_id]
            else:
                inputs = torch.cat([layer_outputs[i] for i in in_layer_ids], dim=0)
            if should_pad:
                if isinstance(layer, SumProductLayer):  # type: ignore[misc]
                    pad_value = 0.0
                else:
                    pad_value = -np.inf
                inputs = F.pad(inputs, [0, 0, 0, 0, 1, 0], value=pad_value)
            # inputs: (fold, arity, units, batch_size)
            inputs = inputs[fold_idx]
            # outputs: (fold, units, batch_size)
            outputs = layer(inputs)

Highlighted features:

  • Unified input and output tensor layout: inputs: (fold, arity, units, batch_size), and outputs: (fold, units, batch_size).
  • No padding is applied if layers to be folded have the same number of inputs.
  • No concatenation operation is performed at all if there is no need to, i.e., if all the inputs of a layer can be found in a fold of another layer => a view is sufficient!
  • Removed legacy mixing layer's parameters mask. Eventually, padding will take care of masking parameters. However, the pad values should be chosen based on (i) the layer and (ii) the computation domain space, but for now they are hard-coded for computations in the log-space.

Note: I had to permute the einsum indices in the mixing layer (see #103) to make it compliant with the layouts above.

Performance is similar to after merging #103 into main, and it is probably the best achieved until now.
Time/Memory: ~0.104-0.110s / 5GiB

...
t/m: 106.83161926269531 5020.56884765625
t/m: 105.59600067138672 5020.56884765625
t/m: 107.38585662841797 5020.56884765625
t/m: 107.3458251953125 5020.56884765625
t/m: 110.02127838134766 5020.56884765625
t/m: 104.13385772705078 5020.56884765625
t/m: 107.49929809570312 5020.56884765625
t/m: 104.31513977050781 5020.56884765625
t/m: 104.40985870361328 5020.56884765625
train LL: 3833.79560546875

Fixes (but does not close) #101.
Closes #96.

@loreloc loreloc added the enhancement New feature or request label Jul 9, 2023
@loreloc loreloc requested review from lkct and arranger1044 and removed request for lkct and arranger1044 July 9, 2023 09:47
@lkct
Copy link
Member

lkct commented Jul 9, 2023

  1. please use rebase for a simpler git history: first commit upon main the revert of the revert. and then rebase what's left on the revert commit
  2. what's the motivation of FCKB? this leads to non-contiguous tensors and is not good for mixing layer, whose performance is not yet benchmarked

PS: I would split the change of layer interface to one input and the refactoring of bookkeeping into 2 PRs, because the former is self-contained, and should be benchmarked separately.

PPS: If you're to switch to FC..., there should be one standalone commit for that, to enable further benchmarking. (I mean, first implement CF to be consistent with what's existing, and then use a commit to switch to FC to enable benchmark and compare)


Time/Memory: ~0.104-0.110s / 5GiB

better take an avg. and there's no need to paste all those lines.
and where is the gain? (or is there a gain?)

@loreloc
Copy link
Member Author

loreloc commented Jul 9, 2023

I added an assert outputs.is_contiguous() to the snippet above and tests pass. Which tensor might not be contiguous exactly? I used the FC... layout instead of the CF... ones in order to unify the input shape format across layers.

For the benchmark of the mixing layer can we upload a non-structured-decomposable quad-tree region file and use that instead?

The performance gain is implicit from my point of view, in the sense that the code is more extendible without impacting the performance. Moreover, the two einsum expression in the CP layer can now be merged into one. I did not try to implement it yet, but it can only improve performances.

@lkct
Copy link
Member

lkct commented Jul 9, 2023

Which tensor might not be contiguous exactly?

See below

I used the FC... layout instead of the CF... ones in order to unify the input shape format across layers.

Yes we should unify, but I was asking: why not unify to CF...?

For the benchmark of the mixing layer can we upload a non-structured-decomposable quad-tree region file and use that instead?

We should. And we should add a benchmark runner to get all results at once. (similar to pyjuice benchmark). We should also go back to see what's the baseline for non-stdec.
For choices, we can have 28x28 PD or QG. Is 28x28 PD too large?

The performance gain is implicit from my point of view,

I just wonder if 0.104~0.110 and 0.106 are the same (just random fluctuations).

Copy link
Member

@lkct lkct left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-contiguous tensors


def _forward_linear(self, x: Tensor) -> Tensor:
return torch.einsum("cfk,cfkb->fkb", self.params, x)
return torch.einsum("fck,fckb->fkb", self.params, x)
Copy link
Member

@lkct lkct Jul 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong about this

:return: result of the left operations, in log-space.
"""
log_left, log_right = inputs[:, 0], inputs[:, 1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two are non-contiguous inputs for the following.

however seems no major performance impact for CP? yet we don't know if this will harm other layers, so if no performance difference, we should use contiguous.

@arranger1044
Copy link
Member

For choices, we can have 28x28 PD or QG. Is 28x28 PD too large?

We can use PD with delta [7, 28], which first splits the image up to regions with $7\times 7$ pixels, and then splits these into $1\times 1$ pixel regions. This is a good compromise.

@arranger1044
Copy link
Member

please use rebase for a simpler git history: first commit upon main the revert of the revert. and then rebase what's left on the revert commit

there should be one standalone commit for that, to enable further benchmarking

at this stage, let's be flexible with commits in PRs, otherwise se might add to much friction to development. When we have many more commits (and developers) we can come up with some good guidelines

@lkct
Copy link
Member

lkct commented Jul 9, 2023

at this stage, let's be flexible with commits in PRs, otherwise se might add to much friction to development.

That's why I'm just using "comment" instead of "request change". However I cannot be convinced that we should use FC instead of CF based on what's given.
Also, if I were to ask for a perfect PR, there could be more to change, e.g. I cannot see why the numbers in the tests should be changed, and the way it was changed can make the test invalid.

What I really want to say is that: the price will only be explicit when we need to debug and benchmark, and we may suffer a lot at that time. Yes, we can say that rigor should give way to dev speed in this PR, and we can even say this in each PR -- there's no clear boundary when we should start to prefer rigor and we can always put off.
On the other hand, it's important to form a better coding habit. It can be hard to rework an existing PR into a better shape, but it really takes little effort to keep a PR good from the first commit. I can only add review comments when a PR is opened, which means it can be too late to rework what's already there and will lead to another "let's be flexible" decision.
Please DO try to start from some better practice instead of being "flexible" when we want to merge.

@arranger1044
Copy link
Member

let's discuss in the next meeting what can be a good balance between rigor and dev speed

@loreloc
Copy link
Member Author

loreloc commented Jul 11, 2023

I have run some benchmarks on this branch.
See #110 for the results on the current main branch.

For PCs without mixing layers it is basically the same.
It seems though it is slightly slower when using mixing layers (~7% for PD and ~3% for QT).
Probably this is the price to pay to have "non-specialized" tensor layouts. As we discussed, in the future it might be worth exploring how to compile layers with the "best" layout.

I believe it is ok performance-wise overall, considered we already gained ~2x recently thanks to #103. Let me know what you think.

PD 28x28 delta=7 and delta=5

Number of parameters: 103151628
Time (ms): 74.650+-1.225
Memory (MiB): 5005.899+-0.000

Number of parameters: 105730068
Time (ms): 132.045+-2.598
Memory (MiB): 5248.470+-0.733
QT 28x28 structured-decomposable and non-structured-decomposable

Number of parameters: 103962625
Time (ms): 108.840+-3.757
Memory (MiB): 5020.569+-0.000

Number of parameters: 105419780
Time (ms): 146.681+-1.638
Memory (MiB): 5043.930+-0.000
RBT 784 depth=4 repetitions=16

Number of parameters: 103120928
Time (ms): 67.594+-0.756
Memory (MiB): 5006.148+-0.000

@lkct
Copy link
Member

lkct commented Jul 11, 2023

(~7% for PD and ~3% for QT)

7% is already a lot in my opinion. In #103, ...BK and ...KB only have 10% difference.
I would say FC... and CF... can still make a difference.

however the performance requires deeper investigation and we can merge now with the current benchmark result and further investigate later, as communicated in zulip DM.

@loreloc
Copy link
Member Author

loreloc commented Jul 11, 2023

Ok then, I think we can merge this. We should investigate the performances if we use CF... for both the CP layer and the mixing layer.

Before merging I will clean unused attributes.

@loreloc loreloc merged commit 16ace43 into main Jul 11, 2023
@loreloc loreloc deleted the refactor-bookkeeping branch July 11, 2023 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor book-keeping
3 participants