Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#14826: reimplement l1 data copy #15226

Merged
merged 1 commit into from
Nov 21, 2024
Merged

#14826: reimplement l1 data copy #15226

merged 1 commit into from
Nov 21, 2024

Conversation

nathan-TT
Copy link
Contributor

Ticket

#14826

Problem description

Now that the crt reorg has landed (#15094), this reimplements the bespoke memcpy we use to copy from l1 to local memory.

What's changed

  1. Reduce insns in the loop. Original loop was 21 insns (3.5 per word), new loop is 10 insns (3.3 per word).

  2. Do not use a loop for residue. We only have to handle 0, 1 and 2 cases. A loop is more overhead.

Checklist

  • [ YES] Post commit CI passes
  • [ YES] Blackhole Post commit (if applicable)
  • Model regression CI testing passes (if applicable)
  • Device performance regression CI testing passes (if applicable)
  • New/Existing tests provide coverage for changes

@nathan-TT nathan-TT merged commit 185ade6 into main Nov 21, 2024
126 of 128 checks passed
@nathan-TT nathan-TT deleted the nsidwell/perf3 branch November 21, 2024 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants