Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple Perf Improvements #80

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

tgrigsby-sc
Copy link
Contributor

@tgrigsby-sc tgrigsby-sc commented May 21, 2021

-- these changes are reasonably well grouped by commit --

  • Use rawr tile size to determine metatile build zoom levels
  • Change instance type to M5 - reduce mem/cpu by 4GB for cheaper instances with more memory than we need
  • Switch to on-demand instance type
  • Big jobs code to spread jobs across (zoom_max, split_zoom) range instead of just one or the other
  • Add TileSpecifier. It allows more complex behavior around specifying mem reqs in a file, but for now it just uses the output of the big jobs code to dictate the order we enqueue high zoom metatile jobs.
  • Change zoom_max to 6 to allow grouping of the less intense high zoom jobs. Reduces variance of job runtimes which allows better utilization of EC2 resources
  • Bug fixes around usage of /usr/bin/time

@tgrigsby-sc tgrigsby-sc requested review from iandees and peitili May 21, 2021 23:01
@tgrigsby-sc tgrigsby-sc force-pushed the travisg/20210520-optimizations-from-rawr-tile-size branch from 613942a to 778c09a Compare May 21, 2021 23:04
@@ -148,34 +148,6 @@ def batch_setup(region_name, run_id, vpc_id, securityGroupIds, computeEnvironmen
boto_iam, boto_ec2, instanceRoleName, instanceRoleName)
print("Using ECS instance profile %s" % instanceProfileArn)

# Create the spot fleet role
# https://docs.aws.amazon.com/batch/latest/userguide/spot_fleet_IAM_role.html
spotIamFleetRoleName = "AmazonEC2SpotFleetRole"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be neat if we could think of a way to keep support for spot instance types for folks where Spot is cheaper than on-demand. On the other hand, it might be enough to point to this commit and say "undo this one".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can certainly do it, it's just some plumbing

@@ -205,7 +177,6 @@ def batch_setup(region_name, run_id, vpc_id, securityGroupIds, computeEnvironmen
'cost_resource_group': run_id,
},
bidPercentage=60,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be removed, too.

@@ -235,7 +242,7 @@ def __call__(self, parent):
assert dz >= 0
width = 1 << dz

size = 0
sizes = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be easier to read. {} always looks like an empty dict to me, not a set.

Suggested change
sizes = {}
sizes = set()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually a dict! I need to track which tile has that size so I can combine them correctly at lower zooms

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, well nevermind then. Maybe dict() then? 😄

count_at_this_zoom = counts_at_zoom[z]
zoom_10_equiv_count = count_at_this_zoom * (4 ** (10 - z))
counts_at_zoom_sum += zoom_10_equiv_count
if counts_at_zoom_sum == 4**10:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this 10 be one of the zoom values that gets passed in? What happens when we want to switch away from zoom 10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doh. Yes definitely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it really just needs to be a number we're comfortable we're not going to go over. It could be zoom 20 and everything would be fine (except overflow?). At hardcoded zoom 10, if we ended up grouping into zoom 11 jobs then we'd be doing a 4^-1 calc which would then compare an int to a float, and things might get bad

@@ -290,24 +472,65 @@ def _big_jobs(rawr_bucket, prefix, key_format_type, rawr_zoom, group_zoom,

return big_jobs

def viable_container_overrides(mem_mb):
"""
Turns a number into the next highest even multiple that AWS will accept, and the min number of CPUs you need for that amount
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth mentioning that this is a workaround for what seems to be a bug in Batch that prevents arbitrary memory/CPU requests.


# now that we know what we want, pick something AWS actually supports
viable_mem_request, required_min_cpus = viable_container_overrides(adjusted_mem)
print("REMOVEME: [%s] enqueueing %s at %s mem mb and %s cpus" % (time.ctime(), coord_line, viable_mem_request, required_min_cpus))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove REMOVEME? This looks like a useful thing to print maybe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are A LOT of these. 25k with this configuration

@@ -149,7 +151,7 @@ def missing_tiles_split(self, split_zoom, zoom_max, big_jobs):

self.read_metas_to_file(missing_meta_file, compress=True)

print("Splitting into high and low zoom lists")
print("[%s] Splitting into high and low zoom lists" % (time.ctime()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefix the log with [make_meta_tiles] too

@@ -162,25 +164,29 @@ def missing_tiles_split(self, split_zoom, zoom_max, big_jobs):

with gzip.open(missing_meta_file, 'r') as fh:
for line in fh:
c = deserialize_coord(line)
if c.zoom < split_zoom:
this_coord = deserialize_coord(line)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need to rebase master

@@ -190,6 +196,7 @@ def missing_tiles_split(self, split_zoom, zoom_max, big_jobs):
for coord in missing_high:
fh.write(serialize_coord(coord) + "\n")

print("[%s] Done splitting into high and low zoom lists" % (time.ctime()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefix log with [make_meta_tiles]

help='If all the RAWR tiles grouped together are '
'bigger than this, split the job up into individual '
'RAWR tiles.')
parser.add_argument('--allowed-missing-tiles', default=2, type=int,
help='The maximum number of missing metatiles allowed '
'to continue the build process.')
parser.add_argument('--tile-specifier-file',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this arg used?


return 0

def get_mem_reqs_mb(self, coord_str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the name mib is probably more precise to mean 1024

or zoom_max (usually lower, e.g: 7) depending on whether big_jobs
contains a truthy value for the RAWR tile. The big_jobs are looked up
at zoom_max.
High zoom jobs are output between split_zoom (RAWR tile granularity) and zoom_max
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion, give a concrete example. As a new hire in the team, this doc is probably still not easy to follow


reordered_lines = tile_specifier.reorder(coord_lines)

print("[%s] Starting to enqueue %d tile batches" % (time.ctime(), len(reordered_lines)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: log prefix of the module name

Look up the RAWR tiles in the rawr_bucket under the prefix and with the
given key format, group the RAWR tiles (usually at zoom 10) by the job
group zoom (usually 7) and sum their sizes. Return an ordered list of job coordinates
by descending raw size sum.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rawr size sum

Provides the ability to sort tiles based on an ordering and specify memory reqs
"""

def __init__(self, default_mem_gb=8, spec_dict={}):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we probably should have a central place for all the default values such as 8 here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants