Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ticket to track Migration of a Root CA #813

Open
sraustein opened this issue Apr 25, 2016 · 19 comments
Open

Ticket to track Migration of a Root CA #813

sraustein opened this issue Apr 25, 2016 · 19 comments

Comments

@sraustein
Copy link
Contributor

is it safe to go through the root creation, pubd credentials, ... and patch in the migrated root key and resultant TAL later and then migrate the resources and users?

Trac ticket #807 component rpkid priority major, owner None, created by randy on 2016-04-25T01:02:57Z, last modified 2016-05-10T18:06:40Z

@sraustein
Copy link
Contributor Author

is it safe to go through the root creation, pubd credentials, ... and patch in the migrated root key and resultant TAL later and then migrate the resources and users?

Define "safe".

So long as you don't publish the TAL you use during testing, nobody but us will ever know about it, so at worst you might have to apt-get purge then reinstall.

Trac comment by sra on 2016-04-25T01:25:31Z

@sraustein
Copy link
Contributor Author

While I had originally expected to do this sort of transition using Django migrations, this is a big enough jump and likely enough to involve multiple machines that I think there's a simpler solution:

  • On old machine (trunk/ code): read all relevant data (/etc/rpki.conf, MySQL tables, *.{cer,key} files reachable from rpki.conf) into Python memory as some kind of simple object (could be a custom class but probably simpler if it's just a dict/list/int/str thing like one would get from parsing JSON or YAML). Write all of that out as a single file, using Python's native "Pickle" format (Python-specific, but guaranteed to work with any version of Python on any hardware).
  • Copy the pickled data to wherever it needs to go.
  • On new machine (tk705/ code): read the pickle to reproduce the in-Python-memory structure, then drop data into Django ORM objects as needed. Might try to do something clever with suggesting /etc/rpki.conf changes based on comparison of what's set on new machine with what's in the pickle, but probably not.

As a refinement, we might run the pickle through some compression program, both for size and, more importantly, for some kind of internal checksum to detect transfer errors while moving the pickle around. Heck, we could use gpg to wrap it, but let's not get carried away.

It turns out that rendering the contents of /etc/rpki.conf, a collection of MySQL databases, and disk files indicated by /etc/rpki.conf as Python dict() objects is not particularly hard. There's some redundancy (particularly if one uses the optional feature in the MySQLdb API that returns each table row as a dict()), but Pickle format is good at identifying common objects, so it's not particularly wasteful except for a bit of CPU time while generating the pickle.

While this could be generalized into some kind of back-up-the-entire-CA mechanism, that would be mission creep. At the moment, I'm focused on a specific nasty transition, which includes the raw-MySQL to Django ORM jump, which is enough of a challenge for one script.

Another thing I like about this besides its (relative) simplicity is that one can save the intermediate format. Assuming we can get keep the part that generates the pickle simple enough, it should be straightforward to reassure ourselves that it has all the data we intended to save. Given that, we can isolate the more complex problem (unpacking the data into the new database) as a separate task, which we can run repeatedly until we get it right if that's what it takes: so long as the pickle is safe, no data has been lost.

Yes, of course we also tell the user to back up every freaking thing possible in addition to generating the pickle, even though we hope and intend that the pickle contains everything we need.

This scheme does assume that everything in a CA instance will fit in memory. That's not a safe assumption in the general case, but I think it's safe for everything we're likely to care about for this particular transition given state of play to date. There are variants on this scheme we could use if this were a problem, but I don't think it is.

Trac comment by sra on 2016-04-27T13:41:41Z

@sraustein
Copy link
Contributor Author

for transfer check, just sha1 it on both ends

do not care about efficiency. one does not do this daily.

being as tolerant of input issues on the /trunk side may be helpful.
i have spared you a lot of horrifying logs, for example

{{{
Apr 26 00:08:40 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache!
Apr 26 00:08:40 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away')
Apr 26 00:08:40 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away')
Apr 26 00:09:07 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache!
Apr 26 00:09:07 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away')
Apr 26 00:09:07 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away')
}}}

not sure we need backup entire CA, as we can back up machine. as you
can see from above, audit and fix CA might be useful.

i think the largest dataset that would migrate would be jpnic or cnnic.

Trac comment by randy on 2016-04-27T14:15:07Z

@sraustein
Copy link
Contributor Author

for transfer check, just sha1 it on both ends

Piping through "xz -C sha256" automates this.

do not care about efficiency. one does not do this daily.

Right.

being as tolerant of input issues on the /trunk side may be helpful.

Input side is just data capture, no analysis.

OperationalError: (2006, 'MySQL server has gone away')

You've been getting that on and off for years, with various causes,
most commonly port upgrade turning mysqld off but not turning it back
on. The other error messages quoted cascade from that.

not sure we need backup entire CA, as we can back up machine. as you
can see from above, audit and fix CA might be useful.

I think we are quibbling about "entire CA". Intent is to capture data
that needs to be in place on the new server to continue operation,
along with some minor config data which we may never need but which is
easiest to capture at the same time (eg, funny settings in rpki.conf).

i think the largest dataset that would migrate would be jpnic or cnnic.

Seems likely.

Trac comment by sra on 2016-04-27T20:48:49Z

@sraustein
Copy link
Contributor Author

In [changeset:"6395" 6395]:
{{{
#!CommitTicketReference repository="" revision="6395"
First step of transition mechanism from trunk/ to tk705/: script to
encapsulate all (well, we hope) relevant configuration and state from
a trunk/ CA in a form we can easily load on another machine, or on the
same machine after a software upgrade, or ....

Transfer format is an ad hoc Python dictionary, encoded in Python's
native "Pickle" format, compressed by "xz" with SHA-256 integrity
checking enabled. See #807.
}}}

Trac comment by sra on 2016-04-27T22:20:20Z

@sraustein
Copy link
Contributor Author

except the mysql server is running. and if i restart mysql-server, the
error persists. this is a rabbit hole best left unexplored if possible.
focus on migration.

it is not that i have not tried
{{{
ca0.rpki.net:/root# mysql -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 13
Server version: 5.5.49 Source distribution

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> Bye
ca0.rpki.net:/root# tail /var/log/messages
Apr 27 23:04:49 ca0 last message repeated 2 times
Apr 27 23:05:45 ca0 sshd[12136]: Connection closed by 198.180.150.1 [preauth]
Apr 27 23:05:45 ca0 sshguard[28474]: 198.180.150.1: should already have been blocked
Apr 27 23:06:17 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache!
Apr 27 23:06:17 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away')
Apr 27 23:06:49 ca0 rpkid[731]: cron keepalive threshold 2016-04-27T23:06:48Z has expired, breaking lock
Apr 27 23:06:49 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache!
Apr 27 23:06:49 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away')
Apr 27 23:08:17 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache!
Apr 27 23:08:17 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away')
}}}

Trac comment by randy on 2016-04-27T23:09:28Z

@sraustein
Copy link
Contributor Author

I'm considering taking advantage of this pickled migration process to make one schema change which appears to be beyond the capabilities of Django's migration system.

Task::
Fold rpki.irdb.models.Turtle model back into rpki.irdb.models.Parent now that rpki.irdb.models.Rootd is gone.

Users Affected::
Users of current tk705/ branch (me, Randy, Michael) may have to drop databases and rebuild, possibly losing all current data. OK, we could fix that too, but probably not worth the trouble for just the three of us and not yet anything in production.

Details::
Current table structure was complicated to allow a Repository to link to either a Parent or a Rootd. Now that we no longer need (or have) Rootd, we no longer need this complexity. Django ORM migrations throw up their hands and whimper when asked to make a change this fundamental to a SQL primary index column (I've tried, boy howdy how I have tried), but the pickled migration code doesn't care, because it doesn't need to modify SQL tables in place.

When::
If we're ever going to do this, we should do it now, before anybody else is using this code. Once we have external users, we're stuck with the mess.

Any objections to making this change, before Randy attempts to move ca0.rpki.net?

Trac comment by sra on 2016-04-29T03:20:30Z

@sraustein
Copy link
Contributor Author

On Fri, Apr 29, 2016 at 03:20:31AM -0000, Trac Ticket System wrote:

Any objections to making this change, before Randy attempts to move
ca0.rpki.net?

Sounds good to me.

Trac comment by melkins on 2016-04-29T16:21:33Z

@sraustein
Copy link
Contributor Author

sure

Trac comment by randy on 2016-04-29T22:49:22Z

@sraustein
Copy link
Contributor Author

uh, any progress?

Trac comment by randy on 2016-05-05T07:13:26Z

@sraustein
Copy link
Contributor Author

One last bug....

Trac comment by sra on 2016-05-05T11:52:16Z

@sraustein
Copy link
Contributor Author

OK, in theory it's ready for an alpha tester.

This is a two stage process: the first stage runs on the machine
you're evacuating, the second runs on the destination machine. This
is deliberate, and should allow you to leave the old machine safely
idle in case something goes horribly wrong and you need to revert.

In addition to the usual tools, you need two scripts:

You can fetch these using svn if you want to pull the whole source
tree, or just fetch the individual scripts with wget, fetch, ....

On the old machine:

  • Stop the rpki servers (rpkid, irdbd, pubd).

  • Run ca-pickle.py. This takes one mandatory argument, the name of
    the output file. You can call this anything you like, but since
    it's xz-compressed it'd probably be less confusing to call it
    something ending in .xz:

    {{{
    sudo python ca-pickle.py pickled-rpki.xz
    }}}

  • Leave the servers shut off, and scp the file written by
    ca-pickle.py to the new machine.

On the new machine:

  • Make sure you have the latest tk705/ rpki-rp and rpki-ca
    packages installed. Given the recent incompatible change (discussed
    last week) to remove the Turtle model from the irdb, you may need
    to purge and reinstall to clear an upgrade error:

    {{{
    sudo apt-get update
    sudo apt-get purge rpki-ca rpki-rp
    sudo apt-get install rpki-rp rpki-ca
    }}}

  • The upgrade itself needs to take place with the servers disabled,
    and includes a bit of additional voodoo (notes follow):

    {{{
    sudo service rpki-ca stop
    sudo killall -u rpki
    sudo rm -rf /usr/share/rpki/.{tal,cer} /usr/share/rpki/publication/ /usr/share/rpki/rrdp-publication/* /var/log/rpki/*
    sudo rpki-sql-setup --postgresql-root-username postgres drop
    sudo install -d -o rpki -g rpki /var/run/rpki /var/log/rpki /usr/share/rpki/publication /usr/share/rpki/rrdp-publication
    sudo rpki-sql-setup --postgresql-root-username postgres create
    sudo sudo -u rpki rpki-manage migrate rpkidb --settings rpki.django_settings.rpkid --no-color
    sudo sudo -u rpki rpki-manage migrate pubdb --settings rpki.django_settings.pubd --no-color
    sudo sudo -u rpki rpki-manage migrate irdb --settings rpki.django_settings.irdb --no-color
    sudo sudo -u rpki rpki-manage migrate --settings rpki.django_settings.gui --no-color
    sudo sudo -u rpki python ca-unpickle.py --rootd pickled-rpki.xz
    rpkic update_bpki
    sudo service rpki-ca restart
    sleep 30
    rpkic update_bpki 2>&1
    }}}

  • If nothing horrible has happened yet, wait five or ten minutes for
    things to settle down, then you should be in business on the new
    server.

Notes on the long script above:

  • If you're running as root, you can omit any sudo which isn't
    immediately followed by a -u rpki.
  • The service command should shut down the servers. The killall is
    paranoia in case some cron job happens to be using the database at
    exactly the wrong moment -- PostgreSQL won't let you drop the
    database while any process has it open.
  • The rm, database drop, install and database create are just wiping
    the state already present from the install and whatever testing you
    did, so we can start fresh. The Django migrations are needed to
    rebuild the database schemas after the drop and create cycle.
  • ca-unpickle does the real work (more below). The --rootd flag
    says you want it to attempt to transition the keypair from an old
    rootd-based configuration. Don't specify this unless you need it,
    the rootd code is considerably more complicated (and fragile) than
    the rest of the upgrade.
  • The first rpkic update_bpki is expected to whine about not being
    able to push data into the servers, because you still have the
    servers turned off at this point. This is normal, and is the reason
    why you run it again after a short wait for the servers to start
    up.

As to what's really going on here:

  • The core mechanism is fairly simple: ca-pickle reads
    /etc/rpki.conf, the contents of the old MySQL databases, and
    whatever files it can locate from the names it sees in
    /etc/rpki.conf, loads them all into one big in-memory Python
    object (top level is a dict()), then runs that object through
    Python's cPickle module and the xz compressor to dump the whole
    thing as a portable file which should be readable by Python on any
    supported platform. A sufficiently big installation would hit
    memory problems with this approach, but I doubt that any current
    installation running this code has hit that limit yet.
  • ca-unpickle does two separate things after uncompressing and
    unpickling the data structure created by ca-pickle:
    1. It translates the old data captured from MySQL into Django ORM
      objects on the new machine. This is tedious but straightforward,
      other than a few minor issues like updating machine-local URIs to
      match changed port numbers and so forth. For the most part, this
      is exactly the same thing we would have had to do in a Django
      data migration had we taken that approach, but without the
      requirement that the old and new databases both be reachable at
      the same time (or even be installed on the same machine).
    2. If --rootd is specified, ca-unpickle also does some rather
      awful stuff to construct a usable rootd-less root configuration
      on the new machine. This is basically pushing on a rope, because
      the one rpkid data structure which absolutely must be preserved
      for this to work (the one that holds the RPKI root private key)
      is normally about six removes from direct control by anything in
      the back end; in order to make this work, we have to duplicate a
      lot of fiddly logic with parallel structures in the rpkidb and
      irdb databases. This is fancy nasty with raisins and cinnamon.
  • The reason you have to let things sit for a few minutes after the
    transition is that, even with all the awfulness described above,
    there's still some internal cleanup that the daemons have to perform
    after they regain control. For example, all of the "resource class"
    values in the RPKI up-down protocol have changed, because the
    trunk/ code was still using the awful hack of using SQL row index
    values as resource class names. Good riddance, but cleaning that up
    requires running a whole bunch of certificates have to run through a
    revoke and reissue cycle. This should all happen automatically, but
    it's not instantaneous.

Trac comment by sra on 2016-05-06T01:04:34Z

@sraustein
Copy link
Contributor Author

how long is
{{{
ca0.rpki.net:/root# python ca-pickle.py pickled-rpki.xz
}}}

expected to run?

it's been maybe 15 minutes. and mysql-server is running

{{{
ca0.rpki.net:/root# mysql -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 2940
Server version: 5.5.49 Source distribution

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> quit;
Bye
}}}

Trac comment by randy on 2016-05-07T06:28:01Z

@sraustein
Copy link
Contributor Author

ignore. it finally finished.

Trac comment by randy on 2016-05-07T06:32:33Z

@sraustein
Copy link
Contributor Author

Out of curiosity, please post size of the xz file.

I don't think I've seen ca-pickle take more than five or ten seconds,
but I was testing it with small data sets on a lightly loaded VM.

Trac comment by sra on 2016-05-07T06:39:04Z

@sraustein
Copy link
Contributor Author

ca0.rpki.net:/root# l -h pickled-rpki.xz
-rw------- 1 root wheel 7.1M May 7 06:28 pickled-rpki.xz

Trac comment by randy on 2016-05-07T06:40:00Z

@sraustein
Copy link
Contributor Author

Removed confused instructions which led to #815. That part of the instructions was just plain wrong.

Trac comment by sra on 2016-05-09T05:44:11Z

@sraustein
Copy link
Contributor Author

Added --root-handle argument to ca-unpickle, so you can do:

{{{
python ca-unpickle.py blarg.xz --rootd --root-handle Root
}}}

so that the name of the entity created from the salvaged rootd data
will be named "Root" instead of some randomly generated UUID.

If you already have an entity named "Root", this will fail with a SQL
constraint violation when it discovers that you're creating a second
Tenant with the same handle, but you want it to fail in such a case.

Trac comment by sra on 2016-05-09T17:55:55Z

@sraustein
Copy link
Contributor Author

Noting something I figured out while writing a report for Sandy:

If we need to take this pickled database hack beyond what will easily fit in memory, one relatively simple way of breaking the problem up into chunks would be to use the Python shelve module with gdbm. So, eg, instead of one great big enormous pickle, we could pickle each SQL table in a separate slot of the shelve database; if necessary, we could break things down even smaller, but one shelf per table is an easy target.

Transfer format in this case would be a gdbm database, which we could then ship to another machine in portable format using the gdbm_dump and gdbm_load utilities, possibly compressed with xz for the same reasons we compress the current pickle format.

None of this is worth worrying about until and unless we hit a case which needs it, just making note of the technique while I remember it.

Trac comment by sra on 2016-05-10T18:06:40Z

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant