Ticket to track Migration of a Root CA #813

sraustein · 2016-04-25T01:02:57Z

is it safe to go through the root creation, pubd credentials, ... and patch in the migrated root key and resultant TAL later and then migrate the resources and users?

Trac ticket #807 component rpkid priority major, owner None, created by randy on 2016-04-25T01:02:57Z, last modified 2016-05-10T18:06:40Z

sraustein · 2016-04-25T01:25:31Z

is it safe to go through the root creation, pubd credentials, ... and patch in the migrated root key and resultant TAL later and then migrate the resources and users?

Define "safe".

So long as you don't publish the TAL you use during testing, nobody but us will ever know about it, so at worst you might have to apt-get purge then reinstall.

Trac comment by sra on 2016-04-25T01:25:31Z

sraustein · 2016-04-27T13:41:41Z

While I had originally expected to do this sort of transition using Django migrations, this is a big enough jump and likely enough to involve multiple machines that I think there's a simpler solution:

On old machine (trunk/ code): read all relevant data (/etc/rpki.conf, MySQL tables, *.{cer,key} files reachable from rpki.conf) into Python memory as some kind of simple object (could be a custom class but probably simpler if it's just a dict/list/int/str thing like one would get from parsing JSON or YAML). Write all of that out as a single file, using Python's native "Pickle" format (Python-specific, but guaranteed to work with any version of Python on any hardware).
Copy the pickled data to wherever it needs to go.
On new machine (tk705/ code): read the pickle to reproduce the in-Python-memory structure, then drop data into Django ORM objects as needed. Might try to do something clever with suggesting /etc/rpki.conf changes based on comparison of what's set on new machine with what's in the pickle, but probably not.

As a refinement, we might run the pickle through some compression program, both for size and, more importantly, for some kind of internal checksum to detect transfer errors while moving the pickle around. Heck, we could use gpg to wrap it, but let's not get carried away.

It turns out that rendering the contents of /etc/rpki.conf, a collection of MySQL databases, and disk files indicated by /etc/rpki.conf as Python dict() objects is not particularly hard. There's some redundancy (particularly if one uses the optional feature in the MySQLdb API that returns each table row as a dict()), but Pickle format is good at identifying common objects, so it's not particularly wasteful except for a bit of CPU time while generating the pickle.

While this could be generalized into some kind of back-up-the-entire-CA mechanism, that would be mission creep. At the moment, I'm focused on a specific nasty transition, which includes the raw-MySQL to Django ORM jump, which is enough of a challenge for one script.

Another thing I like about this besides its (relative) simplicity is that one can save the intermediate format. Assuming we can get keep the part that generates the pickle simple enough, it should be straightforward to reassure ourselves that it has all the data we intended to save. Given that, we can isolate the more complex problem (unpacking the data into the new database) as a separate task, which we can run repeatedly until we get it right if that's what it takes: so long as the pickle is safe, no data has been lost.

Yes, of course we also tell the user to back up every freaking thing possible in addition to generating the pickle, even though we hope and intend that the pickle contains everything we need.

This scheme does assume that everything in a CA instance will fit in memory. That's not a safe assumption in the general case, but I think it's safe for everything we're likely to care about for this particular transition given state of play to date. There are variants on this scheme we could use if this were a problem, but I don't think it is.

Trac comment by sra on 2016-04-27T13:41:41Z

sraustein · 2016-04-27T14:15:07Z

for transfer check, just sha1 it on both ends

do not care about efficiency. one does not do this daily.

being as tolerant of input issues on the /trunk side may be helpful.
i have spared you a lot of horrifying logs, for example

{{{
Apr 26 00:08:40 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache!
Apr 26 00:08:40 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away')
Apr 26 00:08:40 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away')
Apr 26 00:09:07 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache!
Apr 26 00:09:07 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away')
Apr 26 00:09:07 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away')
}}}

not sure we need backup entire CA, as we can back up machine. as you
can see from above, audit and fix CA might be useful.

i think the largest dataset that would migrate would be jpnic or cnnic.

Trac comment by randy on 2016-04-27T14:15:07Z

sraustein · 2016-04-27T20:48:49Z

for transfer check, just sha1 it on both ends

Piping through "xz -C sha256" automates this.

do not care about efficiency. one does not do this daily.

Right.

being as tolerant of input issues on the /trunk side may be helpful.

Input side is just data capture, no analysis.

OperationalError: (2006, 'MySQL server has gone away')

You've been getting that on and off for years, with various causes,
most commonly port upgrade turning mysqld off but not turning it back
on. The other error messages quoted cascade from that.

not sure we need backup entire CA, as we can back up machine. as you
can see from above, audit and fix CA might be useful.

I think we are quibbling about "entire CA". Intent is to capture data
that needs to be in place on the new server to continue operation,
along with some minor config data which we may never need but which is
easiest to capture at the same time (eg, funny settings in rpki.conf).

i think the largest dataset that would migrate would be jpnic or cnnic.

Seems likely.

Trac comment by sra on 2016-04-27T20:48:49Z

sraustein · 2016-04-27T22:20:20Z

In [changeset:"6395" 6395]:
{{{
#!CommitTicketReference repository="" revision="6395"
First step of transition mechanism from trunk/ to tk705/: script to
encapsulate all (well, we hope) relevant configuration and state from
a trunk/ CA in a form we can easily load on another machine, or on the
same machine after a software upgrade, or ....

Transfer format is an ad hoc Python dictionary, encoded in Python's
native "Pickle" format, compressed by "xz" with SHA-256 integrity
checking enabled. See #807.
}}}

Trac comment by sra on 2016-04-27T22:20:20Z

sraustein · 2016-04-27T23:09:28Z

except the mysql server is running. and if i restart mysql-server, the
error persists. this is a rabbit hole best left unexplored if possible.
focus on migration.

it is not that i have not tried
{{{
ca0.rpki.net:/root# mysql -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 13
Server version: 5.5.49 Source distribution

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> Bye
ca0.rpki.net:/root# tail /var/log/messages
Apr 27 23:04:49 ca0 last message repeated 2 times
Apr 27 23:05:45 ca0 sshd[12136]: Connection closed by 198.180.150.1 [preauth]
Apr 27 23:05:45 ca0 sshguard[28474]: 198.180.150.1: should already have been blocked
Apr 27 23:06:17 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache!
Apr 27 23:06:17 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away')
Apr 27 23:06:49 ca0 rpkid[731]: cron keepalive threshold 2016-04-27T23:06:48Z has expired, breaking lock
Apr 27 23:06:49 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache!
Apr 27 23:06:49 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away')
Apr 27 23:08:17 ca0 rpkid[731]: MySQL exception with dirty objects in SQL cache!
Apr 27 23:08:17 ca0 rpkid[731]: Unhandled exception processing up-down request: OperationalError: (2006, 'MySQL server has gone away')
}}}

Trac comment by randy on 2016-04-27T23:09:28Z

sraustein · 2016-04-29T03:20:30Z

I'm considering taking advantage of this pickled migration process to make one schema change which appears to be beyond the capabilities of Django's migration system.

Task::
Fold rpki.irdb.models.Turtle model back into rpki.irdb.models.Parent now that rpki.irdb.models.Rootd is gone.

Users Affected::
Users of current tk705/ branch (me, Randy, Michael) may have to drop databases and rebuild, possibly losing all current data. OK, we could fix that too, but probably not worth the trouble for just the three of us and not yet anything in production.

Details::
Current table structure was complicated to allow a Repository to link to either a Parent or a Rootd. Now that we no longer need (or have) Rootd, we no longer need this complexity. Django ORM migrations throw up their hands and whimper when asked to make a change this fundamental to a SQL primary index column (I've tried, boy howdy how I have tried), but the pickled migration code doesn't care, because it doesn't need to modify SQL tables in place.

When::
If we're ever going to do this, we should do it now, before anybody else is using this code. Once we have external users, we're stuck with the mess.

Any objections to making this change, before Randy attempts to move ca0.rpki.net?

Trac comment by sra on 2016-04-29T03:20:30Z

sraustein · 2016-04-29T16:21:33Z

On Fri, Apr 29, 2016 at 03:20:31AM -0000, Trac Ticket System wrote:

Any objections to making this change, before Randy attempts to move
ca0.rpki.net?

Sounds good to me.

Trac comment by melkins on 2016-04-29T16:21:33Z

sraustein · 2016-04-29T22:49:22Z

sure

Trac comment by randy on 2016-04-29T22:49:22Z

sraustein · 2016-05-05T07:13:26Z

uh, any progress?

Trac comment by randy on 2016-05-05T07:13:26Z

sraustein · 2016-05-05T11:52:16Z

One last bug....

Trac comment by sra on 2016-05-05T11:52:16Z

sraustein · 2016-05-06T01:04:34Z

OK, in theory it's ready for an alpha tester.

This is a two stage process: the first stage runs on the machine
you're evacuating, the second runs on the destination machine. This
is deliberate, and should allow you to leave the old machine safely
idle in case something goes horribly wrong and you need to revert.

In addition to the usual tools, you need two scripts:

On the old (trunk/) machine, you need
https://subvert-rpki.hactrn.net/trunk/potpourri/ca-pickle.py
On the new (tk705/) machine, you need
https://subvert-rpki.hactrn.net/branches/tk705/potpourri/ca-unpickle.py

You can fetch these using svn if you want to pull the whole source
tree, or just fetch the individual scripts with wget, fetch, ....

On the old machine:

Stop the rpki servers (rpkid, irdbd, pubd).
Run ca-pickle.py. This takes one mandatory argument, the name of
the output file. You can call this anything you like, but since
it's xz-compressed it'd probably be less confusing to call it
something ending in .xz:

{{{
sudo python ca-pickle.py pickled-rpki.xz
}}}
Leave the servers shut off, and scp the file written by
ca-pickle.py to the new machine.

On the new machine:

Make sure you have the latest tk705/ rpki-rp and rpki-ca
packages installed. Given the recent incompatible change (discussed
last week) to remove the Turtle model from the irdb, you may need
to purge and reinstall to clear an upgrade error:

{{{
sudo apt-get update
sudo apt-get purge rpki-ca rpki-rp
sudo apt-get install rpki-rp rpki-ca
}}}
The upgrade itself needs to take place with the servers disabled,
and includes a bit of additional voodoo (notes follow):

{{{
sudo service rpki-ca stop
sudo killall -u rpki
sudo rm -rf /usr/share/rpki/.{tal,cer} /usr/share/rpki/publication/ /usr/share/rpki/rrdp-publication/* /var/log/rpki/*
sudo rpki-sql-setup --postgresql-root-username postgres drop
sudo install -d -o rpki -g rpki /var/run/rpki /var/log/rpki /usr/share/rpki/publication /usr/share/rpki/rrdp-publication
sudo rpki-sql-setup --postgresql-root-username postgres create
sudo sudo -u rpki rpki-manage migrate rpkidb --settings rpki.django_settings.rpkid --no-color
sudo sudo -u rpki rpki-manage migrate pubdb --settings rpki.django_settings.pubd --no-color
sudo sudo -u rpki rpki-manage migrate irdb --settings rpki.django_settings.irdb --no-color
sudo sudo -u rpki rpki-manage migrate --settings rpki.django_settings.gui --no-color
sudo sudo -u rpki python ca-unpickle.py --rootd pickled-rpki.xz
rpkic update_bpki
sudo service rpki-ca restart
sleep 30
rpkic update_bpki 2>&1
}}}
If nothing horrible has happened yet, wait five or ten minutes for
things to settle down, then you should be in business on the new
server.

Notes on the long script above:

If you're running as root, you can omit any sudo which isn't
immediately followed by a -u rpki.
The service command should shut down the servers. The killall is
paranoia in case some cron job happens to be using the database at
exactly the wrong moment -- PostgreSQL won't let you drop the
database while any process has it open.
The rm, database drop, install and database create are just wiping
the state already present from the install and whatever testing you
did, so we can start fresh. The Django migrations are needed to
rebuild the database schemas after the drop and create cycle.
ca-unpickle does the real work (more below). The --rootd flag
says you want it to attempt to transition the keypair from an old
rootd-based configuration. Don't specify this unless you need it,
the rootd code is considerably more complicated (and fragile) than
the rest of the upgrade.
The first rpkic update_bpki is expected to whine about not being
able to push data into the servers, because you still have the
servers turned off at this point. This is normal, and is the reason
why you run it again after a short wait for the servers to start
up.

As to what's really going on here:

The core mechanism is fairly simple: ca-pickle reads
/etc/rpki.conf, the contents of the old MySQL databases, and
whatever files it can locate from the names it sees in
/etc/rpki.conf, loads them all into one big in-memory Python
object (top level is a dict()), then runs that object through
Python's cPickle module and the xz compressor to dump the whole
thing as a portable file which should be readable by Python on any
supported platform. A sufficiently big installation would hit
memory problems with this approach, but I doubt that any current
installation running this code has hit that limit yet.
ca-unpickle does two separate things after uncompressing and
unpickling the data structure created by ca-pickle:
1. It translates the old data captured from MySQL into Django ORM
  objects on the new machine. This is tedious but straightforward,
  other than a few minor issues like updating machine-local URIs to
  match changed port numbers and so forth. For the most part, this
  is exactly the same thing we would have had to do in a Django
  data migration had we taken that approach, but without the
  requirement that the old and new databases both be reachable at
  the same time (or even be installed on the same machine).
2. If --rootd is specified, ca-unpickle also does some rather
  awful stuff to construct a usable rootd-less root configuration
  on the new machine. This is basically pushing on a rope, because
  the one rpkid data structure which absolutely must be preserved
  for this to work (the one that holds the RPKI root private key)
  is normally about six removes from direct control by anything in
  the back end; in order to make this work, we have to duplicate a
  lot of fiddly logic with parallel structures in the rpkidb and
  irdb databases. This is fancy nasty with raisins and cinnamon.
The reason you have to let things sit for a few minutes after the
transition is that, even with all the awfulness described above,
there's still some internal cleanup that the daemons have to perform
after they regain control. For example, all of the "resource class"
values in the RPKI up-down protocol have changed, because the
trunk/ code was still using the awful hack of using SQL row index
values as resource class names. Good riddance, but cleaning that up
requires running a whole bunch of certificates have to run through a
revoke and reissue cycle. This should all happen automatically, but
it's not instantaneous.

Trac comment by sra on 2016-05-06T01:04:34Z

sraustein · 2016-05-07T06:28:01Z

how long is
{{{
ca0.rpki.net:/root# python ca-pickle.py pickled-rpki.xz
}}}

expected to run?

it's been maybe 15 minutes. and mysql-server is running

{{{
ca0.rpki.net:/root# mysql -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 2940
Server version: 5.5.49 Source distribution

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> quit;
Bye
}}}

Trac comment by randy on 2016-05-07T06:28:01Z

sraustein · 2016-05-07T06:32:33Z

ignore. it finally finished.

Trac comment by randy on 2016-05-07T06:32:33Z

sraustein · 2016-05-07T06:39:04Z

Out of curiosity, please post size of the xz file.

I don't think I've seen ca-pickle take more than five or ten seconds,
but I was testing it with small data sets on a lightly loaded VM.

Trac comment by sra on 2016-05-07T06:39:04Z

sraustein · 2016-05-07T06:40:00Z

ca0.rpki.net:/root# l -h pickled-rpki.xz
-rw------- 1 root wheel 7.1M May 7 06:28 pickled-rpki.xz

Trac comment by randy on 2016-05-07T06:40:00Z

sraustein · 2016-05-09T05:44:11Z

Removed confused instructions which led to #815. That part of the instructions was just plain wrong.

Trac comment by sra on 2016-05-09T05:44:11Z

sraustein · 2016-05-09T17:55:55Z

Added --root-handle argument to ca-unpickle, so you can do:

{{{
python ca-unpickle.py blarg.xz --rootd --root-handle Root
}}}

so that the name of the entity created from the salvaged rootd data
will be named "Root" instead of some randomly generated UUID.

If you already have an entity named "Root", this will fail with a SQL
constraint violation when it discovers that you're creating a second
Tenant with the same handle, but you want it to fail in such a case.

Trac comment by sra on 2016-05-09T17:55:55Z

sraustein · 2016-05-10T18:06:40Z

Noting something I figured out while writing a report for Sandy:

If we need to take this pickled database hack beyond what will easily fit in memory, one relatively simple way of breaking the problem up into chunks would be to use the Python shelve module with gdbm. So, eg, instead of one great big enormous pickle, we could pickle each SQL table in a separate slot of the shelve database; if necessary, we could break things down even smaller, but one shelf per table is an easy target.

Transfer format in this case would be a gdbm database, which we could then ship to another machine in portable format using the gdbm_dump and gdbm_load utilities, possibly compressed with xz for the same reasons we compress the current pickle format.

None of this is worth worrying about until and unless we hit a case which needs it, just making note of the technique while I remember it.

Trac comment by sra on 2016-05-10T18:06:40Z

sraustein mentioned this issue Aug 8, 2016

Attribute error decanting pickle #819

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ticket to track Migration of a Root CA #813

Ticket to track Migration of a Root CA #813

sraustein commented Apr 25, 2016

sraustein commented Apr 25, 2016

sraustein commented Apr 27, 2016

sraustein commented Apr 27, 2016

sraustein commented Apr 27, 2016

sraustein commented Apr 27, 2016

sraustein commented Apr 27, 2016

sraustein commented Apr 29, 2016

sraustein commented Apr 29, 2016

sraustein commented Apr 29, 2016

sraustein commented May 5, 2016

sraustein commented May 5, 2016

sraustein commented May 6, 2016

sraustein commented May 7, 2016

sraustein commented May 7, 2016

sraustein commented May 7, 2016

sraustein commented May 7, 2016

sraustein commented May 9, 2016

sraustein commented May 9, 2016

sraustein commented May 10, 2016

Ticket to track Migration of a Root CA #813

Ticket to track Migration of a Root CA #813

Comments

sraustein commented Apr 25, 2016

sraustein commented Apr 25, 2016

sraustein commented Apr 27, 2016

sraustein commented Apr 27, 2016

sraustein commented Apr 27, 2016

sraustein commented Apr 27, 2016

sraustein commented Apr 27, 2016

sraustein commented Apr 29, 2016

sraustein commented Apr 29, 2016

sraustein commented Apr 29, 2016

sraustein commented May 5, 2016

sraustein commented May 5, 2016

sraustein commented May 6, 2016

sraustein commented May 7, 2016

sraustein commented May 7, 2016

sraustein commented May 7, 2016

sraustein commented May 7, 2016

sraustein commented May 9, 2016

sraustein commented May 9, 2016

sraustein commented May 10, 2016