INSTALL.pipeline

** STEP 1: INSTALL EVERYTHING **

To run the pipeline, you will need:
- git 2.6.0+
- a Python 3.3+ installation
- Pip (for Python 3.3+)
- seesaw (automatically installed by Pip)
- rsync
- wpull (automatically installed by Pip)
- youtube-dl

Quick install, for Debian and Debian-esque systems like Ubuntu:

    sudo apt-get update
    sudo apt-get install build-essential python3-dev python3-pip \
      libxml2-dev libxslt-dev zlib1g-dev libssl-dev libsqlite3-dev \
      libffi-dev git tmux fontconfig-config fonts-dejavu-core \
      libfontconfig1 libjpeg-turbo8 libjpeg8 lsof ffmpeg youtube-dl \
      autossh rsync
    pip3 install --upgrade pip


** STEP 2: CREATE THE ARCHIVEBOT USER **

After you've installed all the software, set up a dedicated account
for ArchiveBot:

  adduser archivebot

(You may also want to add the user to the sudo'ers group.)

Log out of the server and log back in as user archivebot.

Then do:

    ssh-keygen
      [keep hitting Enter]
    cat ~/.ssh/id_rsa.pub

At this point, copy the public key output from your screen (it should start 
with "ssh-rsa" followed by a bunch of letters and numbers), and put it in 
an e-mail to David Yip (yipdw), letting him know that you're setting up a new 
ArchiveBot pipeline, and that this is your new server's public key. Also let 
him know a username you'd like for yourself, if you don't already have one. 
He will set things up so your new pipeline server can coordinate with the 
others, and will be allowed to upload finished WARCs to the Internet Archive.


** STEP 3: FINALLY, IT'S TIME TO INSTALL ARCHIVEBOT CODE **

Okay, back to the server stuff:

    cd ~/
    git clone https://github.com/ArchiveTeam/ArchiveBot
    cd ArchiveBot
    git submodule update --init
    pip3 install --user -r pipeline/requirements.txt
    
If you get any error messages at this point, you should try to 
fix them before continuing on, as there may be incompatibilities 
between the things that ArchiveBot is expecting and what your server 
actually has.
    

** STEP 4: START IT UP **

As user archivebot, in the FIRST tmux session:

    autossh -C -L 127.0.0.1:16379:127.0.0.1:6379 \
      YOUR-USERNAME-GOES-HERE@CONTROL-NODE-GOES-HERE -N


As user archivebot, in the SECOND tmux session:

    cd ~/ArchiveBot/pipeline
    mkdir -p ~/warcs4fos
    export REDIS_URL=redis://127.0.0.1:16379/0
    export FINISHED_WARCS_DIR=$HOME/warcs4fos

If you run the pipeline on a system with a modern OpenSSL version
(e.g. Debian Buster and later), which comes with a more secure default
configuration, additionally set the OPENSSL_CONF environment variable:

    export OPENSSL_CONF=/home/archivebot/ArchiveBot/ops/openssl-less-secure.cnf

Now, think up a name for this new ArchiveBot pipeline.  It will 
appear on the publicly available pipeline status dashboard. It will 
go in the command you enter next:

    ~/.local/bin/run-pipeline3 pipeline.py --disable-web-server \
      --concurrent 2 YOUR-PIPELINE-NAME-GOES-HERE 2>&1 | \
      tee "pipeline-$(date -u +"%Y-%m-%dT%H_%M_%SZ").log"

You can adjust the number of jobs your server can handle in 
--concurrent as needed.

If you want your pipeline to only handle !ao/!archiveonly jobs, run it
with the AO_ONLY environment variable set:

    AO_ONLY=1 ~/.local/bin/run-pipeline3 pipeline.py \
      --disable-web-server --concurrent 2 \ 
      YOUR-PIPELINE-NAME-GOES-HERE

or

    export AO_ONLY=1
    ~/.local/bin/run-pipeline3 pipeline.py --disable-web-server \
      --concurrent 2 YOUR-PIPELINE-NAME-GOES-HERE

If your pipeline has large amounts of disk space (at least 100GB dedicated to
ArchiveBot's processing), set the LARGE environment variable in the same way
as AO_ONLY above.  Your pipeline will accept jobs queued with the --large
option.

If you are getting errors about wpull, you may need to create a symbolic 
link to it, like this:

    ln -s /usr/bin/wpull /home/archivebot/ArchiveBot/pipeline/wpull

(You may need to edit that /home/YOUR_USER_HERE/YOUR-DIRECTORY/ path as needed.)


As user archivebot, in the THIRD tmux session:

    export RSYNC_URL=UPLOAD-URL-GOES-HERE
    ~/ArchiveBot/uploader/uploader.py $HOME/warcs4fos

If you start multiple pipelines, you can safely point them to the
same FINISHED_WARCS_DIR and run just one uploader.

Check out the ArchiveBot dashboard to make sure everything is 
working like it ought to.


** STEP 5: MISCELLANEOUS **

To gracefully stop the pipeline:

    touch ~/ArchiveBot/pipeline/STOP

To gracefully stop the uploader, hit ctrl-c in its tmux session.

To upgrade, run:

    pip3 install --user --upgrade -r pipeline/requirements.txt


** STEP 5a: TROUBLESHOOTING YOUTUBE-DL INSTALLATION **

youtube-dl is is a command line program that is used for downloading 
videos from YouTube, Vimeo, and other websites that feature embedded videos.  
It is supposed to be installed on your system automatically through 
requirements.txt, but just in case that doesn't work, here's how you can 
get it installed:

    sudo apt-get install python3-pip
    pip3 install --upgrade youtube_dl

Or, for older versions of Python:

    sudo apt-get install python-pip
    pip install --upgrade youtube_dl


** STEP 6: Operate the Pipeline **

Some pointers for pipeline operators:

You can find the process ID for a job by ps aux | grep $jobid.

That job has a job directory, which you can find in the data directory;
you can also get it out of ps.  The job directory is wpull's scratch
space, where it puts files it's downloading and where it assembles its
WARC.  It will move the WARC into the uploader folder when it reaches
the designated size.

If a job becomes stuck, find the process ID of its wpull instance and
kill it with kill -9.  The pipeline will move the completed WARC and
upload it, and complete the job (you may want to note in #archivebot
that you did this).  The job may be re-queued.

If you stop the pipeline or it crashes, you should remove the job
directories under pipeline/data, and clean out /tmp.  The WARCs in the
pipeline directory are almost certainly incomplete and should not be
uploaded.  The jobs cannot currently be resumed, and so the data dir and
/tmp are just consuming space.

If the pipeline runs out of disk, it will be unable to do any useful
work and jobs will lock up or fail.  In this case, check that the
uploader is functioning; if it is, use the du command in the data
directory to see what is taking up space.  If it is wpull.log, truncate
it (not rm, but rather ftruncate) to 0 to free up a little space if
needed.

If the pipeline runs out of RAM, you will likely have to kill the job
that is consuming all the RAM; wpull instances will pause to avoid the
OOM killer being run.  Consider creating a small swap file if your VM
does not have any swap.