-
Notifications
You must be signed in to change notification settings - Fork 71
/
INSTALL.pipeline
199 lines (134 loc) · 6.61 KB
/
INSTALL.pipeline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
** STEP 1: INSTALL EVERYTHING **
To run the pipeline, you will need:
- git 2.6.0+
- a Python 3.3+ installation
- Pip (for Python 3.3+)
- seesaw (automatically installed by Pip)
- rsync
- wpull (automatically installed by Pip)
- youtube-dl
Quick install, for Debian and Debian-esque systems like Ubuntu:
sudo apt-get update
sudo apt-get install build-essential python3-dev python3-pip \
libxml2-dev libxslt-dev zlib1g-dev libssl-dev libsqlite3-dev \
libffi-dev git tmux fontconfig-config fonts-dejavu-core \
libfontconfig1 libjpeg-turbo8 libjpeg8 lsof ffmpeg youtube-dl \
autossh rsync
pip3 install --upgrade pip
** STEP 2: CREATE THE ARCHIVEBOT USER **
After you've installed all the software, set up a dedicated account
for ArchiveBot:
adduser archivebot
(You may also want to add the user to the sudo'ers group.)
Log out of the server and log back in as user archivebot.
Then do:
ssh-keygen
[keep hitting Enter]
cat ~/.ssh/id_rsa.pub
At this point, copy the public key output from your screen (it should start
with "ssh-rsa" followed by a bunch of letters and numbers), and put it in
an e-mail to David Yip (yipdw), letting him know that you're setting up a new
ArchiveBot pipeline, and that this is your new server's public key. Also let
him know a username you'd like for yourself, if you don't already have one.
He will set things up so your new pipeline server can coordinate with the
others, and will be allowed to upload finished WARCs to the Internet Archive.
** STEP 3: FINALLY, IT'S TIME TO INSTALL ARCHIVEBOT CODE **
Okay, back to the server stuff:
cd ~/
git clone https://github.com/ArchiveTeam/ArchiveBot
cd ArchiveBot
git submodule update --init
pip3 install --user -r pipeline/requirements.txt
If you get any error messages at this point, you should try to
fix them before continuing on, as there may be incompatibilities
between the things that ArchiveBot is expecting and what your server
actually has.
** STEP 4: START IT UP **
As user archivebot, in the FIRST tmux session:
autossh -C -L 127.0.0.1:16379:127.0.0.1:6379 \
YOUR-USERNAME-GOES-HERE@CONTROL-NODE-GOES-HERE -N
As user archivebot, in the SECOND tmux session:
cd ~/ArchiveBot/pipeline
mkdir -p ~/warcs4fos
export REDIS_URL=redis://127.0.0.1:16379/0
export FINISHED_WARCS_DIR=$HOME/warcs4fos
If you run the pipeline on a system with a modern OpenSSL version
(e.g. Debian Buster and later), which comes with a more secure default
configuration, additionally set the OPENSSL_CONF environment variable:
export OPENSSL_CONF=/home/archivebot/ArchiveBot/ops/openssl-less-secure.cnf
Now, think up a name for this new ArchiveBot pipeline. It will
appear on the publicly available pipeline status dashboard. It will
go in the command you enter next:
~/.local/bin/run-pipeline3 pipeline.py --disable-web-server \
--concurrent 2 YOUR-PIPELINE-NAME-GOES-HERE 2>&1 | \
tee "pipeline-$(date -u +"%Y-%m-%dT%H_%M_%SZ").log"
You can adjust the number of jobs your server can handle in
--concurrent as needed.
If you want your pipeline to only handle !ao/!archiveonly jobs, run it
with the AO_ONLY environment variable set:
AO_ONLY=1 ~/.local/bin/run-pipeline3 pipeline.py \
--disable-web-server --concurrent 2 \
YOUR-PIPELINE-NAME-GOES-HERE
or
export AO_ONLY=1
~/.local/bin/run-pipeline3 pipeline.py --disable-web-server \
--concurrent 2 YOUR-PIPELINE-NAME-GOES-HERE
If your pipeline has large amounts of disk space (at least 100GB dedicated to
ArchiveBot's processing), set the LARGE environment variable in the same way
as AO_ONLY above. Your pipeline will accept jobs queued with the --large
option.
If you are getting errors about wpull, you may need to create a symbolic
link to it, like this:
ln -s /usr/bin/wpull /home/archivebot/ArchiveBot/pipeline/wpull
(You may need to edit that /home/YOUR_USER_HERE/YOUR-DIRECTORY/ path as needed.)
As user archivebot, in the THIRD tmux session:
export RSYNC_URL=UPLOAD-URL-GOES-HERE
~/ArchiveBot/uploader/uploader.py $HOME/warcs4fos
If you start multiple pipelines, you can safely point them to the
same FINISHED_WARCS_DIR and run just one uploader.
Check out the ArchiveBot dashboard to make sure everything is
working like it ought to.
** STEP 5: MISCELLANEOUS **
To gracefully stop the pipeline:
touch ~/ArchiveBot/pipeline/STOP
To gracefully stop the uploader, hit ctrl-c in its tmux session.
To upgrade, run:
pip3 install --user --upgrade -r pipeline/requirements.txt
** STEP 5a: TROUBLESHOOTING YOUTUBE-DL INSTALLATION **
youtube-dl is is a command line program that is used for downloading
videos from YouTube, Vimeo, and other websites that feature embedded videos.
It is supposed to be installed on your system automatically through
requirements.txt, but just in case that doesn't work, here's how you can
get it installed:
sudo apt-get install python3-pip
pip3 install --upgrade youtube_dl
Or, for older versions of Python:
sudo apt-get install python-pip
pip install --upgrade youtube_dl
** STEP 6: Operate the Pipeline **
Some pointers for pipeline operators:
You can find the process ID for a job by ps aux | grep $jobid.
That job has a job directory, which you can find in the data directory;
you can also get it out of ps. The job directory is wpull's scratch
space, where it puts files it's downloading and where it assembles its
WARC. It will move the WARC into the uploader folder when it reaches
the designated size.
If a job becomes stuck, find the process ID of its wpull instance and
kill it with kill -9. The pipeline will move the completed WARC and
upload it, and complete the job (you may want to note in #archivebot
that you did this). The job may be re-queued.
If you stop the pipeline or it crashes, you should remove the job
directories under pipeline/data, and clean out /tmp. The WARCs in the
pipeline directory are almost certainly incomplete and should not be
uploaded. The jobs cannot currently be resumed, and so the data dir and
/tmp are just consuming space.
If the pipeline runs out of disk, it will be unable to do any useful
work and jobs will lock up or fail. In this case, check that the
uploader is functioning; if it is, use the du command in the data
directory to see what is taking up space. If it is wpull.log, truncate
it (not rm, but rather ftruncate) to 0 to free up a little space if
needed.
If the pipeline runs out of RAM, you will likely have to kill the job
that is consuming all the RAM; wpull instances will pause to avoid the
OOM killer being run. Consider creating a small swap file if your VM
does not have any swap.