Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BBCPARC fails when the file to archive is on a remote host #19

Closed
gsleap opened this issue Dec 6, 2019 · 27 comments
Closed

BBCPARC fails when the file to archive is on a remote host #19

gsleap opened this issue Dec 6, 2019 · 27 comments
Labels

Comments

@gsleap
Copy link
Collaborator

gsleap commented Dec 6, 2019

When issuing a BBCPARC command:
http://ngashost:7777/BBCPARC?filename=user%40hostwithfile%3A%2Fdata%2F1247842824_20190722150006_ch114_000.fits&bnum_streams=12&mime_type=application%2Fx-mwa-fits

NGAS server returns 400 BAD REQUEST, which error as shown in log file as below:
2019-12-06T05:01:03.292 [ 5303] [ R-65] [ INFO] ngamsServer.ngamsServer#handleHttpRequest:1696 Handling HTTP request: client_address=('192.168.120.204', 43008) - method=GET - path=|BBCPARC?filename=user%40hostwithfile%3A%2Fdata%2F1247842824_20190722150006_ch114_000.fits&bnum_streams=12&mime_type=application%2Fx-mwa-fits|
2019-12-06T05:01:03.293 [ 5303] [ R-65] [ INFO] ngamsServer.ngamsCmdHandling#_get_module:74 Received command: BBCPARC
2019-12-06T05:01:03.293 [ 5303] [ R-65] [ INFO] ngamsServer.ngamsArchiveUtils#_dataHandler:1055 Handling archive pull request
2019-12-06T05:01:03.294 [ 5303] [ R-65] [ ERROR] ngamsServer.ngamsServer#reqCallBack:1633 Error while serving request
Traceback (most recent call last):
File "/usr/lib/python3.6/urllib/request.py", line 1474, in open_local_file
stats = os.stat(localfile)
FileNotFoundError: [Errno 2] No such file or directory: '/user@hostwithfile:/data/1247842824_20190722150006_ch114_000.fits'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/mwa/ngas_rt/lib/python3.6/site-packages/ngamsServer-11.0-py3.6.egg/ngamsServer/ngamsServer.py", line 1617, in reqCallBack
method, path, headers)
File "/home/mwa/ngas_rt/lib/python3.6/site-packages/ngamsServer-11.0-py3.6.egg/ngamsServer/ngamsServer.py", line 1711, in handleHttpRequest
ngamsCmdHandling.handle_cmd(self, reqPropsObj, httpRef)
File "/home/mwa/ngas_rt/lib/python3.6/site-packages/ngamsServer-11.0-py3.6.egg/ngamsServer/ngamsCmdHandling.py", line 63, in handle_cmd
msg = _get_module(srvObj, reqPropsObj).handleCmd(srvObj, reqPropsObj, httpRef)
File "/home/mwa/ngas_rt/lib/python3.6/site-packages/ngamsServer-11.0-py3.6.egg/ngamsServer/commands/bbcparc.py", line 182, in handleCmd
transfer=bbcp_transfer)
File "/home/mwa/ngas_rt/lib/python3.6/site-packages/ngamsServer-11.0-py3.6.egg/ngamsServer/ngamsArchiveUtils.py", line 1033, in dataHandler
do_replication=do_replication, transfer=transfer)
File "/home/mwa/ngas_rt/lib/python3.6/site-packages/ngamsServer-11.0-py3.6.egg/ngamsServer/ngamsArchiveUtils.py", line 1062, in _dataHandler
handle = urlrequest.urlopen(url)
File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.6/urllib/request.py", line 526, in open
response = self._open(req, data)
File "/usr/lib/python3.6/urllib/request.py", line 544, in _open
'_open', req)
File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/usr/lib/python3.6/urllib/request.py", line 1452, in file_open
return self.open_local_file(req)
File "/usr/lib/python3.6/urllib/request.py", line 1491, in open_local_file
raise URLError(exp)
urllib.error.URLError: <urlopen error [Errno 2] No such file or directory: '/user@hostwithfile:/data/1247842824_20190722150006_ch114_000.fits'>

NOTE: this is not a showstopper, as I can easily use QARCHIVE, but I thought I would try out bbcparc first.

@rtobar
Copy link
Contributor

rtobar commented Dec 6, 2019

Calling it a bug as this is actually supposed to work, but it doesn't currently. The only scenario where BBCP-driven archiving works is in localhost<-->localhost deployments, which is what we test with in our unit tests, and hence they have always passed.

rtobar added a commit that referenced this issue Jan 13, 2020
This is the core change needed by #19. So far we had always specified
the source file with a simple file path, but to fetch remote files we
need to specify them in the form [user@]host:/path/to/file. The host
part is calculated with the remote IP address of the HTTP request coming
from the client. This works under the assumption a connection in the
reverse order can be established.

Regarding the last point, bbcp seems to have options to revert the
connection flow, so the source (i.e., the NGAS client) connects to the
sink (i.e., the NGAS server). This *should* work in principle, but in a
simple test using docker containers I had trouble making it work, and
since I haven't invested more time figure this out I refrained from
adding this connection flow inversion.

Signed-off-by: Rodrigo Tobar <[email protected]>
@rtobar
Copy link
Contributor

rtobar commented Jan 17, 2020

@gsleap I've finished pushing a first version of a fix for this under the bbcp_fixes branch. Would you be able to test and see if the issue is now gone? I tested locally using a Docker container connecting to my host machine, and I got that working at least, after making sure SSH connections were correctly configured, etc. Our Travis CI tests are also now effectively connecting via SSH to localhost (before they weren't, as file URLs were just file paths, in which case SSH connectivity was bypassed), and existing tests are still working in all platforms.

@gsleap
Copy link
Collaborator Author

gsleap commented Jan 21, 2020

Hi Rod, thanks for this. I've had a go and definitely have gotten further, however there is an issue with the crc checksum.

When my (target) NGAS server has CRCVariant=1 in the ArchiveHandling section, then when I try to bbcparc, I get:

<!DOCTYPE NgamsStatus SYSTEM "http://mwacache10.mwa128t.org:7777/RETRIEVE?internal=ngamsStatus.dtd">
<NgamsStatus>
  <Status Date="2020-01-21T03:08:35.659" HostId="mwacache10:7777" Message="bbcp returncode: 1. Command line: ['bbcp -f -V -S &quot;ssh -x -a -oBatchMode=yes -oGSSAPIAuthentication=no -oFallBackToRsh=no %4 %I -l %U %H bbcp&quot; -e -E c32c=/dev/stdout -P 2 -z 10.128.13.9:/volume1/test.sub /home/mwa/NGAS/volume2/staging/NGAMS_TMP_FILE___krlsu_30test.sub'], out: b'', err: b'bbcp: Invalid checksum type - c32c\n'" State="ONLINE" Status="FAILURE" SubState="IDLE" Version="11.0/2018-10-26T07:00:00"/>

I do have crc32c install in my NGAS virtual environment:

(ngas_rt) mwa@mwacache10:~$ pip freeze | grep crc
crc32c==1.7

So, I decided to ignore that and set my CRCVariant to just crc32 (CRCVariant=0) but then I get an error about CRC32z:

<?xml version="1.0" ?>
<!DOCTYPE NgamsStatus SYSTEM "http://mwacache10.mwa128t.org:7777/RETRIEVE?internal=ngamsStatus.dtd">
<NgamsStatus>
  <Status Date="2020-01-21T03:16:29.107" HostId="mwacache10:7777" Message="bbcp returncode: 1. Command line: ['bbcp -f -V -S &quot;ssh -x -a -oBatchMode=yes -oGSSAPIAuthentication=no -oFallBackToRsh=no %4 %I -l %U %H bbcp&quot; -e -E c32z=/dev/stdout -s 12 -P 2 -z 10.128.13.9:/volume1/test.sub /home/mwa/NGAS/volume1/staging/NGAMS_TMP_FILE___u9t349mqtest.sub.fits'], out: b'', err: b'bbcp: Invalid checksum type - c32z\n'" State="ONLINE" Status="FAILURE" SubState="IDLE" Version="11.0/2018-10-26T07:00:00"/>
</NgamsStatus>

And I get the same error about crc32z when using CRCVariant=2 as well.

So, I ran the bbcp command directly on the host and can confirm that c32c and c32z do not work with bbcp. Also I think that the NGAS server is choosing c32z when CRCVariant=0 OR CRCVariant=2 but the docs say 0 should be c32 and 2 should be c32z. Manually, I can get bbcp to work with c32.

@rtobar
Copy link
Contributor

rtobar commented Jan 21, 2020

Thanks for the tests @gsleap! Would you mind checking which version of bbcp do you have installed? I think you'll need at least 17.01, which contains a few fixes I contributed back regarding incorrect checksums reported by bbcp, and adding support for crc32c, so it looks like that's the issue.

Apparently there are some binaries uploaded in https://www.slac.stanford.edu/~abh/bbcp/bin/, but those seem pretty old. The best is that you compile your own binary and put it in your PATH. You can get the code via git clone https://www.slac.stanford.edu/~abh/bbcp/bbcp.git (mind the s in https, the instructions in the original webpage are still outdated).

@gsleap
Copy link
Collaborator Author

gsleap commented Jan 21, 2020

Hi Rod,

Thanks for the correct link- I had the link to v15 still!

Now that I have that working I'm hitting another issue, but this seems to be a bbcp issue in what it does to check filesystem space free:

<?xml version="1.0" ?>
<!DOCTYPE NgamsStatus SYSTEM "http://mwacache10.mwa128t.org:7777/RETRIEVE?internal=ngamsStatus.dtd">
<NgamsStatus>
  <Status Date="2020-01-21T06:31:19.087" HostId="mwacache10:7777" Message="bbcp returncode: 28. Command line: ['bbcp -f -V -S &quot;ssh -x -a -oBatchMode=yes -oGSSAPIAuthentication=no -oFallBackToRsh=no %4 %I -l %U %H bbcp&quot; -e -E c32z=/dev/stdout -s 12 -P 2 -z [email protected]:/data/20191210/rawdump_1260043216.raw /home/mwa/NGAS/volume2/staging/NGAMS_TMP_FILE___e51wp5vlrawdump_1260043216.raw.fits'], out: b'', err: b'bbcp: Host [::ffff:192.168.120.204] redirect connection to mwax04.mwa128t.org 10.128.9.4\nTarget mwacache10.mwa128t.org using initial recv window of 374400\nSource mwax04.mwa128t.org using initial send window of 87380\nbbcp: Insufficient space to copy all the files from mwax04.100g.mwa128t.org\nTarget mwacache10.mwa128t.org using a final recv window of 374400\nSource mwax04.mwa128t.org using a final send window of 87380\n'" State="ONLINE" Status="FAILURE" SubState="IDLE" Version="11.0/2018-10-26T07:00:00"/>

Basically I have my data volume, which definitely has sufficient space in /volume1, however in the NGAS config, it is specified using the symlink I created in /home/mwa/NGAS/volume1 -> /volume1. I think bbcp is checking the /home filesystem and it defintely does not have enough free space.

So instead of changing bbcp's behaviour, I stopped NGAS and edited my config file so it referred to the absolute path - i.e. /volume1. Like so:

<StorageSets>
                <StorageSet DiskLabel="BULK" MainDiskSlotId="/volume1" Mutex="0"
                            StorageSetId="volume1" Synchronize="1"/>
                    <StorageSet DiskLabel="BULK" MainDiskSlotId="/volume2" Mutex="0"
                            StorageSetId="volume2" Synchronize="1"/>
                    <StorageSet DiskLabel="BULK" MainDiskSlotId="/volume3" Mutex="0"
                            StorageSetId="volume3" Synchronize="1"/>
                    <StorageSet DiskLabel="BULK" MainDiskSlotId="/volume4" Mutex="0"
                            StorageSetId="volume4" Synchronize="1"/>
        </StorageSets>

Here is an example of a stream I'm using (NOTE: it refers to the StorageSetID of "volume1" which did not changed. I just changed the MainDiskSlotId)

<Streams>
                <Stream MimeType="application/x-mwa-fits" PlugIn="ngamsGenDapi" PlugInPars="">
                        <StorageSetRef StorageSetId="volume1"/>
                        <StorageSetRef StorageSetId="volume2"/>
                        <StorageSetRef StorageSetId="volume3"/>
                        <StorageSetRef StorageSetId="volume4"/>
                </Stream>

But now, when I retry the bbcp I get an error:

<?xml version="1.0" ?>
<!DOCTYPE NgamsStatus SYSTEM "http://mwacache10.mwa128t.org:7777/RETRIEVE?internal=ngamsStatus.dtd">
<NgamsStatus>
  <Status Date="2020-01-21T07:10:47.804" HostId="mwacache10:7777" Message="NGAMS_ER_NO_STORAGE_SET:1000:ERROR: No Storage Set matching the Slot ID: volume1. Check NG/AMS Configuration: /home/mwa/NGAS/cfg/ngamsServer.conf." State="ONLINE" Status="FAILURE" SubState="IDLE" Version="11.0/2018-10-26T07:00:00"/>

It's like it is looking for "/volume1" instead of "volume1" - i.e. it's looking up via the MainDiskSlotID instead of the StorageSetId. Or am I missing something?

@rtobar
Copy link
Contributor

rtobar commented Jan 22, 2020

@gsleap I guess you are using a test environment? If so, would you be comfortable modifying the underlying database if required? In particular I would have a look at the contents of the ngas_disks table, basically making sure that both sources of information (the configuration file and the table on the DB) have the same information. I am a bit unsure as to what happens when one changes the configuration and not the database, or vice-versa, so it would be easier to synchronise the two.

Regarding bbcp, you are probably right, it's just not following symbolic links (which it should, or at least should have an option to). I'll have a look and see how difficult/easy is that to implement, and contact the authors. I've done it in the past with good results, so hopefully it won't be too much of an issue this time.

@gsleap
Copy link
Collaborator Author

gsleap commented Jan 22, 2020

Yeah totally test environment- in my db, the ngas_disks rows for that NGAS host all have mounted = 0, host_id = null, slot_id = null. The log shows the following on startup:

2020-01-22T01:47:36.908 [ 7093] [MainThread] [  INFO] ngamsLib.ngamsDiskUtils#getNgasDiskInfoFile:562 Found Disk Info File for disk in slot: volume1
2020-01-22T01:47:36.910 [ 7093] [MainThread] [  INFO] ngamsLib.ngamsDiskUtils#getNgasDiskInfoFile:562 Found Disk Info File for disk in slot: volume2
2020-01-22T01:47:36.912 [ 7093] [MainThread] [  INFO] ngamsLib.ngamsDiskUtils#getNgasDiskInfoFile:562 Found Disk Info File for disk in slot: volume3
2020-01-22T01:47:36.914 [ 7093] [MainThread] [  INFO] ngamsLib.ngamsDiskUtils#getNgasDiskInfoFile:562 Found Disk Info File for disk in slot: volume4
2020-01-22T01:47:36.914 [ 7093] [MainThread] [  INFO] ngamsLib.ngamsDiskUtils#checkDisks:204 Archiving System: Checking that all disks defined in the configuration and which are mounted and not completed, can be accessed. Check that installed disks, have entries in the NGAS DB ...
2020-01-22T01:47:36.915 [ 7093] [MainThread] [  INFO] ngamsLib.ngamsDiskUtils#checkDisks:324 Check if each disk has an associated disk if the configuration specifies this ...
2020-01-22T01:47:36.915 [ 7093] [MainThread] [  INFO] ngamsLib.ngamsDiskUtils#checkDisks:390 Archiving System: Check that there is at least one Storage Set available for each Stream defined ...
2020-01-22T01:47:36.915 [ 7093] [MainThread] [  INFO] ngamsLib.ngamsDiskUtils#checkDisks:394 Checking for target disks availability for mime-type: application/x-mwa-fits
2020-01-22T01:47:36.916 [ 7093] [MainThread] [WARNIN] ngamsLib.ngamsDiskUtils#getDiskInfoObjsFromMimeType:820 NGAMS_AL_NO_STO_SETS:3014:ALERT: No Storage Sets found for mime-type: application/x-mwa-fits
2020-01-22T01:47:36.916 [ 7093] [MainThread] [WARNIN] ngamsLib.ngamsDiskUtils#checkStorageSetAvailability:461 Error encountered checking for storage set availability: NGAMS_AL_NO_STO_SETS:3014:ALERT: No Storage Sets found for mime-type: application/x-mwa-fits
2020-01-22T01:47:36.916 [ 7093] [MainThread] [  INFO] ngamsLib.ngamsDiskUtils#checkDisks:394 Checking for target disks availability for mime-type: ngas/nglog
2020-01-22T01:47:36.917 [ 7093] [MainThread] [WARNIN] ngamsLib.ngamsDiskUtils#getDiskInfoObjsFromMimeType:820 NGAMS_AL_NO_STO_SETS:3014:ALERT: No Storage Sets found for mime-type: ngas/nglog
2020-01-22T01:47:36.917 [ 7093] [MainThread] [WARNIN] ngamsLib.ngamsDiskUtils#checkStorageSetAvailability:461 Error encountered checking for storage set availability: NGAMS_AL_NO_STO_SETS:3014:ALERT: No Storage Sets found for mime-type: ngas/nglog
2020-01-22T01:47:36.917 [ 7093] [MainThread] [  INFO] ngamsLib.ngamsDiskUtils#checkDisks:394 Checking for target disks availability for mime-type: application/x-mwa-subfile
2020-01-22T01:47:36.918 [ 7093] [MainThread] [WARNIN] ngamsLib.ngamsDiskUtils#getDiskInfoObjsFromMimeType:820 NGAMS_AL_NO_STO_SETS:3014:ALERT: No Storage Sets found for mime-type: application/x-mwa-subfile
2020-01-22T01:47:36.918 [ 7093] [MainThread] [WARNIN] ngamsLib.ngamsDiskUtils#checkStorageSetAvailability:461 Error encountered checking for storage set availability: NGAMS_AL_NO_STO_SETS:3014:ALERT: No Storage Sets found for mime-type: application/x-mwa-subfile
2020-01-22T01:47:36.918 [ 7093] [MainThread] [WARNIN] ngamsLib.ngamsDiskUtils#checkDisks:405 NGAMS_WA_NO_TARG_DISKS:3005:WARNING: No target disks found for the Stream(s) with mime-type(s):  application/x-mwa-fits ngas/nglog application/x-mwa-subfile.

So it looks to me like NGAS is seeing the StorageSetId in the Streams section and is looking up "MainDiskSlotId" in the StorageSets section, instead of looking for the same StorageSetId.

One other work around would be to pass the "-F" parameter to bbcp to get it to ignore the free space check altogether?

@rtobar
Copy link
Contributor

rtobar commented Jan 22, 2020

Thanks for the logs, that's really useful. If you can try adding the "-F" flag to the bbcp command locally (under ngas/src/ngamsServer/commands/bbcparc.py:87) that'd be awesome, but in general it'd probably be better to have bbcp do the check -- NGAS does it when not using bbcp.

I'll try to figure out what's going on with the volume configuration, but that's allegedly a different problem. If the -F workaround works then I'd rather move this other problem into a separate issue and track it separately.

@rtobar
Copy link
Contributor

rtobar commented Jan 22, 2020

Another idea to try to workaround the volume-related error: try setting the Server.VolumeDirectory config item to /, and then the individual MainDiskSlotId values to being relative instead of absolute.

@rtobar
Copy link
Contributor

rtobar commented Jan 22, 2020

I think that's it regarding volume directories (i.e., "slot IDs"): they apparently only work correctly if they are relative to the Server.VolumeDirectory, and not necessarily well when given an absolute value. The reason being that at startup time the server always scans Server.VolumeDirectory to look for volumes there and builds its internal list of volumes, and it is then assumed that the volumes defined in the StorageSets configuration are part of those found by the server automatically (or so it appears to be).

My recommendation would then be to try the workaround I suggested above. This is obviously a bit brutal, but should do -- otherwise mount your filesystems somewhere "safer" under /mnt/volume<X> or similar, and then point Server.VolumeDirectory to /mnt.

Another thing: I reproduced the "Insufficient space to copy all the files" error locally, but when running bbcp with the debugging flag (-D) a completely different error came up, which makes me thing the underlying problem is completely different. Would you mind trying with that flag turned on too, and attaching the output in that case? Maybe you can try running bbcp manually instead of through NGAS to simplify things.

@rtobar
Copy link
Contributor

rtobar commented Jan 22, 2020

OK, I found the bug in bbcp, and I think I fixed it. Could you give this a try? https://github.com/ICRAR/bbcp

@gsleap
Copy link
Collaborator Author

gsleap commented Jan 23, 2020

Hi Rod,

Thanks for all of your help on this and I agree- I think we are running up against issues which are maybe not exactly to do with this initial issue.

I put NGAS back together as standard using symlinks and opted to try the updated ICRAR bbcp- this time I get a new error- possibly related to your new change:

<?xml version="1.0" ?>
<!DOCTYPE NgamsStatus SYSTEM "http://mwacache10.mwa128t.org:7777/RETRIEVE?internal=ngamsStatus.dtd">
<NgamsStatus>
  <Status Date="2020-01-23T02:29:38.312" HostId="mwacache10:7777" Message="bbcp returncode: 2. Command line: ['bbcp -f -V -S &quot;ssh -x -a -oBatchMode=yes -oGSSAPIAuthentication=no -oFallBackToRsh=no %4 %I -l %U %H bbcp&quot; -e -E c32c=/dev/stdout -s 12 -P 2 -z [email protected]:/data/20191210/rawdump_1260043216.raw /home/mwa/NGAS/volume2/staging/NGAMS_TMP_FILE___4qqtjy2jrawdump_1260043216.raw.fits'], out: b'', err: b'bbcp: Host [::ffff:192.168.120.204] redirect connection to mwax04.mwa128t.org 10.128.9.4\nTarget mwacache10.mwa128t.org using initial recv window of 374400\nbbcp: Target directory /home/mwa/NGAS/volume2/staging/NGAMS_TMP_FILE___4qqtjy2jrawdump_1260043216.raw.fits is not in a known file system\nTarget mwacache10.mwa128t.org using a final recv window of 374400\nSource mwax04.mwa128t.org using initial send window of 87380\nSource mwax04.mwa128t.org using a final send window of 87380\n'" State="ONLINE" Status="FAILURE" SubState="IDLE" Version="11.0/2018-10-26T07:00:00"/>

I also note that bbcp, because of the -z, parameter is annoyingly ignoring my IP addresses I've specified (which are 10G interfaces) and is instead doing a DNS lookup which resolves to a 1G interface instead (from the NGAS response: "redirect connection to mwax04.mwa128t.org 10.128.9.4"). Passing "-n" to bbcp would fix this, but I'm not sure how prescriptive you want NGAS to be when it calls bbcp? Maybe there should be an extra parameter in the BBCPARC command to pass optional flags to bbcp so the client can decide how they want bbcp to handle situations like this?

@rtobar
Copy link
Contributor

rtobar commented Jan 24, 2020

Argh, yes, the bbcp fix wasn't perfect, it worked for me because of the way I was testing it, but now I get the same error. Just fixed it (hopefully), would you mind pulling the latest bbcp version and trying again?

Yes, passing -n would disable DNS lookups, but it doesn't change the fact that it will still be the server the one that is connecting to the client. Would you mind trying with the -z option? From bbcp's point of view this should make the source (i.e., the NGAS client) connect to the target (i.e., the NGAS server), which would help using the correct interface. I'm not sure this will necessarily work out of the box though, and it might be necessary to make the bbcp target file URL (as computed by NGAS) ip-aware, so that the client uses the same interface it used to contact the NGAS server in the first place. Have a try and let me know how it goes.

Regarding extra bbcp options, I had also thought about making these a bit more flexible. Some of the options should be set by the server though, while a few others can be given to clients, so we'll have to see how to mix both. In any case this will probably acquire less priority once all these problems are sorted.

And thanks for your patience!

@rtobar
Copy link
Contributor

rtobar commented Jan 24, 2020

I actually just pushed the changes I think should be required to have the bbcp transfer use the same network path, as per what I mentioned above. Could you try them out? This is still in the bbcp_fixes branch.

@gsleap
Copy link
Collaborator Author

gsleap commented Jan 28, 2020 via email

@rtobar
Copy link
Contributor

rtobar commented Jan 28, 2020

Hi @gsleap! Thanks for the update. Did you try also pulling from the latest bbcp_fixes in the NGAS repo? I put a change there last week that might finally resolve the way the bbcp source is trying to contact the sink (i.e., it should use the same IP that the NGAS client used to contact the NGAS server). The giveaway is the bbcp command line: the final argument shouldn't be just a filename (e.g., /home/mwa/NGAS/volume2/staging/NGAMS_TMP_FILE___am7oimv6rawdump_1260043216.raw.fits) but something like 192.168.120.110:/home/mwa/NGAS/volume2/staging/NGAMS_TMP_FILE___am7oimv6rawdump_1260043216.raw.fits, which I was hoping would make the bbcp source connect through the correct network interface.

@gsleap
Copy link
Collaborator Author

gsleap commented Jan 28, 2020 via email

@rtobar
Copy link
Contributor

rtobar commented Jan 28, 2020

Mmmm, would you mind as a final test to try this out on the NGAS server machine?

bbcp -f -V -z -S "ssh -x -a -oBatchMode=yes -oGSSAPIAuthentication=no -oFallBackToRsh=no %4 %I -l %U %H bbcp" -e -E c32c=/dev/stdout -s 12 -P 2 -z [email protected]:/data/20191210/rawdump_1260043216.raw 192.168.120.110:/home/mwa/NGAS/volume2/staging/NGAMS_TMP_FILE___3airu2emrawdump_1260043216.raw.fits

The only difference with the latest bbcp command generated by NGAS is that the target filename doesn't have the :7777 (I didn't think of that). If that works then we would be quote close to the "final" result; otherwise I'll have to do some more experimentation on my side.

@gsleap
Copy link
Collaborator Author

gsleap commented Jan 30, 2020 via email

@rtobar
Copy link
Contributor

rtobar commented Jan 30, 2020

@gsleap yes, it seems so. Sorry for all this trial and error, and all the effort you've been putting into this, but after reading more of the bbcp code, this seems to be a limitation of bbcp: the hostname on each node is the main piece of information used by the source to establish the data exchange connections between the bbcp instances. However, if I'm reading this correctly, it might be possible (but I'm not fully sure) to bypass this behavior if you are using the -n (no DNS) option, which sides a few side effects, this included.

So once again I must ask ask more from you. Would you mind trying the last command line, but with the -n option?

@gsleap
Copy link
Collaborator Author

gsleap commented Jan 30, 2020 via email

@rtobar
Copy link
Contributor

rtobar commented Jan 31, 2020

it looks like although simple on the surface, using bbcp is much trickier than anyone realised!

Indeed, who would have guessed!

OK, another try, this time without -z, but with -n:

bbcp -f -V -n -S "ssh -x -a -oBatchMode=yes -oGSSAPIAuthentication=no -oFallBackToRsh=no %4 %I -l %U %H bbcp" -e -E c32c -s 12 -P 2 [email protected]:/data/20191210/rawdump_1260043216.raw 192.168.120.110:/home/mwa/NGAS/volume2/staging/NGAMS_TMP_FILE___3airu2emrawdump_1260043216.raw.fits

I think the -z shouldn't have been there in the first place, my bad; it causes the bbcp SINK (mwacache10) to connect to the bbcp SOURCE (mwax04), but we definitely want the opposite -- and using the same IP that the NGAS client in mwax04 used to connect to NGAS in mwacache10. Fingers crossed...

@gsleap
Copy link
Collaborator Author

gsleap commented Feb 3, 2020 via email

@rtobar
Copy link
Contributor

rtobar commented Feb 3, 2020

Tes, bbcp is still basing its connectivity on name resolutions, and doesn't seem to allow for specific interfaces to be used. Too bad :(.

The next step would be to delve a bit deeper into bbcp and try to add such support. That's a bit outside of the scope of this issue though, which has been satisfactorily resolved I'd say? If you're OK with it, I'll be closing this issue as the original problem is long gone, and will open a new one to keep investigate at some point the bbcp implementation and see how easy/hard would it be to add the missing functionality.

@gsleap
Copy link
Collaborator Author

gsleap commented Feb 3, 2020

Agreed, this is different issue- the one involving ngas was that it was not allowing transfers outside of the localhost test scenario. This is now fixed. Cheers.

@rtobar
Copy link
Contributor

rtobar commented Feb 3, 2020

Created #21 for keeping track of the network path issues with bbcp, and closing this one now.

@rtobar rtobar closed this as completed Feb 3, 2020
rtobar added a commit that referenced this issue Feb 3, 2020
This is the core change needed by #19. So far we had always specified
the source file with a simple file path, but to fetch remote files we
need to specify them in the form [user@]host:/path/to/file. The host
part is calculated with the remote IP address of the HTTP request coming
from the client. This works under the assumption a connection in the
reverse order can be established.

Regarding the last point, bbcp seems to have options to revert the
connection flow, so the source (i.e., the NGAS client) connects to the
sink (i.e., the NGAS server). This *should* work in principle, but in a
simple test using docker containers I had trouble making it work, and
since I haven't invested more time figure this out I refrained from
adding this connection flow inversion.

Signed-off-by: Rodrigo Tobar <[email protected]>
@rtobar
Copy link
Contributor

rtobar commented Feb 3, 2020

Just merged this code into the master branch. Just to note, I did not include the -z flag in the finaly version of the bbcp command line constructed by NGAS, as that was more a test than anything else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants