-
Notifications
You must be signed in to change notification settings - Fork 1.6k
PVF worker: restrict networking #7334
base: master
Are you sure you want to change the base?
Conversation
Co-authored-by: Andrei Sandu <[email protected]>
Co-authored-by: Andrei Sandu <[email protected]>
blacklisted_rules.insert(libc::SYS_socketpair, vec![]); | ||
blacklisted_rules.insert(libc::SYS_socket, vec![]); | ||
|
||
let filter = SeccompFilter::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blacklisting is always brittle when it comes to security. Is whitelisting feasible? Maybe at least based on categories/wildcards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to my knowledge. It's what makes seccomp notoriously hard to use, and why landlock is generally better (or will be when it's fully implemented; it kind of does what you're asking for). But the benefit of this PR over https://github.com/paritytech/polkadot/issues/4718 is that this has a much smaller scope - it's only two syscalls. So, less chance of something going wrong until we have the full sandbox. And I proposed we do something like the following to make it even less brittle:
For example, we can set up a regular job that downloads the syscall table from the latest kernel version and makes sure no new syscalls containing "socket" were added. However, the risk of a new syscall appearing, and libc libraries using it, in the immediate-term is very low, so I think it shouldn't block this development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blacklisting is indeed very problematic IMO. It's hard to keep up with the syscalls and the way they can be misused.
There may be newly discovered CVEs in currently unused syscalls, so limiting the blast radius of potentially harmless syscalls that are not neccessary is useful. Otherwise, an attacker could find a way of triggering an unneeded syscall from a linked library to cause havoc.
Another thing that comes to mind is io_uring. It exposes a way of bypassing system calls for creating sockets and performing network IO. This is another example of why blacklists are not a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I started working on a bigger PR with seccompiler that whitelists only the syscalls needed, but I thought this solution here would be quicker to land. If it's controversial I can just close this PR and focus on the other one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, glad to hear we're taking the whitelist approach. This one will be tricky as well, but better in the long run.
I'm not against an incremental approach either, any sandboxinig is better than none :)
#[cfg(target_os = "linux")] | ||
if let Err(err) = seccomp::try_restrict_thread() { | ||
return ( | ||
Response::InternalError(InternalValidationError::Seccomp(format!( | ||
"sandboxing the thread failed: {}", | ||
err.to_string() | ||
))), | ||
landlock_status, | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... security-wise doing this here probably won't be very useful. (:
Since this only restricts a single thread any threads (including the main thread) which were spawned before this one will still be fully unrestricted, and since all of the threads share an address space if the attacker can execute arbitrary code they could just hijack another unsandboxed thread to do what they want. Slightly annoying but won't stop a motivated attacker.
So 1) this needs to be done on the main thread, and 2) it needs to be done before any other threads are spawned (ideally we'd add an assertion assert_eq!(std::fs::read_dir("/proc/self/task").unwrap().count(), 1);
before the sandboxing is enabled.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. We can add exceptions for both landlock and seccomp (birdcage adds exceptions for local sockets). For FS, the main thread only needs the cached artifact directory, so we can add a landlock exception for it. But I didn't do this because I would prefer attackers not being able to read/modify other artifacts. We could spawn a new process for each job and fully restrict it, but that would be more overhead and complication.
I would want to make sure that is actually needed though. How is hijacking other threads done? Why do landlock and seccomp allow restricting per-thread, if it can be worked around?
Also, if this sandboxing can stop simple attacks it's still worth having until we have full sandboxing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would want to make sure that is actually needed though. How is hijacking other threads done?
There are many ways to do it. With no extra sandboxing the simplest one would be to just 1) stop the thread, 2) set its rip
to where you want the thread to go through ptrace
, 3) let it run again. But even if you'd sandbox all syscalls it's still possible to do it simply by just reading and writing memory (e.g. just overwrite the return address of a function call which the other thread is coincidentally executing; it takes a little bit of skill to pull it off, but it's possible)
Why do landlock and seccomp allow restricting per-thread, if it can be worked around?
Because if they'd work on a per-process basis then that could result in race conditions. (:
Any spawned threads inherit the original thread's sandboxing, so what everyone does is that they first set up the sandboxing, and then spawn their threads. (And then optionally apply extra blacklists on all threads so that no further threads can be spawned.)
Also, if this sandboxing can stop simple attacks it's still worth having until we have full sandboxing.
For stopping simple attacks I think wasmtime
's sandbox is enough of a barrier. If someone is skilled enough to find and exploit a remote code execution hole in wasmtime
then I'm pretty sure they won't be stopped by this. (:
Also, if this sandboxing can stop simple attacks it's still worth having until we have full sandboxing.
Well, I mean, it's always good to lock your front door, but if you'll leave the back door unlocked it's not going to stop anyone. (: (But regardless extra defense in depth is always good to have.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full process isolation sounds like the safest approach to me. The point of threads is sharing an address space, in this case we really don't want that, so we should be using a process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full process isolation sounds like the safest approach to me. The point of threads is sharing an address space, in this case we really don't want that, so we should be using a process.
Indeed. The issue here is that this is already running in a separate process (the worker process), but not the whole worker process is sandboxed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe some restructuring is in order, where we have a clear (process) separation between what needs to be sandboxed and what is not. On the flip side the unsandboxed stuff might not need to be its own process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@koute so you propose to restrict the entire worker process? Wouldn't it be more logical to do that after the worker binary separation, then? Seems like it would be more straightforward, as right now, we'd need to apply sandboxing in the polkadot-cli
root, and later we'd need to move it into some worker-cli
root anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want this to be secure then we must restrict the entire process. As to whether do this now or later, well, I guess it's up to you. But as-is right now the sandboxing effectively doesn't do anything security-wise, so we shouldn't say it does. (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe some restructuring is in order, where we have a clear (process) separation between what needs to be sandboxed and what is not. On the flip side the unsandboxed stuff might not need to be its own process?
Yeah, and it might also be more secure to limit processes to running one job instead of multiple jobs in sequence. See e.g. https://github.com/paritytech/polkadot/issues/7232#issuecomment-1549706304. Unfortunately it would be added overhead for each job to spawn and tear down a new process. We would also need to whitelist the artifacts directory from sandboxing.
I would wait on a response to landlock-lsm/rust-landlock#37. Maybe this is something that was already considered during development of landlock as I have not seen any caveats about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update on this: @koute was right (of course), I've raised https://github.com/paritytech/polkadot/issues/7497 and started working on it.
blacklisted_rules.insert(libc::SYS_socketpair, vec![]); | ||
blacklisted_rules.insert(libc::SYS_socket, vec![]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... there could still be a way around this if there's an existing socket already created; the attacker could then just probably reconfigure/repurpose it. (I've never tried that but I suppose it should work?) Perhaps we could blacklist all socket-related syscalls? setsockopt
, getsockopt
, connect
, accept
, listen
, bind
, sendto
, recvfrom
, sendmsg
, recvmsg
, sendmmsg
, recvmmsg
, and I also think there were some compat_*
variant of some of these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The worker thread communicates with the host through a unix socket, I think restricting send
/recv
would break that communication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it uses those syscalls then yes; I don't remember if it does, or if it just uses normal read
/write
.
(Doing the communication through a shared memory map + futex instead of a socket would probably allow us to completely sandbox that, but it's probably not worth it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thread shouldn't have access to the existing sockets unless it can go outside its address space as you mention in the other comment, right? The idea with only blocking two syscalls is to keep the scope low and minimize the chance of breakage, while we work on the main seccomp PR (right now still waiting on binary separation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is socket-repurposing done? Should be impossible to use them for networking if we block sockets with an exception for local sockets in the main thread (like birdcage does). Then we don't need to let through all the other syscalls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to be conservative and minimize the change for breakage you can probably leave the send
/recv
/etc., accept
and getsockopt
syscalls unsandboxed. AFAIK blacklisting the rest shouldn't break anything as they should only be used with new sockets.
The thread shouldn't have access to the existing sockets unless it can go outside its address space as you mention in the other comment, right?
Sockets don't reside in the process' memory per-se; they are kernel space objects to which the process holds a handle in the form of a file descriptor (which is just a normal number). The only thing necessary to access a given socket is that you know its socket number, which you can trivially get by e.g. iterating over /proc/self/fd
(since filesystem access is not sandboxed) or you could even just loop through numbers from 0 to N and see if any of them are sockets (since file descriptors are not randomized).
Nevertheless, all threads have full access to all other threads' address space; or in other words, there's only a single address space - the address space of the process, which all of the threads share. That's the nature of how threads work. (:
/// # Action on Caught Syscall | ||
/// | ||
/// TODO: Update with the action and explain why given action was chosen. | ||
#[cfg(target_os = "linux")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.... we probably also want to gate this on architecture (target_arch = "x86_64"
).
Technically seccompiler should also support aarch64, but without us auditing aarch64's list of syscalls (which do differ between architectures) and testing that it actually works there's a chance that it'd be broken/incomplete/insecure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seccompiler supports aarch64 and I didn't think there'd be any complication with the two socket creation syscalls. I considered gating on these two architectures, but seccompiler already does it so it would already be a compiler error on an unsupported arch.
What do you think about my idea of having a job that downloads the syscall table for these two architectures, from the latest kernel version, and checks for unexpected changes? I think we would need eventually it but it should not be a blocker for this PR as the scope is so small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seccompiler supports aarch64 and I didn't think there'd be any complication with the two socket creation syscalls.
Yeah, probably, but we're going to extend this, right? I'd rather we either go all the way and properly support it, or not support it at all.
but seccompiler already does it so it would already be a compiler error on an unsupported arch.
It'd be nice to handle this gracefully on unsupported architectures (essentially treat it as being compiled on an unsupported OS) instead of failing to compile at all.
What do you think about my idea of having a job that downloads the syscall table for these two architectures, from the latest kernel version, and checks for unexpected changes?
You mean extract it from the Linux sources? I'm not entirely sure that it can be done with 100% reliability, but the scripts that are floating around which do that do seem like they can grab all of them from what I can see (at least for amd64). I don't think it's worth having a job like this to run for every PR, but a nightly job which runs once each day and pings us if there are any changes - sure, why not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about my idea of having a job that downloads the syscall table for these two architectures, from the latest kernel version, and checks for unexpected changes? I think we would need eventually it but it should not be a blocker for this PR as the scope is so small.
This is doable. But what is the purpose?
there's a script in seccompiler for this: https://github.com/rust-vmm/seccompiler/blob/main/tools/generate_syscall_tables.sh, and I'm sure there are others.
Would this be used to manually check the newly added syscalls, to see if there's a newly introduced network syscall?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be used to manually check the newly added syscalls, to see if there's a newly introduced network syscall?
Yeah, and my idea was to review new syscalls and determine if they should be blacklisted (I figure 99% of the time no action would be required)
|
||
pub type Result<T> = std::result::Result<T, Error>; | ||
|
||
// TODO: Compile the filter at build-time rather than runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. The simplest way to do it would probably be to generate the BpfProgram
in build.rs
, cast it to a &[u8]
with slice::from_raw_parts
, write it to file, and then include_bytes!
it here. You could then slice::from_raw_parts
it again into &'a [sock_filter]
and pass it to seccompiler::apply_filter
. (sock_filter
is #[repr(C)]
so this is safe to do)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's close to the instructions here: https://github.com/rust-vmm/seccompiler#compiling-filters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For such a small filter, I don't think it's worth it.
Build-time compilation is especially helpful for large, JSON-encoded filters.
// Restricted thread cannot open sockets. | ||
let handle = thread::spawn(|| { | ||
// TODO:Open a socket, this should succeed before seccomp is applied. | ||
TcpListener::bind("127.0.0.1:7070").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TcpListener::bind("127.0.0.1:7070").unwrap(); | |
TcpListener::bind("127.0.0.1:0").unwrap(); |
Using 0
as a port will automatically pick a free one; this way this test won't randomly fail if the port's already used by something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I totally forgot about that.
|
||
// Try to open a socket after seccomp. | ||
assert!(matches!( | ||
TcpListener::bind("127.0.0.1:8080"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TcpListener::bind("127.0.0.1:8080"), | |
TcpListener::bind("127.0.0.1:0"), |
} | ||
} | ||
|
||
// TODO: Add a check for whether seccomp is supported and warn if not, like we do for landlock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the kernel does not support seccomp, it should return an error from the syscalls.
calling seccomp(GET_ACTIONS_AVAIL)
should be easy. you could directly call the syscall using libc::syscall
and a bit of unsafe code.
However, I don't think we need this, since the actions we set here (ALLOW/KILL_PROCESS/ERRNO) have been in the kernel for a very long time (ar least kernel 4.9, since we were testing Firecracker on those kernels).
If you'd really like to be safe, I can work on adding this helper function in seccompiler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤯 You're the seccompiler dev! Cool!
calling seccomp(GET_ACTIONS_AVAIL) should be easy. you could directly call the syscall using libc::syscall and a bit of unsafe code.
I didn't realize it was that easy. I left this TODO because the seccompiler docs recommend it, but don't provide a way of directly doing so. I left this issue about it. ;) Sounds like this check may be unnecessary, we do require a somewhat recent kernel, but wouldn't hurt either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤯 You're the seccompiler dev! Cool!
😄 yep
Yep, I think it's unneccessary for this use case. It may be needed for more advanced use cases that use more exotic return actions, like SECCOMP_RET_USER_NOTIF
(introduced in kernel 5.0), that's why I added the recommendation in the docs.
If you wanna make a contribution to seccompiler, I can help you out with reviews/guidance
Overview
For security we restrict networking by blocking the creation of sockets with
seccomp
. See explanation on the related issue.Still has some TODOs that should be resolved, but it already works.
Related
Target branch: #7303 (builds on top of that PR)
Closes paritytech/polkadot-sdk#619