Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding memory safety related checks #3736

Open
balteravishay opened this issue Dec 14, 2023 · 30 comments
Open

Adding memory safety related checks #3736

balteravishay opened this issue Dec 14, 2023 · 30 comments
Assignees
Labels
kind/enhancement New feature or request kind/new-check New check for scorecard Stale

Comments

@balteravishay
Copy link
Contributor

balteravishay commented Dec 14, 2023

Is your feature request related to a problem? Please describe.
Memory safety comes up quite frequently these days in regards to developing secure and safe software. Yet there are hardly any automation tools that try to validate the "memory safety" of a software project. below are some ideas around that notion that were discussed in the recent OpenSSF Memory Safety SIG meetings.

Describe the solution you'd like
Adding platform/language specific checks for memory safety to scorecard, bundled under "Memory Safety" category.

This can be built using existing OpenSSF knowledge and content such as the C/C++ binary hardening guide from the best practices WG, the memory safety SIG's best practice for memory safe by default languages. It and can include checks such as mentioned in issue #200, or the use of automation tools that detect memory safety hardening of the application. It may look at python ctypes dependencies, or at Java UseBoundChecks or older version's Djava.security.manager=unrestricted.
These are just preliminary ideas to raise the discussion on memory safety related checks in Scorecard.

Describe alternatives you've considered
There are no alternatives at the moment for checking or scoring repositories based on memory safety practices.

@balteravishay balteravishay added the kind/enhancement New feature or request label Dec 14, 2023
@ccpalmer
Copy link

Are there examples of automation capable of making such an assessment?

@pnacht
Copy link
Contributor

pnacht commented Dec 15, 2023

Can't SAST tools detect many common memory safety issues? Not all of them, of course (i.e. global variables used everywhere, concurrency, etc), but a fair share of common "classes" of mistakes.

@ccpalmer
Copy link

The OWASP SAST page says that finding buffer overflows is a strength of SAST tools, but it also says "High numbers of false positives." and "Difficult to ‘prove’ that an identified security issue is an actual vulnerability". Appears to be a case of "it's good at some things, but so bad at others it may prove less useful."

@pnacht
Copy link
Contributor

pnacht commented Dec 15, 2023

Oh yeah, SAST is unfortunately riddled with false positives. But to me, that seems like something that should be fixed in the SAST tools, not necessarily something that Scorecard should implement itself.

@evverx
Copy link
Contributor

evverx commented Dec 19, 2023

This can be built using existing OpenSSF knowledge and content such as the C/C++ binary hardening guide from the best practices WG

While that guide is helpful I don't think it's generally applicable to upstream projects scorecard is supposed to assess because usually upstream projects don't compile their own compilers and all their defaults are overridden downstream with custom CFLAGS, CXXFLAGS and so on anyway. If production binaries aren't hardened it should be reported to vendors (or whoever produces and ships those binaries).

(if upstream projects aren't compatible with certain flags it should be reported upstream of course but it goes far beyond what socrecard can do).

checks such as mentioned in issue #200

As far as I can tell the idea behind that check is to figure out whether projects (and their dependencies) are written in Rust (or any other memory-safe languages) and how much code is unsafe. In other words a hypothetical DNS server written in Rust would get more points than a hypothetical DNS server written in C. That sort of check makes sense upstream but it would have to rely on external tools like cargo-geiger. Then again cargo-geiger alone can't be used to scan projects written in both C and Rust.

the use of automation tools that detect memory safety hardening of the application

I'm not 100% what it means but since binskim was mentioned my guess would be that it isn't applicable to upstream projects either.

Can't SAST tools detect many common memory safety issues?

I think the idea behind the SAST, Fuzzing, CI and Binary-Artifacts checks was to figure out whether projects follow practices that are supposed to catch those kinds of issues (and prevent projects from shipping binaries at all). If for example projects written in memory-unsafe languages are continuously fuzzed it kind of implies that they are built and run under ASan/UBSan/MSan with all sorts of weird things passed to them. It doesn't guarantee that they have no bugs of course (nothing can guarantee that) but it at least shows that they care about corner cases (where all those buffer overflows, use-after-frees and so on with funny names come from).

All in all I think #200 kind of makes sense upstream. The other things probably make sense downstream.

@balteravishay
Copy link
Contributor Author

Thanks for the discussion!

Agree that some of the checks may be included in SAST. for instance the Java samples above would presumamly be caught with SAST tools. The question I guess is should scorecard make special case for memory safety checks that may or may not be included in specific SAST tools. for instance, is a project written in go uses unsafe package or leverages the go data race detector during CI, as suggested here. I'm not sure if a SAST tool that is ran on the code would catch that, or any such "build time" checks, that are mandated by scorecard in some cases.

regarding @evverx's comment about the binary hardening guide. scorecard today does have some checks on how binaries are built or shipped like pinned-dependency, packaging, signed-releases, etc. if we check for packaging or binaries with SC, why not include checking the flags that are used in those?

Lastly, memory safety is not only that we use rust. almost all languages, including rust, may tap into the "unsafe" zone. so the idea behind this thread is to see if we can somehow surface memory safety practices through scorecard.

@evverx
Copy link
Contributor

evverx commented Dec 19, 2023

scorecard today does have some checks on how binaries are built or shipped like pinned-dependency, packaging, signed-releases, etc.

Those checks aren't applicable to a lot of projects either. For example low-level projects responsible for the underlying infrastructure are never built upstream. They release tarballs consumed by distributions, vendors producing various devices and so on. Those upstream projects have no control over how they are packaged and built there.

(It makes sense to release signed tarballs of course but there are no binaries, dependencies or packages there. they are all downstream)

The problem is that one-size-fits-all approach with a lot of different ecosystems doesn't work well.

Lastly, memory safety is not only that we use rust. almost all languages, including rust, may tap into the "unsafe" zone

Agreed but I was talking about #200 specifically and it boils down to whether projects are written in memory-safe languages and how far they tap into the unsafe zone (that's what cargo-geiger does).

@zmanion
Copy link

zmanion commented Dec 19, 2023

I'll offer a very coarse (possiby too coarse) check, "How many LoC are written in memory-unsafe langauges?" Since this is so coarse, maybe treat it as a low risk level?

Data can be pulled from GitHub, via API, or using cloc or linguist (what GitHub uses).

  1. Decide what langauge count as "memory unsafe," e.g., (in cloc terms): "C" "C++" "C/C++ Header" "Assembly" "D" "Cython"
  2. Possibly discount "non-executable languages," e.g., markdown maybe shouldn't count against the total LoC since markdown can't create a vulnerability?
  3. unsafe_lang_use = unsafe_loc / ( total_loc - comments - blank lines - non-exec_lines )

IIUC GitHub/linguist measures file size not LoC. Not claiming one is better than the other, I've recently used cloc to count LoC.

cc @AevaOnline

@evverx
Copy link
Contributor

evverx commented Dec 20, 2023

"How many LoC are written in memory-unsafe languages?"

scorecard should recommend something actionable to maintainers based on that and I'm not sure what it can do without using ecosystem-specific tools (if projects are already written in memory-safe languages). If projects are written in memory-unsafe languages it would probably imply that they should be rewritten and that could be kind of controversial.

@pnacht
Copy link
Contributor

pnacht commented Dec 20, 2023

scorecard should recommend something actionable to maintainers [...] If projects are written in memory-unsafe languages it would probably imply that they should be rewritten and that could be kind of controversial.

Yeah, Scorecard must always be actionable:

- Any criteria in the scorecard must be actionable. It should be possible,
with help, for any project to "check all the boxes".

It's not explicit, but I also take that to imply "and feasible/reasonable". That is, the remediation can't be "rewrite your C project in Rust".

And if that's not acceptable, then the check has to be something that a C project can get a 10/10 in... which basically takes any sort of LoC-per-language metric out of contention.

So, as far as I can tell, we can basically just evaluate the use of tooling... but tooling that detects memory safety issues usually falls under the SAST category (or DAST, but that's less common).

(I considered metrics such as "use of raw pointers instead of smart pointers", but that just sounds like a SAST check)

@alpire
Copy link

alpire commented Feb 20, 2024

If projects are written in memory-unsafe languages it would probably imply that they should be rewritten and that could be kind of controversial.

That would be controversial, but perhaps warranted nonetheless?

There is now a lot of data that memory-unsafe languages lead to bad security outcomes. Memory unsafety commonly accounts for 70%+ of severe vulnerabilities in C and C++ codebases. I think scorecards would more accurately evaluate security risks by having a more opinionated stance against inherently insecure technologies. Right now, it doesn't seem quite right that, say, a C media parser would be scored as "safe" as an equivalent memory-safe parser (all else being equal, e.g. SAST/DAST usage).

If a stronger stance against insecure technologies is not an option, may I suggest:

  • Increasing the requirements on those insecure technologies. For instance, to achieve similar security outcomes to a Rust project with few unsafe blocks, it's likely a C or C++ project would need fuzzing with high coverage. So perhaps fuzzing and its coverage should be weighed more heavily for memory unsafe languages (even though fuzzing is not sufficient to make C or C++ sufficiently safe).
  • Code quality metrics indicating usage of secure-by-design practice. For instance, adherence to C++ Buffer hardening would prevent most spatial safety vulnerabilities.

@evverx
Copy link
Contributor

evverx commented Feb 20, 2024

perhaps warranted nonetheless?

Who is going to do that? Here's what the curl maintainer wrote recently for example: https://daniel.haxx.se/blog/2023/12/13/making-it-harder-to-do-wrong/

Why no rewrite
Because I’m not an expert on rust. Someone else would be a much more suitable person to lead such a rewrite. In fact, we could suspect that the entire curl maintainer team would need to be replaced since we are all old C developers maybe not the most suitable to lead and take care of a twin project written in rust. Dedicated long-term maintainer internet transfer library teams do not grow on trees.

I think scorecards would more accurately evaluate security risks

Can't argue with that. From a consumer point of view it's totally reasonable. The problem is that scorecard supposedly helps maintainers too and "just rewrite your thing in Rust" is unhelpful (and annoying) and that's what I was trying to say. I'm not arguing against memory safe languages here.

@evverx
Copy link
Contributor

evverx commented Feb 22, 2024

fuzzing with high coverage

I don't know if scorecard is planning to go down that route but just in case In its current form the tooling scorecard can use to measure that isn't good enough (the Fuzz-Introspector folks are aware of its limitations) but other than that high coverage in a vacuum isn't what matters in terms of accessing projects because it can't be used to figure out whether code paths triggered on receiving weird packets (or interacting with hostile environments in general) are actually covered and how high that coverage is. Other than that high coverage can annoy folks doing their best but having low coverage just because their projects do more than just parsing things: google/oss-fuzz#10924 (comment)

fuzzing and its coverage should be weighed more heavily

That's reasonable but it would be nice if the tooling actually made it easier to do that and was in line with how software gets developed in 2024: google/oss-fuzz#7190 and google/oss-fuzz#11398. Without that it's going to be unhelpful "guidance" and "best practices" all over again.

@jduck
Copy link

jduck commented Feb 22, 2024

Can't SAST tools detect many common memory safety issues? Not all of them, of course (i.e. global variables used everywhere, concurrency, etc), but a fair share of common "classes" of mistakes.

The tools out there are very noisy, often presenting a majority of false positives. Developers don't have the security expertise to proper interpret the results and thus sometimes mark true positives as false positives. There are a lot of potential issues that they cannot identify at all. It's not to say such tools cannot improve, but so far it has been quite hard to realize the benefits from them.

EDIT: I see this was already stated, oops.

@jduck
Copy link

jduck commented Feb 22, 2024

If I can add a point on legacy languages... We likely all agree that it's possible to write safe and secure code in languages like C and C++. The major difference is that the level of investment needed post-coding is much higher. Also, the guarantees you receive are much lower.

IMHO, it would be useful for the software community at large to be presented with a level-of-investment summary that can help score a project. For example, using a formal verification tool like Coq, CBMC, or Frama-C can provide a high level of confidence that the software has been evaluated properly. Fuzzing, as stated above, is another example of a best practice that gives confidence. Coverage-based testing metrics can also help. Even practices like hiring security specialists (and the specific individual matters too) to do a source code audit help.

As far as I know, no project out there tracks their investments in these spaces formally. I think that needs to change and we probably need some formalization around how projects put this information forward to the public. What is their typical process? What one-offs have they invested in? Where is the data? etc.

My belief/vision for memory safety is that it reduces the burden for these tasks and increases the guarantees that we get before any of these processes begin.

Anyway, just my $0.02

@evverx
Copy link
Contributor

evverx commented Feb 22, 2024

it would be useful for the software community at large to be presented with a level-of-investment summary that can help score a project. For example, using a formal verification tool like Coq, CBMC, or Frama-C can provide a high level of confidence that the software has been evaluated properly. Fuzzing, as stated above, is another example of a best practice that gives confidence. Coverage-based testing metrics can also help. Even practices like hiring security specialists (and the specific individual matters too) to do a source code audit help

I'm sorry but it seems to me the power of scorecard is kind of getting overestimated here. It can't even detect that Coverity Scan is used by projects even though scripts sending data are in their actions on GitHub. I'm curious how all that can help with the automated "memory safety" check proposed here?

@jduck
Copy link

jduck commented Feb 22, 2024

I'm sorry but it seems to me the power of scorecard is kind of getting overestimated here. It can't even detect that Coverity Scan is used by projects even though scripts sending data are in their actions on GitHub. I'm curious how all that can help with the automated "memory safety" check proposed here?

Please forgive me. I'm jumping in to all these OpenSSF initiatives somewhat blindly. My point is, if projects (like curl) don't wish to change languages, there are other ways to provide a high confidence signal that they invest in identifying and eliminating memory safety problems in their code.

In the context of this ticket, what @zmanion said makes a lot of sense. A project's commitment to memory safety and progress along their journey should be clear to potential users.

@evverx
Copy link
Contributor

evverx commented Feb 22, 2024

In the context of this ticket, what @zmanion said makes a lot of sense

In the context of scorecard It does. Non-shallow checks trying to cover even basic things like running test suites under sanitizers/valgrind are going to fall apart when they meet reality.

A project's commitment to memory safety and progress along their journey should be clear to potential users.

What would the role of scorecard be here in terms of helping maintainers apart from measuring and providing "guidance" with "best practices"?

@jduck
Copy link

jduck commented Feb 22, 2024

In the context of scorecard It does. Non-shallow checks trying to cover even basic things like running test suites under sanitizers/valgrind are going to fall apart when they meet reality.

Why is that? It should be fairly easy to define a standard method for what this looks like in a repository and provide evidence that the tests are being ran.

What would the role of scorecard be here in terms of helping maintainers apart from measuring and providing "guidance" with "best practices"?

It would help maintainers show potential users their security maturity. This is an important factor for whether or not they might decide to take a dependency on a particular project.

@jvoisin
Copy link

jvoisin commented Feb 23, 2024

using a formal verification tool like Coq, CBMC, or Frama-C can provide a high level of confidence that the software has been evaluated properly.

I'm afraid that this isn't really actionable for anything but small toy projects, as Frama-C and CBMC don't scale at all. Moreover, Coq is borderline unusable for regular developers, and makes little sense outside of small projects.

@evverx
Copy link
Contributor

evverx commented Feb 23, 2024

Why is that? It should be fairly easy to define a standard method for what this looks like in a repository and provide evidence that the tests are being ran

There is no standard way to do things (and that's why its SAST check can only detect obvious things like CodeQL). It can't always detect whether the CI is used either. Fuzzing is detected only because OSS-Fuzz provides its API. If some external infrastructure is used to get around the limitations of OSS-Fuzz it doesn't exist for scorecard anymore.

It would help maintainers show potential users their security maturity.

I don't know why any actual maintainer would want that. They usually have better things to do.

This is an important factor for whether or not they might decide to take a dependency on a particular project.

Those things are already probably in all sorts of devices.

@jduck
Copy link

jduck commented Feb 23, 2024

There is no standard way to do things (and that's why its SAST check can only detect obvious things like CodeQL). It can't always detect whether the CI is used either. Fuzzing is detected only because OSS-Fuzz provides its API. If some external infrastructure is used to get around the limitations of OSS-Fuzz it doesn't exist for scorecard anymore.

We are not going to be able to capture everything automatically, but that doesn't mean data isn't valuable. Where automation isn't possible, project maintainers can self-attest. IMHO self-attestation that is found to be falsified will lead to a loss of trust in the OSS project.

I don't know why any actual maintainer would want that. They usually have better things to do.

According to https://github.com/ossf/scorecard/blob/main/README.md, this is the entire premise of scorecard. If maintainers do not want to improve, they will show a poor score.

Those things are already probably in all sorts of devices.

Yes, and scorecard can help teams determine priorities with which to find other dependencies to replace poorly performing ones.

@evverx
Copy link
Contributor

evverx commented Feb 23, 2024

We are not going to be able to capture everything automatically, but that doesn't mean data isn't valuable

It's valuable but If it can't be captured automatically reliably it isn't useful in the context of scorecard.

project maintainers can self-attest

I'm sorry but (as a consumer) I don't trust those things.

this is the entire premise of scorecard

https://github.com/ossf/scorecard?tab=readme-ov-file#what-is-scorecard

We created Scorecard to help open source maintainers improve their security best practices and to help open source consumers judge whether their dependencies are safe.

https://github.com/ossf/scorecard/blob/main/checks/write.md#requirements-for-a-check

Any criteria in the scorecard must be actionable. It should be possible,
with help, for any project to "check all the boxes".

I wonder how exactly the "your project isn't written in memory-safe language" box can be checked and what sort of help can be expected?

@balteravishay
Copy link
Contributor Author

balteravishay commented Feb 23, 2024

There is no standard way to do things (and that's why its SAST check can only detect obvious things like CodeQL). It can't always detect whether the CI is used either.

That is great! because the positive inversion of that sometimes it can! Obviously Scorecard checks have limitations, things like dependency pinning would look at known places to run build/dependency retrieval scripts and scan those. But that does not mean we can't strive to make things better, if we can. We can and should start small with what we know, with what "most" project would do. Future contributions would add their more or less common ways to do things.

project maintainers can self-attest

I'm sorry but (as a consumer) I don't trust those things.

I feel this statement is controversial to some of the work that other OpenSSF initiatives are doing, and in fact to the entire way that the open source ecosystem works, specifically for instance Security Insights, but others as well. As @jduck said, self-attestation that is found to be falsified will lead to a loss of trust in the OSS project, such as the recent case of Moq for instance.

I don't know why any actual maintainer would want that. They usually have better things to do.

Because Scorecard is as important to consumers of OSS as it is to maintainers/producers of it. If it looks bad in any of the checks I, an OSS consumer, care about, I might opt for a different project/dependency to work with. In fact, wouldn't you be able to say that same line for every Scorecard check, or at Scorecard itself?

I wonder how exactly the "your project isn't written in memory-safe language" box can be checked and what sort of help can be expected?

At the Memory Safety SIG, we identify that 1. memory-safe-by-default languages can also be, sometimes, memory unsafe and that 2. Regardless of the language, memory safety is a continuum and hence can be iteratively improved and that 3. there is a gap today in the ability to understand "how memory safe is that". We are not a rust group, and our main intention with this issue is to bring forward the notion that memory safety in OSS can be measured, and made better. it won't be bullet proof (unless re-written in rust (haha, I just had to)) and you don't have to re-write it, but you can make it better.

@evverx
Copy link
Contributor

evverx commented Feb 23, 2024

I feel this statement is controversial to some of the work that other OpenSSF initiatives are doing

It is controversial and there is an issue here somewhere where I asked how those things are vetted and maintained. When I have to get past dead links from 2017 and try to figure out why criteria are met without any links to any evidence (and when I generally have the resources to double-check everything) I kind of question the usefulness of this to me personally as a consumer. They are great for other purposes though.

in fact to the entire way that the open source ecosystem works

I'm sorry but when I participate in discussions related to OpenSSF it feels like a parallel reality where open-source projects are called vendors, they somehow have the resources to invest in their journey toward memory-safety, add machine-readable specifications to help consumers to consume their thing and so on. I probably live in some other open source ecosystem.

Because Scorecard is as important to consumers of OSS as it is to maintainers/producers of it

It is and I believe the consumer part is already covered here.

In fact, wouldn't you be able to say that same line for every Scorecard check, or at Scorecard itself?

I wouldn't. The other boxes can actually be checked with help (and the GOSST folks actually help some projects). The "memory safety" check (in its current form) is on another level entirely.

our main intention with this issue is to bring forward the notion that memory safety in OSS can be measured, and made better

To me it seems that the ability of scorecard to measure things is overestimated and given its current limitations I'm not sure how it's even possible to introduce nuanced checks like that. If the memory safety sig can come up with something that can actually work in practice without just looking at languages it would be great.

Once again I'm not against the idea. My only concern was that scorecard could potentially start sending a very controversial message without any actual resources to back that stance up and that's it. To quote https://daniel.haxx.se/blog/2023/12/13/making-it-harder-to-do-wrong/ once again

The rewrite-it-in-rust mantra is mostly repeated by rust fans and people who think this is an easy answer to fixing the share of security problems that are due to C mistakes. Typically, the kind who has no desire or plans to participate in said venture.

@afmarcum afmarcum added the kind/new-check New check for scorecard label Mar 7, 2024
Copy link

This issue has been marked stale because it has been open for 60 days with no activity.

@github-actions github-actions bot added the Stale label May 15, 2024
@evverx
Copy link
Contributor

evverx commented Jul 2, 2024

I saw https://www.cisa.gov/resources-tools/resources/exploring-memory-safety-critical-open-source-projects the other day and it reminded me of this issue somehow.

As far as I understand (based on #3903 (comment)) scorecard is turning into a supply-chain linter with a bunch of probes that can be turned on/off depending on where and why it's run so at this point it seems it's fine to introduce hidden experimental probes for, say, research purposes that can be refined in the process without annoying maintainers too much (hopefully).

@lelia
Copy link
Contributor

lelia commented Jul 25, 2024

This issue was discussed in this week's community meeting, there is agreement that probes would be an appropriate place to address this, but concerns about achieving parity will require multiple subject matter experts across languages / ecosystems. Here are some resources that may be helpful:

@evverx
Copy link
Contributor

evverx commented Aug 24, 2024

concerns about achieving parity

Dunno. The scorecard checks/probes are kind of crude in the sense that for example dangerous-workflows where stuff can be injected aren't always detected and it's still necessary to catch things like stefanbuck/github-issue-parser#24 using other means. Another vector where tokens end up being exposed in workflow artifacts isn't recognized either and the list goes on. Those checks/probes are still useful even though they aren't perfect and what's cool about experimental probes is that they can be refined in the process without exposing them to the outside world too much.

Also I've been watching stuff related to OpenSSF since 2021 as far as I can remember and based on my observations things tend to be discussed and then archived without producing any actual results. Personally I think it would be useful if a couple of crude probes like "language" and "coverage" mentioned here were implemented so that it was possible to look for, say, projects written in C that aren't fuzzed (or fuzzed badly). The fuzzing thing can be improved too: #3475. With those probes it should already be possible to detect projects that needs looking into in terms of helping their maintainers to improve things (and also ask vendors why unfuzzed C projects receiving incoming packets are embedded in their appliances and why they don't do anything about it).

Then again it probably depends on what this check is supposed to accomplish. I hope scores and other misleading measurements leading to #4219 would go away eventually and would be replaced with something that can help with improving things.

Copy link

This issue has been marked stale because it has been open for 60 days with no activity.

@github-actions github-actions bot added the Stale label Nov 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement New feature or request kind/new-check New check for scorecard Stale
Projects
Status: No status
Development

No branches or pull requests