Skip to content

Commit

Permalink
Merge branch 'datahub-project:master' into master
Browse files Browse the repository at this point in the history
  • Loading branch information
anshbansal authored Aug 26, 2024
2 parents d14f4c8 + beb4306 commit f91e219
Show file tree
Hide file tree
Showing 84 changed files with 28,854 additions and 3,429 deletions.
21 changes: 4 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,10 @@ We welcome contributions from the community. Please refer to our [Contributing G

Join our [Slack workspace](https://datahubproject.io/slack?utm_source=github&utm_medium=readme&utm_campaign=github_readme) for discussions and important announcements. You can also find out more about our upcoming [town hall meetings](docs/townhalls.md) and view past recordings.

## Security

See [Security Stance](docs/SECURITY_STANCE.md) for information on DataHub's Security.

## Adoption

Here are the companies that have officially adopted DataHub. Please feel free to add yours to the list if we missed it.
Expand Down Expand Up @@ -175,23 +179,6 @@ Here are the companies that have officially adopted DataHub. Please feel free to

See the full list [here](docs/links.md).

## Security Notes

### Multi-Component

The DataHub project uses a wide range of code which is responsible for build automation, documentation generation, and
include both service (i.e. GMS) and client (i.e. ingestion) components. When evaluating security vulnerabilities in
upstream dependencies, it is important to consider which component and how it is used in the project. For example, an
upstream javascript library may include a Denial of Service (DoS) vulnerability however when used for generating
documentation it does not affect the running of DataHub itself and cannot be used to impact DataHub's service. Similarly,
python dependencies for ingestion are part of the DataHub client and are not exposed as a service.

### Known False Positives

DataHub's ingestion client does not include credentials in the code repository, python package, or Docker images.
Upstream python dependencies may include files that look like credentials and are often misinterpreted as credentials
by automated scanners.

## License

[Apache License 2.0](./LICENSE).
6 changes: 3 additions & 3 deletions build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ buildscript {
// Releases: https://github.com/linkedin/rest.li/blob/master/CHANGELOG.md
ext.pegasusVersion = '29.57.0'
ext.mavenVersion = '3.6.3'
ext.springVersion = '6.1.5'
ext.springVersion = '6.1.6'
ext.springBootVersion = '3.2.6'
ext.springKafkaVersion = '3.1.6'
ext.openTelemetryVersion = '1.18.0'
Expand All @@ -49,7 +49,7 @@ buildscript {
ext.log4jVersion = '2.23.1'
ext.slf4jVersion = '1.7.36'
ext.logbackClassic = '1.4.14'
ext.hadoop3Version = '3.3.5'
ext.hadoop3Version = '3.3.6'
ext.kafkaVersion = '5.5.15'
ext.hazelcastVersion = '5.3.6'
ext.ebeanVersion = '12.16.1'
Expand Down Expand Up @@ -134,7 +134,7 @@ project.ext.externalDependency = [
'elasticSearchRest': 'org.opensearch.client:opensearch-rest-high-level-client:' + elasticsearchVersion,
'elasticSearchJava': 'org.opensearch.client:opensearch-java:2.6.0',
'findbugsAnnotations': 'com.google.code.findbugs:annotations:3.0.1',
'graphqlJava': 'com.graphql-java:graphql-java:21.3',
'graphqlJava': 'com.graphql-java:graphql-java:21.5',
'graphqlJavaScalars': 'com.graphql-java:graphql-java-extended-scalars:21.0',
'gson': 'com.google.code.gson:gson:2.8.9',
'guice': 'com.google.inject:guice:7.0.0',
Expand Down
3 changes: 3 additions & 0 deletions datahub-upgrade/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,9 @@ dependencies {
implementation('io.airlift:aircompressor:0.27') {
because("CVE-2024-36114")
}
implementation('dnsjava:dnsjava:3.6.1') {
because("CVE-2024-25638")
}
}


Expand Down
8 changes: 4 additions & 4 deletions docker/datahub-frontend/env/docker.env
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,9 @@ JAVA_OPTS=-Xms512m -Xmx512m -Dhttp.port=9002 -Dconfig.file=datahub-frontend/conf
# Uncomment & populate these configs to enable OIDC SSO in React application.
# Required OIDC configs
# AUTH_OIDC_ENABLED=true
# AUTH_OIDC_CLIENT_ID=1030786188615-rr9ics9gl8n4acngj9opqbf2mruflqpr.apps.googleusercontent.com
# AUTH_OIDC_CLIENT_SECRET=acEdaGcnfd7KxvsXRFDD7FNF
# AUTH_OIDC_DISCOVERY_URI=https://accounts.google.com/.well-known/openid-configuration
# AUTH_OIDC_CLIENT_ID=<client id>
# AUTH_OIDC_CLIENT_SECRET=<client secret>
# AUTH_OIDC_DISCOVERY_URI=https://<idp host>/.well-known/openid-configuration
# AUTH_OIDC_BASE_URL=http://localhost:9001
# Optional OIDC configs
# AUTH_OIDC_USER_NAME_CLAIM=email
Expand Down Expand Up @@ -68,4 +68,4 @@ ELASTIC_CLIENT_PORT=9200
# To use simple username/password authentication to Elasticsearch over HTTPS
# set ELASTIC_CLIENT_USE_SSL=true and uncomment:
# ELASTIC_CLIENT_USERNAME=
# ELASTIC_CLIENT_PASSWORD=
# ELASTIC_CLIENT_PASSWORD=
2 changes: 1 addition & 1 deletion docker/kafka-setup/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
ARG KAFKA_DOCKER_VERSION=7.4.4
ARG KAFKA_DOCKER_VERSION=7.4.6

# Defining custom repo urls for use in enterprise environments. Re-used between stages below.
ARG ALPINE_REPO_URL=http://dl-cdn.alpinelinux.org/alpine
Expand Down
8 changes: 0 additions & 8 deletions docs-website/docusaurus.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -170,14 +170,6 @@ module.exports = {
value: '<div class="dropdown__link"><b>Archived versions</b></div>',
},
{
value: `
<a class="dropdown__link" href="https://docs-website-qou70o69f-acryldata.vercel.app/docs/features">0.14.0
<svg width="12" height="12" aria-hidden="true" viewBox="0 0 24 24"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg>
</a>
`,
type: "html",
},
{
value: `
<a class="dropdown__link" href="https://docs-website-lzxh86531-acryldata.vercel.app/docs/features">0.13.0
<svg width="12" height="12" aria-hidden="true" viewBox="0 0 24 24"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg>
Expand Down
21 changes: 16 additions & 5 deletions docs-website/download_historical_versions.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import tarfile
import time
import urllib.request
import shutil

repo_url = "https://api.github.com/repos/datahub-project/static-assets"

Expand All @@ -18,7 +19,7 @@ def download_file(url, destination):


def fetch_urls(
repo_url: str, folder_path: str, file_format: str, max_retries=3, retry_delay=5
repo_url: str, folder_path: str, file_format: str, active_versions: list, max_retries=3, retry_delay=5
):
api_url = f"{repo_url}/contents/{folder_path}"
for attempt in range(max_retries + 1):
Expand All @@ -30,7 +31,7 @@ def fetch_urls(
urls = [
file["download_url"]
for file in json.loads(data)
if file["name"].endswith(file_format)
if file["name"].endswith(file_format) and any(version in file["name"] for version in active_versions)
]
print(urls)
return urls
Expand All @@ -48,12 +49,22 @@ def extract_tar_file(destination_path):
tar.extractall()
os.remove(destination_path)

def get_active_versions():
# read versions.json
with open("versions.json") as f:
versions = json.load(f)
return versions

def clear_directory(directory):
if os.path.exists(directory):
shutil.rmtree(directory)
os.makedirs(directory)

def download_versioned_docs(folder_path: str, destination_dir: str, file_format: str):
if not os.path.exists(destination_dir):
os.makedirs(destination_dir)
clear_directory(destination_dir) # Clear the directory before downloading

urls = fetch_urls(repo_url, folder_path, file_format)
active_versions = get_active_versions()
urls = fetch_urls(repo_url, folder_path, file_format, active_versions)

for url in urls:
filename = os.path.basename(url)
Expand Down
1 change: 1 addition & 0 deletions docs-website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -948,6 +948,7 @@ module.exports = {
// - "metadata-service/services/README"
// "metadata-ingestion/examples/structured_properties/README"
// "smoke-test/tests/openapi/README"
// "docs/SECURITY_STANCE"
// ],
],
};
80 changes: 80 additions & 0 deletions docs/SECURITY_STANCE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# DataHub's Commitment to Security

## Introduction

The open-source DataHub project takes security seriously. As part of our commitment to maintaining a secure environment
for our users and contributors, we have established a comprehensive security policy. This document outlines the key
aspects of our approach to handling security vulnerabilities and keeping our community informed.

## Our Track Record

We have a proactive approach to security. To date we've successfully resolved over 2,000 security related issues
flagged by automated scanners and reported by community members, demonstrating our commitment to maintaining a secure
platform. This is a testament to the collaborative efforts of our community in identifying and helping us address
potential vulnerabilities. It truly takes a village.

## Reporting Security Issues

If you believe you've discovered a security vulnerability in DataHub, we encourage you to report it immediately. We have
a dedicated process for handling security-related issues to ensure they're addressed promptly and discreetly.

For detailed instructions on how to report a security vulnerability, including our PGP key for encrypted communications,
please visit our official security policy page:

[DataHub Security Policy](https://github.com/datahub-project/datahub/security/policy)

We kindly ask that you do not disclose the vulnerability publicly until the committers have had the chance to address it
and make an announcement.

## Our Response Process

Once a security issue is reported, the project follows a structured process to ensure that each report is handled with
the attention and urgency it deserves. This includes:

1. Verifying the reported vulnerability
2. Assessing its potential impact
3. Developing and testing a fix
4. Releasing security patches
5. Coordinating the public disclosure of the vulnerability

All reported vulnerabilities are carefully assessed and triaged internally to ensure appropriate action is taken.

## How we prioritize (and the dangers of blindly following automated scanners)

While we appreciate the value of automated vulnerability detection systems like Dependabot, we want to emphasize the
importance of critical thinking when addressing flagged issues. These systems are excellent at providing signals of
potential vulnerabilities, but they shouldn't be followed blindly.

Here's why:

1. Context matters: An issue flagged might only affect a non-serving component of the stack (such as our docs-website
code or our CI smoke tests), which may not pose a significant risk to the overall system.

2. False positives: Sometimes, these systems may flag vulnerabilities in libraries that are linked but not actively
used. For example, a vulnerability in an email library might be flagged even if the software never sends emails.

3. Exploit feasibility: Some vulnerabilities may be technically present but extremely difficult or impractical to
exploit in real-world scenarios. Automated scanners often don't consider the actual implementation details or
security controls that might mitigate the risk. For example, a reported SQL injection vulnerability might exist in
theory, but if the application uses parameterized queries or has proper input validation in place, the actual risk
could be significantly lower than the scanner suggests.

We carefully review all automated alerts in the context of our specific implementation to determine the actual risk and
appropriate action.

## Keeping the Community Informed

Transparency is key in maintaining trust within our open-source community. To keep everyone informed about
security-related matters:

- We maintain Security Advisories on the DataHub project GitHub repository
- These advisories include summaries of security issues, details on the fixes implemented, and any necessary mitigation
steps for users

## Conclusion

Security is an ongoing process, and we're committed to continuously improving our practices. By working together with
our community of users and contributors, we aim to maintain DataHub as a secure and reliable metadata platform.

We encourage all users to stay updated with our security announcements and to promptly apply any security patches
released. Together, we can ensure a safer environment for everyone in the DataHub community.
7 changes: 3 additions & 4 deletions docs/deploy/confluent-cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,8 @@ First, you'll need to create following new topics in the [Confluent Control Cent
7. (Deprecated) **MetadataAuditEvent_v4**: Metadata change log messages
8. (Deprecated) **FailedMetadataChangeEvent_v4**: Failed to process #1 event
9. **MetadataGraphEvent_v4**:
10. **MetadataGraphEvent_v4**:
11. **PlatformEvent_v1**
12. **DataHubUpgradeHistory_v1**: Notifies the end of DataHub Upgrade job so dependants can act accordingly (_eg_, startup).
10. **PlatformEvent_v1**
11. **DataHubUpgradeHistory_v1**: Notifies the end of DataHub Upgrade job so dependants can act accordingly (_eg_, startup).
Note this topic requires special configuration: **Infinite retention**. Also, 1 partition is enough for the occasional traffic.

The first five are the most important, and are explained in more depth in [MCP/MCL](../advanced/mcp-mcl.md). The final topics are
Expand Down Expand Up @@ -243,4 +242,4 @@ Accepting contributions for a setup script compatible with Confluent Cloud!

The kafka-setup-job container we ship with is only compatible with a distribution of Kafka wherein ZooKeeper
is exposed and available. A version of the job using the [Confluent CLI](https://docs.confluent.io/confluent-cli/current/command-reference/kafka/topic/confluent_kafka_topic_create.html)
would be very useful for the broader community.
would be very useful for the broader community.
3 changes: 1 addition & 2 deletions metadata-ingestion/docs/sources/bigquery/bigquery_recipe.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
source:
type: bigquery
config:
# `schema_pattern` for BQ Datasets
schema_pattern:
dataset_pattern:
allow:
- finance_bq_dataset
table_pattern:
Expand Down
6 changes: 5 additions & 1 deletion metadata-ingestion/docs/sources/dbt/dbt.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,11 @@ We support the following operations:
1. add_tag - Requires `tag` property in config.
2. add_term - Requires `term` property in config.
3. add_terms - Accepts an optional `separator` property in config.
4. add_owner - Requires `owner_type` property in config which can be either user or group. Optionally accepts the `owner_category` config property which can be set to either a [custom ownership type](../../../../docs/ownership/ownership-types.md) urn like `urn:li:ownershipType:architect` or one of `['TECHNICAL_OWNER', 'BUSINESS_OWNER', 'DATA_STEWARD', 'DATAOWNER'` (defaults to `DATAOWNER`).
4. add_owner - Requires `owner_type` property in config which can be either `user` or `group`. Optionally accepts the `owner_category` config property which can be set to either a [custom ownership type](../../../../docs/ownership/ownership-types.md) urn like `urn:li:ownershipType:architect` or one of `['TECHNICAL_OWNER', 'BUSINESS_OWNER', 'DATA_STEWARD', 'DATAOWNER'` (defaults to `DATAOWNER`).

- The `owner_type` property will be ignored if the owner is a fully qualified urn.
- You can use commas to specify multiple owners - e.g. `business_owner: "jane,john,urn:li:corpGroup:data-team"`.
5. add_doc_link - Requires `link` and `description` properties in config. Upon ingestion run, this will overwrite current links in the institutional knowledge section with this new link. The anchor text is defined here in the meta_mappings as `description`.

Note:
Expand Down
Loading

0 comments on commit f91e219

Please sign in to comment.