Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] /tmp/ File Not Found Error Causing Task Failure for dbt Cosmos Tasks #1075

Closed
1 task done
oliverrmaa opened this issue Jul 2, 2024 · 9 comments
Closed
1 task done
Assignees
Labels
bug Something isn't working customer request An Astronomer customer made requested this execution:local Related to Local execution environment
Milestone

Comments

@oliverrmaa
Copy link
Contributor

Astronomer Cosmos Version

Other Astronomer Cosmos version (please specify below)

If "Other Astronomer Cosmos version" selected, which one?

1.4.3

dbt-core version

1.7.17

Versions of dbt adapters

dbt-bigquery==1.7.4
dbt-core==1.7.17
dbt-extractor==0.5.1
dbt-semantic-interfaces==0.4.4

LoadMode

DBT_LS

ExecutionMode

LOCAL

InvocationMode

SUBPROCESS

airflow version

apache-airflow==2.9.2+astro.1

Operating System

Debian GNU/Linux 11 (bullseye)

If a you think it's an UI issue, what browsers are you seeing the problem on?

No response

Deployment

Astronomer

Deployment details

We have a main production deployment in Astro Cloud which we consider as production. We also do local development via astro dev start. We have continuous deployment set up through CircleCI which deploys merged PRs to our master branch to our production deployment via astro deploy --dags. For authentication to our data warehouse (Google BigQuery) in production, we use GoogleCloudServiceAccountDictProfileMapping and for local we use ProfileConfig where our dbt profiles.yml has a hardcoded path to a service account JSON file which is at the same path for each developer.

What happened?

We are still intermittently seeing FileNotFoundError: [Errno 2] No such file or directory for /tmp files every few hours or so across multiple DAGs ever since the inception of our Astronomer/Cosmos setup. This error appears on Cosmos created dbt model run tasks. This issue affects our on-call personnel because they have to manually clear and re-run these tasks in order for the model to successfully run (the re-run usually succeeds). Some model runs must be manual re-run in order for this task to succeed and some may recover on their own.

Relevant log output

Here are four examples of errors in log output for different missing /tmp/ files: 

(1)
[2024-06-24, 18:32:45 UTC] {subprocess.py:94} INFO - FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp4yx6m8en/package-lock.yml'

(2) This is a typical example for one of our models: 
"FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp2pcqnmxp/models/frontroom/business_models/provider_pay/datamarts/.provider_performance.sql.JjagaL'", ''); 274)

(3) This is a typical example for one of our models:
2024-06-21, 10:17:35 UTC] {log.py:232} WARNING - [2024-06-21T10:17:35.702+0000] {subprocess.py:94} INFO - (astronomer-cosmos) - FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmph56xe15q/models/frontroom/business_models/datamarts/honest_heatlh/.honest_health_monthly_subscription_snapshots.sql.KDCDEl'
[2024-06-21, 10:17:35 UTC] {subprocess.py:94} INFO - FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmph56xe15q/models/frontroom/business_models/datamarts/honest_heatlh/.honest_health_monthly_subscription_snapshots.sql.KDCDEl'

(4) This example is for external models we use from the dbt qualtrics package: 
[Errno 2] No such file or directory: '/tmp/tmpf6h80niz/models/intermediate/shipment/.int_easy_post_tracking.sql.GgjJgL'
06:02:42  Traceback (most recent call last):
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 91, in wrapper
    result, success = func(*args, **kwargs)
                      ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 76, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 169, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 198, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 245, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 271, in wrapper
    ctx.obj["manifest"] = parse_manifest(
                          ^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/manifest.py", line 1790, in parse_manifest
    manifest = ManifestLoader.get_full_manifest(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/manifest.py", line 318, in get_full_manifest
    manifest = loader.load()
               ^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/manifest.py", line 365, in load
    file_reader.read_files()
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/read_files.py", line 203, in read_files
    self.read_files_for_project(project, file_types)
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/read_files.py", line 210, in read_files_for_project
    project_files[file_type_info["parser"]] = read_files_for_parser(
                                              ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/read_files.py", line 158, in read_files_for_parser
    source_files = get_source_files(
                   ^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/read_files.py", line 134, in get_source_files
    fp_list = filesystem_search(project, paths, extension, ignore_spec)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/parser/search.py", line 74, in filesystem_search
    for result in find_matching(root, relative_dirs, ext, ignore_spec):
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/airflow/dbt_venv/lib/python3.11/site-packages/dbt/clients/system.py", line 79, in find_matching
    modification_time = os.path.getmtime(absolute_path)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen genericpath>", line 55, in getmtime
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpf6h80niz/models/intermediate/shipment/.int_easy_post_tracking.sql.GgjJgL'

How to reproduce

I believe having our setup should reproduce it as the issue happens quite often (once every few hours). We also have 1000+ dbt models so an example project of that size may be needed for reproducibility. If underlying deployment infrastructure information is required here it is:

image image

Anything else :)?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Contact Details

[email protected]

@oliverrmaa oliverrmaa added bug Something isn't working triage-needed Items need to be reviewed / assigned to milestone labels Jul 2, 2024
@dosubot dosubot bot added the execution:local Related to Local execution environment label Jul 2, 2024
Copy link

dosubot bot commented Jul 2, 2024

To resolve the intermittent FileNotFoundError: [Errno 2] No such file or directory for /tmp files that is causing task failures for dbt Cosmos tasks in your Astronomer deployment, you should ensure that the temporary directories and files are properly managed and exist when needed.

Here are some steps to address this issue:

  1. Manage Temporary Directories Properly:
    Ensure that the temporary directories are not cleaned up prematurely. Here is a revised version of the function to ensure the temporary directory is properly managed:

    import tempfile
    import shutil
    from pathlib import Path
    
    def failing_test_dbt_project(tmp_path):
        with tempfile.TemporaryDirectory() as tmp_dir:
            tmp_dir_path = Path(tmp_dir) / "mini"
            shutil.copytree(MINI_DBT_PROJ_DIR, tmp_dir_path)
            target_schema = tmp_dir_path / "models/schema.yml"
            if target_schema.exists():
                os.remove(target_schema)
            shutil.copy(MINI_DBT_PROJ_DIR_FAILING_SCHEMA, target_schema)
            yield tmp_dir_path
  2. Set the TMPDIR Environment Variable:
    Ensure that the TMPDIR environment variable is set correctly, especially on MacOS, to avoid the value of the temp directory changing unexpectedly. You can configure this in your cosmos/settings.py:

    import os
    import tempfile
    from pathlib import Path
    
    import airflow
    from airflow.configuration import conf
    
    from cosmos.constants import DEFAULT_COSMOS_CACHE_DIR_NAME, DEFAULT_OPENLINEAGE_NAMESPACE
    
    # In MacOS users may want to set the envvar `TMPDIR` if they do not want the value of the temp directory to change
    DEFAULT_CACHE_DIR = Path(tempfile.gettempdir(), DEFAULT_COSMOS_CACHE_DIR_NAME)
    cache_dir = Path(conf.get("cosmos", "cache_dir", fallback=DEFAULT_CACHE_DIR) or DEFAULT_CACHE_DIR)
    enable_cache = conf.getboolean("cosmos", "enable_cache", fallback=True)
    enable_cache_partial_parse = conf.getboolean("cosmos", "enable_cache_partial_parse", fallback=True)
    enable_cache_dbt_ls = conf.getboolean("cosmos", "enable_cache_dbt_ls", fallback=True)
    propagate_logs = conf.getboolean("cosmos", "propagate_logs", fallback=True)
    dbt_docs_dir = conf.get("cosmos", "dbt_docs_dir", fallback=None)
    dbt_docs_conn_id = conf.get("cosmos", "dbt_docs_conn_id", fallback=None)
    dbt_docs_index_file_name = conf.get("cosmos", "dbt_docs_index_file_name", fallback="index.html")
    enable_cache_profile = conf.getboolean("cosmos", "enable_cache_profile", fallback=True)
    dbt_profile_cache_dir_name = conf.get("cosmos", "profile_cache_dir_name", fallback="profile")
    
    try:
        LINEAGE_NAMESPACE = conf.get("openlineage", "namespace")
    except airflow.exceptions.AirflowConfigException:
        LINEAGE_NAMESPACE = os.getenv("OPENLINEAGE_NAMESPACE", DEFAULT_OPENLINEAGE_NAMESPACE)
  3. Handle Temporary Directory in run_command Function:
    Ensure that the run_command function in cosmos/dbt/graph.py is correctly handling the temporary directory and environment variables:

    def run_command(command: list[str], tmp_dir: Path, env_vars: dict[str, str]) -> str:
        """Run a command in a subprocess, returning the stdout."""
        logger.info("Running command: `%s`", " ".join(command))
        logger.debug("Environment variable keys: %s", env_vars.keys())
        process = Popen(
            command,
            stdout=PIPE,
            stderr=PIPE,
            cwd=tmp_dir,
            universal_newlines=True,
            env=env_vars,
        )
        stdout, stderr = process.communicate()
        returncode = process.returncode
    
        if 'Run "dbt deps" to install package dependencies' in stdout and command[1] == "ls":
            raise CosmosLoadDbtException(
                "Unable to run dbt ls command due to missing dbt_packages. Set RenderConfig.dbt_deps=True."
            )
    
        if returncode or "Error" in stdout.replace("WarnErrorOptions", ""):
            details = stderr or stdout
            raise CosmosLoadDbtException(f"Unable to run {command} due to the error:\n{details}")
    
        return stdout

By ensuring the temporary directory is properly managed, configuring the TMPDIR environment variable, and correctly handling the temporary directory in the run_command function, you can mitigate the FileNotFoundError issue [1][2][3].

To continue talking to Dosu, mention @dosu.

@tatiana tatiana added the customer request An Astronomer customer made requested this label Jul 2, 2024
@tatiana tatiana self-assigned this Jul 2, 2024
@tatiana tatiana added this to the Cosmos 1.6.0 milestone Jul 2, 2024
@tatiana tatiana modified the milestones: Cosmos 1.6.0, Cosmos 1.5.1 Jul 12, 2024
@tatiana
Copy link
Collaborator

tatiana commented Jul 12, 2024

Hi, @oliverrmaa. I'm still trying to reproduce this issue, but I have a few hypotheses, and we can try a few things out. We'll soon have an alpha release to validate one of these hypotheses.

The errors you are seeing are raised by dbt-core itself. Cosmos 1.4.3 does not have any specific knowledge of any of the files mentioned, including package-lock.yml.

We create a temporary directory for every dbt command that will be run. This was to overcome two main issues:

  1. Read-only filesystems: many users needed help to write in the original dbt project folder. Older versions of Cosmos would create a full copy of the original directory. This was changed in the last Cosmos versions to be a symbolic link, where applicable (available from 1.1 until now, 1.5).
  2. dbt concurrency: many dbt-core commands were not designed to be run concurrently, as described in their official docs. So, running them from independent directories would help (avoid issues).

Some follow-up questions:
(i) Have you observed if any of these errors are affected by deployments to Astro?
(ii) How is your disk consumption at the moment?
(iii) Do you have task retries enabled for these DAGs? What is the current amount of retries?
(iv) Would you consider having more and more minor Astro instances with smaller concurrency and checking if that helps mitigate the issue?
(v) Which data warehouse are you using?
(vi) Which version of dbt-core are you using?
(vii) Are you currently running dbt deps as part of the task runs? Are these errors happening when dbt deps or the central dbt command (e.g. dbt run) is executed?
(viii) The files that are mentioned to be missing (e.g. .provider_performance.sql.JjagaL, .honest_health_monthly_subscription_snapshots.sql.KDCDEl), are they part of your original project, are they dependencies or are they being created by dbt-core dynamically (e.g. compiled SQL files, that may use macros)?

There are two low-hanging fruits I can see:

a) Assuming the issue may be with the creation of symbolic links, we can create an alpha version of Cosmos that avoids creating those, runs the commands from the original folder, and uses environment variables for dbt-core JSON artifacts and dbt-core logs.

b) We could check if this error message ("[Errno 2] No such file or directory") happened within the dbt task run, and attempt to run it as part of the same task run in Cosmos itself. However, it does feel that configuring the task_retry in Airflow may be the most suitable place.

If we implement (a) and (b), you confirm that reducing the concurrency in smaller Airflow /Astro instances leads to the same error. Another path we can follow to improve the concurrent executions of dbt tasks is not to use dbt to run the SQL but to have Airflow operators running the compiled SQL. This is a strategy companies who had to scale dbt in Airflow used in their proprietary solutions (e.g., Monzo), and were quite successful. The downside is that some macros may not work as expected, but this can be a good compromise. The downside of this strategy is that we'll have some effort per Dataware house, but it may pay off.

@tatiana
Copy link
Collaborator

tatiana commented Jul 12, 2024

This issue seems to be affecting several users from the OpenSource community and Astro customers.

@tatiana tatiana removed the triage-needed Items need to be reviewed / assigned to milestone label Jul 12, 2024
@tatiana tatiana modified the milestones: Cosmos 1.5.1, Cosmos 1.6.0 Jul 17, 2024
@tatiana
Copy link
Collaborator

tatiana commented Jul 17, 2024

Feedback from @oliverrmaa :

First of all our dbt-bigquery version is 1.7.4 and dbt core version is 1.7.17 and in the dbt_project.yml , no we do not override any paths
The rest of the answers:
i) appear unaffected. They can occur even when there are no deployments (i.e. weekend)
ii) how do we check that?
iii) we rolled it out for 1 DAG (set it at 2) and it was very promising (i think the retries prevented it from being an issue for a > ~week), doesn't tell us root cause but addresses issue. We may roll it out to all DAGs.
iv) potentially open to this
v) Goolge BigQuery
vi) dbt core version is 1.7.17
vii) yup we do, we use operator_args={"install_deps": True} in our DbtTaskGroup()
viii) definitely not part of the original project, they are dependencies created dynamically. They do use macros like config() , source() , but so do other dbt models. We also have a custom macro that is on every single dbt model but its just a model that adds LIMIT 0 if the target is our CI environment

On (ii), I'd recommend the following steps:

  1. Open the Deployment in the Astro UI.
  2. Click on Overview.
  3. Click Get Analytics.
  4. You can view the following metrics related to ephemeral storage:
    Ephemeral storage usage (Kubernetes Executor/KubernetesPodOperator): This metric shows how your Kubernetes tasks use available ephemeral storage as a percentage of the total ephemeral storage configured. You can adjust the graph's y-axis to better fit your data or zoom in to view details.
    Ephemeral storage usage (Celery workers): This metric shows how your Celery worker uses available ephemeral storage as a percentage of the total ephemeral storage configured. You can also adjust the graph's y-axis to better fit your data or zoom in to view details [1].

After a few hours of analyzing and troubleshooting this issue, I have a few conclusions:

Cause

Either dbt-core or another related library creates temporary files in the source models folder. I read part of the dbt-core source code, and it was unclear where this happened. This was a surprise since I expected artifacts to be created in the target and logs folders, not in the source folders. This behavior may be associated with Jinja 2 caching or some macro or dbt adaptor. What is clear is that this is not caused by Cosmos (earlier this week I spoke to another library user to run dbt in Airflow who was facing the same issue). However, this is very likely to happen in environments like Airflow, where many concurrent "dbt" commands are run.

MItigations

There are two workarounds to the problem:

a) Use Airflow task_retry in the DbtDag / DbtTaskGroup that this is happening. As confirmed with the customer, by setting two, they no longer experienced this issue since it is unlike that this concurrency issue will happen many times in a row for the same task. It is not guaranteed, but it minimizes and works with any version of Cosmos.

b) Reduce task concurrency at the Airflow worker nodes to 1 in the node pools running Cosmos tasks, and have very small worker nodes, with a larger autoscaling upper limit.

c) Use KubernetesExecutor (no state would be shared between task executions)

d) Use Cosmos ExecutionMode.KUBERNETES

Follow ups

We were not able to reproduce this problem. But it would be great, @oliverrmaa , if we could narrow down "who" is creating these files (e.g. "/usr/local/airflow/dags/dbt_bfp/models/paid_media/audiences/om_20240401_fl_scotus_abortion_showbuy_person/._om_20240401_fl_scotus_abortion_showbuy_person.yml.DGBDnb").

We could make this folder: /usr/local/airflow/dags/dbt/ read-only to the process that is running Airflow (Astro).

By making the original folder read-only. There are two options:

  1. Whichever process is trying to create these files will fail, and hopefully, the traceback will be clear enough to determine who this process is. This would allow us to confirm if it is dbt-core or another library and help us understand the possibilities
  2. The process of creating these files in the original folder will be smart enough to handle the read-only source folder, and it will work. In this case, your problem would be solved without adding retries or changing the concurrency.

From an implementation perspective in Cosmos itself, we could adopt one of the following strategies to mitigate the problem:

  1. Stop creating a symbolic link to the /models folder and create a new temp folder for /models, symbolic linking only sql and yml files inside of it
  2. Have Cosmos trying to identify this error message in the command output, and if it happens, retry within Cosmos itself

@phanikumv phanikumv mentioned this issue Jul 18, 2024
18 tasks
@tatiana tatiana modified the milestones: Cosmos 1.6.0, Cosmos 1.7.0 Aug 16, 2024
@tatiana tatiana modified the milestones: Cosmos 1.7.0, Triage, Cosmos 1.8.0 Sep 20, 2024
@arodrber0
Copy link

arodrber0 commented Nov 12, 2024

I'm having these same problems with airflow, using version 1.7.1 of cosmos and 2.10.3 of airflow

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/cosmos-venv7u4h_420/bin/dbt'

I am using VIRTUALENV mode

venv_execution_config = ExecutionConfig(
    execution_mode=ExecutionMode.VIRTUALENV
)

@tatiana
Copy link
Collaborator

tatiana commented Dec 13, 2024

@oliverrmaa After analysing this issue, I considered a few possible strategies to overcome it. I'll first summarise what seems to be the cause and then give suggestions.

Cause

This is not a "bug" in Cosmos per se - since this happens to users of other solutions that run dbt in Airflow (and potentially anyone trying to run multiple dbt commands in the same host, regardless of the orchestration tool). This issue happens intermittently, and I was not able to reproduce it. dbt, or some plugin/adaptor, is creating/deleting these temporary .*.sql.* files in the original dbt models folder. Let's assume concurrent dbt commands are referencing the same models' folder. In that case, there is a chance that multiple processes would be creating/deleting these files concurrently, and one process may delete the temporary files that another in-progress process depends on.

Although Cosmos creates a new temporary folder per dbt command run, the new temporary folder contains a symbolic link to the original dbt project models' folder. In other words, multiple dbt processes in the same host will reference the same original models' folder. In the past, Cosmos would copy the entire dbt project folder per process, but this was causing lots of disk usage & delaying running dbt commands when dbt projects were large.

Mitigations

Astro Customers reported that by setting the task retries to a value equal to or higher than 1, they stopped experiencing this issue. One of the concerns is that many users may want to set a different retry amount for all the dbt/Airflow tasks. Once feature #1339 is merged, users will no longer have this concern since they can only set the retry for the models that make sense.

Another mitigation I thought about is to leverage the dbt .dbtignore file (https://docs.getdbt.com/reference/dbtignore). I, honestly do not know what will be the consequences of the actual dbt command execution. Please, @oliverrmaa, I'd love your feedback on this - if you could try it out:

Create/update the root directory of your dbt project to contain a .dbtignore file containing the following:

# ignore all .sql files that start with "." and have "sql" in the middle of the name
.*.sql.*

Possible code changes to Cosmos

I see two possible changes we could make to the Cosmos code-base to handle this problem:

  1. Have Cosmos create a symbolic link for every .sql in the original models folder, as opposed to the whole models folder
  2. If the dbt command execution fails and it contains the .*.sql.* pattern, have Cosmos retrying once to re-run the command regardless of the Airflow retries

Next steps

We should only consider these two changes to Cosmos if .dbtignore does not solve the problem. I'll wait for feedback on this for a few days; after this time, I'll likely close this ticket. Please reopen the issue if the proposed mitigation does not work - and we'll implement (1) or (2).

@tatiana
Copy link
Collaborator

tatiana commented Dec 16, 2024

@arodrber0, your problem differs from the original one reported in the current ticket. Please, create a dedicated ticket with the steps to reproduce.

@tatiana tatiana closed this as completed Dec 16, 2024
@arodrber0
Copy link

arodrber0 commented Dec 16, 2024

yes, I managed to solve it by changing the execution configuration, in these newer versions

`dbt_executable = Path(f"{os.environ['AIRFLOW_HOME']}/dbt_venv/bin/dbt")

venv_execution_config = ExecutionConfig(
dbt_executable_path=dbt_executable
)`

@tatiana
Copy link
Collaborator

tatiana commented Jan 27, 2025

This seems to attempt to fix the issue at the root of the problem, dbt-core:
dbt-labs/dbt-common#235

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working customer request An Astronomer customer made requested this execution:local Related to Local execution environment
Projects
None yet
Development

No branches or pull requests

3 participants