Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"task exited" error in mesh partioner #11

Open
szwang1990 opened this issue Nov 14, 2023 · 4 comments
Open

"task exited" error in mesh partioner #11

szwang1990 opened this issue Nov 14, 2023 · 4 comments

Comments

@szwang1990
Copy link

Hi, I've been using fmesh for a while, and have run FESOM2 successfully using my own mesh generated by fmesh. But from time to time, when I use the mesh partitioner to decompose the mesh, the code stops running suddenly, with only an error message in the slurm error output file. I upload the three mesh files (nod2d.out, elem2d.out, aux3d.out) and the output files.
FESOM2_mesh_generated_by_fmesh.zip

My questions are:

  1. Why does the code stop running?
  2. If it is due to low mesh quality, what should I do to enhance the mesh quality?

Thanks in advance.
Shizhu

@Sealki
Copy link
Collaborator

Sealki commented Nov 14, 2023

Hi Shizhu,

Thank you for using fmesh and I am glad you find it useful (sometimes). However, the problem you described doesn't seem to relate to fmesh, and I don't know why partitioning does not work. From my previous experience while debugging fmesh, if there were any mistakes in fmesh results, partitioning gave the lat/lon position of a problematic node and indicated a reason of failure. But I don't see anything like this in your slurm files.

In slurm-err file, there are multiple "mkdir: cannot create directory" errors, however. Could it be a problem?

Sergei

@szwang1990
Copy link
Author

Hi Shizhu,

Thank you for using fmesh and I am glad you find it useful (sometimes). However, the problem you described doesn't seem to relate to fmesh, and I don't know why partitioning does not work. From my previous experience while debugging fmesh, if there were any mistakes in fmesh results, partitioning gave the lat/lon position of a problematic node and indicated a reason of failure. But I don't see anything like this in your slurm files.

In slurm-err file, there are multiple "mkdir: cannot create directory" errors, however. Could it be a problem?

Sergei

Hi Sergei,

thanks for the reply. Yes, you are right. I ran the mesh partitioning on multiple CPUs (FESOM1.4 style). Patrick told me that in FESOM2, normally we should do the partitioning using only one CPU. Now the partitioning seems all right after I ran the code serially :)
I ran into another problem. After I did the partitioning, I ran FESOM2 using my own mesh. When I used pyfesom2 to check the result
pfplot $MESH_DIR $DATA_DIR temp
I got this spotted picture
Screenshot from 2023-11-15 14-46-02

  1. Should this be caused by low mesh quality?
  2. Do we have tools to assess the quality of the mesh generated by fmesh?
  3. Is it possible that in one narrow strait there are only two nodes and hence both nodes are idle during the run? In FESOM1.4, this situation should report errors.
  4. Is the mesh generated by fmesh already rotated or not? If not, then both rotated_grid and force_rotation in namelist.config should be true?

Shizhu

@Sealki
Copy link
Collaborator

Sealki commented Nov 15, 2023

Hi Shizhu,

I am glad you could make it running. Regarding your other questions:

  1. I think so. But it's not about quality. You probably need to modify the followed argument in pfplot:
    "--influence",
    "-i",
    default=80000,
    type=float,
    help="Radius of influence for interpolation, in meters.",
    If your element sizes exceed 80 km, there is a chance that no model nodes are found within this radius around the default (1 degree) grid nodes. Alternatively, you can probably play with this pfplot argument:
    "--res",
    "-r",
    nargs=2,
    type=int,
    default=(360, 170),
    help="Number of points along each axis that will be used for interpolation (for lon and lat).",
    metavar=("N_POINTS_LON", "N_POINTS_LAT"),

  2. I am not aware of such tools.

  3. Yes, fmesh does allow existence of the narrow straits (one element wide). It should not affect work of the model, but you are right that such strait will be "idling".

  4. I have no idea what you are asking about here. I have never touched the mentioned config parameters and don't know exactly what they are responsible for.

Sergei

@szwang1990
Copy link
Author

Hi Shizhu,

I am glad you could make it running. Regarding your other questions:

  1. I think so. But it's not about quality. You probably need to modify the followed argument in pfplot:
    "--influence",
    "-i",
    default=80000,
    type=float,
    help="Radius of influence for interpolation, in meters.",
    If your element sizes exceed 80 km, there is a chance that no model nodes are found within this radius around the default (1 degree) grid nodes. Alternatively, you can probably play with this pfplot argument:
    "--res",
    "-r",
    nargs=2,
    type=int,
    default=(360, 170),
    help="Number of points along each axis that will be used for interpolation (for lon and lat).",
    metavar=("N_POINTS_LON", "N_POINTS_LAT"),
  2. I am not aware of such tools.
  3. Yes, fmesh does allow existence of the narrow straits (one element wide). It should not affect work of the model, but you are right that such strait will be "idling".
  4. I have no idea what you are asking about here. I have never touched the mentioned config parameters and don't know exactly what they are responsible for.

Sergei

Hi Sergei,

your instructions are very helpful. It turns out that the resolution of my test mesh is really low, which exceeds the default radius. Now it works fine for me.

Again, many thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants