Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket Service: Restart Flag Not Reset and Accidentally Cleared by Offloaded Polling Events #81813

Open
illysky opened this issue Nov 23, 2024 · 4 comments
Assignees
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug

Comments

@illysky
Copy link

illysky commented Nov 23, 2024

I’m working with Zephyr v4.0.0 and trying to get to grips with the socket service. The process has been somewhat challenging, as I’m also using a combination of offloaded sockets alongside native sockets, with plans to eventually incorporate native DNS and TLS. There’s a good chance that some of the issues I’m experiencing are due to a misunderstanding on my part, so apologies in advance.

The DNS Resolve module registers its sockets using net_socket_service_register, as does one of my own modules for HTTP. Each time a new socket is registered, the socket service thread is instructed to restart to reload the updated events. This behavior is triggered by the following code in socket_service.c:

/* Tell the thread to re-read the variables */
zvfs_eventfd_write(ctx.events[0].fd, 1);
ret = 0;

And handled as:

if (ret > 0 && ctx.events[0].revents) {
    zvfs_eventfd_read(ctx.events[0].fd, &value);
    NET_DBG("Received restart event.");
    goto restart;
}

However, I can’t see where or how the restart flag (i.e., ctx.events[0].fd = 0) is cleared once the restart is complete. It doesn’t appear that zvfs_eventfd_read(ctx.events[0].fd, &value) clears it, and frankly, I’m not sure what the purpose of that call is anyway.

After digging into this further, it seems the restart flag might be cleared indirectly—or even randomly—by modules such as offloaded socket drivers. For example, I’ve implemented a variant of the in-tree SimpleLink Wi-Fi driver to support the NWP-only CC3120. The original driver doesn’t appear to be designed with the socket service in mind, as it modifies file descriptors that don’t belong to it. This creates problems, as the socket service reserves the first element of ctx.events for its internal control.

Specifically, the SimpleLink driver clears all revents fields before checking ownership, which can inadvertently prevent the socket service from restarting. Additionally, when the driver detects it doesn’t own a file descriptor, it returns an error instead of gracefully ignoring it and leaving other interfaces unaffected. Here’s the relevant snippet from simplelink_sockets.c:

for (i = 0; i < nfds; i++) {
    fds[i].revents = 0;  // <--- Clears all revents before determining ownership
    if (fds[i].fd < 0) {
        continue;
    } else {
        obj = zvfs_get_fd_obj(fds[i].fd,
                              (const struct fd_op_vtable *)&simplelink_socket_fd_op_vtable,
                              ENOTSUP);
        if (obj != NULL) {
            /* Offloaded socket found */
            // <--- Should clear here only
            sd = OBJ_TO_SD(obj);
        } else {
            /* Non-offloaded socket, return an error */
            retval = slcb_SetErrno(EINVAL);  // <--- Should ignore, not return, as this exits socket_service thread
            goto exit;
        }
    }
    if (fds[i].events & ZSOCK_POLLIN) {
        SL_SOCKET_FD_SET(sd, &rfds);
    }
    if (fds[i].events & ZSOCK_POLLOUT) {
        SL_SOCKET_FD_SET(sd, &wfds);
    }
    if (sd > max_sd) {
        max_sd = sd;
    }
}

Since my implementation of the driver no longer interacts with file descriptors that don’t belong to it, the restart flag isn’t always cleared as expected. This results in the socket service repeatedly restarting when returning from a blocked state in zsock_poll, causing socket data processing to be skipped.

Have I understood this issue correctly? If so, who is responsible for clearing the restart flag in socket_service.c (i.e., ctx.events[0].fd = 0)? I assume the service itself should handle this. Additionally, I’m wondering if the restart condition should be checked and addressed after processing the results of zsock_poll.

To address this, my proposed update to socket_service.c is as follows:

while (true) {
    ret = zsock_poll(ctx.events, count + 1, -1);
    if (ret < 0) {
        ret = -errno;
        NET_ERR("poll failed (%d)", ret);
        goto out;
    }

    if (ret == 0) {
        /* should not happen because timeout is -1 */
        break;
    }

    for (i = 1; i < (count + 1); i++) {
        if (ctx.events[i].fd < 0) {
            continue;
        }

        if (ctx.events[i].revents > 0) {
            ret = trigger_work(&ctx.events[i]);
            if (ret < 0) {
                NET_DBG("Triggering work failed (%d)", ret);
            }
        }
    }

    // Relocate after trigger work so the work gets done before restarting
    if (ret > 0 && ctx.events[0].revents) {
        // zvfs_eventfd_read(ctx.events[0].fd, &value);  // Remove
        ctx.events[0].revents = 0;                      // Clear the event as soon as possible
        NET_DBG("Received restart event.");
        goto restart;
    }
}

Apologies if I’ve misunderstood any part of this—I've only been working on it for a week. Thanks in advance for your insights!

@illysky illysky added the bug The issue is a bug, or the PR is fixing a bug label Nov 23, 2024
@illysky illysky changed the title Socket Service restart service flag not reset and accidently reset by offloaded polling events Socket Service: Restart Flag Not Reset and Accidentally Cleared by Offloaded Polling Events Nov 23, 2024
Copy link

Hi @illysky! We appreciate you submitting your first issue for our open-source project. 🌟

Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙

@jukkar
Copy link
Member

jukkar commented Nov 23, 2024

Thanks for impressive bug analysis. it must have been difficult debugging session as the interaction of different components is complicated.
I am not sure about commenting out the zvfs_eventfd_read(), I have understood that as we wrote into the event, we should read away the data too. But clearing the revents looks a right thing to do.

About the simplelink driver, I don't fully understand the interactions here. Both the driver and the socket service have their own fds array so how the driver could affect the service here. Or is this related to recent change where the poll implementation was moved to zvfs in commit 881dc1f

cc: @cfriedt and @rlubos

@illysky
Copy link
Author

illysky commented Nov 23, 2024

No problem. Happy to help! Debugging was eye watering! 😅

Yeah I think that socket service uses the global fdtable now, but it makes its own copy on thread restart, and passes that to zsock_poll. When simplelinks zsock_poll implementation is called, it assumes all the fds belong to it and starts clearing revents. Element 0 in ctx->events is used by socket service so that gets cleared. It will be also problematic when I start using sockets from other offloaded interfaces.

I’ve tested it my proposal, it’s all working fine now, I put back in the read call as you said. Let me know if you need me to do anything. 🙏

@cfriedt
Copy link
Member

cfriedt commented Nov 23, 2024

Relocating the code to restart polling looks correct. Work should probably get done before restarting so that events are not lost. I agree with Jukka as well though, that we should still read from the eventfd to clear the value from it.

@illysky - would you be able to make a pull request with your fix?

I suspect that this issue would affect not only offloaded sockets but also centrally managed ones.

With that, you could create a ZTest to demonstrate the problem in isolation with intentionally injected conditions. The test can be run with e.g. qemu, or native_sim.

The test would be expected to fail when run without the fix. With the fix would demonstrate the test passing. That's important for maintainers reviewing new code submissions, but it also ensures we don't encounter the same problem in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug
Projects
None yet
Development

No branches or pull requests

5 participants