-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
esp_mqtt_client_enqueue() can block the caller in unreliable connection cases (IDFGH-7853) #230
Comments
Just a suggestion- as this is what I plan to do- couldn't you just build without API locks and just ensure you only access the API from a single task? I agree though- it seems that a different mutex could be used for enqueue/dequeue messages when locks are enabled. |
@someburner Good point, but I think this option won't work. Why: There are always two tasks involved: 1) the "user task" which enqueues the messages and 2) the ESP-MQTT task which dequeues the messages. The MQTT-ESP task is spawn automatically – no option to disable it. Therefore, disabling the lock would imply, that the underneath used |
Yeah I don't think STAILQ is thread-safe. I suppose you could assume that if the esp-mqtt task is blocking due to not being connected, then it's probably not accessing those queues, but thats not really safe. I think my proposed solution of a separate lock for enqueue/dequeue would work in theory, but looking at the esp-mqtt codebase, the API_LOCK is on the entire client struct and is required as part of formatting the message before it even goes into the outbox, so that's not really an option without a major overhaul. It would probably just be better if esp-mqtt could release the mutex while it's waiting to reconnect. However the If it'd be possible to use a counting or binary semaphore for the API_LOCK then you could implement your own queue and then call FYI: I gave this a try myself and do not get a reboot, but it does seem like the calling task stalls for 5-10 seconds, and after the stall I get this log:
It only stalls once. After a little bit typedef struct {
char * payload;
int len;
int qos;
} MSG_T;
static QueueHandle_t mqtt_tx_queue;
void mqtt_tx_task(void *arg)
{
MSG_T * m;
while(1) {
if(xQueueReceive(mqtt_tx_queue, &m, portMAX_DELAY) == pdTRUE) {
int msg_id = esp_mqtt_client_publish(mqtt_client, "your/topic", m->payload, m->len, m->qos, 0);
if(m->payload) free(m->payload);
free((void*)(m));
ESP_LOGI(TAG, "sent publish successful, msg_id=%d", msg_id);
}
}
}
bool your_publish_task(MSG_T * m)
{
if(!m) return false;
if(xQueueSend(mqtt_tx_queue, &m, portMAX_DELAY) != pdPASS) {
free((void*)(m));
return false;
}
return true;
}
void main() {
mqtt_tx_queue = xQueueCreate(32, sizeof(MSG_T *));
BaseType_t xReturned = xTaskCreatePinnedToCore(
mqtt_tx_task, "mqtt_tx", MQTT_TX_TASK_STACK_SIZE, NULL, MQTT_TX_TASK_PRIORITY, NULL, MQTT_TX_TASK_CORE);
while(1) {
vTaskDelay( 2000 / portTICK_PERIOD_MS );
MSG_T * m = (MSG_T *)malloc(sizeof(MSG_T));
m->payload = (char *)malloc(5);
memcpy(m->payload, "test", 5);
m->len = 5;
m->qos = 0;
your_publish_task(m);
}
} Alternatively you can set |
@someburner I have made some more thoughts on what could be made better. The idea is to optionally externalise the task for ESP-MQTT. Going this path would allow to have one task that can synchronize all. If the (to be created) Kconfig option is enabled (default disabled for backwards compatibility), the user must take care of calling the processing part of @someburner @david-cermak Thanks a lot |
@michaelgaehwiler - Honestly to me that sounds like a lot more work on your part, but yes there could be a kconfig option to provide your own task method, like how there is the option to provide your own outbox implementation. I was going to say that just using an extra queue is enough for my purposes but actually the issue is more insidious for me too I think. In my case I have qos1 messages that I must send only one at a time and to do so, I have to check the number of qos1 messages in the outbox. To make that safe I obviously need API locks, but in that case would lead to the same stalling in another thread. However I think in that case it would be better for the mqtt_task to signal to my other task when the outbox has no qos1 items. I'm not constrained to use the official SDK and I just copy |
Hi @michaelgaehwiler and @someburner Thanks for reporting this problem and elaborating on a potential solution. Also, thanks for the idea of disabling the API-locks! Maybe this could be used as a workaround to have the internal event loop handle also user events and (in the context of the mqtt thread) call the Then we'd be able to post requests to publish from user threads: char *data;
asprintf(&data, "data %d", i);
esp_mqtt_event_t event = { .event_id = MQTT_USER_EVENT, .client = client, .topic = "/topic/qos0", .data = data, .qos = 0};
esp_err_t ret = esp_mqtt_dispatch_custom_event(client, MQTT_USER_EVENT, &event);
ESP_LOGI(TAG, "[APP] Posted event with ret-code %d", ret); and process it directly in the event handlers: static void mqtt_event_handler(void *handler_args, esp_event_base_t base, int32_t event_id, void *event_data)
{
ESP_LOGD(TAG, "Event dispatched from event loop base=%s, event_id=%d", base, event_id);
esp_mqtt_event_handle_t event = event_data;
esp_mqtt_client_handle_t client = event->client;
int msg_id;
switch ((esp_mqtt_event_id_t)event_id) {
case MQTT_EVENT_CONNECTED:
....
....
case MQTT_USER_EVENT:
ESP_LOGI(TAG, "MQTT_USER_EVENT");
esp_err_t ret = esp_mqtt_client_enqueue(client, event->topic, event->data, 0, event->qos,0,1); // or publish
free(event->data);
ESP_LOGD(TAG, "[APP] Enqueued posted data with ret-code %d", ret);
break; Note that this should work with |
Thank you for the feedback. I have no experience with event loops, but if I understand correctly, How to proceed? Will you take care of finalizing it or what would you expect from me? |
Hi @david-cermak again, I just have thought about how I would integrate it in my code. As I'm using C++ as much as possible and to have a more generic Why: What do you think of that approach? |
Thanks! Just wanted to hear the feedback to see if the proposed user-event would help and fix your issues. About the handler's arguments, it uses the standard type, the same as other events: esp-mqtt/include/mqtt_client.h Lines 182 to 207 in f14eeb9
I think we could add another field esp-mqtt/include/mqtt_client.h Lines 56 to 60 in f14eeb9
(downside is that the data ptr is PS: About C++
Have you seen IDF's C++ wrapper? https://github.com/espressif/esp-idf/blob/master/examples/cxx/experimental/esp_mqtt_cxx/tcp/main/mqtt_tcp_example.cpp |
@david-cermak Just want to throw in my vote for separate outbox locks :) Seems like the main difficulty there is handling msg_id properly. What other pitfalls are there? Perhaps I should make a new issue for this feature? For the time being I like the idea of custom user events that could be used to enqueue within the esp-mqtt task. I still need a way to ensure only 1 at a time QoS 1 publishes (and be able to check that before actually calling any enqueue methods), so I think I will manually hack in task signalling for that. But being able to lock/unlock the outbox specifically would be great in the future. |
@someburner As said above, using separate outbox locks would be my preferred choice, so yes, please, create a new issue for this.
Could you please elaborate? What kind of signalling would you need? Are you trying to use QoS 1 messages for Qos 2 purpose? |
@david-cermak Okay will make separate issue for that.
Sure. Essentially I need to guarantee in-order delivery of QoS 1 messages, similar to setting the Right now what I have done is add some methods like this: int outbox_get_qos1_count(outbox_handle_t outbox)
{
int count = 0;
outbox_item_handle_t item;
STAILQ_FOREACH(item, outbox, next) {
if(item->msg_qos == 1) {
count += 1;
}
}
return count;
}
int esp_mqtt_client_get_outbox_qos1_count(esp_mqtt_client_handle_t client)
{
int outbox_size = 1;
if (client == NULL) {
return 1;
}
MQTT_API_LOCK(client);
if (client->outbox) {
outbox_size = outbox_get_qos1_count(client->outbox);
}
MQTT_API_UNLOCK(client);
return outbox_size;
} Also inside I can then call I understand this is a pretty specific use-case, although I do believe it is the correct approach since neither Qos 1 or 2 make any claims about order of delivery. If |
@david-cermak Sorry for the delay until my response.
Yes, it would be great, if that solution could be provided until #231 gets available.
You're right and yes, reusing the
Thanks for the suggestion! I'll take a look on it. |
@someburner Thanks for explaining! If I understand correctly, you're trying to implement some kind of streaming with mqtt messages. Would some platform which inherently supports steaming help in your use-case? For example, I was thinking (for some time already) about adding support for kafka client API. Do you think using apache kafka would be a solution? I think you already thought about alternatives, but let me mention this anyway: |
@michaelgaehwiler Thanks for the feedback. Yes, I think the user-event would be provided before implementing separate locking. |
We already have an architecture in place with thousands of ESP8266's, and that is basically what we do for that using edits to the taunpm library and a custom qos1 outbox implementation. I don't think we have the resources or want to switch to a different library/platform for ingesting data, although it is something we have thought about. I think we would run into the same issue using kafka. Keeping track of in-flight message IDs does work, but I didn't investigate how the message IDs work in this library. E.g. does message ID survive through reconnect and re-send? I'm guessing it does, so that would probably be an option. I forgot that I just went with the "quick and dirty" way of just tracking outbox size. Perhaps message ID would work better, as I could add my own lock to just check for that specific message ID. However it's still a bit of a circular issue if the |
@david-cermak Thanks for the implementation! I've quickly made a code review of the changes in 97503cc, but from my understanding, the user event will be executed after |
Yes, the event is posted after locking the client, but that should not prevent from calling other client's API. Same as the default example publishes in the event handler: It's because the lock used here is a recursive mutex: esp-mqtt/lib/include/mqtt_client_priv.h Lines 45 to 46 in 5688a84
|
Oh, I missed that – thanks for clarification! |
* Update submodule: git log --oneline ae53d799da294f03ef65c33e88fa33648e638134..fde00340f19b9f5ae81fff02ccfa9926f0e33687 Detailed description of the changes: * Fix the default configuration for event queue - See merge request espressif/esp-mqtt!153 - See commit espressif/esp-mqtt@fb42588 * Adds missing header. - See merge request espressif/esp-mqtt!152 - See commit espressif/esp-mqtt@8a60057 * Moves state change when stopping the client - See merge request espressif/esp-mqtt!150 - Closes espressif/esp-mqtt#239 - See commit espressif/esp-mqtt@3738fcd * Adds error code to MQTT_EVENT_SUBSCRIBED in case of failure - See merge request espressif/esp-mqtt!143 - - Closes espressif/esp-mqtt#233 - See commit espressif/esp-mqtt@9af5c26 * Adds debug information on sending dup messages - See merge request espressif/esp-mqtt!145 - See commit espressif/esp-mqtt@47b3f9b * ci: Fix qemu build - See merge request espressif/esp-mqtt!147 - See commit espressif/esp-mqtt@68e8c4f * ci: Build and Test QEMU on v5.0 - See merge request espressif/esp-mqtt!142 - See commit espressif/esp-mqtt@9db9ee7 * client: Add support for user events - See merge request espressif/esp-mqtt!140 - Closes espressif/esp-mqtt#230 - See commit espressif/esp-mqtt@97503cc * Adds unregister event API - See merge request espressif/esp-mqtt!139 - Closes #9194 - See commit espressif/esp-mqtt@a9a9fe7
Also supporting configurable queue size for the internal event loop. Closes espressif#230
Also supporting configurable queue size for the internal event loop. Closes espressif#230
Hi all,
Finding
esp_mqtt_client_enqueue()
can block for the MQTT network timeout time.Expectations
esp_mqtt_client_enqueue()
never blocks, in no circumstance (or at least blocks only for very short amount of time in the range of milliseconds).esp_mqtt_client_enqueue()
is documentedLong Version
From the documentation (
... could be used as a non blocking version of esp_mqtt_client_publish()
), I have the expectation, thatesp_mqtt_client_enqueue()
never blocks (or at least it blocks only very short in the range of milliseconds). I wanted to useesp_mqtt_client_enqueue()
to avoid creating an additional MQTT task (for saving resources), as ESP-MQTT is creating a task already. But the function can block for the time used for connection timeout and I'm using the watchdog with a way shorter timeout for the caller task.The happy path works great, no problem. I came across the bug when doing some smoke tests, as I must expect an unreliable internet connection and I must also expect an unreliable MQTT server. If the MQTT server is no more available (e.g. unplug the LAN port of the server to simulate an ungraceful shutdown), the network communication hangs for the timeout (which is by default 10 seconds). If I call
esp_mqtt_client_enqueue()
in this time, the function will block the caller until the network connection timeout is over.The same happens if the MQTT server is not available from the start of the ESP32 program (like pointing the client to e.g.
ws://8.8.8.8:80
) and callingesp_mqtt_client_enqueue()
regardless of the connection state of ESP-MQTT, the caller will be blocked until the network connection timeout triggers.I came across this problem, because my device rebooted, because of a trigger of the watchdog. The caller of
esp_mqtt_client_enqueue()
is a task with default watchdog of 5 seconds. It handles also the GUI, which now is from time to time unresponsive. I know, I could create a separate FreeRTOS task as a workaround for this case, but it is ridiculous to have an additional task when ESP-MQTT uses already a dedicated task...Details
I'm sorry to not provide an example that reproduces the problem, as it would need a system of multiple hosts and manually unplugging the LAN connection on the server.
The dependency to network timeouts arise in the use of
MQTT_API_LOCK() / MQTT_API_UNLOCK()
, as the same lock is used also in the functionesp_mqtt_task()
in which the lock can be acquired with afterwards calling blocking functions.From my understanding, the lock should be acquired only very shortly and never while calling blocking functions. Or if that is needed, the lock for
esp_mqtt_task()
should be separated from the lock used byesp_mqtt_client_enqueue()
. Or a buffer is used, which never blocks the writer for putting messages in the queue.Question
Can you confirm this bug? Do you need any additional information?
Thanks in advance
The text was updated successfully, but these errors were encountered: