Opened 4 months ago

Closed 4 months ago

#5609 closed defect (fixed)

Celery connection to RabbitMQ errors at start or after running for a few minutes

Reported by: Fernando Gutierrez Owned by:
Priority: major Milestone:
Component: programming Keywords:
Cc: Parent Tickets:

Description

When Celery is starting or after it has run for a few minutes I get many "connection reset" errors in the log.

After this errors Celery does not recover and appears to be running but does not process any tasks.

There are two root causes for this problem:

1) The systemd service file provided as example in the deployment guide does not specify that celery must be started after RabbitMQ. The OS is free to start them in any order and when celery starts before RabbitMQ it will not be able to connect

2) In mediagoblin/init/celery/init.py the BROKER_HEARTBEAT is set to 1 second. In slower machines or under heavy load this setting causes missed hearbeats and a eventual connection reset.
I think this code had the intention to cause a more granular task scheduling but from what I understand from celery docs it has nothing to do with task scheduling. The CELERYBEAT_SCHEDULE is just a periodic task and has nothing to do with BROKER_HEARTBEAT.

Subtickets

Attachments (1)

fix_celery_connections.diff (897 bytes) - added by Fernando Gutierrez 4 months ago.
Proposed fix

Download all attachments as: .zip

Change History (2)

Changed 4 months ago by Fernando Gutierrez

Attachment: fix_celery_connections.diff added

Proposed fix

comment:1 Changed 4 months ago by Ben Sturmfels

Resolution: fixed
Status: newclosed

Thanks very much Fernando, merged!
http://git.savannah.gnu.org/cgit/mediagoblin.git/commit/?id=243354b65e1c2793f12d01d8174e9a168eb01ecd

We really appreciate the work you've done here to investigate this issue.

Note: See TracTickets for help on using tickets.