Opened 5 years ago
Closed 3 years ago
#5608 closed defect (fixed)
After upgrade to 0.10.0 restarting marks all items as "failed"
Reported by: | Owned by: | ||
---|---|---|---|
Priority: | major | Milestone: | 0.12.0 |
Component: | programming | Keywords: | |
Cc: | Olivier Mehani | Parent Tickets: |
Description
Fergtm reports that after upgrading to 0.10.0, any subsequent restart marks all media items as "failed" processing.
Seems similar to #5604.
Attachments (5)
Change History (22)
by , 5 years ago
Attachment: | mediagoblin.ini added |
---|
by , 5 years ago
by , 5 years ago
Attachment: | processing_panel_screenshot.jpg added |
---|
comment:1 by , 5 years ago
comment:2 by , 5 years ago
Looking into the database entry I see
fail_error Unhandled exception: This entry is not eligible for processor with name 'initial'
comment:3 by , 5 years ago
After uploading the image and after it shows as completed in the media processing panel.
The corresponding row in the core_media_entries table is:
id 45 public_id http://[redacted]/api/image/220a24e4-cfaf-4ebf-874d-2fad015684d7/ remote 0 actor 1 title [redacted] slug [redacted] description [redacted] media_type mediagoblin.media_types.image state processed license file_size 1.95 location 18 created 04:00.2 updated 04:00.2 fail_error fail_metadata transcoding_progress 0 queued_media_file queued_task_id e1d84031-5da6-4b43-bfff-8b2e32aef25c media_metadata {} main_transcoding_progress 0
and in celery_taskmeta:
id 58 task_id e1d84031-5da6-4b43-bfff-8b2e32aef25c status SUCCESS result date_done 04:00.9 traceback
After a restart status changes to failed in both tables with error set to "Unhandled exception: This entry is not eligible for processor with name 'initial'"
comment:4 by , 5 years ago
From looking at celery logs the task "e1d84031-5da6-4b43-bfff-8b2e32aef25c" was submitted again after the server restarts.
This second time it fails with ""Unhandled exception: This entry is not eligible for processor with name 'initial'" because it was already processed and the status is not "unprocessed" or "failed".
May 9 00:04:00 opochtli celery[585]: 2020-05-09 00:04:00,330 INFO [celery.worker.strategy] Received task: process_media[e1d84031-5da6-4b43-bfff-8b2e32aef25c] May 9 01:08:52 opochtli systemd[1]: Stopping MediaGoblin Celeryd... May 9 01:10:28 opochtli systemd[1]: Started MediaGoblin Celeryd. May 9 01:10:50 opochtli celery[590]: May 9 01:10:50 opochtli celery[590]: -------------- celery@opochtli v4.2.2 (windowlicker) May 9 01:10:50 opochtli celery[590]: ---- **** ----- May 9 01:10:50 opochtli celery[590]: --- * *** * -- Linux-5.4.0-29-generic-x86_64-with-glibc2.29 2020-05-09 01:10:50 May 9 01:10:50 opochtli celery[590]: -- * - **** --- May 9 01:10:50 opochtli celery[590]: - ** ---------- [config] May 9 01:10:50 opochtli celery[590]: - ** ---------- .> app: default:0x7f620a81a130 (.default.Loader) May 9 01:10:50 opochtli celery[590]: - ** ---------- .> transport: amqp://guest:**@localhost:5672// May 9 01:10:50 opochtli celery[590]: - ** ---------- .> results: sqlite:////srv/mediagoblin/media.nil.mx/mediagoblin/celery.db May 9 01:10:50 opochtli celery[590]: - *** --- * --- .> concurrency: 1 (prefork) May 9 01:10:50 opochtli celery[590]: -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker) May 9 01:10:50 opochtli celery[590]: --- ***** ----- May 9 01:10:50 opochtli celery[590]: -------------- [queues] May 9 01:10:50 opochtli celery[590]: .> default exchange=default(direct) key=default May 9 01:10:50 opochtli celery[590]: May 9 01:10:50 opochtli celery[590]: [tasks] May 9 01:10:50 opochtli celery[590]: . mediagoblin.media_types.video.processing.complementary_task May 9 01:10:50 opochtli celery[590]: . mediagoblin.media_types.video.processing.main_task May 9 01:10:50 opochtli celery[590]: . mediagoblin.media_types.video.processing.processing_cleanup May 9 01:10:50 opochtli celery[590]: . mediagoblin.notifications.task.EmailNotificationTask May 9 01:10:50 opochtli celery[590]: . mediagoblin.processing.task.handle_push_urls May 9 01:10:50 opochtli celery[590]: . mediagoblin.submit.task.collect_garbage May 9 01:10:50 opochtli celery[590]: . process_media May 9 01:10:51 opochtli celery[590]: 2020-05-09 01:10:51,467 ERROR [celery.worker.consumer.consumer] consumer: Cannot connect to amqp://guest:**@127.0.0.1:5672//: [Errno 111] Connection refused. May 9 01:10:51 opochtli celery[590]: Trying again in 2.00 seconds... May 9 01:10:53 opochtli celery[590]: 2020-05-09 01:10:53,510 ERROR [celery.worker.consumer.consumer] consumer: Cannot connect to amqp://guest:**@127.0.0.1:5672//: [Errno 111] Connection refused. May 9 01:10:53 opochtli celery[590]: Trying again in 4.00 seconds... May 9 01:10:57 opochtli celery[590]: 2020-05-09 01:10:57,535 ERROR [celery.worker.consumer.consumer] consumer: Cannot connect to amqp://guest:**@127.0.0.1:5672//: [Errno 111] Connection refused. May 9 01:10:57 opochtli celery[590]: Trying again in 6.00 seconds... May 9 01:11:03 opochtli celery[590]: 2020-05-09 01:11:03,991 INFO [celery.worker.consumer.connection] Connected to amqp://guest:**@127.0.0.1:5672// May 9 01:11:04 opochtli celery[590]: 2020-05-09 01:11:04,015 INFO [celery.worker.consumer.mingle] mingle: searching for neighbors May 9 01:11:05 opochtli celery[590]: 2020-05-09 01:11:05,051 INFO [celery.worker.consumer.mingle] mingle: all alone May 9 01:11:05 opochtli celery[590]: 2020-05-09 01:11:05,072 INFO [celery.worker.strategy] Received task: process_media[e1d84031-5da6-4b43-bfff-8b2e32aef25c]
That is all I can debug. I not familiar with celery so I don't understand how the app interacts with it and how it knows if a task needs to be resubmitted.
Does it have something to do with main_transcoding_progress ending up with 0 the first time the task was run?
comment:5 by , 5 years ago
Hello,
I fixed this problem with commenting these lines in the file mediagoblin/processing/__init__.py
# if entry and not processor.media_is_eligible(entry): # raise ProcessorNotEligible( # "This entry is not eligible for processor with name '%s'" % key)
Then after restarting mediagoblin-celeryd.service, medias won't be failed.
comment:6 by , 5 years ago
I think commenting the eligible check will just cause media to be reprocessed at restart.
It fixes the error but unless the transcoder checks if thumbnails already exists it will waste cpu power processing them again.
comment:7 by , 5 years ago
I found why the reprocessing job is re-submitted. The message stays unacknowledged in RabbitMQ. See the two attached screenshots: rabbitmq_status.png and celery_queue_connections.PNG
I mentioned before that videos survive a restart but images don't so I made a test:
1) Connected to RabbitMQ management UI and made sure that the message queue was empty.
2) Uploaded a new image and looked again at RabbitMQ, there was 1 unacknowledged message.
3) Uploaded a video and again looked at the queue, there were 4 unacknowledged (the previous image upload + 3 video trans-coding tasks.)
4) Waited for the media processing panel to show all processing complete and connected again to RabbitMQ management UI. There is still 1 unacknowledged message.
comment:8 by , 5 years ago
Images do not use a workflow just a single task. Videos work correctly and they use a workflow.
The only difference on how the tasks are scheduled is in mediagoblin/mediagoblin/submit/lib.py
wf = manager.workflow(entry, feed_url, reprocess_action, reprocess_info) if wf is None: ProcessMedia().apply_async( [entry.id, feed_url, reprocess_action, reprocess_info], {}, task_id=entry.queued_task_id) else: chord(wf[0])(wf[1])
I don't know why apply_async causes the task to stay unacknowledged but when run as part of a chord they are correctly tracked. Maybe a bug in celery?
Notice that celery documentation warns against leaking the task result. Both apply_async and chord return an AsyncResult instance that is ignored here. The documentations says: you must eventually call get() or forget() on EVERY AsyncResult instance returned after calling a task.
comment:9 by , 5 years ago
Cc: | added |
---|
comment:10 by , 4 years ago
Hi Fernando,
Can you tell me whether you are running Postgresql or Sqlite3 as the database please? I saw some similar suspicious error messages while doing some testing with Sqlite3, but I haven't run into any issues on the sites I run that use Postgresql. I'm interested to know if that's a coincidence or not.
Regards,
Bne
comment:11 by , 4 years ago
Hi Ben,
I'm using sqlite.
I previously used Postgres but later switched to sqlite, unfortunately it was around the same time I upgraded to 0.10.0 so I don't know if it was a coincidence or something new in 0.10.0
comment:12 by , 3 years ago
Milestone: | → 0.12.0 |
---|
comment:13 by , 3 years ago
Thanks for your investigation on this issue Fernando. I think this is an important one to fix before the next release.
I've had a brief look at it just now and have been able to replicate it reliably on both PostgreSQL and SQLite databases. The issue only occurs when running a separate Celery instance. I'm currently using Redis as my Celery broker as I don't have RabbitMQ in Guix System.
- Create a fresh database.
- Run
bin/gmg dbupdate
andbin/gmg adduser --username admin --password a --email admin@example.com
- Run
CELERY_ALWAYS_EAGER=false paster serve paste.ini --reload
in one terminal. - Run
CELERY_CONFIG_MODULE=mediagoblin.init.celery.from_celery bin/python -m celery worker --loglevel=INFO
in another terminal. - Log in to the web interface and upload an image.
- Stop the Celery process (shows "restoring 1 unacknowledged message")
- Start the Celery process (shows "Received task: process_media")
- Refresh the web interface, noting 1 failed upload, and that clicking on existing thumbnail shows 404.
As Fernando mentioned above, there's no issue with uploading videos, only images.
comment:14 by , 3 years ago
In the spirit of trying the easiest thing first, I tested a few versions of Celery up to 4.4.7 to see if this issue was related to a fixed Celery bug. No change, which suggests the bug is in our code.
comment:15 by , 3 years ago
I have a solution for this issue. Fernando, would you mind testing to confirm this fixes for you too?
As part of the multi-resolution video transcoding work merged into v0.10.0, the following Celery setting was added to mediagoblin/config_spec.ini
:
CELERY_ACKS_LATE = boolean(default=True)
The issue is resolved if we change this setting to False
, either in config_spec.ini
or mediagoblin.ini
.
Before this setting was added to MediaGoblin, Celery tasks were acknowledged by worker processes as soon as they had been taken off the queue. When True, this setting makes workers acknowledge tasks only after they have been completed. The unsatisfying part is that tasks should eventually be acknowledged either way and I don't yet understand why they're not being, even after the apparently successful processing.
In any case, this fixes the issue and I've also tested to make sure there's no negative impact on video processing. This fix probably deserves a minor release in the near future.
comment:16 by , 3 years ago
I just tested this and it fixed the problem!, I had to comment CELERY_ACKS_LATE = boolean(default=True)
in config_spec.ini
, I first tried adding CELERY_ACKS_LATE = False
to mediagoblin.ini
but that didn't work.
I ran ./bin/gmg dbupdate
after each edit, I'm not sure if that was necessary.
comment:17 by , 3 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Wonderful, applied to master branch and marking as fixed. This will be the default in our upcoming 0.12.0 release.
Before restarting the server I checked that the processing queue was empty:
I could already see the images listed in the page and after restart they disapeared.