Opened 4 years ago

Closed 3 years ago

#5608 closed defect (fixed)

After upgrade to 0.10.0 restarting marks all items as "failed"

Reported by: fergtm@… Owned by:
Priority: major Milestone: 0.12.0
Component: programming Keywords:
Cc: Olivier Mehani Parent Tickets:

Description

Fergtm reports that after upgrading to 0.10.0, any subsequent restart marks all media items as "failed" processing.

Seems similar to #5604.

Attachments (5)

mediagoblin.ini (2.3 KB ) - added by Ben Sturmfels 4 years ago.
ls-l.txt (7.0 KB ) - added by Ben Sturmfels 4 years ago.
processing_panel_screenshot.jpg (163.1 KB ) - added by Ben Sturmfels 4 years ago.
rabbitmq_status.png (56.6 KB ) - added by Fernando Gutierrez 4 years ago.
RabbitMQ Management UI Screenshot
celery_queue_connections.PNG (14.2 KB ) - added by Fernando Gutierrez 4 years ago.
Celery RabbitMQ connections

Download all attachments as: .zip

Change History (22)

by Ben Sturmfels, 4 years ago

Attachment: mediagoblin.ini added

by Ben Sturmfels, 4 years ago

Attachment: ls-l.txt added

by Ben Sturmfels, 4 years ago

comment:1 by Fernando Gutierrez, 4 years ago

Before restarting the server I checked that the processing queue was empty:

mediagoblin@opochtli:/srv/mediagoblin/media.nil.mx/mediagoblin$ export MEDIAGOBLIN_CONFIG=/srv/mediagoblin/media.nil.mx/mediagoblin/mediagoblin_local.ini
mediagoblin@opochtli:/srv/mediagoblin/media.nil.mx/mediagoblin$ export CELERY_CONFIG_MODULE=mediagoblin.init.celery.from_celery
mediagoblin@opochtli:/srv/mediagoblin/media.nil.mx/mediagoblin$ ./bin/celery inspect active
-> celery@opochtli: OK
    - empty -

I could already see the images listed in the page and after restart they disapeared.

comment:2 by Fernando Gutierrez, 4 years ago

Looking into the database entry I see

fail_error
Unhandled exception: This entry is not eligible for processor with name 'initial'

comment:3 by Fernando Gutierrez, 4 years ago

After uploading the image and after it shows as completed in the media processing panel.

The corresponding row in the core_media_entries table is:

id                           45
public_id                    http://[redacted]/api/image/220a24e4-cfaf-4ebf-874d-2fad015684d7/
remote                       0
actor                        1
title                        [redacted]
slug                         [redacted]
description                  [redacted]
media_type                   mediagoblin.media_types.image
state                        processed
license                      
file_size                    1.95
location                     18
created                      04:00.2
updated                      04:00.2
fail_error                   
fail_metadata                
transcoding_progress         0
queued_media_file            
queued_task_id               e1d84031-5da6-4b43-bfff-8b2e32aef25c
media_metadata               {}
main_transcoding_progress    0

and in celery_taskmeta:

id           58
task_id      e1d84031-5da6-4b43-bfff-8b2e32aef25c
status       SUCCESS
result       
date_done    04:00.9
traceback   


After a restart status changes to failed in both tables with error set to "Unhandled exception: This entry is not eligible for processor with name 'initial'"

comment:4 by Fernando Gutierrez, 4 years ago

From looking at celery logs the task "e1d84031-5da6-4b43-bfff-8b2e32aef25c" was submitted again after the server restarts.

This second time it fails with ""Unhandled exception: This entry is not eligible for processor with name 'initial'" because it was already processed and the status is not "unprocessed" or "failed".

May  9 00:04:00 opochtli celery[585]: 2020-05-09 00:04:00,330 INFO    [celery.worker.strategy] Received task: process_media[e1d84031-5da6-4b43-bfff-8b2e32aef25c]
May  9 01:08:52 opochtli systemd[1]: Stopping MediaGoblin Celeryd...
May  9 01:10:28 opochtli systemd[1]: Started MediaGoblin Celeryd.
May  9 01:10:50 opochtli celery[590]:  
May  9 01:10:50 opochtli celery[590]:  -------------- celery@opochtli v4.2.2 (windowlicker)
May  9 01:10:50 opochtli celery[590]: ---- **** -----
May  9 01:10:50 opochtli celery[590]: --- * ***  * -- Linux-5.4.0-29-generic-x86_64-with-glibc2.29 2020-05-09 01:10:50
May  9 01:10:50 opochtli celery[590]: -- * - **** ---
May  9 01:10:50 opochtli celery[590]: - ** ---------- [config]
May  9 01:10:50 opochtli celery[590]: - ** ---------- .> app:         default:0x7f620a81a130 (.default.Loader)
May  9 01:10:50 opochtli celery[590]: - ** ---------- .> transport:   amqp://guest:**@localhost:5672//
May  9 01:10:50 opochtli celery[590]: - ** ---------- .> results:     sqlite:////srv/mediagoblin/media.nil.mx/mediagoblin/celery.db
May  9 01:10:50 opochtli celery[590]: - *** --- * --- .> concurrency: 1 (prefork)
May  9 01:10:50 opochtli celery[590]: -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
May  9 01:10:50 opochtli celery[590]: --- ***** -----
May  9 01:10:50 opochtli celery[590]:  -------------- [queues]
May  9 01:10:50 opochtli celery[590]:                 .> default          exchange=default(direct) key=default
May  9 01:10:50 opochtli celery[590]:                 
May  9 01:10:50 opochtli celery[590]: [tasks]
May  9 01:10:50 opochtli celery[590]:   . mediagoblin.media_types.video.processing.complementary_task
May  9 01:10:50 opochtli celery[590]:   . mediagoblin.media_types.video.processing.main_task
May  9 01:10:50 opochtli celery[590]:   . mediagoblin.media_types.video.processing.processing_cleanup
May  9 01:10:50 opochtli celery[590]:   . mediagoblin.notifications.task.EmailNotificationTask
May  9 01:10:50 opochtli celery[590]:   . mediagoblin.processing.task.handle_push_urls
May  9 01:10:50 opochtli celery[590]:   . mediagoblin.submit.task.collect_garbage
May  9 01:10:50 opochtli celery[590]:   . process_media
May  9 01:10:51 opochtli celery[590]: 2020-05-09 01:10:51,467 ERROR   [celery.worker.consumer.consumer] consumer: Cannot connect to amqp://guest:**@127.0.0.1:5672//: [Errno 111] Connection refused.
May  9 01:10:51 opochtli celery[590]: Trying again in 2.00 seconds...
May  9 01:10:53 opochtli celery[590]: 2020-05-09 01:10:53,510 ERROR   [celery.worker.consumer.consumer] consumer: Cannot connect to amqp://guest:**@127.0.0.1:5672//: [Errno 111] Connection refused.
May  9 01:10:53 opochtli celery[590]: Trying again in 4.00 seconds...
May  9 01:10:57 opochtli celery[590]: 2020-05-09 01:10:57,535 ERROR   [celery.worker.consumer.consumer] consumer: Cannot connect to amqp://guest:**@127.0.0.1:5672//: [Errno 111] Connection refused.
May  9 01:10:57 opochtli celery[590]: Trying again in 6.00 seconds...
May  9 01:11:03 opochtli celery[590]: 2020-05-09 01:11:03,991 INFO    [celery.worker.consumer.connection] Connected to amqp://guest:**@127.0.0.1:5672//
May  9 01:11:04 opochtli celery[590]: 2020-05-09 01:11:04,015 INFO    [celery.worker.consumer.mingle] mingle: searching for neighbors
May  9 01:11:05 opochtli celery[590]: 2020-05-09 01:11:05,051 INFO    [celery.worker.consumer.mingle] mingle: all alone
May  9 01:11:05 opochtli celery[590]: 2020-05-09 01:11:05,072 INFO    [celery.worker.strategy] Received task: process_media[e1d84031-5da6-4b43-bfff-8b2e32aef25c]

That is all I can debug. I not familiar with celery so I don't understand how the app interacts with it and how it knows if a task needs to be resubmitted.

Does it have something to do with main_transcoding_progress ending up with 0 the first time the task was run?

comment:5 by Özcan Oğuz, 4 years ago

Hello,

I fixed this problem with commenting these lines in the file mediagoblin/processing/__init__.py

#        if entry and not processor.media_is_eligible(entry):
#            raise ProcessorNotEligible(
#                "This entry is not eligible for processor with name '%s'" % key)

Then after restarting mediagoblin-celeryd.service, medias won't be failed.

comment:6 by Fernando Gutierrez, 4 years ago

I think commenting the eligible check will just cause media to be reprocessed at restart.

It fixes the error but unless the transcoder checks if thumbnails already exists it will waste cpu power processing them again.

by Fernando Gutierrez, 4 years ago

Attachment: rabbitmq_status.png added

RabbitMQ Management UI Screenshot

by Fernando Gutierrez, 4 years ago

Celery RabbitMQ connections

comment:7 by Fernando Gutierrez, 4 years ago

I found why the reprocessing job is re-submitted. The message stays unacknowledged in RabbitMQ. See the two attached screenshots: rabbitmq_status.png and celery_queue_connections.PNG​

I mentioned before that videos survive a restart but images don't so I made a test:

1) Connected to RabbitMQ management UI and made sure that the message queue was empty.

2) Uploaded a new image and looked again at RabbitMQ, there was 1 unacknowledged message.

3) Uploaded a video and again looked at the queue, there were 4 unacknowledged (the previous image upload + 3 video trans-coding tasks.)

4) Waited for the media processing panel to show all processing complete and connected again to RabbitMQ management UI. There is still 1 unacknowledged message.

comment:8 by Fernando Gutierrez, 4 years ago

Images do not use a workflow just a single task. Videos work correctly and they use a workflow.

The only difference on how the tasks are scheduled is in mediagoblin/mediagoblin/submit/lib.py

wf = manager.workflow(entry, feed_url, reprocess_action, reprocess_info)
if wf is None:
     ProcessMedia().apply_async(
        [entry.id, feed_url, reprocess_action, reprocess_info], {},
        task_id=entry.queued_task_id)
else:
    chord(wf[0])(wf[1])

I don't know why apply_async causes the task to stay unacknowledged but when run as part of a chord they are correctly tracked. Maybe a bug in celery?

Notice that celery documentation warns against leaking the task result. Both apply_async and chord return an AsyncResult instance that is ignored here. The documentations says: you must eventually call get() or forget() on EVERY AsyncResult instance returned after calling a task.

comment:9 by Olivier Mehani, 4 years ago

Cc: Olivier Mehani added

comment:10 by Ben Sturmfels, 3 years ago

Hi Fernando,

Can you tell me whether you are running Postgresql or Sqlite3 as the database please? I saw some similar suspicious error messages while doing some testing with Sqlite3, but I haven't run into any issues on the sites I run that use Postgresql. I'm interested to know if that's a coincidence or not.

Regards,
Bne

comment:11 by Fernando Gutierrez, 3 years ago

Hi Ben,

I'm using sqlite.

I previously used Postgres but later switched to sqlite, unfortunately it was around the same time I upgraded to 0.10.0 so I don't know if it was a coincidence or something new in 0.10.0

comment:12 by Ben Sturmfels, 3 years ago

Milestone: 0.12.0

comment:13 by Ben Sturmfels, 3 years ago

Thanks for your investigation on this issue Fernando. I think this is an important one to fix before the next release.

I've had a brief look at it just now and have been able to replicate it reliably on both PostgreSQL and SQLite databases. The issue only occurs when running a separate Celery instance. I'm currently using Redis as my Celery broker as I don't have RabbitMQ in Guix System.

  1. Create a fresh database.
  2. Run bin/gmg dbupdate and bin/gmg adduser --username admin --password a --email admin@example.com
  3. Run CELERY_ALWAYS_EAGER=false paster serve paste.ini --reload in one terminal.
  4. Run CELERY_CONFIG_MODULE=mediagoblin.init.celery.from_celery bin/python -m celery worker --loglevel=INFO in another terminal.
  5. Log in to the web interface and upload an image.
  6. Stop the Celery process (shows "restoring 1 unacknowledged message")
  7. Start the Celery process (shows "Received task: process_media")
  8. Refresh the web interface, noting 1 failed upload, and that clicking on existing thumbnail shows 404.

As Fernando mentioned above, there's no issue with uploading videos, only images.

comment:14 by Ben Sturmfels, 3 years ago

In the spirit of trying the easiest thing first, I tested a few versions of Celery up to 4.4.7 to see if this issue was related to a fixed Celery bug. No change, which suggests the bug is in our code.

comment:15 by Ben Sturmfels, 3 years ago

I have a solution for this issue. Fernando, would you mind testing to confirm this fixes for you too?

As part of the multi-resolution video transcoding work merged into v0.10.0, the following Celery setting was added to mediagoblin/config_spec.ini:

CELERY_ACKS_LATE = boolean(default=True)

The issue is resolved if we change this setting to False, either in config_spec.ini or mediagoblin.ini.

Before this setting was added to MediaGoblin, Celery tasks were acknowledged by worker processes as soon as they had been taken off the queue. When True, this setting makes workers acknowledge tasks only after they have been completed. The unsatisfying part is that tasks should eventually be acknowledged either way and I don't yet understand why they're not being, even after the apparently successful processing.

In any case, this fixes the issue and I've also tested to make sure there's no negative impact on video processing. This fix probably deserves a minor release in the near future.

comment:16 by Fernando Gutierrez, 3 years ago

I just tested this and it fixed the problem!, I had to comment CELERY_ACKS_LATE = boolean(default=True) in config_spec.ini, I first tried adding CELERY_ACKS_LATE = False to mediagoblin.ini but that didn't work.
I ran ./bin/gmg dbupdate after each edit, I'm not sure if that was necessary.

comment:17 by Ben Sturmfels, 3 years ago

Resolution: fixed
Status: newclosed

Wonderful, applied to master branch and marking as fixed. This will be the default in our upcoming 0.12.0 release.

Note: See TracTickets for help on using tickets.