Opened 4 weeks ago

Last modified 5 days ago

#5608 new defect

After upgrade to 0.10.0 restarting marks all items as "failed"

Reported by: fergtm@… Owned by:
Priority: major Milestone:
Component: programming Keywords:
Cc: Olivier Mehani Parent Tickets:

Description

Fergtm reports that after upgrading to 0.10.0, any subsequent restart marks all media items as "failed" processing.

Seems similar to #5604.

Subtickets

Attachments (5)

mediagoblin.ini (2.3 KB) - added by Ben Sturmfels 4 weeks ago.
ls-l.txt (7.0 KB) - added by Ben Sturmfels 4 weeks ago.
processing_panel_screenshot.jpg (163.1 KB) - added by Ben Sturmfels 4 weeks ago.
rabbitmq_status.png (56.6 KB) - added by Fernando Gutierrez 6 days ago.
RabbitMQ Management UI Screenshot
celery_queue_connections.PNG (14.2 KB) - added by Fernando Gutierrez 6 days ago.
Celery RabbitMQ connections

Download all attachments as: .zip

Change History (14)

Changed 4 weeks ago by Ben Sturmfels

Attachment: mediagoblin.ini added

Changed 4 weeks ago by Ben Sturmfels

Attachment: ls-l.txt added

Changed 4 weeks ago by Ben Sturmfels

comment:1 Changed 4 weeks ago by Fernando Gutierrez

Before restarting the server I checked that the processing queue was empty:

mediagoblin@opochtli:/srv/mediagoblin/media.nil.mx/mediagoblin$ export MEDIAGOBLIN_CONFIG=/srv/mediagoblin/media.nil.mx/mediagoblin/mediagoblin_local.ini
mediagoblin@opochtli:/srv/mediagoblin/media.nil.mx/mediagoblin$ export CELERY_CONFIG_MODULE=mediagoblin.init.celery.from_celery
mediagoblin@opochtli:/srv/mediagoblin/media.nil.mx/mediagoblin$ ./bin/celery inspect active
-> celery@opochtli: OK
    - empty -

I could already see the images listed in the page and after restart they disapeared.

comment:2 Changed 3 weeks ago by Fernando Gutierrez

Looking into the database entry I see

fail_error
Unhandled exception: This entry is not eligible for processor with name 'initial'

comment:3 Changed 3 weeks ago by Fernando Gutierrez

After uploading the image and after it shows as completed in the media processing panel.

The corresponding row in the core_media_entries table is:

id                           45
public_id                    http://[redacted]/api/image/220a24e4-cfaf-4ebf-874d-2fad015684d7/
remote                       0
actor                        1
title                        [redacted]
slug                         [redacted]
description                  [redacted]
media_type                   mediagoblin.media_types.image
state                        processed
license                      
file_size                    1.95
location                     18
created                      04:00.2
updated                      04:00.2
fail_error                   
fail_metadata                
transcoding_progress         0
queued_media_file            
queued_task_id               e1d84031-5da6-4b43-bfff-8b2e32aef25c
media_metadata               {}
main_transcoding_progress    0

and in celery_taskmeta:

id           58
task_id      e1d84031-5da6-4b43-bfff-8b2e32aef25c
status       SUCCESS
result       
date_done    04:00.9
traceback   


After a restart status changes to failed in both tables with error set to "Unhandled exception: This entry is not eligible for processor with name 'initial'"

comment:4 Changed 3 weeks ago by Fernando Gutierrez

From looking at celery logs the task "e1d84031-5da6-4b43-bfff-8b2e32aef25c" was submitted again after the server restarts.

This second time it fails with ""Unhandled exception: This entry is not eligible for processor with name 'initial'" because it was already processed and the status is not "unprocessed" or "failed".

May  9 00:04:00 opochtli celery[585]: 2020-05-09 00:04:00,330 INFO    [celery.worker.strategy] Received task: process_media[e1d84031-5da6-4b43-bfff-8b2e32aef25c]
May  9 01:08:52 opochtli systemd[1]: Stopping MediaGoblin Celeryd...
May  9 01:10:28 opochtli systemd[1]: Started MediaGoblin Celeryd.
May  9 01:10:50 opochtli celery[590]:  
May  9 01:10:50 opochtli celery[590]:  -------------- celery@opochtli v4.2.2 (windowlicker)
May  9 01:10:50 opochtli celery[590]: ---- **** -----
May  9 01:10:50 opochtli celery[590]: --- * ***  * -- Linux-5.4.0-29-generic-x86_64-with-glibc2.29 2020-05-09 01:10:50
May  9 01:10:50 opochtli celery[590]: -- * - **** ---
May  9 01:10:50 opochtli celery[590]: - ** ---------- [config]
May  9 01:10:50 opochtli celery[590]: - ** ---------- .> app:         default:0x7f620a81a130 (.default.Loader)
May  9 01:10:50 opochtli celery[590]: - ** ---------- .> transport:   amqp://guest:**@localhost:5672//
May  9 01:10:50 opochtli celery[590]: - ** ---------- .> results:     sqlite:////srv/mediagoblin/media.nil.mx/mediagoblin/celery.db
May  9 01:10:50 opochtli celery[590]: - *** --- * --- .> concurrency: 1 (prefork)
May  9 01:10:50 opochtli celery[590]: -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
May  9 01:10:50 opochtli celery[590]: --- ***** -----
May  9 01:10:50 opochtli celery[590]:  -------------- [queues]
May  9 01:10:50 opochtli celery[590]:                 .> default          exchange=default(direct) key=default
May  9 01:10:50 opochtli celery[590]:                 
May  9 01:10:50 opochtli celery[590]: [tasks]
May  9 01:10:50 opochtli celery[590]:   . mediagoblin.media_types.video.processing.complementary_task
May  9 01:10:50 opochtli celery[590]:   . mediagoblin.media_types.video.processing.main_task
May  9 01:10:50 opochtli celery[590]:   . mediagoblin.media_types.video.processing.processing_cleanup
May  9 01:10:50 opochtli celery[590]:   . mediagoblin.notifications.task.EmailNotificationTask
May  9 01:10:50 opochtli celery[590]:   . mediagoblin.processing.task.handle_push_urls
May  9 01:10:50 opochtli celery[590]:   . mediagoblin.submit.task.collect_garbage
May  9 01:10:50 opochtli celery[590]:   . process_media
May  9 01:10:51 opochtli celery[590]: 2020-05-09 01:10:51,467 ERROR   [celery.worker.consumer.consumer] consumer: Cannot connect to amqp://guest:**@127.0.0.1:5672//: [Errno 111] Connection refused.
May  9 01:10:51 opochtli celery[590]: Trying again in 2.00 seconds...
May  9 01:10:53 opochtli celery[590]: 2020-05-09 01:10:53,510 ERROR   [celery.worker.consumer.consumer] consumer: Cannot connect to amqp://guest:**@127.0.0.1:5672//: [Errno 111] Connection refused.
May  9 01:10:53 opochtli celery[590]: Trying again in 4.00 seconds...
May  9 01:10:57 opochtli celery[590]: 2020-05-09 01:10:57,535 ERROR   [celery.worker.consumer.consumer] consumer: Cannot connect to amqp://guest:**@127.0.0.1:5672//: [Errno 111] Connection refused.
May  9 01:10:57 opochtli celery[590]: Trying again in 6.00 seconds...
May  9 01:11:03 opochtli celery[590]: 2020-05-09 01:11:03,991 INFO    [celery.worker.consumer.connection] Connected to amqp://guest:**@127.0.0.1:5672//
May  9 01:11:04 opochtli celery[590]: 2020-05-09 01:11:04,015 INFO    [celery.worker.consumer.mingle] mingle: searching for neighbors
May  9 01:11:05 opochtli celery[590]: 2020-05-09 01:11:05,051 INFO    [celery.worker.consumer.mingle] mingle: all alone
May  9 01:11:05 opochtli celery[590]: 2020-05-09 01:11:05,072 INFO    [celery.worker.strategy] Received task: process_media[e1d84031-5da6-4b43-bfff-8b2e32aef25c]

That is all I can debug. I not familiar with celery so I don't understand how the app interacts with it and how it knows if a task needs to be resubmitted.

Does it have something to do with main_transcoding_progress ending up with 0 the first time the task was run?

comment:5 Changed 11 days ago by Özcan Oğuz

Hello,

I fixed this problem with commenting these lines in the file mediagoblin/processing/__init__.py

#        if entry and not processor.media_is_eligible(entry):
#            raise ProcessorNotEligible(
#                "This entry is not eligible for processor with name '%s'" % key)

Then after restarting mediagoblin-celeryd.service, medias won't be failed.

comment:6 Changed 11 days ago by Fernando Gutierrez

I think commenting the eligible check will just cause media to be reprocessed at restart.

It fixes the error but unless the transcoder checks if thumbnails already exists it will waste cpu power processing them again.

Changed 6 days ago by Fernando Gutierrez

Attachment: rabbitmq_status.png added

RabbitMQ Management UI Screenshot

Changed 6 days ago by Fernando Gutierrez

Celery RabbitMQ connections

comment:7 Changed 6 days ago by Fernando Gutierrez

I found why the reprocessing job is re-submitted. The message stays unacknowledged in RabbitMQ. See the two attached screenshots: rabbitmq_status.png and celery_queue_connections.PNG​

I mentioned before that videos survive a restart but images don't so I made a test:

1) Connected to RabbitMQ management UI and made sure that the message queue was empty.

2) Uploaded a new image and looked again at RabbitMQ, there was 1 unacknowledged message.

3) Uploaded a video and again looked at the queue, there were 4 unacknowledged (the previous image upload + 3 video trans-coding tasks.)

4) Waited for the media processing panel to show all processing complete and connected again to RabbitMQ management UI. There is still 1 unacknowledged message.

comment:8 Changed 6 days ago by Fernando Gutierrez

Images do not use a workflow just a single task. Videos work correctly and they use a workflow.

The only difference on how the tasks are scheduled is in mediagoblin/mediagoblin/submit/lib.py

wf = manager.workflow(entry, feed_url, reprocess_action, reprocess_info)
if wf is None:
     ProcessMedia().apply_async(
        [entry.id, feed_url, reprocess_action, reprocess_info], {},
        task_id=entry.queued_task_id)
else:
    chord(wf[0])(wf[1])

I don't know why apply_async causes the task to stay unacknowledged but when run as part of a chord they are correctly tracked. Maybe a bug in celery?

Notice that celery documentation warns against leaking the task result. Both apply_async and chord return an AsyncResult instance that is ignored here. The documentations says: you must eventually call get() or forget() on EVERY AsyncResult instance returned after calling a task.

comment:9 Changed 5 days ago by Olivier Mehani

Cc: Olivier Mehani added
Note: See TracTickets for help on using tickets.