Opened 12 years ago

Closed 10 years ago

#647 closed defect (fixed)

Uncontrolled reading of files into memory

Reported by: Jakob Kramer Owned by:
Priority: major Milestone: 0.8.0
Component: programming Keywords:
Cc: Stephen Compall, Jakob Kramer, Matt Molyneaux Parent Tickets:

Description

Hi!

When I was looking at the attachments source code I have seen code like this:

with file:

file.write(another_file.read())

This is dangerous because you read another_file into memory completely and this can lead to MemoryErrors. There is a function shutil.copyfileobj that already does what we need: reading chunk for chunk into memory and writing it to another file object.

Attachments (1)

patch.diff (3.0 KB ) - added by Boris Bobrov 10 years ago.
patch 1

Download all attachments as: .zip

Change History (42)

comment:1 by Jakob Kramer, 12 years ago

Summary: Uncontrolled reading of files in memoryUncontrolled reading of files into memory

comment:2 by Jakob Kramer, 12 years ago

Status: newaccepted

Fixing branch. Should work, though someone should have another look at it.

comment:3 by Jakob Kramer, 12 years ago

Keywords: review added

comment:4 by Christopher Allan Webber, 12 years ago

Resolution: duplicate
Status: acceptedclosed

Gandaro, I will look at your branch also, and thanks for submitting it! However, I also want to note that at the moment I believe that this is a duplicate of issue #419... spaetz has a branch there that I think does mostly the same thing; the problem is that someone needs to test whether the new code works with cloudfiles. When that is done I think spaetz's branch can be merged.

In the meanwhile I'm closing this as a duplicate of #419.

comment:5 by Christopher Allan Webber, 12 years ago

Sorry for not linking to the relevant branch earlier; it's here: https://gitorious.org/~spaetz/mediagoblin/spaetz-mediagoblin/commits/WIP/large_uploads

comment:6 by Christopher Allan Webber, 12 years ago

Milestone: 0.4.0

Okay, ho ho, reopening this one. There look like there are some good features in gandaro's branch here that weren't in #419. Since that's closed out, reopening this.

Problem: it was hard for me to review this. The commits seemed to be doing a lot more than the description of the commit at first glance (rearranging imports, shifting indentation, etc) and it made it harder to figure out what was going on.

I think it looks like there's good stuff in here though, so I'd like to get it merged... but it would be good to have some help! I also suspect this does not merge on top of master at present.

comment:7 by Christopher Allan Webber, 12 years ago

Resolution: duplicate
Status: closedreopened

comment:8 by Jakob Kramer, 12 years ago

This commit really only adds the "with ..." line and changes the 'foo\'bar' to "foo'bar" (although I think there was another occurence of that in the file that I did not change).

In the other commit I changed the if clause to only check wether attachments are not allowed, so we don't have the complete function indented. Then I edited line 143 in the freshly unindented block to use shutil.copyfileobj instead of f1.write(f2.read()). The rest is self-explanatory, I think. :-)

comment:9 by Stephen Compall, 12 years ago

Cc: Stephen Compall added

comment:10 by Jakob Kramer, 11 years ago

Owner: Jakob Kramer removed
Status: reopenedreview

comment:12 by Christopher Allan Webber, 11 years ago

Owner: set to Jakob Kramer
Status: reviewin_progress

So this looks good, in the sense that it's definitely catching a lot of areas where we're clearly doing the wrong thing.

What I'm not sure about is if it works with rackspace cloudfiles. I haven't checked what happens with shutil.copyfileobj() with cloudfiles, but there's no way we can merge this without knowing that it works with that.

#419 did handle that right, so maybe comparing with that is what's needed?

A few places I think actually it's easy enough to not use copyfileobj; eg:

In stl/processing.py for example:

        # copy it up!
        with open(workbench_path, 'rb') as rendered_file:
            public_path = create_pub_filepath(entry, filename)

            with mgg.public_store.get_file(public_path, "wb") as public_file:
                shutil.copyfileobj(rendered_file, public_file)

For this, we could use the public_store's copy_local_to_storage() method, which should work right, and I think we should before this is merged (so I'm passing this back for that reason).

What to do with the files that are coming in off of requests in terms of the submission objects? I'm not sure about those. This is what we get, in some respects, for only-kinda-sorta working with with "file-like" objects. Anyway, if all our storage systems can support shutil.copyfileobj() for sure, and we know that we *expect* that, we could merge this... and we should document that as both known and expected behavior.

comment:13 by Christopher Allan Webber, 11 years ago

Milestone: 0.4.00.4.1

Moving to 0.4.1

comment:14 by Christopher Allan Webber, 11 years ago

Milestone: 0.5.00.6.0

comment:15 by Andreas, 11 years ago

I think I'm running into this bug myself, when uploading large files (2.2gb in this example), I get the following server error:

Error - <type 'exceptions.OverflowError'>: requested number of bytes is more than a Python string can hold
URL: http://media_goblin:6543/submit/
File '/home/mediagoblin/mediagoblin/local/lib/python2.7/site-packages/Paste-1.7.5.1-py2.7.egg/paste/exceptions/errormiddleware.py', line 144 in __call__
  app_iter = self.application(environ, sr_checker)
File '/home/mediagoblin/mediagoblin/local/lib/python2.7/site-packages/Paste-1.7.5.1-py2.7.egg/paste/urlmap.py', line 203 in _
_call__                                                                                                                        return app(environ, start_response)
File '/home/mediagoblin/mediagoblin/mediagoblin/app.py', line 259 in __call__
  return self.call_backend(environ, start_response)
File '/home/mediagoblin/mediagoblin/mediagoblin/app.py', line 236 in call_backend                                              response = controller(request)
File '/home/mediagoblin/mediagoblin/mediagoblin/decorators.py', line 52 in new_controller_func
  return controller(request, *args, **kwargs)
File '/home/mediagoblin/mediagoblin/mediagoblin/submit/views.py', line 107 in submit_start                                     queue_file.write(request.files['file'].stream.read())
OverflowError: requested number of bytes is more than a Python string can hold

                                                                                                                             CGI Variables
-------------
  CONTENT_LENGTH: '2373533566'
  CONTENT_TYPE: 'multipart/form-data; boundary=---------------------------16242764931763880005461866851'                       CSRF_TOKEN: u'303073878618807821'
  HTTP_ACCEPT: 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
  HTTP_ACCEPT_ENCODING: 'gzip, deflate'
  HTTP_ACCEPT_LANGUAGE: 'en-au,en;q=0.7,en-us;q=0.3'                                                                           HTTP_CONNECTION: 'keep-alive'
  HTTP_COOKIE: 'mediagoblin_csrftoken=303073878618807821; mediagoblin_csrftoken=303073878618807821; MGSession=eyJtZXNzYWdlcyI
6W10sInVzZXJfaWQiOiIxIn0.BSgdAQ.E7v4vBtW60YSELUEzHV_lf68iu8'
  HTTP_HOST: 'media_goblin:6543'                                                                                               HTTP_REFERER: 'http://media_goblin:6543/submit/'
  HTTP_USER_AGENT: 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'
  PATH_INFO: '/submit/'
  REMOTE_ADDR: '192.168.39.1'                                                                                                  REQUEST_METHOD: 'POST'
  SERVER_NAME: '0.0.0.0'
  SERVER_PORT: '6543'
  SERVER_PROTOCOL: 'HTTP/1.1'                                                                                                

WSGI Variables
--------------                                                                                                                 application: {(None, '/mgoblin_static'): <StaticURLParser '/home/mediagoblin/mediagoblin/mediagoblin/static'>, (None, '/mgo
blin_media'): <StaticURLParser '/home/mediagoblin/mediagoblin/user_dev/media/public'>, (None, ''): <mediagoblin.app.MediaGobl
inApp object at 0x9aef7cc>, (None, '/plugin_static'): <StaticURLParser '/home/mediagoblin/mediagoblin/user_dev/plugin_static'
>, (None, '/theme_static'): <StaticURLParser '/home/mediagoblin/mediagoblin/user_dev/theme_static'>}                           paste.httpserver.thread_pool: <paste.httpserver.ThreadPool object at 0xa30d84c>
  paste.throw_errors: True
  werkzeug.request: <Request 'http://media_goblin:6543/submit/' [POST]>
  wsgi process: 'Multithreaded'

It's on a local installation (5.1 setup according to manual) inside an LXC container, with about 440mb or available RAM. Hope this helps.

comment:16 by Christopher Allan Webber, 11 years ago

andreas45, it's quite likely you're right that this is related. It would be good if someone could help test with openstack's storage and review the state of the ticket in regard to my comments... that would be good. It might be mergable as-is... I'm not sure. Testing with cloudfiles would help.

comment:17 by Andreas, 11 years ago

cwebber, I cloned from the repository you linked to (on my same machine) to see if it affects my problem, but I still get a very similar one:

[mediagoblin.media_types] No media handler found by file extension. Doing it the expensive way...
Error - <type 'exceptions.OverflowError'>: requested number of bytes is more than a Python string can hold
URL: http://media_goblin:6543/submit/
File '/home/mediagoblin/fix_test_mediagoblin/local/lib/python2.7/site-packages/Paste-1.7.5.1-py2.7.egg/paste/exceptions/errormiddleware.py', line 144 in __call__
  app_iter = self.application(environ, sr_checker)
File '/home/mediagoblin/fix_test_mediagoblin/local/lib/python2.7/site-packages/Paste-1.7.5.1-py2.7.egg/paste/urlmap.py', line 203 in __call__
  return app(environ, start_response)
File '/home/mediagoblin/fix_test_mediagoblin/mediagoblin/app.py', line 259 in __call__
  return self.call_backend(environ, start_response)
File '/home/mediagoblin/fix_test_mediagoblin/mediagoblin/app.py', line 236 in call_backend
  response = controller(request)
File '/home/mediagoblin/fix_test_mediagoblin/mediagoblin/decorators.py', line 52 in new_controller_func
  return controller(request, *args, **kwargs)
File '/home/mediagoblin/fix_test_mediagoblin/mediagoblin/submit/views.py', line 64 in submit_start
  request.files['file'])
File '/home/mediagoblin/fix_test_mediagoblin/mediagoblin/media_types/__init__.py', line 66 in sniff_media
  media_file.write(media.stream.read())
OverflowError: requested number of bytes is more than a Python string can hold


CGI Variables
-------------
  CONTENT_LENGTH: '2373533220'
  CONTENT_TYPE: 'multipart/form-data; boundary=---------------------------202606272211906711501952622290'
  CSRF_TOKEN: u'303073878618807821'
  HTTP_ACCEPT: 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
  HTTP_ACCEPT_ENCODING: 'gzip, deflate'
  HTTP_ACCEPT_LANGUAGE: 'en-au,en;q=0.7,en-us;q=0.3'
  HTTP_CONNECTION: 'keep-alive'
  HTTP_COOKIE: 'mediagoblin_csrftoken=303073878618807821; mediagoblin_csrftoken=303073878618807821; MGSession=eyJtZXNzYWdlcyI6W10sInVzZXJfaWQiOiIxIn0.BTEcBg.1OEEK9CvA_S_OnoTv15ImTxHnp8'
  HTTP_HOST: 'media_goblin:6543'
  HTTP_REFERER: 'http://media_goblin:6543/submit/'
  HTTP_USER_AGENT: 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'
  PATH_INFO: '/submit/'
  REMOTE_ADDR: '192.168.39.1'
  REQUEST_METHOD: 'POST'
  SERVER_NAME: '0.0.0.0'
  SERVER_PORT: '6543'
  SERVER_PROTOCOL: 'HTTP/1.1'


WSGI Variables
--------------
  application: {(None, '/mgoblin_static'): <StaticURLParser '/home/mediagoblin/fix_test_mediagoblin/mediagoblin/static'>, (None, '/mgoblin_media'): <StaticURLParser '/home/mediagoblin/fix_test_mediagoblin/user_dev/media/public'>, (None, ''): <mediagoblin.app.MediaGoblinApp object at 0x968a9ac>, (None, '/plugin_static'): <StaticURLParser '/home/mediagoblin/fix_test_mediagoblin/user_dev/plugin_static'>, (None, '/theme_static'): <StaticURLParser '/home/mediagoblin/fix_test_mediagoblin/user_dev/theme_static'>}
  paste.httpserver.thread_pool: <paste.httpserver.ThreadPool object at 0x97eea0c>
  paste.throw_errors: True
  werkzeug.request: <Request 'http://media_goblin:6543/submit/' [POST]>
  wsgi process: 'Multithreaded'
------------------------------------------------------------

This one is in a different area though. Is this a place that hasn't been focusing on in your repository, or perhaps I've got the wrong bug?

comment:18 by Christopher Allan Webber, 11 years ago

Sadly you're right! That's a completely different instance of this same problem. I'm not sure if it should be split into another bug or not, but it should definitely be fixed.

comment:19 by Christopher Allan Webber, 11 years ago

Milestone: 0.6.00.7.0

comment:20 by Julien Gouesse, 10 years ago

comment:21 by Christopher Allan Webber, 10 years ago

#834 is indeed a duplicate, but has useful information for resolving this ticket.

comment:22 by Christopher Allan Webber, 10 years ago

Cc: Jakob Kramer added
Owner: Jakob Kramer removed
Status: in_progressaccepted

Gandaro, I don't think you're currently working on this are you? I'm unclaiming this from you, but if you jump back on it, please reclaim!

comment:23 by Christopher Allan Webber, 10 years ago

Note: I'd love for this to get fixed in 0.7.0 but it's a pretty big task and has been around forever. It would be okay to move this to 0.8.0. Let's save this one till towards the end of the cycle.

comment:24 by Christopher Allan Webber, 10 years ago

Milestone: 0.7.00.8.0

I think this should be a 0.8.0 feature. But we should *really* do it for 0.8.0!

Taking this on will take some effort, and we're pretty close to release, so.

comment:25 by Matt Molyneaux, 10 years ago

Cc: Matt Molyneaux added

comment:26 by Christopher Allan Webber, 10 years ago

Owner: set to Christopher Allan Webber
Status: acceptedin_progress

Claiming and assigning to 0.8.0

comment:27 by Christopher Allan Webber, 10 years ago

Milestone: 0.8.00.9.0

I hate to bump this to the next release, but I am... but I am still working on it.

comment:28 by ayleph, 10 years ago

I just made the following two changes on my test instance. I had no problems uploading an image and a video to CloudFiles with the cloudfiles.py storage interface.

diff --git a/mediagoblin/media_types/__init__.py b/mediagoblin/media_types/__init__.py
index 2e39231..ba49e33 100644
--- a/mediagoblin/media_types/__init__.py
+++ b/mediagoblin/media_types/__init__.py
@@ -16,6 +16,7 @@
 
 import os
 import logging
+import shutil
 import tempfile
 
 from mediagoblin.tools.pluginapi import hook_handle
@@ -63,7 +64,7 @@ def sniff_media(media_file, filename):
         # Create a temporary file for sniffers suchs as GStreamer-based
         # Audio video
         tmp_media_file = tempfile.NamedTemporaryFile()
-        tmp_media_file.write(media_file.read())
+        shutil.copyfileobj(media_file, tmp_media_file)
         tmp_media_file.seek(0)
         media_file.seek(0)
 
diff --git a/mediagoblin/submit/lib.py b/mediagoblin/submit/lib.py
index 541447e..8981a9c 100644
--- a/mediagoblin/submit/lib.py
+++ b/mediagoblin/submit/lib.py
@@ -15,6 +15,7 @@
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
 
 import logging
+import shutil
 import uuid
 from os.path import splitext
 
@@ -157,7 +158,7 @@ def submit_media(mg_app, user, submitted_file, filename,
     queue_file = prepare_queue_task(mg_app, entry, filename)
 
     with queue_file:
-        queue_file.write(submitted_file.read())
+        shutil.copyfileobj(submitted_file, queue_file)
 
     # Get file size and round to 2 decimal places
     file_size = mg_app.queue_store.get_file_size(

comment:29 by Boris Bobrov, 10 years ago

I had no problems uploading an image and a video to CloudFiles with the cloudfiles.py storage interface.

The patch still doesn't fix https://gitorious.org/mediagoblin/mediagoblin/source/41dbb27a4f162b5c2b109d2214c5ccd2946ed535:mediagoblin/storage/cloudfiles.py#L213 or L153 of mediagoblin/edit/views.py.

Anyway, I think that the approach is good. I'd suggest to do it a bit another way, like this:

diff --git a/mediagoblin/submit/lib.py b/mediagoblin/submit/lib.py
index 541447e..85188d9 100644
--- a/mediagoblin/submit/lib.py
+++ b/mediagoblin/submit/lib.py
@@ -157,7 +157,7 @@ def submit_media(mg_app, user, submitted_file, filename,
     queue_file = prepare_queue_task(mg_app, entry, filename)
 
     with queue_file:
-        queue_file.write(submitted_file.read())
+        submitted_file.save(queue_file)
 
     # Get file size and round to 2 decimal places
     file_size = mg_app.queue_store.get_file_size(

comment:30 by Boris Bobrov, 10 years ago

I am also concerned about performance of cloudfiles.py when using copyfileobj. copyfileobj is defined here and in fact does multiple read() and write(), making cloudfiles' write() read data multiple times.

comment:31 by Boris Bobrov, 10 years ago

Keywords: bitesized removed
Milestone: 0.9.00.8.0
Owner: changed from Christopher Allan Webber to Boris Bobrov

comment:32 by ayleph, 10 years ago

Breton's suggestion above also tests out okay on my dev instance with CloudFiles enabled. I can't yet comment on whether either solution helps solve my personal out of memory errors, but neither solution seems to break my environment.

comment:33 by ayleph, 10 years ago

Whoops, found a problem. CLI uploads don't work.

~/mediagoblin $ bin/gmg addmedia admin image.png
** Message: pygobject_register_sinkfunc is deprecated (GstObject)

Traceback (most recent call last):
  File "bin/gmg", line 9, in <module>
    load_entry_point('mediagoblin==0.7.2.dev', 'console_scripts', 'gmg')()
  File "/path/to/mediagoblin/mediagoblin/gmg_commands/__init__.py", line 156, in main_cli
    args.func(args)
  File "/path/to/mediagoblin/mediagoblin/gmg_commands/addmedia.py", line 103, in addmedia
    upload_limit=upload_limit, max_file_size=max_file_size)
  File "/path/to/mediagoblin/mediagoblin/submit/lib.py", line 163, in submit_media
    submitted_file.save(queue_file)
AttributeError: 'file' object has no attribute 'save'

comment:34 by Boris Bobrov, 10 years ago

Wow. Interesting. It seems that submitted_file is not always an instance of werkzeug.datastructures.FileStorage.

OK, I'm still working on it, will take this into account.

comment:35 by ayleph, 10 years ago

In case anyone's curious, CLI upload does work when I use shutil.copyfileobj(submitted_file, queue_file).

comment:36 by ayleph, 10 years ago

I created separate ticket #5053 to address the bin/gmg addmedia error introduced in 0.8.0.

by Boris Bobrov, 10 years ago

Attachment: patch.diff added

patch 1

comment:37 by Boris Bobrov, 10 years ago

I've attached a patch that should fix the issue. Ayleph, could you please test it on cloudfiles?

Also, any testing would be appreciated.

comment:38 by ayleph, 10 years ago

Thanks for the patch, Breton. It seems to work fine with CloudFiles storage. Console uploads via bin/gmg addmedia work as well.

comment:39 by Boris Bobrov, 10 years ago

Owner: Boris Bobrov removed
Status: in_progressreview

Patch was added to branch bug/647 of bretons-mediagoblin repo

comment:40 by Boris Bobrov, 10 years ago

Keywords: review removed

comment:41 by Christopher Allan Webber, 10 years ago

Resolution: fixed
Status: reviewclosed

Since this went through review I'm trusting it, and merging it as-is! The tests pass fine.

Thanks for your hard work, breton & ayleph!

Note: See TracTickets for help on using tickets.