Opened 13 years ago
Closed 11 years ago
#647 closed defect (fixed)
Uncontrolled reading of files into memory
| Reported by: | Jakob Kramer | Owned by: | |
|---|---|---|---|
| Priority: | major | Milestone: | 0.8.0 | 
| Component: | programming | Keywords: | |
| Cc: | Stephen Compall, Jakob Kramer, Matt Molyneaux | Parent Tickets: | 
Description
Hi!
When I was looking at the attachments source code I have seen code like this:
with file:
file.write(another_file.read())
This is dangerous because you read another_file into memory completely and this can lead to MemoryErrors.  There is a function shutil.copyfileobj that already does what we need: reading chunk for chunk into memory and writing it to another file object.
Attachments (1)
Change History (42)
comment:1 by , 13 years ago
| Summary: | Uncontrolled reading of files in memory → Uncontrolled reading of files into memory | 
|---|
comment:2 by , 13 years ago
| Status: | new → accepted | 
|---|
comment:3 by , 13 years ago
| Keywords: | review added | 
|---|
comment:4 by , 13 years ago
| Resolution: | → duplicate | 
|---|---|
| Status: | accepted → closed | 
Gandaro, I will look at your branch also, and thanks for submitting it!  However, I also want to note that at the moment I believe that this is a duplicate of issue #419... spaetz has a branch there that I think does mostly the same thing; the problem is that someone needs to test whether the new code works with cloudfiles.  When that is done I think spaetz's branch can be merged.
In the meanwhile I'm closing this as a duplicate of #419.
comment:5 by , 13 years ago
Sorry for not linking to the relevant branch earlier; it's here: https://gitorious.org/~spaetz/mediagoblin/spaetz-mediagoblin/commits/WIP/large_uploads
comment:6 by , 13 years ago
| Milestone: | → 0.4.0 | 
|---|
Okay, ho ho, reopening this one.  There look like there are some good features in gandaro's branch here that weren't in #419.  Since that's closed out, reopening this.
Problem: it was hard for me to review this.  The commits seemed to be doing a lot more than the description of the commit at first glance (rearranging imports, shifting indentation, etc) and it made it harder to figure out what was going on.
I think it looks like there's good stuff in here though, so I'd like to get it merged... but it would be good to have some help!  I also suspect this does not merge on top of master at present.
comment:7 by , 13 years ago
| Resolution: | duplicate | 
|---|---|
| Status: | closed → reopened | 
comment:8 by , 13 years ago
This commit really only adds the "with ..." line and changes the 'foo\'bar' to "foo'bar" (although I think there was another occurence of that in the file that I did not change).
In the other commit I changed the if clause to only check wether attachments are not allowed, so we don't have the complete function indented.  Then I edited line 143 in the freshly unindented block to use shutil.copyfileobj instead of f1.write(f2.read()).  The rest is self-explanatory, I think. :-)
comment:9 by , 13 years ago
| Cc: | added | 
|---|
comment:10 by , 12 years ago
| Owner: | removed | 
|---|---|
| Status: | reopened → review | 
comment:11 by , 12 years ago
That is a new commit without the stylistic fixes. https://gitorious.org/~gandaro/mediagoblin/gandaros-mediagoblin/commits/647-cleaner
comment:12 by , 12 years ago
| Owner: | set to | 
|---|---|
| Status: | review → in_progress | 
So this looks good, in the sense that it's definitely catching a lot of areas where we're clearly doing the wrong thing.
What I'm not sure about is if it works with rackspace cloudfiles.  I haven't checked what happens with shutil.copyfileobj() with cloudfiles, but there's no way we can merge this without knowing that it works with that.
#419 did handle that right, so maybe comparing with that is what's needed?
A few places I think actually it's easy enough to not use copyfileobj; eg:
In stl/processing.py for example:
        # copy it up!
        with open(workbench_path, 'rb') as rendered_file:
            public_path = create_pub_filepath(entry, filename)
            with mgg.public_store.get_file(public_path, "wb") as public_file:
                shutil.copyfileobj(rendered_file, public_file)
For this, we could use the public_store's copy_local_to_storage() method, which should work right, and I think we should before this is merged (so I'm passing this back for that reason).
What to do with the files that are coming in off of requests in terms of the submission objects?  I'm not sure about those.  This is what we get, in some respects, for only-kinda-sorta working with with "file-like" objects.  Anyway, if all our storage systems can support shutil.copyfileobj() for sure, and we know that we *expect* that, we could merge this... and we should document that as both known and expected behavior.
comment:14 by , 12 years ago
| Milestone: | 0.5.0 → 0.6.0 | 
|---|
comment:15 by , 12 years ago
I think I'm running into this bug myself, when uploading large files (2.2gb in this example), I get the following server error:
Error - <type 'exceptions.OverflowError'>: requested number of bytes is more than a Python string can hold
URL: http://media_goblin:6543/submit/
File '/home/mediagoblin/mediagoblin/local/lib/python2.7/site-packages/Paste-1.7.5.1-py2.7.egg/paste/exceptions/errormiddleware.py', line 144 in __call__
  app_iter = self.application(environ, sr_checker)
File '/home/mediagoblin/mediagoblin/local/lib/python2.7/site-packages/Paste-1.7.5.1-py2.7.egg/paste/urlmap.py', line 203 in _
_call__                                                                                                                        return app(environ, start_response)
File '/home/mediagoblin/mediagoblin/mediagoblin/app.py', line 259 in __call__
  return self.call_backend(environ, start_response)
File '/home/mediagoblin/mediagoblin/mediagoblin/app.py', line 236 in call_backend                                              response = controller(request)
File '/home/mediagoblin/mediagoblin/mediagoblin/decorators.py', line 52 in new_controller_func
  return controller(request, *args, **kwargs)
File '/home/mediagoblin/mediagoblin/mediagoblin/submit/views.py', line 107 in submit_start                                     queue_file.write(request.files['file'].stream.read())
OverflowError: requested number of bytes is more than a Python string can hold
                                                                                                                             CGI Variables
-------------
  CONTENT_LENGTH: '2373533566'
  CONTENT_TYPE: 'multipart/form-data; boundary=---------------------------16242764931763880005461866851'                       CSRF_TOKEN: u'303073878618807821'
  HTTP_ACCEPT: 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
  HTTP_ACCEPT_ENCODING: 'gzip, deflate'
  HTTP_ACCEPT_LANGUAGE: 'en-au,en;q=0.7,en-us;q=0.3'                                                                           HTTP_CONNECTION: 'keep-alive'
  HTTP_COOKIE: 'mediagoblin_csrftoken=303073878618807821; mediagoblin_csrftoken=303073878618807821; MGSession=eyJtZXNzYWdlcyI
6W10sInVzZXJfaWQiOiIxIn0.BSgdAQ.E7v4vBtW60YSELUEzHV_lf68iu8'
  HTTP_HOST: 'media_goblin:6543'                                                                                               HTTP_REFERER: 'http://media_goblin:6543/submit/'
  HTTP_USER_AGENT: 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'
  PATH_INFO: '/submit/'
  REMOTE_ADDR: '192.168.39.1'                                                                                                  REQUEST_METHOD: 'POST'
  SERVER_NAME: '0.0.0.0'
  SERVER_PORT: '6543'
  SERVER_PROTOCOL: 'HTTP/1.1'                                                                                                
WSGI Variables
--------------                                                                                                                 application: {(None, '/mgoblin_static'): <StaticURLParser '/home/mediagoblin/mediagoblin/mediagoblin/static'>, (None, '/mgo
blin_media'): <StaticURLParser '/home/mediagoblin/mediagoblin/user_dev/media/public'>, (None, ''): <mediagoblin.app.MediaGobl
inApp object at 0x9aef7cc>, (None, '/plugin_static'): <StaticURLParser '/home/mediagoblin/mediagoblin/user_dev/plugin_static'
>, (None, '/theme_static'): <StaticURLParser '/home/mediagoblin/mediagoblin/user_dev/theme_static'>}                           paste.httpserver.thread_pool: <paste.httpserver.ThreadPool object at 0xa30d84c>
  paste.throw_errors: True
  werkzeug.request: <Request 'http://media_goblin:6543/submit/' [POST]>
  wsgi process: 'Multithreaded'
It's on a local installation (5.1 setup according to manual) inside an LXC container, with about 440mb or available RAM.  Hope this helps.
comment:16 by , 12 years ago
andreas45, it's quite likely you're right that this is related.  It would be good if someone could help test with openstack's storage and review the state of the ticket in regard to my comments... that would be good.  It might be mergable as-is... I'm not sure.  Testing with cloudfiles would help.
comment:17 by , 12 years ago
cwebber, I cloned from the repository you linked to (on my same machine) to see if it affects my problem, but I still get a very similar one:
[mediagoblin.media_types] No media handler found by file extension. Doing it the expensive way...
Error - <type 'exceptions.OverflowError'>: requested number of bytes is more than a Python string can hold
URL: http://media_goblin:6543/submit/
File '/home/mediagoblin/fix_test_mediagoblin/local/lib/python2.7/site-packages/Paste-1.7.5.1-py2.7.egg/paste/exceptions/errormiddleware.py', line 144 in __call__
  app_iter = self.application(environ, sr_checker)
File '/home/mediagoblin/fix_test_mediagoblin/local/lib/python2.7/site-packages/Paste-1.7.5.1-py2.7.egg/paste/urlmap.py', line 203 in __call__
  return app(environ, start_response)
File '/home/mediagoblin/fix_test_mediagoblin/mediagoblin/app.py', line 259 in __call__
  return self.call_backend(environ, start_response)
File '/home/mediagoblin/fix_test_mediagoblin/mediagoblin/app.py', line 236 in call_backend
  response = controller(request)
File '/home/mediagoblin/fix_test_mediagoblin/mediagoblin/decorators.py', line 52 in new_controller_func
  return controller(request, *args, **kwargs)
File '/home/mediagoblin/fix_test_mediagoblin/mediagoblin/submit/views.py', line 64 in submit_start
  request.files['file'])
File '/home/mediagoblin/fix_test_mediagoblin/mediagoblin/media_types/__init__.py', line 66 in sniff_media
  media_file.write(media.stream.read())
OverflowError: requested number of bytes is more than a Python string can hold
CGI Variables
-------------
  CONTENT_LENGTH: '2373533220'
  CONTENT_TYPE: 'multipart/form-data; boundary=---------------------------202606272211906711501952622290'
  CSRF_TOKEN: u'303073878618807821'
  HTTP_ACCEPT: 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
  HTTP_ACCEPT_ENCODING: 'gzip, deflate'
  HTTP_ACCEPT_LANGUAGE: 'en-au,en;q=0.7,en-us;q=0.3'
  HTTP_CONNECTION: 'keep-alive'
  HTTP_COOKIE: 'mediagoblin_csrftoken=303073878618807821; mediagoblin_csrftoken=303073878618807821; MGSession=eyJtZXNzYWdlcyI6W10sInVzZXJfaWQiOiIxIn0.BTEcBg.1OEEK9CvA_S_OnoTv15ImTxHnp8'
  HTTP_HOST: 'media_goblin:6543'
  HTTP_REFERER: 'http://media_goblin:6543/submit/'
  HTTP_USER_AGENT: 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'
  PATH_INFO: '/submit/'
  REMOTE_ADDR: '192.168.39.1'
  REQUEST_METHOD: 'POST'
  SERVER_NAME: '0.0.0.0'
  SERVER_PORT: '6543'
  SERVER_PROTOCOL: 'HTTP/1.1'
WSGI Variables
--------------
  application: {(None, '/mgoblin_static'): <StaticURLParser '/home/mediagoblin/fix_test_mediagoblin/mediagoblin/static'>, (None, '/mgoblin_media'): <StaticURLParser '/home/mediagoblin/fix_test_mediagoblin/user_dev/media/public'>, (None, ''): <mediagoblin.app.MediaGoblinApp object at 0x968a9ac>, (None, '/plugin_static'): <StaticURLParser '/home/mediagoblin/fix_test_mediagoblin/user_dev/plugin_static'>, (None, '/theme_static'): <StaticURLParser '/home/mediagoblin/fix_test_mediagoblin/user_dev/theme_static'>}
  paste.httpserver.thread_pool: <paste.httpserver.ThreadPool object at 0x97eea0c>
  paste.throw_errors: True
  werkzeug.request: <Request 'http://media_goblin:6543/submit/' [POST]>
  wsgi process: 'Multithreaded'
------------------------------------------------------------
This one is in a different area though.  Is this a place that hasn't been focusing on in your repository, or perhaps I've got the wrong bug?
comment:18 by , 12 years ago
Sadly you're right!  That's a completely different instance of this same problem.  I'm not sure if it should be split into another bug or not, but it should definitely be fixed.
comment:19 by , 12 years ago
| Milestone: | 0.6.0 → 0.7.0 | 
|---|
comment:21 by , 11 years ago
#834 is indeed a duplicate, but has useful information for resolving this ticket.
comment:22 by , 11 years ago
| Cc: | added | 
|---|---|
| Owner: | removed | 
| Status: | in_progress → accepted | 
Gandaro, I don't think you're currently working on this are you?  I'm unclaiming this from you, but if you jump back on it, please reclaim!
comment:23 by , 11 years ago
Note: I'd love for this to get fixed in 0.7.0 but it's a pretty big task and has been around forever.  It would be okay to move this to 0.8.0.  Let's save this one till towards the end of the cycle.
comment:24 by , 11 years ago
| Milestone: | 0.7.0 → 0.8.0 | 
|---|
I think this should be a 0.8.0 feature.  But we should *really* do it for 0.8.0!
Taking this on will take some effort, and we're pretty close to release, so.
comment:25 by , 11 years ago
| Cc: | added | 
|---|
comment:26 by , 11 years ago
| Owner: | set to | 
|---|---|
| Status: | accepted → in_progress | 
Claiming and assigning to 0.8.0
comment:27 by , 11 years ago
| Milestone: | 0.8.0 → 0.9.0 | 
|---|
I hate to bump this to the next release, but I am... but I am still working on it.
comment:28 by , 11 years ago
I just made the following two changes on my test instance. I had no problems uploading an image and a video to CloudFiles with the cloudfiles.py storage interface.
diff --git a/mediagoblin/media_types/__init__.py b/mediagoblin/media_types/__init__.py
index 2e39231..ba49e33 100644
--- a/mediagoblin/media_types/__init__.py
+++ b/mediagoblin/media_types/__init__.py
@@ -16,6 +16,7 @@
 
 import os
 import logging
+import shutil
 import tempfile
 
 from mediagoblin.tools.pluginapi import hook_handle
@@ -63,7 +64,7 @@ def sniff_media(media_file, filename):
         # Create a temporary file for sniffers suchs as GStreamer-based
         # Audio video
         tmp_media_file = tempfile.NamedTemporaryFile()
-        tmp_media_file.write(media_file.read())
+        shutil.copyfileobj(media_file, tmp_media_file)
         tmp_media_file.seek(0)
         media_file.seek(0)
 
diff --git a/mediagoblin/submit/lib.py b/mediagoblin/submit/lib.py
index 541447e..8981a9c 100644
--- a/mediagoblin/submit/lib.py
+++ b/mediagoblin/submit/lib.py
@@ -15,6 +15,7 @@
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
 
 import logging
+import shutil
 import uuid
 from os.path import splitext
 
@@ -157,7 +158,7 @@ def submit_media(mg_app, user, submitted_file, filename,
     queue_file = prepare_queue_task(mg_app, entry, filename)
 
     with queue_file:
-        queue_file.write(submitted_file.read())
+        shutil.copyfileobj(submitted_file, queue_file)
 
     # Get file size and round to 2 decimal places
     file_size = mg_app.queue_store.get_file_size(
comment:29 by , 11 years ago
I had no problems uploading an image and a video to CloudFiles with the cloudfiles.py storage interface.
The patch still doesn't fix https://gitorious.org/mediagoblin/mediagoblin/source/41dbb27a4f162b5c2b109d2214c5ccd2946ed535:mediagoblin/storage/cloudfiles.py#L213 or L153 of mediagoblin/edit/views.py.
Anyway, I think that the approach is good. I'd suggest to do it a bit another way, like this:
diff --git a/mediagoblin/submit/lib.py b/mediagoblin/submit/lib.py
index 541447e..85188d9 100644
--- a/mediagoblin/submit/lib.py
+++ b/mediagoblin/submit/lib.py
@@ -157,7 +157,7 @@ def submit_media(mg_app, user, submitted_file, filename,
     queue_file = prepare_queue_task(mg_app, entry, filename)
 
     with queue_file:
-        queue_file.write(submitted_file.read())
+        submitted_file.save(queue_file)
 
     # Get file size and round to 2 decimal places
     file_size = mg_app.queue_store.get_file_size(
comment:30 by , 11 years ago
I am also concerned about performance of cloudfiles.py when using copyfileobj. copyfileobj is defined here and in fact does multiple read() and write(), making cloudfiles' write() read data multiple times.
comment:31 by , 11 years ago
| Keywords: | bitesized removed | 
|---|---|
| Milestone: | 0.9.0 → 0.8.0 | 
| Owner: | changed from to | 
comment:32 by , 11 years ago
Breton's suggestion above also tests out okay on my dev instance with CloudFiles enabled. I can't yet comment on whether either solution helps solve my personal out of memory errors, but neither solution seems to break my environment.
comment:33 by , 11 years ago
Whoops, found a problem. CLI uploads don't work.
~/mediagoblin $ bin/gmg addmedia admin image.png
** Message: pygobject_register_sinkfunc is deprecated (GstObject)
Traceback (most recent call last):
  File "bin/gmg", line 9, in <module>
    load_entry_point('mediagoblin==0.7.2.dev', 'console_scripts', 'gmg')()
  File "/path/to/mediagoblin/mediagoblin/gmg_commands/__init__.py", line 156, in main_cli
    args.func(args)
  File "/path/to/mediagoblin/mediagoblin/gmg_commands/addmedia.py", line 103, in addmedia
    upload_limit=upload_limit, max_file_size=max_file_size)
  File "/path/to/mediagoblin/mediagoblin/submit/lib.py", line 163, in submit_media
    submitted_file.save(queue_file)
AttributeError: 'file' object has no attribute 'save'
comment:34 by , 11 years ago
Wow. Interesting. It seems that submitted_file is not always an instance of werkzeug.datastructures.FileStorage.
OK, I'm still working on it, will take this into account.
comment:35 by , 11 years ago
In case anyone's curious, CLI upload does work when I use shutil.copyfileobj(submitted_file, queue_file).
comment:36 by , 11 years ago
I created separate ticket #5053 to address the bin/gmg addmedia error introduced in 0.8.0.
comment:37 by , 11 years ago
I've attached a patch that should fix the issue. Ayleph, could you please test it on cloudfiles?
Also, any testing would be appreciated.
comment:38 by , 11 years ago
Thanks for the patch, Breton. It seems to work fine with CloudFiles storage. Console uploads via bin/gmg addmedia work as well.
comment:39 by , 11 years ago
| Owner: | removed | 
|---|---|
| Status: | in_progress → review | 
Patch was added to branch bug/647 of bretons-mediagoblin repo
comment:40 by , 11 years ago
| Keywords: | review removed | 
|---|
comment:41 by , 11 years ago
| Resolution: | → fixed | 
|---|---|
| Status: | review → closed | 
Since this went through review I'm trusting it, and merging it as-is!  The tests pass fine.
Thanks for your hard work, breton & ayleph!


Fixing branch. Should work, though someone should have another look at it.