Opened 12 years ago
Closed 10 years ago
#647 closed defect (fixed)
Uncontrolled reading of files into memory
Reported by: | Jakob Kramer | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | 0.8.0 |
Component: | programming | Keywords: | |
Cc: | Stephen Compall, Jakob Kramer, Matt Molyneaux | Parent Tickets: |
Description
Hi!
When I was looking at the attachments source code I have seen code like this:
with file:
file.write(another_file.read())
This is dangerous because you read another_file
into memory completely and this can lead to MemoryErrors. There is a function shutil.copyfileobj
that already does what we need: reading chunk for chunk into memory and writing it to another file object.
Attachments (1)
Change History (42)
comment:1 by , 12 years ago
Summary: | Uncontrolled reading of files in memory → Uncontrolled reading of files into memory |
---|
comment:2 by , 12 years ago
Status: | new → accepted |
---|
comment:3 by , 12 years ago
Keywords: | review added |
---|
comment:4 by , 12 years ago
Resolution: | → duplicate |
---|---|
Status: | accepted → closed |
Gandaro, I will look at your branch also, and thanks for submitting it! However, I also want to note that at the moment I believe that this is a duplicate of issue #419... spaetz has a branch there that I think does mostly the same thing; the problem is that someone needs to test whether the new code works with cloudfiles. When that is done I think spaetz's branch can be merged.
In the meanwhile I'm closing this as a duplicate of #419.
comment:5 by , 12 years ago
Sorry for not linking to the relevant branch earlier; it's here: https://gitorious.org/~spaetz/mediagoblin/spaetz-mediagoblin/commits/WIP/large_uploads
comment:6 by , 12 years ago
Milestone: | → 0.4.0 |
---|
Okay, ho ho, reopening this one. There look like there are some good features in gandaro's branch here that weren't in #419. Since that's closed out, reopening this.
Problem: it was hard for me to review this. The commits seemed to be doing a lot more than the description of the commit at first glance (rearranging imports, shifting indentation, etc) and it made it harder to figure out what was going on.
I think it looks like there's good stuff in here though, so I'd like to get it merged... but it would be good to have some help! I also suspect this does not merge on top of master at present.
comment:7 by , 12 years ago
Resolution: | duplicate |
---|---|
Status: | closed → reopened |
comment:8 by , 12 years ago
This commit really only adds the "with ..." line and changes the 'foo\'bar'
to "foo'bar"
(although I think there was another occurence of that in the file that I did not change).
In the other commit I changed the if clause to only check wether attachments are not allowed, so we don't have the complete function indented. Then I edited line 143 in the freshly unindented block to use shutil.copyfileobj
instead of f1.write(f2.read())
. The rest is self-explanatory, I think. :-)
comment:9 by , 12 years ago
Cc: | added |
---|
comment:10 by , 11 years ago
Owner: | removed |
---|---|
Status: | reopened → review |
comment:11 by , 11 years ago
That is a new commit without the stylistic fixes. https://gitorious.org/~gandaro/mediagoblin/gandaros-mediagoblin/commits/647-cleaner
comment:12 by , 11 years ago
Owner: | set to |
---|---|
Status: | review → in_progress |
So this looks good, in the sense that it's definitely catching a lot of areas where we're clearly doing the wrong thing.
What I'm not sure about is if it works with rackspace cloudfiles. I haven't checked what happens with shutil.copyfileobj()
with cloudfiles, but there's no way we can merge this without knowing that it works with that.
#419 did handle that right, so maybe comparing with that is what's needed?
A few places I think actually it's easy enough to not use copyfileobj; eg:
In stl/processing.py for example:
# copy it up! with open(workbench_path, 'rb') as rendered_file: public_path = create_pub_filepath(entry, filename) with mgg.public_store.get_file(public_path, "wb") as public_file: shutil.copyfileobj(rendered_file, public_file)
For this, we could use the public_store's copy_local_to_storage() method, which should work right, and I think we should before this is merged (so I'm passing this back for that reason).
What to do with the files that are coming in off of requests in terms of the submission objects? I'm not sure about those. This is what we get, in some respects, for only-kinda-sorta working with with "file-like" objects. Anyway, if all our storage systems can support shutil.copyfileobj() for sure, and we know that we *expect* that, we could merge this... and we should document that as both known and expected behavior.
comment:14 by , 11 years ago
Milestone: | 0.5.0 → 0.6.0 |
---|
comment:15 by , 11 years ago
I think I'm running into this bug myself, when uploading large files (2.2gb in this example), I get the following server error:
Error - <type 'exceptions.OverflowError'>: requested number of bytes is more than a Python string can hold URL: http://media_goblin:6543/submit/ File '/home/mediagoblin/mediagoblin/local/lib/python2.7/site-packages/Paste-1.7.5.1-py2.7.egg/paste/exceptions/errormiddleware.py', line 144 in __call__ app_iter = self.application(environ, sr_checker) File '/home/mediagoblin/mediagoblin/local/lib/python2.7/site-packages/Paste-1.7.5.1-py2.7.egg/paste/urlmap.py', line 203 in _ _call__ return app(environ, start_response) File '/home/mediagoblin/mediagoblin/mediagoblin/app.py', line 259 in __call__ return self.call_backend(environ, start_response) File '/home/mediagoblin/mediagoblin/mediagoblin/app.py', line 236 in call_backend response = controller(request) File '/home/mediagoblin/mediagoblin/mediagoblin/decorators.py', line 52 in new_controller_func return controller(request, *args, **kwargs) File '/home/mediagoblin/mediagoblin/mediagoblin/submit/views.py', line 107 in submit_start queue_file.write(request.files['file'].stream.read()) OverflowError: requested number of bytes is more than a Python string can hold CGI Variables ------------- CONTENT_LENGTH: '2373533566' CONTENT_TYPE: 'multipart/form-data; boundary=---------------------------16242764931763880005461866851' CSRF_TOKEN: u'303073878618807821' HTTP_ACCEPT: 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' HTTP_ACCEPT_ENCODING: 'gzip, deflate' HTTP_ACCEPT_LANGUAGE: 'en-au,en;q=0.7,en-us;q=0.3' HTTP_CONNECTION: 'keep-alive' HTTP_COOKIE: 'mediagoblin_csrftoken=303073878618807821; mediagoblin_csrftoken=303073878618807821; MGSession=eyJtZXNzYWdlcyI 6W10sInVzZXJfaWQiOiIxIn0.BSgdAQ.E7v4vBtW60YSELUEzHV_lf68iu8' HTTP_HOST: 'media_goblin:6543' HTTP_REFERER: 'http://media_goblin:6543/submit/' HTTP_USER_AGENT: 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0' PATH_INFO: '/submit/' REMOTE_ADDR: '192.168.39.1' REQUEST_METHOD: 'POST' SERVER_NAME: '0.0.0.0' SERVER_PORT: '6543' SERVER_PROTOCOL: 'HTTP/1.1' WSGI Variables -------------- application: {(None, '/mgoblin_static'): <StaticURLParser '/home/mediagoblin/mediagoblin/mediagoblin/static'>, (None, '/mgo blin_media'): <StaticURLParser '/home/mediagoblin/mediagoblin/user_dev/media/public'>, (None, ''): <mediagoblin.app.MediaGobl inApp object at 0x9aef7cc>, (None, '/plugin_static'): <StaticURLParser '/home/mediagoblin/mediagoblin/user_dev/plugin_static' >, (None, '/theme_static'): <StaticURLParser '/home/mediagoblin/mediagoblin/user_dev/theme_static'>} paste.httpserver.thread_pool: <paste.httpserver.ThreadPool object at 0xa30d84c> paste.throw_errors: True werkzeug.request: <Request 'http://media_goblin:6543/submit/' [POST]> wsgi process: 'Multithreaded'
It's on a local installation (5.1 setup according to manual) inside an LXC container, with about 440mb or available RAM. Hope this helps.
comment:16 by , 11 years ago
andreas45, it's quite likely you're right that this is related. It would be good if someone could help test with openstack's storage and review the state of the ticket in regard to my comments... that would be good. It might be mergable as-is... I'm not sure. Testing with cloudfiles would help.
comment:17 by , 11 years ago
cwebber, I cloned from the repository you linked to (on my same machine) to see if it affects my problem, but I still get a very similar one:
[mediagoblin.media_types] No media handler found by file extension. Doing it the expensive way... Error - <type 'exceptions.OverflowError'>: requested number of bytes is more than a Python string can hold URL: http://media_goblin:6543/submit/ File '/home/mediagoblin/fix_test_mediagoblin/local/lib/python2.7/site-packages/Paste-1.7.5.1-py2.7.egg/paste/exceptions/errormiddleware.py', line 144 in __call__ app_iter = self.application(environ, sr_checker) File '/home/mediagoblin/fix_test_mediagoblin/local/lib/python2.7/site-packages/Paste-1.7.5.1-py2.7.egg/paste/urlmap.py', line 203 in __call__ return app(environ, start_response) File '/home/mediagoblin/fix_test_mediagoblin/mediagoblin/app.py', line 259 in __call__ return self.call_backend(environ, start_response) File '/home/mediagoblin/fix_test_mediagoblin/mediagoblin/app.py', line 236 in call_backend response = controller(request) File '/home/mediagoblin/fix_test_mediagoblin/mediagoblin/decorators.py', line 52 in new_controller_func return controller(request, *args, **kwargs) File '/home/mediagoblin/fix_test_mediagoblin/mediagoblin/submit/views.py', line 64 in submit_start request.files['file']) File '/home/mediagoblin/fix_test_mediagoblin/mediagoblin/media_types/__init__.py', line 66 in sniff_media media_file.write(media.stream.read()) OverflowError: requested number of bytes is more than a Python string can hold CGI Variables ------------- CONTENT_LENGTH: '2373533220' CONTENT_TYPE: 'multipart/form-data; boundary=---------------------------202606272211906711501952622290' CSRF_TOKEN: u'303073878618807821' HTTP_ACCEPT: 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' HTTP_ACCEPT_ENCODING: 'gzip, deflate' HTTP_ACCEPT_LANGUAGE: 'en-au,en;q=0.7,en-us;q=0.3' HTTP_CONNECTION: 'keep-alive' HTTP_COOKIE: 'mediagoblin_csrftoken=303073878618807821; mediagoblin_csrftoken=303073878618807821; MGSession=eyJtZXNzYWdlcyI6W10sInVzZXJfaWQiOiIxIn0.BTEcBg.1OEEK9CvA_S_OnoTv15ImTxHnp8' HTTP_HOST: 'media_goblin:6543' HTTP_REFERER: 'http://media_goblin:6543/submit/' HTTP_USER_AGENT: 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0' PATH_INFO: '/submit/' REMOTE_ADDR: '192.168.39.1' REQUEST_METHOD: 'POST' SERVER_NAME: '0.0.0.0' SERVER_PORT: '6543' SERVER_PROTOCOL: 'HTTP/1.1' WSGI Variables -------------- application: {(None, '/mgoblin_static'): <StaticURLParser '/home/mediagoblin/fix_test_mediagoblin/mediagoblin/static'>, (None, '/mgoblin_media'): <StaticURLParser '/home/mediagoblin/fix_test_mediagoblin/user_dev/media/public'>, (None, ''): <mediagoblin.app.MediaGoblinApp object at 0x968a9ac>, (None, '/plugin_static'): <StaticURLParser '/home/mediagoblin/fix_test_mediagoblin/user_dev/plugin_static'>, (None, '/theme_static'): <StaticURLParser '/home/mediagoblin/fix_test_mediagoblin/user_dev/theme_static'>} paste.httpserver.thread_pool: <paste.httpserver.ThreadPool object at 0x97eea0c> paste.throw_errors: True werkzeug.request: <Request 'http://media_goblin:6543/submit/' [POST]> wsgi process: 'Multithreaded' ------------------------------------------------------------
This one is in a different area though. Is this a place that hasn't been focusing on in your repository, or perhaps I've got the wrong bug?
comment:18 by , 11 years ago
Sadly you're right! That's a completely different instance of this same problem. I'm not sure if it should be split into another bug or not, but it should definitely be fixed.
comment:19 by , 11 years ago
Milestone: | 0.6.0 → 0.7.0 |
---|
comment:21 by , 10 years ago
#834 is indeed a duplicate, but has useful information for resolving this ticket.
comment:22 by , 10 years ago
Cc: | added |
---|---|
Owner: | removed |
Status: | in_progress → accepted |
Gandaro, I don't think you're currently working on this are you? I'm unclaiming this from you, but if you jump back on it, please reclaim!
comment:23 by , 10 years ago
Note: I'd love for this to get fixed in 0.7.0 but it's a pretty big task and has been around forever. It would be okay to move this to 0.8.0. Let's save this one till towards the end of the cycle.
comment:24 by , 10 years ago
Milestone: | 0.7.0 → 0.8.0 |
---|
I think this should be a 0.8.0 feature. But we should *really* do it for 0.8.0!
Taking this on will take some effort, and we're pretty close to release, so.
comment:25 by , 10 years ago
Cc: | added |
---|
comment:26 by , 10 years ago
Owner: | set to |
---|---|
Status: | accepted → in_progress |
Claiming and assigning to 0.8.0
comment:27 by , 10 years ago
Milestone: | 0.8.0 → 0.9.0 |
---|
I hate to bump this to the next release, but I am... but I am still working on it.
comment:28 by , 10 years ago
I just made the following two changes on my test instance. I had no problems uploading an image and a video to CloudFiles with the cloudfiles.py storage interface.
diff --git a/mediagoblin/media_types/__init__.py b/mediagoblin/media_types/__init__.py index 2e39231..ba49e33 100644 --- a/mediagoblin/media_types/__init__.py +++ b/mediagoblin/media_types/__init__.py @@ -16,6 +16,7 @@ import os import logging +import shutil import tempfile from mediagoblin.tools.pluginapi import hook_handle @@ -63,7 +64,7 @@ def sniff_media(media_file, filename): # Create a temporary file for sniffers suchs as GStreamer-based # Audio video tmp_media_file = tempfile.NamedTemporaryFile() - tmp_media_file.write(media_file.read()) + shutil.copyfileobj(media_file, tmp_media_file) tmp_media_file.seek(0) media_file.seek(0) diff --git a/mediagoblin/submit/lib.py b/mediagoblin/submit/lib.py index 541447e..8981a9c 100644 --- a/mediagoblin/submit/lib.py +++ b/mediagoblin/submit/lib.py @@ -15,6 +15,7 @@ # along with this program. If not, see <http://www.gnu.org/licenses/>. import logging +import shutil import uuid from os.path import splitext @@ -157,7 +158,7 @@ def submit_media(mg_app, user, submitted_file, filename, queue_file = prepare_queue_task(mg_app, entry, filename) with queue_file: - queue_file.write(submitted_file.read()) + shutil.copyfileobj(submitted_file, queue_file) # Get file size and round to 2 decimal places file_size = mg_app.queue_store.get_file_size(
comment:29 by , 10 years ago
I had no problems uploading an image and a video to CloudFiles with the cloudfiles.py storage interface.
The patch still doesn't fix https://gitorious.org/mediagoblin/mediagoblin/source/41dbb27a4f162b5c2b109d2214c5ccd2946ed535:mediagoblin/storage/cloudfiles.py#L213 or L153 of mediagoblin/edit/views.py.
Anyway, I think that the approach is good. I'd suggest to do it a bit another way, like this:
diff --git a/mediagoblin/submit/lib.py b/mediagoblin/submit/lib.py index 541447e..85188d9 100644 --- a/mediagoblin/submit/lib.py +++ b/mediagoblin/submit/lib.py @@ -157,7 +157,7 @@ def submit_media(mg_app, user, submitted_file, filename, queue_file = prepare_queue_task(mg_app, entry, filename) with queue_file: - queue_file.write(submitted_file.read()) + submitted_file.save(queue_file) # Get file size and round to 2 decimal places file_size = mg_app.queue_store.get_file_size(
comment:30 by , 10 years ago
I am also concerned about performance of cloudfiles.py when using copyfileobj. copyfileobj is defined here and in fact does multiple read() and write(), making cloudfiles' write() read data multiple times.
comment:31 by , 10 years ago
Keywords: | bitesized removed |
---|---|
Milestone: | 0.9.0 → 0.8.0 |
Owner: | changed from | to
comment:32 by , 10 years ago
Breton's suggestion above also tests out okay on my dev instance with CloudFiles enabled. I can't yet comment on whether either solution helps solve my personal out of memory errors, but neither solution seems to break my environment.
comment:33 by , 10 years ago
Whoops, found a problem. CLI uploads don't work.
~/mediagoblin $ bin/gmg addmedia admin image.png ** Message: pygobject_register_sinkfunc is deprecated (GstObject) Traceback (most recent call last): File "bin/gmg", line 9, in <module> load_entry_point('mediagoblin==0.7.2.dev', 'console_scripts', 'gmg')() File "/path/to/mediagoblin/mediagoblin/gmg_commands/__init__.py", line 156, in main_cli args.func(args) File "/path/to/mediagoblin/mediagoblin/gmg_commands/addmedia.py", line 103, in addmedia upload_limit=upload_limit, max_file_size=max_file_size) File "/path/to/mediagoblin/mediagoblin/submit/lib.py", line 163, in submit_media submitted_file.save(queue_file) AttributeError: 'file' object has no attribute 'save'
comment:34 by , 10 years ago
Wow. Interesting. It seems that submitted_file is not always an instance of werkzeug.datastructures.FileStorage.
OK, I'm still working on it, will take this into account.
comment:35 by , 10 years ago
In case anyone's curious, CLI upload does work when I use shutil.copyfileobj(submitted_file, queue_file)
.
comment:36 by , 10 years ago
I created separate ticket #5053 to address the bin/gmg addmedia
error introduced in 0.8.0.
comment:37 by , 10 years ago
I've attached a patch that should fix the issue. Ayleph, could you please test it on cloudfiles?
Also, any testing would be appreciated.
comment:38 by , 10 years ago
Thanks for the patch, Breton. It seems to work fine with CloudFiles storage. Console uploads via bin/gmg addmedia
work as well.
comment:39 by , 10 years ago
Owner: | removed |
---|---|
Status: | in_progress → review |
Patch was added to branch bug/647 of bretons-mediagoblin repo
comment:40 by , 10 years ago
Keywords: | review removed |
---|
comment:41 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | review → closed |
Since this went through review I'm trusting it, and merging it as-is! The tests pass fine.
Thanks for your hard work, breton & ayleph!
Fixing branch. Should work, though someone should have another look at it.