Opened 14 years ago
Last modified 13 years ago
#75 closed defect (FIXED)
tagging
Reported by: | Will Kahn-Greene | Owned by: | Caleb Davis |
---|---|---|---|
Priority: | minor | Milestone: | 0.0.4 |
Component: | programming | Keywords: | |
Cc: | Parent Tickets: |
Description
This feature request covers tagging images with a series of words/phrases related to the image. For example: - vacation - Florida - Mark - goofy Chris said this in the 0.0.3 planning meeting: :: Tagging -- In what sense? good question, we should at least make a ticket :) probably tagging like MediaEntry['tags'] = ['photo', 'baby'] is good enough
Change History (36)
comment:2 by , 14 years ago
Thanks Spaetz, insightful points. I think that for 0.1.0 we probably won't have tag auto-completion or anything, so it's not a big deal if we don't have a list of all active tags. But maybe we should make that tags collection from the get-go anyway. Since I'm not a MongoDB expert, I did a bit of googling instead. I found several things: - simple version: we just add tags to a collection as users add new ones and we see they don't exist in our tags collection yet. This shouldn't be too hard or bad, except for removing tags when they don't exist any longer would take a bit more tracking (maybe it would be in SQL also though), but wouldn't be so bad if we indexed on MediaEntry['tags'] (which we should anyway). I guess we don't need to worry about that too much. Anyway, that sounds mostly like it's what Spaetz said anyway.. so yes, creating a collection of tags (where the \_id is the tag name) is a good idea. - Say we wanted to do something like a tag cloud, or auto-suggest tags based on their popularity. How would we do that? - `http://cookbook.mongodb.org/patterns/count\_tags/ <http://cookbook.mongodb.org/patterns/count_tags/>`_ looks like a good example using mapreduce, which we've never used. Apparently it's possible to do incremental mapreduce so when we add new entries we don't have to do the whole thing over again? I don't know, I've never used mapreduce - The comments of that cookbook entry mention a new aggregation framework in mongodb which sounds pretty stellar, but isn't here yet and probably won't be until the 1.9 series. - Here's the bug, but it doesn't contain too much info `https://jira.mongodb.org/browse/SERVER-447 <https://jira.mongodb.org/browse/SERVER-447>`_ - `https://gist.github.com/993733 <https://gist.github.com/993733>`_ is apparently some examples of this. Maybe we should implement the simple version with this bug, it doesn't sound like it would be too much extra work? I'll leave that to Will to decide... even if we don't get that in, I don't think it's the end of the world.
comment:3 by , 14 years ago
I'm about to go away for a week, so I'm pretty sure I'm not going to get to this for 0.0.3, but it's high on my list for July work.
comment:4 by , 14 years ago
Component: | → Programming |
---|---|
Owner: | set to |
comment:4 by , 14 years ago
This should also be designed in such a way that it can snatch the IPTC "keywords" field from the imported images and mark them as tags. Also, when people add new tags to an image, it should be possible to write the new tags out to the image again (so that the image files always have the correct information stored inside).
comment:5 by , 14 years ago
Hey Odin! That sounds like a nice iteration, but I think not necessary for closing this ticket. But I think it's worth opening a second ticket about that. That will become more and more significant as we get toward things like the mobile uploading API.
comment:6 by , 14 years ago
Milestone: | 0.0.3 → 0.0.4 |
---|
comment:7 by , 14 years ago
Feedback requested!! What is remaining to close this ticket? I moved the ball on this a teeny bit by adding tags fields in various places. - [https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commits/f360\_tagging](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commits/f360\_tagging). Tagging is so important. Basically, this is just to get the ball rolling and hopefully get your thoughts on what you want tagging to be. assumptions: - space-delimited tags - no multi-word tags TODO for this ticket: - move the tags so they display in the sidebar? - .. raw:: html <del> create a tags template so we can {% include %} it .. raw:: html </del> (not really needed yet) - ensure tags aren't duplicated in the same media entry - form validation - what should we consider? maximum tag length? case-sensitivity? - database design - shall we store the tags by ID so that it's easier to combine, rename, and delete them? If so, how do we distinguish which tags come from which users? Things to go on future tickets - tag clouds - auto-completes - beautification - public vs. private tags (delicious does a good job here) - how to deal with tag- and viewer-specific privacy scenario - Pinky searches Brain's items by 'wishlist' tag to get ideas for Brain's birthday gift. However, Brain wants a great many things, most of which are connected to top-secret plans to take over the world. These items are tagged as wishlist by Brain but they are not shared with Pinky, therefore Pinky should only pull up those media items tagged with wishlist AND viewable by Pinky. If Brain has no viewable wishlist items for Pinky then the search comes up empty even though Brain's collection is full of wishlist tags.
comment:8 by , 14 years ago
Nice! Looking good. From a quick scan, I have a couple of thoughts: - Is splitting on spaces best, or is splitting on commas best? I'm actually not totally sure, but both firefox and wordpress split on commas, so I'm inclined toward that. - Maybe we should add sanitization to this field that both converts the tag string into a list of tags and removes excess whitespace from the tags. Like so: def convert\_to\_tag\_list(tag\_string): return [ # each remove excess spacing in something like ' tag 3 ' u' '.join(tag.strip().split()) # split on commas for tag in tag\_string.split(u',') # only consider this tag if it has content after a strip if tag.strip()] convert\_to\_tag\_list(u' tag 1, tag 2, tag3') ['tag 1', 'tag 2', 'tag3'] You should be able to install this on the form field's filters list. I've never done this, but it's something like: :: tags = wtforms.TextField( 'Tags', filters=[convert_to_tag_list]) - Now would be a good time to add a couple of tagging indexes (please ping me on IRC about this, though.) - One across all active mediaentries, by tag - One across all active mediaentries for a user, by tag
comment:9 by , 14 years ago
okay, working... - delimiter - you say tah-MAY-dough, I say tah-MAW-dough, let's let the user decide. Suggestions? - indices - Did you see mongodb's query optimizer (`http://www.mongodb.org/display/DOCS/Query+Optimizer <http://www.mongodb.org/display/DOCS/Query+Optimizer>`_)? It makes sense that you would need this because the most efficient query path will fluctuate depending on the data in the database. It's almost like you're supposed to supply multiple index types, and let the optimizer figure out the most efficient query path on the fly. I will make one that makes the most sense to me, and we'll iterate from there.
comment:10 by , 14 years ago
from the mongodb indexing docs: There is a maximum key size in the indexes, currently approximately 800 bytes. Documents with a key to index greater than this size are allowed, but the key is not indexed. This limitation will eventually be removed.
comment:11 by , 14 years ago
hm.... related to the indexing tags `http://groups.google.com/group/mongodb-user/browse\_thread/thread/5acf65c31fa4f77 <http://groups.google.com/group/mongodb-user/browse_thread/thread/5acf65c31fa4f77>`_ So my impression is then that excessively long tags can't be indexed? Maybe we should set a limit on how long a tag name is allowed to be based around that limitation. I'm not sure that tags > 50 characters are helpful anyway. One more thing. What about case sensitivity? We could normalize all our tags to lowercase, but is that a good idea, I wonder?
comment:12 by , 14 years ago
another tiny commit proposing an index for tag searches by uploader - [https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commits/f360\_tagging](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commits/f360\_tagging)
comment:13 by , 14 years ago
woohoo, after some timely help in IRC, I pushed an update incorporating the filter function. I made the tags delimiter a global variable in mediagoblin/util. I like spaces! /me shakes fist ;) :) [https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commits/f360\_tagging](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commits/f360\_tagging) EDIT: .. raw:: html <del> next up, search! .. raw:: html </del> different ticket ;)
comment:14 by , 14 years ago
Status: | New → Feedback |
---|
feedback requested! [https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commits/f360\_tagging](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commits/f360\_tagging) perhaps this makes setting up a tag search bite-sized! 8-> by default in this branch, tags are: - space-delimited - no greater than 50 characters - case-insensitive I went with what I like. These settings are easily changed in mediagoblin/util.py.
comment:15 by , 14 years ago
oh yeah, and there was a discussion in irc about whether or not to index on processing state. It was decided to not since: - very few items should fall into the unprocessed category, - serving only the processed media could be unexpected. The failure scenario went something like, 'hey, I just submitted a movie tagged with "shniz", but it's not showing up when I search for my shniz.'
comment:16 by , 14 years ago
Hm... we almost certainly **should** index on processing state. We should probably index on processing state first, before tags. The reason for this is that if you do a query for tags and you also query on processing, if you don't index on processing yourself, every query done will manually walk over every result returned and check to see what its processing state is.
comment:17 by , 14 years ago
added requested indexing on processing state. I think I get it now. Unprocessed media will appear elsewhere, therefore there is no need to mix them up with active items in the main views!
comment:18 by , 14 years ago
Hey! So I finally have some feedback. Sorry for the wait! So this is looking really good... it's pretty close to mergable. There are a few things: - I think we should split on comma, and some conversation on IRC makes me feel a bit more strongly about this: do tags need to be able to consist of multiple words? schendje: Yes. I think people are going to want that. outside of microblogging, multi-word-tags seem in demand. schendje: I often tag people in, "Odin Horthe Omdal" et. al. And also names of what's happening. Or organizations. or you tag them OdinHortheOmdal ;) Velmont: thanks for saying this btw, I was wondering if any actual users were going to feel this way Velmont: and now I know in advance. ok no multiple words make sense I guess er, weird sentence multiple words *do* make sense, I think so if we seperate by commas, we will be able to do both "this,that,something" *and* "this, that, something" without it causing any trouble? schendje: Yes. :-) I hate software that forces you to use either spaces or no spaces, even though it could take care of that itself Split on ',' as long as it's not inside " or ', and then trim every element in the list. <- that's how it's normally looks implemented :-) Don't worry about the quotation stuff, we can handle that later, but I'm now completely convinced that splitting on commas is the way to go for now. - convert\_to\_tag\_list changes - Pass in tag string instead of request to convert\_to\_tag\_list Don't pass in a request to convert\_to\_tag\_list please... passing in simply the tag\_string is a lot more versatile. - Moving errors from messages->form errors, and a separate function for finding errors So this is kind of a two-parter... First of all, I don't think errors should use the messaging system... most of the time I think that's better for quick errors, particularly quick errors that are displayed after a redirect. It's more clear to users if we have errors related to a field input that failed displayed inline. Which leads to the question, if you aren't sending the errors directly from this function, how do you pass them in? I think the best way is via a few things... first of all, make def convert\_to\_tag\_list() **just** convert to a tag list, not determine if any tags are too long. Then make a separate function that detects if tags are too long. In fact! This could be done as a validator on the form. Something like: in util.py: :: TOO_LONG_TAG_WARNING = \ u'Tags must be shorter than %s characters. Tags that are too long: %s' def tag_length_validator(form, field): """ Make sure tags do not exceed the maximum tag length. """ tags = convert_to_tag_list(field.data) too_long_tags = [ tag for tag in tags if len(tag) > TAGS_MAX_LENGTH] if too_long_tags: raise ValidationError( TOO_LONG_TAG_WARNING % ( TAGS_MAX_LENGTH, ', '.join(too_long_tags))) in submit/forms.py: :: from mediagoblin.util import tag_length_validator class SubmitStartForm(wtforms.Form): [...] tags = wtforms.TextField( 'Tags', [tag_length_validator]) For more information: `http://wtforms.simplecodes.com/docs/0.6.1/validators.html#custom-validators <http://wtforms.simplecodes.com/docs/0.6.1/validators.html#custom-validators>`_ - Should we have some page that allows us to view all active media related to an active tag? Something like: /tag/bunnies/ ... also I've been thinking about something to slugify / normalize tags, but I think it's going to drive you crazy, so I'm going to think about it, and make one more post on the comments here, and then we can talk on IRC :)
comment:19 by , 14 years ago
I will leave it up to Caleb/gullydwarf to add notes about the rest of the things we discussed, but re: quoting tags, this might be helpful :: <paulproteus> gullydwarf + paroneayea: re: quoting in tags: OpenHatch has some code and test cases for that. <paulproteus> Let me dig 'em up quickly <paulproteus> https://gitorious.org/openhatch/oh-mainline/blobs/master/mysite/profile/controllers.py and https://gitorious.org/openhatch/oh-mainline/blobs/master/mysite/profile/views.py probably <paulproteus> must vanish for a few <paulproteus> https://gitorious.org/openhatch/oh-mainline/blobs/master/mysite/profile/controllers.py#line181 <paulproteus> for example<paulproteus> I don't see any tests, sadly. <paulproteus> It's also a bunch of spaghetti, that code. <paulproteus> https://gitorious.org/openhatch/oh-mainline/blobs/master/mysite/search/tests.py#line256 looks better
comment:20 by , 14 years ago
remaining items: - .. raw:: html <del> split on ',' by default .. raw:: html </del> - .. raw:: html <del> add tag\_delimiter option in config\_spec.ini .. raw:: html </del> - .. raw:: html <del> use list-o-dicts instead of list for tags, allowing slugified tags: 'tags': [{'name': 'Gully Gardens', 'slug': 'gully-gardens'}, {'name': 'Castle Adventure Time?!", 'slug': 'castle-adventure-time'}] .. raw:: html </del> - .. raw:: html <del> indices should use tags.slug .. raw:: html </del> - .. raw:: html <del> add support for quoting .. raw:: html </del> EDIT (7/24,IRC): rolled over to next ticket - .. raw:: html <del> separate the tag parsing from length detection, and use in-line form validation to raise errors .. raw:: html </del> - .. raw:: html <del> use tag\_string instead of request as input to tag parsing function .. raw:: html </del> - EDIT: add unit tests /tag/bunnies and /tag/bunnies+fangs and other searching and displaying views can come in future tickets thanks for all the detailed feedback. EDIT: crossing stuff out as I go along
comment:21 by , 14 years ago
Instead of writing code that allows for different tagging markup (this includes the delimiter), I think it's better to expose the tag parser in the API so that a plugin can override it. That way people can use different tagging markups, implement the parser in a plugin, and we don't have to maintain code that doesn't quite go all the way. Also, this really should have unit tests before the functionality gets merged in.
comment:22 by , 14 years ago
Will, I'm not sure what you mean by tagging markup. Excepting the delimiter, what markup is there?
comment:23 by , 14 years ago
Well, so we have delimiters and just like csv/tsv files, you need to have a way to allow the tag to include the delimiter. So most people use double-quotes. Then you need a way to escape double quotes, too. At that point, I vote we can call it a markup. So, we've got two markups now, but what if I want to use xml? What if I want to use some kind of semantic hierarchy? What are all the kinds of ways that people tag images today? I think there's a lot of possibility here. Just like we plan to push comment/description markup to a plugin, I think we should push this off to a plugin, too. Amongst other things, it requires less code, less functionality to test, and it gives us flexibility down the road when someone has a use case we didn't think of.
comment:24 by , 14 years ago
I'm not sure what to do next, in that case. I can: - complete the bullets above, or - work towards designing and building a plugin I would be happy to render assistance for the latter, but a plugin feature is advanced enough to warrant a new ticket. Please advise!
comment:25 by , 14 years ago
I would not worry about making it a plugin. That might be a good idea, but we'll need plugin infrastructure to do that and we don't have it yet. Really, I think beyond comma and space delimiters, that's good enough for almost everyone's needs (at least for now). I'm excited about this and I'd really like to see this in the next release so please proceed!
comment:26 by , 14 years ago
Status: | Feedback → In Progress |
---|
- separate the tag parsing from length detection, and use in-line form validation to raise errors - use tag\_string instead of request as input to tag parsing function [https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commit/909371cdceace162af880c275b9e6e70488e3029](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commit/909371cdceace162af880c275b9e6e70488e3029)
comment:28 by , 14 years ago
more bullets finished ([https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commit/10d7496da2f147a30a70304e8be3a579b15fd093](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commit/10d7496da2f147a30a70304e8be3a579b15fd093)): - split on ',' by default - add tag\_delimiter option in config\_spec.ini Also updated the media submission view to use inline form error messaging. [https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commit/97e4498c10c803c9d239011d4c48efef52673ec3](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commit/97e4498c10c803c9d239011d4c48efef52673ec3)
comment:29 by , 14 years ago
I just want to save some suggestions from irc for possible ways to display the tag list in the template: to show the tags you'd probably do something like: :: <div class="mediaentry_tags"> {% for tag in entry['tags'] %} <div class="tag">{{ tag['name'] }}</div> {% endfor %} </div> or later on when we have tag views, something like: :: <div class="mediaentry_tags"> {% for tag in entry['tags'] %} <div class="tag"> <a href="{{ request.urlgen('something_tags', tag=tag['slug']) }}">{{ tag['name'] }}</a> </div> {% endfor %} </div> You could fall back to a simple text\_string display by passing the list\_of\_dicts -> tag\_string function in the context :: return render_to_response( request, 'mediagoblin/submit/start.html', {'submit_form': submit_form, 'media_tags_as_string': media_tags_as_string}) The template could do it with :: <div class="mediaentry_tags"> {% for tag in entry['tags'] %} <div class="tag">{{ tag['name'] }}</div>{% if not loop.last %},{% endif %} {% endfor %} </div> also it is correct to wrap in :: <ul><li>...</li><li>...</li></ul>
comment:30 by , 14 years ago
- use list-o-dicts instead of list for tags, allowing slugified tags: 'tags': [{'name': 'Gully Gardens', 'slug': 'gully-gardens'}, {'name': 'Castle Adventure Time?!", 'slug': 'castle-adventure-time'}] [https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commit/0712a06dc6d6b05cb78d4b10af12c051e8f765e3](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commit/0712a06dc6d6b05cb78d4b10af12c051e8f765e3)
comment:31 by , 14 years ago
- state\_tags\_created - `http://dpaste.com/580079/ <http://dpaste.com/580079/>`_ EDIT: the dpaste is doing the following - first I do a query to find a picture I tagged with 'imeanit' - and I show the full record - then I perform the find again using explain() to tell me how the database executed the query. It said u'cursor': u'BasicCursor', which meant that the query didn't use the index and did a full table scan - I then fixed the query to include the correct fields in the correct order, and then the cursor is a Btree, / Mentioning these details will help me remember what I did. :) EDIT: the user-specific index is also working - state\_uploader\_tags\_created - `http://dpaste.com/580153/ <http://dpaste.com/580153/>`_ [https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commit/37be7b6da6aa765991fac55c5c2d0cdb37fd24fc](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commit/37be7b6da6aa765991fac55c5c2d0cdb37fd24fc) EDIT: I could use some help identifying unit tests now.
comment:32 by , 13 years ago
Status: | In Progress → Feedback |
---|
tests pass [https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commit/8ff4dec742d0b9f375afd9d1862a560e1be200d1](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commit/8ff4dec742d0b9f375afd9d1862a560e1be200d1) feedback requested
comment:34 by , 13 years ago
The original url for this bug was http://bugs.foocorp.net/issues/360 .
Relations:
#162: blocks, #163: blocks
Note:
See TracTickets
for help on using tickets.