Opened 13 years ago

Last modified 13 years ago

#75 closed defect (FIXED)

tagging

Reported by: Will Kahn-Greene Owned by: Caleb Davis
Priority: minor Milestone: 0.0.4
Component: programming Keywords:
Cc: Parent Tickets:

Description

This feature request covers tagging images with a series of
words/phrases related to the image.

For example:


-  vacation
-  Florida
-  Mark
-  goofy

Chris said this in the 0.0.3 planning meeting:

::

        Tagging -- In what sense?  good question, we should at least make a ticket :) probably tagging
        like MediaEntry['tags'] = ['photo', 'baby'] is good enough



Change History (36)

comment:1 by Sebastian Spaeth, 13 years ago

This might be my SQL object-related database thinking, but wouldn't
it make sense to have a table with tag names, and link to those tag
names from the media page?

I imagine we will be filtering by tags a lot (search for all media
tagged "Florida"), and having to parse all media documents, pull
out tags and compare will not be nice in terms of performance. If
we have a "tag table" we can simply look for all media with tag
including a link to a certain tag (I don't really know mongodb, so
that might be nonsense here).

When "modifying" a media tag, we would 1) create a new tag (if
needed) with the new value and remove the link from the old tag
name, removing the old tag if no link is left. This way I can
change a tag from "Florida" to "Floor-ida" without changing the tag
for all media that have it.

What do the mongo experts say here?



comment:2 by Christopher Allan Webber, 13 years ago

Thanks Spaetz, insightful points. I think that for 0.1.0 we
probably won't have tag auto-completion or anything, so it's not a
big deal if we don't have a list of all active tags. But maybe we
should make that tags collection from the get-go anyway.

Since I'm not a MongoDB expert, I did a bit of googling instead. I
found several things:


-  simple version: we just add tags to a collection as users add
   new ones and we see they don't exist in our tags collection yet.
   This shouldn't be too hard or bad, except for removing tags when
   they don't exist any longer would take a bit more tracking (maybe
   it would be in SQL also though), but wouldn't be so bad if we
   indexed on MediaEntry['tags'] (which we should anyway). I guess we
   don't need to worry about that too much. Anyway, that sounds mostly
   like it's what Spaetz said anyway.. so yes, creating a collection
   of tags (where the \_id is the tag name) is a good idea.
-  Say we wanted to do something like a tag cloud, or auto-suggest
   tags based on their popularity. How would we do that?
   
   -  `http://cookbook.mongodb.org/patterns/count\_tags/ <http://cookbook.mongodb.org/patterns/count_tags/>`_
      looks like a good example using mapreduce, which we've never used.
      Apparently it's possible to do incremental mapreduce so when we add
      new entries we don't have to do the whole thing over again? I don't
      know, I've never used mapreduce
   -  The comments of that cookbook entry mention a new aggregation
      framework in mongodb which sounds pretty stellar, but isn't here
      yet and probably won't be until the 1.9 series.
   -  Here's the bug, but it doesn't contain too much info
      `https://jira.mongodb.org/browse/SERVER-447 <https://jira.mongodb.org/browse/SERVER-447>`_
   -  `https://gist.github.com/993733 <https://gist.github.com/993733>`_
      is apparently some examples of this.


Maybe we should implement the simple version with this bug, it
doesn't sound like it would be too much extra work? I'll leave that
to Will to decide... even if we don't get that in, I don't think
it's the end of the world.



comment:3 by Will Kahn-Greene, 13 years ago

I'm about to go away for a week, so I'm pretty sure I'm not going
to get to this for 0.0.3, but it's high on my list for July work.



comment:4 by Elrond, 13 years ago

Component: Programming
Owner: set to Will Kahn-Greene

comment:4 by Odin Hørthe Omdal, 13 years ago

This should also be designed in such a way that it can snatch the
IPTC "keywords" field from the imported images and mark them as
tags. Also, when people add new tags to an image, it should be
possible to write the new tags out to the image again (so that the
image files always have the correct information stored inside).



comment:5 by Christopher Allan Webber, 13 years ago

Hey Odin! That sounds like a nice iteration, but I think not
necessary for closing this ticket. But I think it's worth opening a
second ticket about that. That will become more and more
significant as we get toward things like the mobile uploading API.



comment:6 by Elrond, 13 years ago

Milestone: 0.0.30.0.4

comment:6 by Caleb Davis, 13 years ago

Owner: set to Caleb Davis
0.0.3 meeting result



comment:7 by Caleb Davis, 13 years ago

Feedback requested!! What is remaining to close this ticket?

I moved the ball on this a teeny bit by adding tags fields in
various places. -
[https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commits/f360\_tagging](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commits/f360\_tagging).
Tagging is so important. Basically, this is just to get the ball
rolling and hopefully get your thoughts on what you want tagging to
be.

assumptions:


-  space-delimited tags
-  no multi-word tags TODO for this ticket:
-  move the tags so they display in the sidebar?
-  
.. raw:: html

      <del>
      
   create a tags template so we can {% include %} it
   
.. raw:: html

      </del> 
      
   (not really needed yet)
-  ensure tags aren't duplicated in the same media entry
-  form validation - what should we consider? maximum tag length?
   case-sensitivity?
-  database design - shall we store the tags by ID so that it's
   easier to combine, rename, and delete them? If so, how do we
   distinguish which tags come from which users? Things to go on
   future tickets
-  tag clouds
-  auto-completes
-  beautification
-  public vs. private tags (delicious does a good job here)
-  how to deal with tag- and viewer-specific privacy

    scenario - Pinky searches Brain's items by 'wishlist' tag to get
    ideas for Brain's birthday gift. However, Brain wants a great many
    things, most of which are connected to top-secret plans to take
    over the world. These items are tagged as wishlist by Brain but
    they are not shared with Pinky, therefore Pinky should only pull up
    those media items tagged with wishlist AND viewable by Pinky. If
    Brain has no viewable wishlist items for Pinky then the search
    comes up empty even though Brain's collection is full of wishlist
    tags.




comment:8 by Christopher Allan Webber, 13 years ago

Nice! Looking good.

From a quick scan, I have a couple of thoughts:


-  Is splitting on spaces best, or is splitting on commas best? I'm
   actually not totally sure, but both firefox and wordpress split on
   commas, so I'm inclined toward that.
-  Maybe we should add sanitization to this field that both
   converts the tag string into a list of tags and removes excess
   whitespace from the tags. Like so:

   def convert\_to\_tag\_list(tag\_string): return [ # each remove
   excess spacing in something like ' tag 3 ' u'
   '.join(tag.strip().split()) # split on commas for tag in
   tag\_string.split(u',') # only consider this tag if it has content
   after a strip if tag.strip()]

               convert\_to\_tag\_list(u' tag 1, tag 2, tag3') ['tag 1', 'tag 2',
               'tag3']





You should be able to install this on the form field's filters
list. I've never done this, but it's something like:

::

        tags = wtforms.TextField(
            'Tags', filters=[convert_to_tag_list])


-  Now would be a good time to add a couple of tagging indexes
   (please ping me on IRC about this, though.)
   
   -  One across all active mediaentries, by tag
   -  One across all active mediaentries for a user, by tag




comment:9 by Caleb Davis, 13 years ago

okay, working...


-  delimiter - you say tah-MAY-dough, I say tah-MAW-dough, let's
   let the user decide. Suggestions?
-  indices - Did you see mongodb's query optimizer
   (`http://www.mongodb.org/display/DOCS/Query+Optimizer <http://www.mongodb.org/display/DOCS/Query+Optimizer>`_)?
   It makes sense that you would need this because the most efficient
   query path will fluctuate depending on the data in the database.
   It's almost like you're supposed to supply multiple index types,
   and let the optimizer figure out the most efficient query path on
   the fly. I will make one that makes the most sense to me, and we'll
   iterate from there.



comment:10 by Caleb Davis, 13 years ago

from the mongodb indexing docs:

    There is a maximum key size in the indexes, currently approximately
    800 bytes. Documents with a key to index greater than this size are
    allowed, but the key is not indexed. This limitation will
    eventually be removed.




comment:11 by Christopher Allan Webber, 13 years ago

hm.... related to the indexing tags
`http://groups.google.com/group/mongodb-user/browse\_thread/thread/5acf65c31fa4f77 <http://groups.google.com/group/mongodb-user/browse_thread/thread/5acf65c31fa4f77>`_

So my impression is then that excessively long tags can't be
indexed? Maybe we should set a limit on how long a tag name is
allowed to be based around that limitation. I'm not sure that tags
> 50 characters are helpful anyway.

One more thing. What about case sensitivity? We could normalize all
our tags to lowercase, but is that a good idea, I wonder?



comment:12 by Caleb Davis, 13 years ago

another tiny commit proposing an index for tag searches by uploader
-
[https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commits/f360\_tagging](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commits/f360\_tagging)



comment:13 by Caleb Davis, 13 years ago

woohoo, after some timely help in IRC, I pushed an update
incorporating the filter function. I made the tags delimiter a
global variable in mediagoblin/util. I like spaces! /me shakes fist
;) :)

[https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commits/f360\_tagging](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commits/f360\_tagging)

EDIT:

.. raw:: html

   <del>
   
next up, search!

.. raw:: html

   </del> 
   
different ticket ;)



comment:14 by Caleb Davis, 13 years ago

Status: NewFeedback
feedback requested!

[https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commits/f360\_tagging](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commits/f360\_tagging)

perhaps this makes setting up a tag search bite-sized! 8->

by default in this branch, tags are:


-  space-delimited
-  no greater than 50 characters
-  case-insensitive

I went with what I like. These settings are easily changed in
mediagoblin/util.py.



comment:15 by Caleb Davis, 13 years ago

oh yeah, and there was a discussion in irc about whether or not to
index on processing state. It was decided to not since:


-  very few items should fall into the unprocessed category,
-  serving only the processed media could be unexpected. The
   failure scenario went something like, 'hey, I just submitted a
   movie tagged with "shniz", but it's not showing up when I search
   for my shniz.'



comment:16 by Christopher Allan Webber, 13 years ago

Hm... we almost certainly **should** index on processing state. We
should probably index on processing state first, before tags.

The reason for this is that if you do a query for tags and you also
query on processing, if you don't index on processing yourself,
every query done will manually walk over every result returned and
check to see what its processing state is.



comment:17 by Caleb Davis, 13 years ago

added requested indexing on processing state.

I think I get it now. Unprocessed media will appear elsewhere,
therefore there is no need to mix them up with active items in the
main views!



comment:18 by Christopher Allan Webber, 13 years ago

Hey! So I finally have some feedback. Sorry for the wait!

So this is looking really good... it's pretty close to mergable.

There are a few things:


-  I think we should split on comma, and some conversation on IRC
   makes me feel a bit more strongly about this:

   do tags need to be able to consist of multiple words? schendje:
   Yes. I think people are going to want that. outside of
   microblogging, multi-word-tags seem in demand. schendje: I often
   tag people in, "Odin Horthe Omdal" et. al. And also names of what's
   happening. Or organizations. or you tag them OdinHortheOmdal ;)
   Velmont: thanks for saying this btw, I was wondering if any actual
   users were going to feel this way Velmont: and now I know in
   advance. ok no multiple words make sense I guess er, weird sentence
   multiple words *do* make sense, I think so if we seperate by
   commas, we will be able to do both "this,that,something" *and*
   "this, that, something" without it causing any trouble? schendje:
   Yes. :-) I hate software that forces you to use either spaces or no
   spaces, even though it could take care of that itself Split on ','
   as long as it's not inside " or ', and then trim every element in
   the list. <- that's how it's normally looks implemented :-)


Don't worry about the quotation stuff, we can handle that later,
but I'm now completely convinced that splitting on commas is the
way to go for now.


-  convert\_to\_tag\_list changes
   
   -  Pass in tag string instead of request to
      convert\_to\_tag\_list


Don't pass in a request to convert\_to\_tag\_list please... passing
in simply the tag\_string is a lot more versatile.


-  Moving errors from messages->form errors, and a separate
   function for finding errors

So this is kind of a two-parter...

First of all, I don't think errors should use the messaging
system... most of the time I think that's better for quick errors,
particularly quick errors that are displayed after a redirect. It's
more clear to users if we have errors related to a field input that
failed displayed inline.

Which leads to the question, if you aren't sending the errors
directly from this function, how do you pass them in? I think the
best way is via a few things... first of all, make def
convert\_to\_tag\_list() **just** convert to a tag list, not
determine if any tags are too long. Then make a separate function
that detects if tags are too long. In fact! This could be done as a
validator on the form. Something like:

in util.py:

::

    TOO_LONG_TAG_WARNING = \
        u'Tags must be shorter than %s characters.  Tags that are too long: %s'
    
    def tag_length_validator(form, field):
        """ 
        Make sure tags do not exceed the maximum tag length.
        """ 
        tags = convert_to_tag_list(field.data)
        too_long_tags = [
            tag for tag in tags
            if len(tag) > TAGS_MAX_LENGTH]
    
        if too_long_tags:
            raise ValidationError(
                TOO_LONG_TAG_WARNING % (
                    TAGS_MAX_LENGTH, ', '.join(too_long_tags)))

in submit/forms.py:

::

    from mediagoblin.util import tag_length_validator
    
    class SubmitStartForm(wtforms.Form):
        [...]
        tags = wtforms.TextField(
            'Tags',
            [tag_length_validator])

For more information:
`http://wtforms.simplecodes.com/docs/0.6.1/validators.html#custom-validators <http://wtforms.simplecodes.com/docs/0.6.1/validators.html#custom-validators>`_


-  Should we have some page that allows us to view all active media
   related to an active tag? Something like: /tag/bunnies/

... also I've been thinking about something to slugify / normalize
tags, but I think it's going to drive you crazy, so I'm going to
think about it, and make one more post on the comments here, and
then we can talk on IRC :)



comment:19 by Christopher Allan Webber, 13 years ago

I will leave it up to Caleb/gullydwarf to add notes about the rest
of the things we discussed, but re: quoting tags, this might be
helpful

::

    <paulproteus> gullydwarf + paroneayea: re: quoting in tags: OpenHatch has some
                  code and test cases for that.
    <paulproteus> Let me dig 'em up quickly
    <paulproteus> https://gitorious.org/openhatch/oh-mainline/blobs/master/mysite/profile/controllers.py
                  and
                  https://gitorious.org/openhatch/oh-mainline/blobs/master/mysite/profile/views.py
                  probably
    <paulproteus> must vanish for a few
    <paulproteus> https://gitorious.org/openhatch/oh-mainline/blobs/master/mysite/profile/controllers.py#line181
    <paulproteus> for example<paulproteus> I don't see any tests, sadly.
    <paulproteus> It's also a bunch of spaghetti, that code.
    <paulproteus> https://gitorious.org/openhatch/oh-mainline/blobs/master/mysite/search/tests.py#line256
                  looks better



comment:20 by Caleb Davis, 13 years ago

remaining items:


-  
.. raw:: html

      <del>
      
   split on ',' by default
   
.. raw:: html

      </del>
      
-  
.. raw:: html

      <del>
      
   add tag\_delimiter option in config\_spec.ini
   
.. raw:: html

      </del>
      
-  
.. raw:: html

      <del>
      
   use list-o-dicts instead of list for tags, allowing slugified
   tags:

   'tags': [{'name': 'Gully Gardens', 'slug': 'gully-gardens'},
   {'name': 'Castle Adventure Time?!", 'slug':
   'castle-adventure-time'}]



.. raw:: html

   </del>
     
   

-  
.. raw:: html

      <del>
      
   indices should use tags.slug
   
.. raw:: html

      </del>
      
-  
.. raw:: html

      <del>
      
   add support for quoting
   
.. raw:: html

      </del> 
      
   EDIT (7/24,IRC): rolled over to next ticket
-  
.. raw:: html

      <del>
      
   separate the tag parsing from length detection, and use in-line
   form validation to raise errors
   
.. raw:: html

      </del>
      
-  
.. raw:: html

      <del>
      
   use tag\_string instead of request as input to tag parsing function
   
.. raw:: html

      </del>
      
-  EDIT: add unit tests

/tag/bunnies and /tag/bunnies+fangs and other searching and
displaying views can come in future tickets

thanks for all the detailed feedback.
EDIT: crossing stuff out as I go along



comment:21 by Will Kahn-Greene, 13 years ago

Instead of writing code that allows for different tagging markup
(this includes the delimiter), I think it's better to expose the
tag parser in the API so that a plugin can override it. That way
people can use different tagging markups, implement the parser in a
plugin, and we don't have to maintain code that doesn't quite go
all the way.

Also, this really should have unit tests before the functionality
gets merged in.



comment:22 by Christopher Allan Webber, 13 years ago

Will, I'm not sure what you mean by tagging markup. Excepting the
delimiter, what markup is there?



comment:23 by Will Kahn-Greene, 13 years ago

Well, so we have delimiters and just like csv/tsv files, you need
to have a way to allow the tag to include the delimiter. So most
people use double-quotes. Then you need a way to escape double
quotes, too. At that point, I vote we can call it a markup.

So, we've got two markups now, but what if I want to use xml? What
if I want to use some kind of semantic hierarchy? What are all the
kinds of ways that people tag images today?

I think there's a lot of possibility here. Just like we plan to
push comment/description markup to a plugin, I think we should push
this off to a plugin, too. Amongst other things, it requires less
code, less functionality to test, and it gives us flexibility down
the road when someone has a use case we didn't think of.



comment:24 by Caleb Davis, 13 years ago

I'm not sure what to do next, in that case. I can:


-  complete the bullets above, or
-  work towards designing and building a plugin

I would be happy to render assistance for the latter, but a plugin
feature is advanced enough to warrant a new ticket.

Please advise!



comment:25 by Christopher Allan Webber, 13 years ago

I would not worry about making it a plugin. That might be a good
idea, but we'll need plugin infrastructure to do that and we don't
have it yet.

Really, I think beyond comma and space delimiters, that's good
enough for almost everyone's needs (at least for now).

I'm excited about this and I'd really like to see this in the next
release so please proceed!



comment:26 by Caleb Davis, 13 years ago

Status: FeedbackIn Progress
-  separate the tag parsing from length detection, and use in-line
   form validation to raise errors
-  use tag\_string instead of request as input to tag parsing
   function

[https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commit/909371cdceace162af880c275b9e6e70488e3029](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commit/909371cdceace162af880c275b9e6e70488e3029)



comment:27 by Christopher Allan Webber, 13 years ago

Yup, that last commit looks good!



comment:28 by Caleb Davis, 13 years ago

more bullets finished
([https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commit/10d7496da2f147a30a70304e8be3a579b15fd093](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commit/10d7496da2f147a30a70304e8be3a579b15fd093)):


-  split on ',' by default
-  add tag\_delimiter option in config\_spec.ini

Also updated the media submission view to use inline form error
messaging.

[https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commit/97e4498c10c803c9d239011d4c48efef52673ec3](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commit/97e4498c10c803c9d239011d4c48efef52673ec3)



comment:29 by Caleb Davis, 13 years ago

I just want to save some suggestions from irc for possible ways to
display the tag list in the template:

to show the tags you'd probably do something like:

::

    <div class="mediaentry_tags">
      {% for tag in entry['tags'] %}
        <div class="tag">{{ tag['name'] }}</div>
      {% endfor %}
    </div>

or later on when we have tag views, something like:

::

    <div class="mediaentry_tags">
      {% for tag in entry['tags'] %}
        <div class="tag">
          <a href="{{ request.urlgen('something_tags',
                                     tag=tag['slug']) }}">{{ tag['name'] }}</a>
        </div>
      {% endfor %}
    </div>

You could fall back to a simple text\_string display by passing the
list\_of\_dicts -> tag\_string function in the context

::

        return render_to_response(
            request,
            'mediagoblin/submit/start.html',
            {'submit_form': submit_form,
             'media_tags_as_string': media_tags_as_string})

The template could do it with

::

    <div class="mediaentry_tags">
      {% for tag in entry['tags'] %}
        <div class="tag">{{ tag['name'] }}</div>{% if not loop.last %},{% endif %}
      {% endfor %}
    </div>

also it is correct to wrap in

::

    <ul><li>...</li><li>...</li></ul>



comment:30 by Caleb Davis, 13 years ago

-  use list-o-dicts instead of list for tags, allowing slugified
   tags:

   'tags': [{'name': 'Gully Gardens', 'slug': 'gully-gardens'},
   {'name': 'Castle Adventure Time?!", 'slug':
   'castle-adventure-time'}]


[https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commit/0712a06dc6d6b05cb78d4b10af12c051e8f765e3](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commit/0712a06dc6d6b05cb78d4b10af12c051e8f765e3)



comment:31 by Caleb Davis, 13 years ago

-  state\_tags\_created -
   `http://dpaste.com/580079/ <http://dpaste.com/580079/>`_ EDIT: the
   dpaste is doing the following
-  first I do a query to find a picture I tagged with 'imeanit'
-  and I show the full record
-  then I perform the find again using explain() to tell me how the
   database executed the query. It said u'cursor': u'BasicCursor',
   which meant that the query didn't use the index and did a full
   table scan
-  I then fixed the query to include the correct fields in the
   correct order, and then the cursor is a Btree, /

Mentioning these details will help me remember what I did. :)

EDIT: the user-specific index is also working


-  state\_uploader\_tags\_created -
   `http://dpaste.com/580153/ <http://dpaste.com/580153/>`_

[https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commit/37be7b6da6aa765991fac55c5c2d0cdb37fd24fc](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commit/37be7b6da6aa765991fac55c5c2d0cdb37fd24fc)

EDIT: I could use some help identifying unit tests now.



comment:32 by Caleb Davis, 13 years ago

Status: In ProgressFeedback
tests pass

[https://gitorious.org/\ :sub:`cfdv/mediagoblin/cfdvs-mediagoblin/commit/8ff4dec742d0b9f375afd9d1862a560e1be200d1](https://gitorious.org/`\ cfdv/mediagoblin/cfdvs-mediagoblin/commit/8ff4dec742d0b9f375afd9d1862a560e1be200d1)

feedback requested



comment:33 by Christopher Allan Webber, 13 years ago

Status: FeedbackClosed
Merged!!! Yussssssssss



comment:34 by Will Kahn-Greene, 12 years ago

The original url for this bug was http://bugs.foocorp.net/issues/360 .
Relations:
#162: blocks, #163: blocks

Note: See TracTickets for help on using tickets.