Updates from March, 2004 Hide threads | Keyboard Shortcuts

  • Spam fine tuning 

    fdietz 4:20 pm on March 19, 2004 Permalink | Reply

    Waking up this morning I recognized that I’ve forgotten something really important for the token database.

    I’m using MD5 sums to detect already learned messages. These shouldn’t be added to the token database twice or more times – just once. When marking messages as spam, which are non spam, Columba should use this MD5 sum to correct the value, if the user marks the message as non spam.

    You can imagine that this MD5 message list is getting hugh over time. So, I’ve added a timestamp for every MD5 sum. This way, cleaning up the token database is very easy because I can just remove old messages.

    Sadly, this means that the spam.db file format has changed a bit.

    Another new thing are handcrafted rules. This is pretty new, I’ve got it from a paper “A bayesian approach to filtering junk email” from Sahami et. al.
    In Columba I’ve basically eliminated the need for something like a training mode with these rules. The idea is pretty easy. The bayesian classifier enables us to add handcrafted rules, which are handled equally to other tokens. So, a rule just adds another probability additionally to the word probabilities.
    When starting from scratch you don’t have any words in your token database. The spam engine can’t detect spam messages until you trained it for some time. But the handcrafted rules are in place, giving you a good start to determine spam messages, until you have collected enough tokens using the message contents.

    One such rule would be for example checking if the Subject is of capital letters only, or if the Subject contains many whitespaces. You can think if many more rules. Spamassassin has a very big list of rules which can be found here. Adding new rules is very easy. Just subclass AbstractRule in org.columba.mail.spam.rules.

     
  • Silently added more filters 

    fdietz 4:08 pm on March 19, 2004 Permalink | Reply

    I’ve added two more filters. The first one is a “match all” filter, which just matches all messages. Very handy if you want to apply filters on all messages to cleanup your inbox or doing other repeating stuff.
    This will become even more useful, when the “automatically apply filter on new messages” option is added on a per-folder basis. This way, you can apply filters on newly added messages. This is something many people want. Just drop a message into a folder, execute the filter and process the message data.
    The other filter checks if the sender already exists in your addressbook. This is currently used in the spam filter anyway, to achieve an automatic whitelisting.

     
  • Spam filter integrated in Columba 

    fdietz 7:44 pm on March 16, 2004 Permalink | Reply

    Checking the latest CVS sources from Columba you will notice the new integrated spam solution. This work is based on my bachelor dissertation. An introduction can be read at http://frederikdietz.de. The thesis paper can be downloaded as pdf file.

    The spam filter can be enabled in the account configuration. Note, that it takes some time to train this filter. This is done in marking messages as spam or not spam.
    In the first few days its recommended to just mark the messages as spam. Do not make them move to trash automatically, as its high likely that you will get some false-positives. Also, use the addressbook checking option, which prevents marking messages as spam from people which are already in your addressbook.

    Its recommended to delete:

    [your-config-folder]/mail/main_toolbar.xml

    This will create a new toolbar for you, with a mark message as spam button added.

    In my personal experience, learning around 1000 messages should be enough to make it perform well.

    There’s going to be some more fine-tuning needed, especially to make it perform better in the beginning – you have been warned!

    So, please help me test this beast ;-)

     
  • fdietz 5:37 pm on March 8, 2004 Permalink | Reply

    css Zen Garden: The Beauty in CSS Design

    If you are interested in web design you should definitly see this page. Its a very simple page, using just CSS to totally change its look. Great. I wish more people would start using CSS more creativly.

     
  • IM Project Codename “Altura” 

    fdietz 5:33 pm on March 8, 2004 Permalink | Reply

    I’ve silently starting on hacking together a prototype for a jabber-based IM client. This is a pretty interesting technology I would love to see integrated in Columba in the future – meaning right after the 1.0 release this summer.

    Think about receiving a message by your friend. An icon would immediately show his current online/offline/busy state and you can chat right away with him.

    I’m still thinking about releasing this client as open-source in the future as stand-alone, but this dependents on user requests.

     
  • Folder and Filter TestSuite complete 

    fdietz 5:29 pm on March 8, 2004 Permalink | Reply

    My exams currently keep me pretty busy until next week. Right now, I’m totally into DB2 – the second lecture on database, not IBM’s database – focusing on Object-Oriented and XML databases. Not very exciting, though…

    So, finally some news on Columba development. After starting a bigger refactoring of the folder packages in the mail module, I’ve finished with a testsuite for all types of folders. You can add every foldertype you want and the testcases run over it showing you problems. This should become pretty handy when adding new folders, including new local mailbox formats or even database backed folder implementations.

    The Filter testsuite covers all plugin-based filters which are shipped by default with Columba. This is more important than you would actually think because these are also the default fallbacks, used in case a search-engine doesn’t implement all operations. So, implementors of search-engines can just start hacking on their search-engine adding more optimized operations, and falling back to the default engine in case.

     
c
compose new post
j
next post/next comment
k
previous post/previous comment
r
reply
e
edit
o
show/hide comments
t
go to top
l
go to login
h
show/hide help
esc
cancel