Jon Aquino's Mental Garden

Engineering beautiful software jon aquino labs | personal blog

Monday, May 25, 2009

[Programming] Command-line tool for finding frequent substrings

Hiroki Arimura has written a nifty tool called wasa that you can use to find the most frequent substrings in a text file. (It’s implemented using a suffix array.) I modified it to have an -xc option for displaying the counts.

I was pretty excited about this at first because I thought it would solve the problem of summarizing log files by finding frequent substrings. But alas, it outputs too much noise to be useful.

For example, given the following logfile,
May 14 13:26:13 kenya sshd[41244]: Invalid user louis from 85.21.206.18
May 14 13:26:16 kenya sshd[41246]: Invalid user louis from 85.21.206.18
May 15 04:00:58 kenya sshd[50672]: Did not receive identification string from 61.152.157.166
May 15 04:04:04 kenya sshd[50699]: Invalid user test from 61.152.157.166
May 15 04:04:06 kenya sshd[50701]: Invalid user test from 61.152.157.166
May 15 04:04:08 kenya sshd[50705]: Invalid user test from 61.152.157.166

The most frequent substrings are as follows:
prompt> wasa -xc -m 1 c:\junk\foo.log | sort -nr
6 sshd
6 may
6 kenya sshd
6 from
5 user
5 invalid user
4 may 15 04
4 from 61
4 61
4 166
4 157
4 152
4 15 04
4 04
3 user test from 61
3 test from 61
3 invalid user test from 61
2 user louis from 85
2 may 14 13
2 louis from 85
2 invalid user louis from 85
2 from 85
2 85
2 26
2 21
2 206
2 18
2 14 13
2 13

It finds interesting substrings (“invalid user” and “invalid user louis from 85”) but it finds a lot of uninteresting substrings as well.

Ah well. Interesting tool nonetheless. ∎

2 Comments:

  • Unfortunately the DL-link to
    wasa0.9_with_xc_option.zip
    is not valid any more and also no luck in Wayback machine. Any chance for downloading this file?!

    By Anonymous pappalapub, at 1/25/2013 3:09 p.m.  

  • Thx Jonathan for updated DL link

    By Anonymous pappalapub, at 1/26/2013 6:34 a.m.  

Post a Comment

<< Home