Aug 19 2009

Watch what you say, or else!

Whether you have Members, Players, or Anonymous guests, when groups of people come together someone is bound to say something that offends, upsets, or annoys other people. For those businesses that host resources for such communities, they spend a tremendous amount of energy moderating, policing, or otherwise trying to protect people from offensive speech, unwanted advertisements, or generally obnoxious people. Some do this to provide a more positive experience for users, while others do it because they are required by law. However they are motivated, this is not as simple a task as you might think. Putting enforcement aside, the monitoring challenges are quite varied:

  • Blacklisted term/word violations
    • Phonetic variations
    • Equivalence
    • Language variations
    • Synonyms
  • Contextually sensitive remarks
  • Spamming
    • Program generated content (e.g. bots)
    • Phone Numbers
    • Urls
    • Email Addresses
    • Rate vs. Similarity of submission
      • Word Stemming
      • Edit Distance
      • Longest Common Substring

Lets take monitoring for terms as an example.  "Sounds" easy enough. If a user says "puck" and we've decided that it's a bad word, then we just check for the word and take action... Right...? 

Unfortunately, things are just not that simple. Before we get to how tricky users can be, there are a lot of challenges in this simple problem.  Any one word can have synonym's, tense variations (e.g. Past, Present, Future), and/or gender variations. From an international perspective, even if we are only concerned about one language (Ex. US English [en-US]), culture settings could lead to the very same word having completely different representations causing simple equivalence checks to fail. Consider the following example:

Rendered WordUnicode Values
puck u0070 u0075 u0063 u006B
рսск u0440 u057D u0441 u043A


puck != рսск

As you can see, the above two words look the same although they are very different words when subjected to an equivalence test. There are a lot of factors that could cause this kind of variation, so any thorough monitoring solution should try and address as many of them as they can. But then there is the trickiness of users, they can be clever little buggers. Once they know you are checking for certain things someone will always try to get around your monitoring. For example, they will try CaSe variations, they will add extra or replacement characters, etc...  Lets face it, if you are the reader, you get the same meaning behind variations such as PucK, Puuck, P_u_c_k, pÜck. So what do you do?

If you have not already guessed, you need a portfolio of solutions. For the previous example, a SoundEx Algorithm could be a reasonable approach and addresses many of the term detection issues we have identified so far. Strictly speaking, this is a Phonetic algorithm, but its pattern provides a nice Framework for normalizing words for comparison, even when skipping the phonetic encoding. So what is SoundEx?

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.
- Wikipedia (http://en.wikipedia.org/wiki/Soundex)

Essentially, this pattern normalizes a portion of a word to a signature that can be used to compare against a list of banned signatures. Normalization could include the following:

  • Remove extraneous special characters (Ex. p_u_c_k becomes puck)
  • Remove adjacent repeating characters like consonants and vowels (Ex. puuuccck becomes puck
  • Map letter variations to a common; singular; value (Ex. the letters 'u' and 'Ü' would be seen as the same character)  
  • Etc...

This normalization process helps remove common variances and increases our chances of catching violations.  In code, this might look something like this:

First we need something to calculate the signature for a given term:

public string GenerateSoundExSignature(string s)
{
    var output = new StringBuilder();

    if (s.Length > 0)
    {
        output.Append(Char.ToUpper(s[0]));

        // Stop at a maximum of 4 characters. This is the significant portion of a word
        for (int i = 1; i < s.Length && output.Length < 4; i++)
        {
            string c = EncodeChar(s[i]);

            // We either append or ignore, determined by the preceeding char
            if (IsVowelForm(s[i - 1]))
            {
                // Chars separated by a vowel - OK to encode
                output.Append(c);
            }
            else
            {
                // Ignore duplicated phonetic sounds
                if (output.Length == 1)
                {
                    // We only have the first character, which is never
                    // encoded. However, we need to check whether it is
                    // the same phonetically as the next char
                    if (EncodeChar(output[output.Length - 1]) != c)
                        output.Append(c);
                }
                else
                {
                    if (output[output.Length - 1].ToString() != c)
                        output.Append(c);
                }
            }
        }

        for (int i = output.Length; i < 4; i++)
        {
            output.Append("0");
        }
    }

    return output.ToString();
}

Notice the EncodeChar method, this is where we take a single character and normalize it to a single character form (preferably numeric). This is important because we need to eliminate as much variation as we can. To do that, we could create a dictionary of character forms and use them for comparison later. Here is a quick example to illustrate why this is needed. Depending on the keyboard and culture settings the user may select any of the following characters to represent the letter T:

private static void LoadEncodingPairs_T()
{
    _encodingPairs.Add('\u0054', 'T'); //LATIN CAPITAL LETTER T Basic Latin
    _encodingPairs.Add('\u0074', 'T'); //LATIN SMALL LETTER T Basic Latin
    _encodingPairs.Add('\u0162', 'T'); //LATIN CAPITAL LETTER T WITH CEDILLA Latin Extended-A
    _encodingPairs.Add('\u0163', 'T'); //LATIN SMALL LETTER T WITH CEDILLA Latin Extended-A
    _encodingPairs.Add('\u0164', 'T'); //LATIN CAPITAL LETTER T WITH CARON Latin Extended-A
    _encodingPairs.Add('\u0165', 'T'); //LATIN SMALL LETTER T WITH CARON Latin Extended-A
    _encodingPairs.Add('\u021A', 'T'); //LATIN CAPITAL LETTER T WITH COMMA BELOW Latin Extended-B
    _encodingPairs.Add('\u021B', 'T'); //LATIN SMALL LETTER T WITH COMMA BELOW Latin Extended-B
    _encodingPairs.Add('\u1E6A', 'T'); //LATIN CAPITAL LETTER T WITH DOT ABOVE Latin Extended Additional
    _encodingPairs.Add('\u1E6B', 'T'); //LATIN SMALL LETTER T WITH DOT ABOVE Latin Extended Additional
    _encodingPairs.Add('\u1E6C', 'T'); //LATIN CAPITAL LETTER T WITH DOT BELOW Latin Extended Additional
    _encodingPairs.Add('\u1E6D', 'T'); //LATIN SMALL LETTER T WITH DOT BELOW Latin Extended Additional
    _encodingPairs.Add('\u1E6E', 'T'); //LATIN CAPITAL LETTER T WITH LINE BELOW Latin Extended Additional
    _encodingPairs.Add('\u1E6F', 'T'); //LATIN SMALL LETTER T WITH LINE BELOW Latin Extended Additional
    _encodingPairs.Add('\u1E70', 'T'); //LATIN CAPITAL LETTER T WITH CIRCUMFLEX BELOW Latin Extended Additional
    _encodingPairs.Add('\u1E71', 'T'); //LATIN SMALL LETTER T WITH CIRCUMFLEX BELOW Latin Extended Additional
    _encodingPairs.Add('\u1E97', 'T'); //LATIN SMALL LETTER T WITH DIAERESIS Latin Extended Additional

    _encodingPairs.Add('\u03A4', 'T'); //Greek Capital Letter Tau
    _encodingPairs.Add('\u03C4', 'T'); //Greek Small Letter Tau

    _encodingPairs.Add('\u0422', 'T'); //Cyrillic Capital Letter Te
    _encodingPairs.Add('\u0442', 'T'); //Cyrillic Small Letter Te
}

The value applied to a mapping is not important, as long as you understand it. However you build your mappings; once in place; you can encode your characters doing something similar to this:

protected virtual string EncodeChar(char c)
{
    byte encodingValue;

    if (_encodingPairs.TryGetValue(c, out encodingValue))
    {
        return encodingValue.ToString(CultureInfo.InvariantCulture);
    }

    return string.Empty;
}

Once all the characters are encoded you will receive a signature like 'X200' which is what you should use for equivalence testing.

Ironically, throughout this article I used the term "puck" for demonstration purposes. I am sure we all know what this is derived from, but saying it here would not really be the nicest thing. With that said, it does demonstrate that no matter what solution you provide there is always a way around it, but "trying" to monitor for this kind of stuff is still worth it and may be the law. Lets face it, "puck" is still a lot nicer than F#%&. Wink

As you can see, these are over simplified examples, but that is really the point. Providing solutions to the problems described previously are not as straight-forward as the might think. So make sure you take the time to think this through and go as far as you need to protect your community. Do some googling and you will find lots of resources to solve or mitigate many of these issue. Good Luck!

Tags:

Comments

1.
trackback DotNetKicks.com says:

Watch what you say, or else!

You've been kicked (a good thing) - Trackback from DotNetKicks.com

Comments are closed