Musings of a Software Developer: Algorithms, performance and getting burnt

After a long time, I am writing something on my blog. So here it is ..

This post is about me starting to solve a small but interesting problem with different approaches and ended up breaking my head against why an algorithm with supposedly O(n) complexity is 4 times slower than O(n^2).

So here's the issue. I have the following data :

Name,Value

Sridhar,1

Ashish,2

Prasanth,3

Sridhar,4

Ashish,5

Sridhar,8

and so on .. I hope you get the idea.

Now what I would like to do is to print the following output.

Sridhar : 1;4;8;....

Ashish : 2;4;.....

Prasanth: 3;......

Note that here, it does not matter what the values are, I am giving this data just for the example. So shown below is my setup which would be used by my implementations. (I am demoing it as a test).


  private Stopwatch sw;
        [SetUp]
        public void SetUp()
        {
            GC.GetTotalMemory(true); // I dont know why i did this!
            tl = new List(10000);
            var names = new[] { "Krishna", "Ashish", "Sridhar", "Prasanth" };
            foreach (var name in names)
                for (var i = 0; i < 2500; i++)
                    tl.Add(new Ud { Name = name, DrawerId = i.ToString() });
            tl.OrderBy(s => s.DrawerId);
            sw = Stopwatch.StartNew();
        }

        [TearDown]
        public void TearDown()
        {
            sw.Stop();
            Console.WriteLine(sw.ElapsedMilliseconds);
            sw = null;
        }

        public class Ud
        {
            public string Name { get; set; }
            public string DrawerId { get; set; }
        }

        private List tl;

The above code is self explanatory. I basically create a lot of Ud objects which generate the data that I presented earlier. Shown below is the most straight forward way to do it. It has two for-loops which makes the complexity O(n^2).


[Test]
        public void BasicImplementation()
        {
            var nvcs = new List();
            var list = new List();
            foreach (var item in tl)
            {
                if (list.Contains(item.Name)) continue;

                string val = string.Empty;

                foreach (var item2 in tl)
                {
                    if (item2.Name == item.Name)
                        val += item2.DrawerId + ";";
                }

                nvcs.Add(new NameValueCollection { { "Name", item.Name }, { "DrawerId", val } });
                list.Add(item.Name);
            }
            //foreach (var nvc in nvcs)
            //    Console.WriteLine(nvc["Name"] + " : " + nvc["DrawerId"]);
            Assert.AreEqual(4, nvcs.Count);
        }

Now I went ahead and added another potential implementation which gives the same result but instead makes use of dictionary to track the strings that we build for each name in the list of objects. So instinctively, it appears that the dictionary method would be way faster than the one mentioned above. Lets look at that code.


        [Test]
        public void ADictionary()
        {
            var vals = new Dictionary();
            foreach (var item in tl)
            {
                if (!vals.ContainsKey(item.Name))
                    vals[item.Name] = item.DrawerId;
                else
                    vals[item.Name] = vals[item.Name] + item.DrawerId + ";";
            }
            Assert.AreEqual(4, vals.Values.Count);
        }

When I ran these two tests, I did not notice any performance gain with the above O(n) implementation and in fact it was three times slower. So why was it slower? Look at the setup, it has GC.GetTotalMemory(true) which forced a full garbage collection and its time was accounted in the time consumed by this dictionary as well since for the second time (when test with dictionary was executing) it had a lot of strings to clean up. So why did I put it in the first place? The answer is "I was not thinking straight". Never ever use GC classes in your code. It is a bad-bad-bad practice.

So I remove this GC call made and rerun the tests again. Yet I do not see any performance gain. WHY?? I took a lot of time trying to diagnose why this is happening and eventually gave up manual inspection. I downloaded the trail version of dotTrace (which is freaking awesome tool) Performance 4.0 and made it profile both the tests. The culprit was the strings. If you look at the code right, we are generate a lot of strings whose "Concat" operation was so time consuming that it dominated the gain that we obtained using O(n) algorithm.

So the lesson here is "Be watchful of the strings that are generated when your code executes, otherwise you would be burned". It does not matter how small the string concatenation may seem but in cases like above it piles up a lot and screws up your clever algorithm. All I did was to change the tests to use stringbuilder instead of Strings.

Do not use GC calls in your code, especially those which force GC.
Use a profiler to accurately capture performance information of specific methods or your program. Stopwatch, Timers, etc are not good enough and waste of time.
Be aware of the impact of string operations. Use StringBuilder wherever possible. Use String.Format() in other simpler cases.

I will continue in the next post with some code that shows you how to approach the problem I initially started with using LINQ and how simple things would appear.

2 comments:

Anonymous said...: String concatenation is always expensive hence why the StringBuilder class exists. This is basic programming 101. How did this not set a light bulb off sooner?; 5:55 PM
Krishna Bhargava Vangapandu said...: i know ... its prog 101 but it is not that often tat we use StringBuilder. Atleast me, i always start with a string and deal with it later. still i thought it was interesting.; 9:31 PM

Thursday, July 15, 2010

Algorithms, performance and getting burnt

2 comments: