C# vs. C#

In the course of doing some reading on Scala and Clojure, I stumbled upon an interesting article by Mogens Heller Grabe entitled C# vs. Clojure vs. Ruby & Scala.  In the article, Mogens provides a C# solution to a word frequency counting exercise, originally demonstrated in Ruby & Scala, and later in other languages in attempt to showcase how each measures up.

The problem takes an archive of newsgroup articles and creates one file containing a list of all unique words with their occurrence count sorted by word and another sorted by occurrence.

Here is Mogens’ C# solution:

	class Program
	{
		static void Main()
		{
			const string dir = @"c:temp20_newsgroups";
			Stopwatch stopwatch = Stopwatch.StartNew();
			var regex = new Regex(@"w+", RegexOptions.Compiled);

			var list = (from filename in Directory.GetFiles(dir, "*.*", SearchOption.AllDirectories)
						from match in regex.Matches(File.ReadAllText(filename).ToLower()).Cast<Match>()
						let word = match.Value
						group word by word
						into aggregate
						select new
								   {
									   Word = aggregate.Key,
									   Count = aggregate.Count(),
									   Text = string.Format("{0}t{1}", aggregate.Key, aggregate.Count())
								   })
				.ToList();

			File.WriteAllLines(@"words-by-count.txt", list.OrderBy(c => c.Count).Select(c => c.Text).ToArray());
			File.WriteAllLines(@"words-by-word.txt", list.OrderBy(c => c.Word).Select(c => c.Text).ToArray());

			Console.WriteLine("Elapsed: {0:0.0} seconds", stopwatch.Elapsed.TotalSeconds);
		}
	}

 

While my lack of familiarity with the languages used in the other examples made it a little more difficult to appreciate their strengths, I felt the C# example provided by Mogens was fairly concise and intuitive by comparison.  Nevertheless, I couldn’t help wondering if I might be able to improve it in some way, so I set out to see what I could come up with.


Here are the results:

		static void Solution2()
		{
			var regex = new Regex(@"W+", RegexOptions.Compiled);
			var d = new Dictionary<string, int>();

			Directory.GetFiles(dir, "*.*", SearchOption.AllDirectories)
				.ForEach(file => regex.Split(File.ReadAllText(file).ToLower())
									 .ForEach(s => d[s] = 1 + (d.ContainsKey(s) ? d[s] : 0)));

			File.WriteAllLines(@"words-by-count2.txt", d.OrderBy(p => p.Value).Select(p => string.Format("{0}t{1}", p.Key, p.Value)));
			File.WriteAllLines(@"words-by-word2.txt", d.OrderBy(p => p.Key).Select(p => string.Format("{0}t{1}", p.Key, p.Value)));
		}

 

The primary differences in this example are the use of a Dictionary to accumulate the frequency count rather than grouping, and the use of the Regex.Split rather than Regex.Match to avoid the need of casting the resulting collection.  Based on my measurements, this approach is approximately 36% faster on average than the first solution and is a bit more concise.


Overall, I don’t think this example has a varied enough problem domain to really compare the strengths of different languages as some have done, but I found it a fun exercise to see how I might improve the original C# version nonetheless.

About Derek Greer

Derek Greer is a consultant, aspiring software craftsman and agile enthusiast currently specializing in C# development on the .Net platform.
This entry was posted in Uncategorized and tagged . Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • Stimul8d

    Directory.GetFiles() returns a string array so I’m assuming you’ve written an extension method called ForEach for such an occasion. Obviously using an extension method is a little slower than a standard ‘for each’ …you might be losing some cycles there….Just a thought.

  • Erik

    Compiling a regex and not saving it is bad performance wise. Cache it in a static field or skip compiled flag.

  • http://www.lostechies.com/members/derekgreer/default.aspx derekgreer

    Ah yes, I am using an extension method here. I was using ToList().ForEach() in an earlier version and added the extension method to see what performance benefits I would get. I actually planned to remove this for the article because some early tests indicated the benefit was negligible and I wanted to keep the code as compact as possible, but going back and running the final test again without it shows it does make a bit of difference (30% better vs. 36% better on average).

    As far as using foreach directly, this would definitely improve the performance to some degree and I think I’ll go back and test this to satisfy my curiosity, but I think the ForEach() extension makes for more expressive syntax.

  • http://www.lostechies.com/members/derekgreer/default.aspx derekgreer

    @Erik – I’m not sure how the Regex class handles calls to recompile the same regular expression multiple times (i.e. does it cache, or emit new IL every time?), but in this particular example it doesn’t matter. Because the regular expression is being applied many times, its worth the initial performance hit to compile the regex to gain the performance of applying it to the contents of every file. I ran some minimal testing to verify this, and the compiled version runs 7% faster on average.

  • Erik

    If you use the static methods on the Regex class and use the Compiled flag, it will cache it in the background.
    When using new Regex it will not…
    But you are right about that it is reused in the method.

  • http://stillpearling.blogspot.com David Clarke

    Line 8 in Solution2 is either incorrect or my browser is doing something funky, should be:

    .ForEach(s => d[s] = 1 + (d.ContainsKey(s) ? d[s] : 0)));

  • http://stillpearling.blogspot.com David Clarke

    Hmm, the square braces and contents are being stripped out prior to displaying the code which is why that line is garbled. “d” is a dictionary so clearly cannot have an int assigned to it. Would be nice if I could drop the code straight into LINQPad and have it run but I guess at worst it was guilty of making me think.

  • http://www.lostechies.com/members/derekgreer/default.aspx derekgreer

    @Erik – Good to know for future reference. For this particular test though, the method should do the exact same thing every time so that I get accurate measurements.

    @David – You’re right, the code formatting plug-in must have been stripping out the brackets. It should make more sense now :)

  • http://stillpearling.blogspot.com David Clarke

    Yeah thanks Derek, the problem I have now is that my original comment makes me look like a complete imbecile. Suspect it won’t be the last time ;-)

  • Jarin

    Derek, I’ve been playing with new Parallel extensions in .NET 4.0 lately and I converted your example into parallel version:

    Parallel.ForEach(Directory.GetFiles(dir, “*.*”, SearchOption.AllDirectories),
    file => regex.Split(File.ReadAllText(file).ToLower())
    .ForEach(s => { lock (monitor) d[s] = 1 + (d.ContainsKey(s) ? d[s] : 0); }));

    It’s faster than the original version on my quad core, but not by much. But who cares, it’s been great fun anyway…

  • http://www.techwench.com neo

    yeah thanx for this comparison