C# vs. C#

In the course of doing some reading on Scala and Clojure, I stumbled upon an interesting article by Mogens Heller Grabe entitled C# vs. Clojure vs. Ruby & Scala.  In the article, Mogens provides a C# solution to a word frequency counting exercise, originally demonstrated in Ruby & Scala, and later in other languages in attempt to showcase how each measures up.

The problem takes an archive of newsgroup articles and creates one file containing a list of all unique words with their occurrence count sorted by word and another sorted by occurrence.

Here is Mogens’ C# solution:

class Program
		static void Main()
			const string dir = @"c:temp20_newsgroups";
			Stopwatch stopwatch = Stopwatch.StartNew();
			var regex = new Regex(@"w+", RegexOptions.Compiled);

			var list = (from filename in Directory.GetFiles(dir, "*.*", SearchOption.AllDirectories)
						from match in regex.Matches(File.ReadAllText(filename).ToLower()).Cast<Match>()
						let word = match.Value
						group word by word
						into aggregate
						select new
									   Word = aggregate.Key,
									   Count = aggregate.Count(),
									   Text = string.Format("{0}t{1}", aggregate.Key, aggregate.Count())

			File.WriteAllLines(@"words-by-count.txt", list.OrderBy(c => c.Count).Select(c => c.Text).ToArray());
			File.WriteAllLines(@"words-by-word.txt", list.OrderBy(c => c.Word).Select(c => c.Text).ToArray());

			Console.WriteLine("Elapsed: {0:0.0} seconds", stopwatch.Elapsed.TotalSeconds);


While my lack of familiarity with the languages used in the other examples made it a little more difficult to appreciate their strengths, I felt the C# example provided by Mogens was fairly concise and intuitive by comparison.  Nevertheless, I couldn’t help wondering if I might be able to improve it in some way, so I set out to see what I could come up with.

Here are the results:

static void Solution2()
			var regex = new Regex(@"W+", RegexOptions.Compiled);
			var d = new Dictionary<string, int>();

			Directory.GetFiles(dir, "*.*", SearchOption.AllDirectories)
				.ForEach(file => regex.Split(File.ReadAllText(file).ToLower())
									 .ForEach(s => d[s] = 1 + (d.ContainsKey(s) ? d[s] : 0)));

			File.WriteAllLines(@"words-by-count2.txt", d.OrderBy(p => p.Value).Select(p => string.Format("{0}t{1}", p.Key, p.Value)));
			File.WriteAllLines(@"words-by-word2.txt", d.OrderBy(p => p.Key).Select(p => string.Format("{0}t{1}", p.Key, p.Value)));


The primary differences in this example are the use of a Dictionary to accumulate the frequency count rather than grouping, and the use of the Regex.Split rather than Regex.Match to avoid the need of casting the resulting collection.  Based on my measurements, this approach is approximately 36% faster on average than the first solution and is a bit more concise.

Overall, I don’t think this example has a varied enough problem domain to really compare the strengths of different languages as some have done, but I found it a fun exercise to see how I might improve the original C# version nonetheless.

Windows-Friendly Cygwin Paths