I recently added a test-shuffling feature to the Fixie test framework. The basic idea is that you can opt into randomizing the order of execution for tests within a test class. I’ll cover the details of the test shuffling feature in my next post. Today, let’s take a deeper look at shuffling in general. Readily-available advice on shuffling is actually quite wrong.
The Too-Clever One-Liner
Let’s say you want to randomize the order of items in a simple .NET array. A little googling will lead you to solutions like this:
You may have also run into similar shufflers that use Guid.NewGuid() for the same effect. This seems pretty reasonable. We’re “randomingly sorting” the items. Each time the sort algorithm needs to decide the order of two items, it’ll basically flip a coin. go What could wrong possibly?
I’m sure I’ve followed this advice on past projects, and never ran into any trouble. It might actually work, but if so it works only due to a convenient accident of implementation details.
Sorting algorithms operate with a very specific contract in mind. They frequently need to compare two items, and you are expected to provide a consistent answer to the question, “Which comes first, a or b?”. (In the case of OrderBy, the lambda provides a “key selector” for shorthand: a comes before b if a’s ‘key’ comes before b’s ‘key’. The overall contract of consistent ordering is still present.)
Consider a humble Bubble Sort:
Imagine our comparer flips a coin each time it is asked to compare two items in the array. That while loop starts to look pretty scary. Each iteration of the outer loop makes a single pass through the list. If any items get swapped during that pass, we are
stillGoing and will execute another pass through the list. When would the bubble sort even terminate? It’ll terminate when we happen to generate
list.Length consistent answers in a row. A few test runs of this loop against a 10-item array reached thousands of iterations before terminating!
Sure, .NET’s array sorter doesn’t use Bubble Sort, but the clever one-liner may fall into the same basic trap. In practice it seems to work, but just try proving that it will always work. What algorithm does .NET’s
Array.Sort(...) use? Insertion Sort? Heap Sort? Quick Sort? Trick question: it could do any of those depending on the input array. You could inspect the implementation all day long, constructing an ironclad proof that a randomized comparer will in fact work in all possible cases, and still run into trouble when a future implementation changes the underlying algorithm(s).
Randomized sorting is a contradiction in terms.
Thankfully, real shuffling is a solved problem. We don’t have to violate an algorithm’s contract if we use the right algorithm.
Most of the world’s loops have already been written, and they have names. This is the Fisher-Yates Shuffle. It randomizes the order of array items in a single pass.
We’ve adopted the term “contract” for a reason: two parties make an agreement about how they are meant to interact. Violate the contract, and who knows what will happen.