Empirically testing shuffle implementations#

We’ve defined two functions for you in task.py: fisheryates and broken. They’re both purportedly shuffles, but we’d like to prove that they aren’t.

Here’s our plan: run each shuffle many, many times. We’ll compute the distribution of the results and do some analysis on it. Concretely:

Pick a simple list—say, list("ABC"), i.e., ['A', 'B', 'C']. There are 3! = 6 possible permutations.
Run fisheryates and broken 6000 times on list("ABC"). Each permutation should show up about 1000 times, but there will be some variation.
Compute the histogram: how many times does each possible output appear?
Analyze the histogram: what are the mean, median, and standard deviation of the various frequencies.

Our tool produces output like:

fisheryates:
ABC: 1014
ACB: 944
BAC: 1028
BCA: 949
CAB: 999
CBA: 1066

MEAN: 1000.0 MEDIAN: 1006.5 STDDEV: 22.861904265976328

broken:
ABC: 1299
ACB: 1316
BAC: 677
BCA: 676
CAB: 1338
CBA: 694

MEAN: 1000.0 MEDIAN: 996.5 STDDEV: 124.92397688194208

From this output, it’s really clear that the broken function is indeed broken and fisheryates is good. It’s worth noting that the mean and median aren’t informative, but the distributions themselves and the standard deviations are very informative.

So: test these two functions enough to determine that fisheryates is a shuffle and broken is not.

We wrote a testing function test_shuffle that takes in a function and a list, and does the following:

Prints out f.__name__ (which is the name of the function—cool!)
Runs fact(l) * 1000 tests, shuffling a copy of the input list.
Stores the shuffled list in a dictionary, where the key is the list converted to a string (using ''.join()) and the value is the number of times that list has been seen.
Prints out the keys and their counts (sorted by key).
Prints out the statistics.

You don’t need to do all of that, but it’s good practice to write good, thorough tests like this! The more you do it, the easier it gets. To replicate our output, you’ll have to write a number of helper functions: mean, median, and stddev. We also defined fact so we could vary how many tests we compute with the length of l.