Watercooler: Puzzlehunt Testing Practices

Let’s see how this goes…

One thing I really want to do with this blog is get conversations started about different puzzlehunt construction practices. I’ve had lots of experience constructing in certain areas (Mystery Hunt, BAPHL, NPL Con) and none at all in others (Microsoft Puzzle Hunt, DASH, BANG, The Game, etc.). Different formats and different audiences present different challenges, and I’d love to hear about everyone’s experiences from all angles.

As a starter “watercooler” topic, on my last post eudaemon asked about puzzle testing, specifically in reference to online hunts. In the last year, we’ve seen some online events that were, in my opinion, extremely clean (Galactic, REDDOT) and less so to varying degrees (SUMS, Cambridge). Of course, testing is just as important, if not more so, for live puzzlehunts… a posted PDF may be easier to edit on the fly

I’d love for people to chime in with stories and opinions about puzzle testing. To constructors, what’s worked well for events you’ve helped write? What hasn’t? What are good practices in general, for the people who test and/or the people who organize testing? And for solvers who perhaps haven’t written before, how can you tell when a puzzle probably has or hasn’t been tested, and what do you think would help?

Thursday is usually my “work from home” day, so I’ll try to get in the habit of posting these on Thursdays; though we’ll see based on participation whether there’s actually a demand for weekly prompts. Comment away!


10 thoughts on “Watercooler: Puzzlehunt Testing Practices

  1. I don’t do many large-scale puzzle hunts, but I do write the occasional extravaganza designed to be solved in a couple of hours, so I’ll comment on my testing process there. I like to have three rounds of testing. In round 1, I get one, maybe two, people to do a test with the main purpose of making sure I don’t have any major puzzle-killing errors. I’ll typically make a direct request to the person I think would do a good job with that (Hi, Tahnan!). Then when I think the puzzles are clean, I get a second larger group for doing more general testing for solvability, readability, and general fun-ness. I then have a third group that gets to see the “camera-ready” hunt to make sure I didn’t introduce any problems after revisions from the second group. For the last two groups, I’ll usually solicit volunteers from Facebook or the NPL mailing list to get a wide range of solvers.

    One thing that I think I need to focus on, especially as I start thinking about Mystery Themes 2 for Learned League, is making sure there’s at least one tester whose job it is to look things up to make sure there are no factual mistakes. Maybe that falls outside the purview of this topic, as in a larger event, that would be the job for an editor. But if you’re a one-person constructing group, you’re probably relying more on your testers for that sort of thing.


    • I’ve definitely worked with groups that consider “testing” and “fact checking” two different steps (particular for Mystery Hunt where there’s more material to be checked). Interestingly, one team I was on tested before fact checking, and one fact checked before testing. I think in either order, whoever goes first has the less pleasant job.


      • I think it’s actually crucial that fact-checking and testsolving be totally separate steps, especially for puzzles where a successful testsolve might not involve interacting with 100% of the puzzle’s data. The two steps have different goals in mind: a testsolve is verifying that the puzzle is (a) solvable and (b) fun, while a fact-check is trying to avoid the always-embarrassing need to issue a mid-hunt correction. For this reason, fact-checkers should have access to the solution so they can cross-check it against the puzzle.


      • I don’t have the luxury of a large group of test-solvers/fact-checkers for this thankless task. So for the mini-hunt I run, I ask my test-solvers to do the once-over to also help fact-check all the clues/info, even those they had skipped over during their initial solving. It’s not ideal in terms of load, but it could work, possibly as the puzzles are shorter/easier. There’s also overlap in terms of things picked up in both steps.


  2. Yes, I tend to follow Scott’s process too, although I count myself lucky if I can get that third stage of checking! But the alpha-test is crucial; you need folk who understand that they may be attempting to solve an inadvertently broken puzzle and won’t want to kill you as a result.
    But I also agree that having a fact-checker is a very useful component (I do have access to someone who doesn’t seem to mind that, which is very helpful.)


    • “you need folk who understand that they may be attempting to solve an inadvertently broken puzzle and won’t want to kill you as a result”

      My wife Jackie performed the alpha-test of the 6th Duck Konundrum (from Mystery Hunt 2014)… by herself, and there were many many errors. This was months before our wedding, and miraculously she still married me.


  3. Dan,

    This is a great topic for discussion. I’ve been on both sides of testing at this point, so –

    So far I’ve only written one puzzle set, which was the April 2017 Puzzled Pint. The Puzzle Pint folks have a very extensive test-solving process, as they are scoring the puzzles on “fun” in addition to looking for errors to fix. I went through 4-5 iterations of various puzzles in my set of location+4+meta. In the end, I think my puzzles were improved. However, with so many testsolvers (something like a dozen, I think, for mine?) there is the risk of contradicting advice, so beware of diminishing returns with too large an “n”.

    I have on the other side been a regular testsolver for Foggy. I’d like to hear his side of things, but my experience on PB4 in particular is when you have a massive amount of puzzles to test it is DEFINITELY good to have a round of testsolving that is just, “Are these solvable,” followed by a round of factual editing and copy editing. As a testsolver in “round 1” there are just things I will miss, and I don’t think it’s right to do a full hunt with one pass of testing. Also, I personally find it kind of fun as a testsolver sussing out what a broken puzzle is missing…

    Finally, if the hunt ends up fast for “veterans” but still fun for relative newcomers, that’s perfect in my mind. Putting something broken/inelegant in to slow things down is a travesty. I’m OK with people that solved my Puzzled Pint in 20 minutes, as that’s not the target Puzzled Pint audience.


    • Those who are fortunate enough (and/or giant enough) to win the MIT Hunt will discover a testing software system that gets passed on from team to team called Puzzletron. I believe it was created by Metaphysical Plant for the 2011 Hunt, and it’s an absolutely fantastic system for organizing puzzle drafts, editor discussion, testing feedback, and all other elements of the construction process. (In 2014, after writing five hunts without such a tool, I was the one scoffing and saying that it was going to add more bureaucracy than needed; in 2017, I was the one convincing other Setec members that my initial instincts were wrong and it’s the best thing ever, at least if your construction team is large.)

      I bring this up because one element of the incarnation Setec used (which I think also existed in 2014) is that testers rank the puzzles they test on a scale of 1 to 5 in both difficulty and fun. They were good for getting a sense at a glance of how the test went, though I’d say we got much more information out of the qualitative responses than the number ratings. Though if editors were iffy about a puzzle, it was a lot easier to convince the author changes needed to be made if they were getting all 1’s and 2’s in the fun department.

      On the other end of PP (which I’ve never actually done in a bar, but which I’ve solved a few of at home or at other events), I know I’m not the PP target audience, so I would not be cranky if one took me 20 minutes. šŸ™‚


  4. I’ve done only partial test writing and testing for a few past iterations of the Microsoft Intern Game, so I have a bit of experience with their testing practices. Since these tests are meant for interns who have variable puzzle solving experience, it is probably geared to be only a tiny bit more difficult than a Puzzled Pint, but much more hands-on than paper-based.

    Most staff iterations of testing are normally done by 2-3 people, with at least one of the testers being an experienced puzzle solver. The earlier tests are more focused around the puzzle aspect of it, since the physical aspects of these puzzles may not be fully completed yet. Over the course of a puzzle’s construction, there would normally be 2-4 of these tests for an average puzzle.

    About 4 months before the event, there was a small beta that used a few teams of winter interns, employees, and outside enthusiasts. Most of the puzzles were still in a paper form for the beta, so the feedback looked for at that point was solve times, puzzle design, as well as some feedback on puzzle fun factor.

    2 months before the event there is an RC (release control) test, which uses almost fully constructed puzzles on location (as this is a drive-around event) with 2-3 teams of employees and enthusiasts. Since there has already been a lot of puzzle feedback by this point, the feedback is mostly around location choices, construction feedback, with some fun factor and puzzle construction comments. This will allow the final changes and testing to be done by the staff in time for the event.

    Considering all the work and testing done by these people for a hunt of around 30 puzzles, it’s crazy to think of the time required for Mystery Hunt / PB testing!


  5. A little late joining this discussion, but here’s my two cents.

    I find the testing for Mystery Hunt to be generally a very high standard (yes, Puzzletron is awesome). When we pretty much get to consistently enjoy the elegance of clean unbroken puzzles over a 100-200 puzzle hunt, that speaks a lot for the commitment and behind-the-scenes effort from constructing teams. This standard has risen expectations that it just takes encountering one broken puzzle to leave a bad experience. And while some Hunts unfortunately had more than others, as long as constructing teams/editors kept to the philosophy of doing their due diligence (including testing) to put together a fun hunt for solvers, I can appreciate and live with such imperfections.

    This philosophy and best practices in puzzle testing are important things which I think need to be shared and passed down amongst hunt/puzzle constructors. Without the benefit of guidance, hunt/puzzle construction could seem like a relatively straightforward task to a solver/enthusiast, but the end product might not necessarily be fun or “solvable”. I know of hunts where the culture is that broken puzzles can be expected, and worked around through brute-forcing all possibilities, rather than logical deduction. To answer Dan’s question, I feel if I solved a live puzzle and can immediately still tell you one clear thing which is wrong with it (or better yet how to fix that), then the puzzle probably hadn’t been tested/edited sufficiently in my view.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s