Jiby's toolbox

Jb Doyon’s personal website

Low-tech Cucumber replacement

Posted on — Apr 4, 2022

In the previous post, I showed how Gherkin Scenarios provide a great framework for acceptance test definition. I also showed how the dreams of the BDD movement extend to automating Gherkin, using a tool called Cucumber, enforcing traceability of acceptance tests.

In this article, I want to explain my disilusionment towards Cucumber, and show how a low-tech alternative emerged that I believe covers most of the value for none of the effort.

Cucumber fatigue

Cucumber is originally a Ruby tool, but I primarily write Python these days. Though there exists tooling to hook Ruby’s Cucumber in other languages, the most convenient tooling for me were the Python rewrites: The elderly Behave, the defunct lettuce, and more recently, the nobly-backed pytest-bdd, of Pytest workgroup fame.

At one time or another, I ended up testing all of them on personal projects, to experiment with how cool BDD is in practice, I was a big Gherkin fan already. I don’t remember much about playing with the tools, as this was over ten years ago now. But I do remember coming out disappointed, like the tooling didn’t deliver on the promise of BDD.

I’ll try to break down why I felt this.

Pytest framework is just too good

Having worked with the fantastic pytest, the framework is really natural, convenient: Each function is a standalone test, with reusable fixtures providing flexibility for setup and teardown.

Compared to that, the Given/When/Then tests in Cucumber force us to split code into three or more functions. Tests are thus harder to read, as we need to jump between functions to see what trigger ran before this assert error, but there’s hardly a trace to follow in the execution stack (as steps are run sequentially, not recursively. By the time we get to an assert error, the precondition or trigger has been executed, and backtrace is empty of it).

Cucumber test code reuse is hard

Code reuse inside tests is hard enough as it is, but Cucumber encourages reuse of Given-s implementation across tests as the most supreme form of BDD. So two Gherkin Features that happen to have the same pre-conditional Given, should use the same Given setup function in Cucumber… Except I never managed to do it.

Tests in a typical projects have multiple layers, so a Feature’s Given becomes a full Scenario below, how do we reuse the fixture code then?

Similarly, a chains of events (say a sequence of 5 message events that must happen in order) are very hard to write Gherkin about, because each event in the chain is an assert (Then), but also a trigger for the next one (Given). Writing Cucumber code to match this becomes tedious.

So Gherkin step code reuse, the “pinnacle of BDD”, was a bust for me. For most of these I ended up writing Pytest-style tests, one test function at a time, ditching Cucumber.

Too different from normal workflow

The Cucumber technique of writing tests is a major break from conventional test practices, so big a change that I don’t feel comfortable asking anybody else to do so in a team project.

So the tooling didn’t impress me, the abstraction level felt wrong, and this big break with well established practice didn’t seem justified enough to get workplaces to adopt it.

Where is the value, again?

As a mildly self-aware person, I taught myself to follow a rule: anytime I spend half an hour deep-diving in a problem, I must take a step back, and ask “Why am I doing this?“, and followup questions, like “Is the problem as stated worth solving?” “What long-term goal is this short-term hell solving?”.

Most of the time, the answer is “actually I don’t need to solve this, not exactly”, and I find some way to side-step the issue altogether. Or convince myself it wasn’t worth even trying this and just abandon the line of inquiry.

So, what are we really doing here, trying to crowbar this Cucumber tool inside the testing workflow!?

Well, as we saw before, gathering requirements was important enough to write nice Gherkin Features about. And now we would like to prove that we indeed have 100% coverage of the Features via acceptance tests. That’s a boolean answer: all features covered, or missing tests. And on top of that, we want traceability: which test covers which feature.

But is Cucumber necessary for this? Really?

Surveying my thoughts, I still enjoyed weaving Scenarios within conventional test functions, I just didn’t want to have to do anything drastic to the test code. So I evolved a technique that can be seen as a natural middle ground.

Low tech Gherkin inside test comments

Going back to my Pytest code, I found myself still writing new tests with Gherkin Scenarios, but inserting it in code comments. I found the Gherkin helps me reason about the tests, but the comments nature avoids any tooling interference.

def test_reject_long_words():
    """Scenario: Reject long words"""
    # When guessing "affable"
    guess = "affable"
    is_valid, reject_reason = check_valid_word(guess)
    # Then the guess is rejected
    assert not is_valid, "Overly long guess should have been rejected"
    # And reason for rejection is "Guess too long"
    assert reject_reason == "Guess too long"
Code Snippet 1: Gherkin in tests, via comments. The seed of an idea.

When I noticed myself doing this, I started to do it more and more intentionally, down to starting new tests with the Gherkin to direct it:

def test_reject_overly_short_words():
    """Scenario: Reject short words"""
    # When guessing "baby"
    # Then the guess is rejected
    # And reason for rejection is "Guess too short"
    raise NotImplementedError("WIP!")
Code Snippet 2: New test stub, driven by Gherkin.

The Gherkin is purely comments for the reader, and I love it. No tooling in the way. But soon my tests grow to multiple pages of code, and I wanted to get a bird’s eye view of my tests’ scenarios, to know what test to write next.

Accidental automation

I noticed that the markup syntax I had landed on was pretty regular … like, regular-expression kind of regular. So I experimented with a quick grep command on a given file, extracting all the Gherkin Keywords.

export GHERKIN_KEYWORDS="Given|When|Then|And|But|Scenario|Background|Feature|In order to|As a|I want to|I need to|So that"
grep -E "${GHERKIN_KEYWORDS}" tests/test_something.py
Code Snippet 3: First attempt at extracting Gherkin
    """Scenario: Reject long words"""
    # When guessing "affable"
    # Then the guess is rejected
    # And reason for rejection is "Guess too long"
    """Scenario: Reject short words"""
    # When guessing "baby"
    # Then the guess is rejected
    # And reason for rejection is "Guess too short"
Code Snippet 4: Result of executing it on a test file

It works, but I was irrationally irritated at the comment markers and whitespace. A few iterations of the script later, I got a shell function called show-gherkin:

# Extracts gherkin from file, printing finename/line number.
# show-gherkin tests/*.py
# Specify only one file to show only line number
show-gherkin () {
    local GHERKIN_KEYWORDS="Given|When|Then|And|But|Scenario|Background|Feature|In order to|As a|I want to|I need to|So that"
    egrep -osn "^.*($GHERKIN_KEYWORDS)(.*)" $@ \
        | sed -E "s/^(.*):.*($GHERKIN_KEYWORDS)/\1: \2/" \
        | sed 's/"""//'
}
Code Snippet 5: Show-gherkin, as it lives now in my dotfiles
13: Scenario: Reject long words
14: When guessing "affable"
17: Then the guess is rejected
19: And reason for rejection is "Guess too long"
24: Scenario: Reject short words
25: When guessing "baby"
28: Then the guess is rejected
30: And reason for rejection is "Guess too short"
Code Snippet 6: Result of executing show-gherkin on a test file. Note the numbers are line numbers where the expression is found!

I keep using this script to date, and started encouraging others to adopt the pattern of commenting, shared the little script with my team at work.

Conclusion

I explained how the noble Dream of Cucumber tooling seems to fall short, not delivering all the value it promised. I showed also how this dream was rekindled, by aiming lower: a simple comment-based markup in code. And again, such joy at finding that small, incremental amounts of automation add value in a low-tech fashion.

I believe this simple solution for Gherkin really captures over 80% of the value of Gherkin Scenarios usage in BDD frameworks, while being less than 20% of the effort of “proper tools” like Cucumber, with none the drawbacks outlined above.

I do believe that the markup approach I grew here hasn’t reached its full potential, and will keep experimenting with this format. I think there’s an opportunity to build some tooling that leverage the comments for traceability like Cucumber does. If I succeed at building tools that still feel lightweight, you can be sure to I’ll post about it on this blog.

This is the end of my planned series of articles on Gherkin and BDD, join me in the next post, where I’ll demonstrate all we just talked about through a toy project.