transition 2 agile: Real Agile Testing, In Large Organizations

(Continued from Part 2)

What about TDD and unit testing?

As promised, let’s talk about how Test-Driven Development (TDD) fits into the over testing regime that we have been presenting.

TDD is a powerful technique for achieving high coverage unit tests and code with low coupling. TDD is a complex issue and there is quite a bit of controversy around it, so we will not take sides here. TDD is absolutely not a mandatory or necessary practice of Agile, but many very smart people swear by it, so be your own judge. It is even possible that TDD works for certain personality types and styles of thinking: TDD is an inherently inductive process, whereas top-down test design is a deductive process, and so TDD possibly reflects a way of thinking that is natural for its proponents. (See also this discussion of top-down versus bottom-up design. The book The Cathedral and the Bazaar also discusses the tension between top-down and bottom up design and development.)

The controversy between top-down and bottom-up approaches – and the personalities that are attracted to each – might even have analogy to the well known division in the sciences between those who are theorists and those who are experimentalists at heart: these two camps never seem to see eye-to-eye, but they know that they need each other. Thus, instead of getting into the TDD or no-TDD debate, we will merely explain TDD’s place in an overall testing approach, and a few things to consider when deciding whether to use it or not. Most importantly: do not let proponents of TDD convince you that TDD is a necessary Agile practice, or that teams that do not use TDD are inherently less “advanced”. These assertions are not true: TDD is a powerful design and test strategy, but there are other competing strategies (e.g., object-oriented analysis and design, functional design – both of which are completely compatible with Agile – and many others).

TDD operates at a unit test level (i.e., on individual code units or methods) and does not replace acceptance tests (including paradigms such as acceptance test-driven development (ATDD) and behavior-driven development (BDD)), which operate at a feature (aka story, or scenario) level. Unit testing is what most programmers do when they write their own tests for the methods and components that they write – regardless of whether or not they use TDD. Unit testing is also well suited for testing “failure mode” requirements such as that bad data should not crash the system. Used in conjunction with TDD and a focus on failure modes, failure mode issues can be resolved far sooner which is certainly very Agile.

Acceptance level testing is still critically important. Unit testing cannot replace acceptance tests, because one of the most important reasons for acceptance tests is to check that the developer’s understanding of the requirements is correct. If the developer misunderstands the requirements, and the developer writes the tests, then the tests will reflect that misunderstanding, yet the tests will pass! Separate people need to write a story’s acceptance tests and the story’s implementation. Therefore, TDD is a technique for improving test coverage and improving certain code attributes, and many see it as an approach to design – but it is not a replacement for acceptance tests.

One type of unit level testing that is very important,

regardless of TDD, is interface level testing.

One type of unit level testing that is very important, regardless of TDD, is interface level testing. Complex systems usually have tiers or subsystems, and it is very valuable to have high coverage test suites at these interfaces. In the TDD world, such an interface is nothing special: it is merely a unit test on those components. In a non-TDD world, it is viewed as an interface regression test and one specifically plans for it. For example, a REST based Web service defines a set of “endpoints” that are essentially remote functions, and those functions define a kind of façade interface for access to the server application. There should be a comprehensive test suite at that interface, even if there are user level (e.g., browser based, using Selenium) acceptance tests. The reason is that the REST interface is a reusable interface in its own right, and is used by many developers, so changes to it have a major impact. Leaving it to the user level tests to detect changes makes it difficult to identify the source of an error. In this scenario mocking is often the most advantageous way of unit level testing interfaces.

Another, much more important reason to have high coverage tests on each major interface is that the user level tests might not exercise the full range of functionality of the REST interface – but the REST level tests should, so that future changes to the user level code will not access new parts of the REST interface that have not been tested yet – long after the REST code has been written. The REST interface can also be tested much more efficiently, without having to run them in a browser. In fact, the performance tests will likely be performed using that interface instead of the user level interface.

Detection of change impact, at a component level, is in fact one of the arguments for Unit Testing (and TDD): if a change causes a test to fail, the test is right at the component that is failing. That helps to narrow down the impact of changes. The cost of that, however, is maintaining a large set of tests, which introduce a kind of impedance to change. Be your own judge on the tradeoff.

TDD can also impact the group process: it is generally not feasible in a shared code ownership environment to have some people using TDD and others not. Thus, TDD really needs to be a team-level decision.

It is possible that the preference (or not) for TDD

should be a criteria in assembling teams.

Legacy code maintenance is often a leading challenge when it comes to unit testing. TDD helps greatly to identify the impact when changes are made to an existing code base, but at the cost of maintaining a large body of tests, which can impede refactoring. Another example of a real challenge to utilizing TDD techniques is model-based development (see also this MathWorks summary) – often used today for the design of real time software e.g., in embedded controllers, using tools such as Simulink. These techniques are used because of the extremely high reliability of the generated code. There are ways of applying TDD in this setting (such as writing .m scripts for Simulink tests), but that is not a widespread practice. Acceptance Test Driven Development (ATDD) is potentially a better approach when using model-based development.

Finally, TDD seems to favor certain types of programmers over others. By adopting TDD, you might enable some of your team to be more effective, but you might also hinder others. It is therefore possible that the preference (or not) for TDD should be a criteria in assembling teams. Making an organization-wide decision, however, might be a mistake, unless you intend to exclude deductive thinkers from all of your programming teams.

The jury is still out on these issues, so you will have to use your own judgment: just be sure to allow for the fact that people think and work differently – from you. Do not presume that everyone thinks the way that you do. Do not presume that if you have found TDD to be effective (or not), that everyone else will find the same thing after trying it for long enough.

Some other types of testing

There are still many types of testing that we have not covered! And all are applicable to Agile teams!

Disaster Recovery

In our EFT management portal example (see Part 1), the system needed to be highly secure and reliable, comply with numerous laws, and our development process must satisfy Sarbanes Oxley laws and the information demands of an intrusive oversight group. Most likely, there is also a “continuity” or “disaster recovery” requirement, in which case there will have to be an entire repeatable test strategy for simulating a disaster with failover to another set of systems in another data center or another cloud provider. That is one case where a detailed test plan is needed: for testing disaster recovery. However, one could argue that such a plan could be developed incrementally, and tried in successive pieces, instead of all at once.

Security

Nowadays, security is increasingly being addressed by enumerating “controls” according to a security control framework such as NIST FISMA Security Controls. For government systems, this is mandatory. This used to be executed in a very document-centric way, but increasingly it is becoming more real time, with security specialists working with teams on a frequent basis – e.g., once per iteration – to review controls. Most of the controls pertain to tools and infrastructure, and can be addressed through adding scanning to some of the CI/CD pipeline scripts, to be run a few times per iteration. These scans check that the OSes are hardened and that major applications such as Apache are hardened. In addition, the security folks will want to verify that the third party components in the project binary artifact repository (Nexus, etc.) are “approved” – that is, they have been reviewed by security experts, are up to date, and do not pose a risk. All this can be done using tools without knowing much about how the application actually works.

Unfortunately, we cannot test for

careful secure design: we can only build it in.

However, some controls pertain to application design and data design. These are the hard ones. Again, for Morticia’s website (see Part 1), we don’t need to worry about that. But for the other end of the spectrum, where we know that we are a juicy target for an expert level attack – such as what occurred for Target, Home Depot, and Sony Pictures in the past 12 months – we have no choice but to assume that very smart hackers will make protracted attempts to find mistakes in our system or cause our users to make mistakes that enable the hackers to get in. To protect against that, scanning tools are merely a first step – a baby step. The only things that really work are a combination of,
1. Careful secure design.
2. Active monitoring (intrusion detection).

Unfortunately, we cannot test for careful secure design: we can only build it in. To do that, we as developers need to know secure design patterns – compartmentalization, least privilege, privileged context, and so on. For monitoring, all large organizations have monitoring in place, but they need the development team’s help in identifying what kinds of traffic are normal and what are not normal – especially at points of interconnection to third party or partner systems. Teams should conduct threat modeling, and in the process identify the traffic patterns that are normal and those that might signify an attack. This information should be passed to the network operations team. Attacks cannot be prevented, but they can often be stopped while they are in progress – before damage is done. To do that, the network operations team needs to know what inter-system traffic patterns should be considered suspicious.

Compliance with laws

Compliance with laws is a matter of decomposing the legal requirements and handling them like any other – functional or non-functional, depending on the requirement. However, while it is important for all requirements to be traceable (identifiable through acceptance criteria), it is absolutely crucial for legal compliance requirements. Otherwise, there is no way to “demonstrate” compliance, and no way to prove that sufficient diligence was applied in attempting to comply.

Performance testing

There are many facets to performance testing. If you are doing performance testing at all, four main scenario types are generally universal:
1.    Normal usage profile.
2.    Spike profile.
3.    Break and “soak” test.
4.    Ad-hoc tests.

Normal usage includes low and high load periods: the goal is to simulate the load that is expected to occur over the course of normal usage throughout the year. Thus, a normal usage profile will include the expected peak period loads. One can usually run normal load profile tests for, say, an hour – this is not long duration testing. It is also not up-time testing.

The purpose of spike testing is to see what happens if there is an unusual load transient: does the system slow down gracefully, and recover quickly after the transient is over? Spike testing generally consists of running an average load profile but overlaying a “spike” for a brief duration, and seeing what happens during and after the spike.

Break testing is seeing what happens when the load is progressively increased until the system fails. Does it fail gracefully? This is a failure mode, and will be discussed further below. Soak testing is similar, in that lots of load is generated for a long time, to see if the system starts to degrade in some way.

The last category, “ad-hoc tests”, are tests that are run by the developers in order to examine the traffic between internal system interfaces, and how that traffic changes under load. E.g., traffic between two components might increase but traffic between two others might not – indicating a possible bottleneck between the first two. Performing these tests requires intimate knowledge of the system’s design and intended behavior, and these tests are usually not left in place. However, these tests often result in monitors being designed to permanently monitor the system’s internal operation.

In an Agile setting, performance tests are best run in a separate performance testing environment, on a schedule, e.g., daily. This ensures that the results are available every day as code changes, and that the tests do not disrupt other kinds of testing. Cloud environments are perfect for load testing, which might require multiple load generation machines to generate sufficient load. Performance testing is usually implemented as a Jenkins task that runs the tests on schedule.

Testing for resiliency

Acceptance criteria are usually “happy path”: that is, if the system does what is required, then the test passes. Often a few “user error” paths are thrown in. But what should happen if something goes wrong due to input that is not expected, or due to an internal error – perhaps a transient error – of some kind? If the entire system crashes when the user enters invalid input or the network connection drops, that is probably not acceptable.

Failure modes are extremely important to explicitly test for. For example, suppose Morticia’s website has a requirement,

Given that I am perusing the product catalog,
When I click on a product,
Then the product is added to my shopping cart.

But what happens if I double-click on a product? What happens if I click on a product, but then hit the Back button in the browser? What happens if someone else clicks on that product at the same instant, causing it to be out of stock? You get the idea.

Generally, there are two ways to address this: on a feature/action basis, and on a system/component basis. The feature oriented approach is where outcomes based story design comes into play: thinking through the failure modes when writing the story. For example, for each acceptance scenario, think of as many situations that you can about what might go wrong. Then phrase these as additional acceptance criteria. You can nest the criteria if you like: languages like Gherkin support scenario nesting and parameterized tables to help you decompose acceptance criteria into hierarchical paths of functionality.

Testing for resiliency on a component basis is more technical. The test strategy should include intentional disruptions to the physical systems with systems and applications highly instrumented, to test that persistent data is not corrupted and that failover occurs properly with minimal loss of service and continuing compliance with SLAs. Memory leaks should be watched for by running the system for a long time under load. Artifacts such as exceptions written to logs should be examined and the accumulation of temporary files should be watched for. If things are happening that are not understood, the application is probably not ready for release. Applying Agile values and principles to this, this type of testing should be developed from the outset, and progressively made more and more thorough.

Concurrency testing is a special case of functional testing.

Driving the system to failure is very important for high reliability systems. The intention is to ensure that the system fails gracefully: that it fails gradually – not catastrophically – and that there is no loss or corruption of persistent data and no loss of messages that are promised to be durable. Transactional databases and durable messaging systems are designed for this, but many web applications do not perform their transactions correctly (multiple transactions in one user action) and are vulnerable to inconsistency if a user action only partially completes. Tests should therefore check that as the system fails under load, nothing “crashes”, and each simulated update request that failed does not leave artifacts in the databases or file system, and as the system recovers, requests that completed as the system was failing do not get performed twice.

Concurrency testing is a special case of functional testing, but it is often overlooked. When I (Cliff) was CTO of Digital Focus (acquired by Command Information in 2006) we used to put our apps in our performance lab when the developers thought that the apps were done. (We should have done it sooner.) We generally started to see new kinds of failures at around ten concurrent simulated users, and then a whole new set of failures at around 100 concurrent users. The first group – at around ten users – generally were concurrency errors. The second group had to do with infrastructure: TCP/IP settings and firewalls.

Regarding the first group, these are of the kind in which, say, a husband and wife have a joint bank account, and the husband accesses the account from home and the wife accesses it from her office, and they both update their account info at the same time. What happens? Does the last one win – with the other one oblivious that his or her changes were lost? Do the changes get combined? Is there an error? These conditions need to be tested for, because these things will happen under high volume use, and they will result in customer unhappiness and customer support calls. There need to be test scenarios written, with acceptance criteria, for all these kinds of failure mode scenarios. These should be run on a regular basis, using a simulated light load of tens of users, intentionally inducing these kinds of scenarios. This is not performance testing however: it is functional testing, done with concurrent usage.

Is all this in the Definition of Done?

The definition of done (DoD) is an important Agile construct in which a team defines what it means for a story – any story – to be considered done. Thus, the DoD is inherently a story level construct. That is, it is for acceptance criteria that are written for stories. DoD is not applicable to system-wide acceptance criteria, such as performance criteria, security criteria, general legal compliance criteria that might apply to the implementation of many stories, etc.

It is not practical to treat every type of requirement as part of the DoD. For example, if one has to prove performance criteria for each story, then the team could not mark off stories as complete until the performance tests are run, and each and every story would have to have its performance measured – something that is generally not necessary: typically a representative set of user actions are simulated to create a performance load. Thus, non-functional requirements or system-wide requirements are best not included in the DoD. This is shown in Figure 1 of Part 1, where a story is checked “done” after the story has passed its story level acceptance tests, has not broken any integration tests, and the user has tried the story in a demo environment and agrees that the acceptance criteria have been met. Ideally, this happens during an iteration – not at the end – otherwise, nothing gets marked as “done” until the end of an iteration. Thus, marking a story as “done” is tentative because that decision can be rejected by the Product Owner during the iteration review, even if a user has tried the story and thought that it was done. Remember that the Product Owner represents many stakeholders – not just the users.

Another technique we use with larger sets of teams (think portfolio) – and especially when there are downstream types of testing (e.g., hardware integration testing) – is definition of ready (DoR). The state of “ready” is a precursor to the state of being “done”. This helps to ensure that the DoD – which might include complex forms of testing – can be met by the team. The team first ensures that a story meets the DoR. These are other criteria such as that the story has acceptance criteria (DoD would say the acceptance criteria have been met), certain analysis have be completed, etc. – just enough so that development and testing have a much higher likelihood of being completed within an iteration. This works with teams and programs of all sizes. We do find that for larger programs, the DoR is almost always very useful. Defining a DoR with the Product Owner is also a great way of engaging the Product Owner on non-functional issues to increase their understanding of those issues and ensure the team is being fed high quality stories.

End Of Part 3

Next time in Part 4 we will connect the dots on the organizational issues of all this!

Authors (alphabetically):
Scott Barnes
Cliff Berg

transition 2 agile

Interview with Madhur Kathuria

Interview with Elena Yatzeck

Monday, December 22, 2014

Real Agile Testing, In Large Organizations – Part 3