Interview with Madhur Kathuria

Madhur Kathuria has coached nearly 300 teams for almost 75 clients across the US, Europe, South East Asia, Malaysia and Thailand. In this interview he talks about some of the cultural challenges for agile adoption. Read it here.

Interview with Elena Yatzeck

Elena was Chief Agilist for JP Morgan Chase Treasury Services and is now a VP of Corporate Compliance Tech. Find out how JP Morgan Chase reconciles agile with compliance and risk management demands. Read it here.

Monday, December 22, 2014

Real Agile Testing, In Large Organizations – Part 3

(Continued from Part 2)

What about TDD and unit testing?

As promised, let’s talk about how Test-Driven Development (TDD) fits into the over testing regime that we have been presenting.

TDD is a powerful technique for achieving high coverage unit tests and code with low coupling. TDD is a complex issue and there is quite a bit of controversy around it, so we will not take sides here. TDD is absolutely not a mandatory or necessary practice of Agile, but many very smart people swear by it, so be your own judge. It is even possible that TDD works for certain personality types and styles of thinking: TDD is an inherently inductive process, whereas top-down test design is a deductive process, and so TDD possibly reflects a way of thinking that is natural for its proponents. (See also this discussion of top-down versus bottom-up design. The book The Cathedral and the Bazaar also discusses the tension between top-down and bottom up design and development.)

The controversy between top-down and bottom-up approaches – and the personalities that are attracted to each – might even have analogy to the well known division in the sciences between those who are theorists and those who are experimentalists at heart: these two camps never seem to see eye-to-eye, but they know that they need each other. Thus, instead of getting into the TDD or no-TDD debate, we will merely explain TDD’s place in an overall testing approach, and a few things to consider when deciding whether to use it or not. Most importantly: do not let proponents of TDD convince you that TDD is a necessary Agile practice, or that teams that do not use TDD are inherently less “advanced”. These assertions are not true: TDD is a powerful design and test strategy, but there are other competing strategies (e.g., object-oriented analysis and design, functional design – both of which are completely compatible with Agile – and many others).

TDD operates at a unit test level (i.e., on individual code units or methods) and does not replace acceptance tests (including paradigms such as acceptance test-driven development (ATDD) and behavior-driven development (BDD)), which operate at a feature (aka story, or scenario) level. Unit testing is what most programmers do when they write their own tests for the methods and components that they write – regardless of whether or not they use TDD. Unit testing is also well suited for testing “failure mode” requirements such as that bad data should not crash the system. Used in conjunction with TDD and a focus on failure modes, failure mode issues can be resolved far sooner which is certainly very Agile.

Acceptance level testing is still critically important. Unit testing cannot replace acceptance tests, because one of the most important reasons for acceptance tests is to check that the developer’s understanding of the requirements is correct. If the developer misunderstands the requirements, and the developer writes the tests, then the tests will reflect that misunderstanding, yet the tests will pass! Separate people need to write a story’s acceptance tests and the story’s implementation. Therefore, TDD is a technique for improving test coverage and improving certain code attributes, and many see it as an approach to design – but it is not a replacement for acceptance tests.

One type of unit level testing that is very important,
regardless of TDD, is interface level testing.

One type of unit level testing that is very important, regardless of TDD, is interface level testing. Complex systems usually have tiers or subsystems, and it is very valuable to have high coverage test suites at these interfaces. In the TDD world, such an interface is nothing special: it is merely a unit test on those components. In a non-TDD world, it is viewed as an interface regression test and one specifically plans for it. For example, a REST based Web service defines a set of “endpoints” that are essentially remote functions, and those functions define a kind of façade interface for access to the server application. There should be a comprehensive test suite at that interface, even if there are user level (e.g., browser based, using Selenium) acceptance tests. The reason is that the REST interface is a reusable interface in its own right, and is used by many developers, so changes to it have a major impact. Leaving it to the user level tests to detect changes makes it difficult to identify the source of an error. In this scenario mocking is often the most advantageous way of unit level testing interfaces.

Another, much more important reason to have high coverage tests on each major interface is that the user level tests might not exercise the full range of functionality of the REST interface – but the REST level tests should, so that future changes to the user level code will not access new parts of the REST interface that have not been tested yet – long after the REST code has been written. The REST interface can also be tested much more efficiently, without having to run them in a browser. In fact, the performance tests will likely be performed using that interface instead of the user level interface.

Detection of change impact, at a component level, is in fact one of the arguments for Unit Testing (and TDD): if a change causes a test to fail, the test is right at the component that is failing. That helps to narrow down the impact of changes. The cost of that, however, is maintaining a large set of tests, which introduce a kind of impedance to change. Be your own judge on the tradeoff.

TDD can also impact the group process: it is generally not feasible in a shared code ownership environment to have some people using TDD and others not. Thus, TDD really needs to be a team-level decision.

It is possible that the preference (or not) for TDD
should be a criteria in assembling teams.

Legacy code maintenance is often a leading challenge when it comes to unit testing. TDD helps greatly to identify the impact when changes are made to an existing code base, but at the cost of maintaining a large body of tests, which can impede refactoring. Another example of a real challenge to utilizing TDD techniques is model-based development (see also this MathWorks summary) – often used today for the design of real time software e.g., in embedded controllers, using tools such as Simulink. These techniques are used because of the extremely high reliability of the generated code. There are ways of applying TDD in this setting (such as writing .m scripts for Simulink tests), but that is not a widespread practice. Acceptance Test Driven Development (ATDD) is potentially a better approach when using model-based development.

Finally, TDD seems to favor certain types of programmers over others. By adopting TDD, you might enable some of your team to be more effective, but you might also hinder others. It is therefore possible that the preference (or not) for TDD should be a criteria in assembling teams. Making an organization-wide decision, however, might be a mistake, unless you intend to exclude deductive thinkers from all of your programming teams.

The jury is still out on these issues, so you will have to use your own judgment: just be sure to allow for the fact that people think and work differently – from you. Do not presume that everyone thinks the way that you do. Do not presume that if you have found TDD to be effective (or not), that everyone else will find the same thing after trying it for long enough.

Some other types of testing

There are still many types of testing that we have not covered! And all are applicable to Agile teams!

Disaster Recovery

In our EFT management portal example (see Part 1), the system needed to be highly secure and reliable, comply with numerous laws, and our development process must satisfy Sarbanes Oxley laws and the information demands of an intrusive oversight group. Most likely, there is also a “continuity” or “disaster recovery” requirement, in which case there will have to be an entire repeatable test strategy for simulating a disaster with failover to another set of systems in another data center or another cloud provider. That is one case where a detailed test plan is needed: for testing disaster recovery. However, one could argue that such a plan could be developed incrementally, and tried in successive pieces, instead of all at once.


Nowadays, security is increasingly being addressed by enumerating “controls” according to a security control framework such as NIST FISMA Security Controls. For government systems, this is mandatory. This used to be executed in a very document-centric way, but increasingly it is becoming more real time, with security specialists working with teams on a frequent basis – e.g., once per iteration – to review controls. Most of the controls pertain to tools and infrastructure, and can be addressed through adding scanning to some of the CI/CD pipeline scripts, to be run a few times per iteration. These scans check that the OSes are hardened and that major applications such as Apache are hardened. In addition, the security folks will want to verify that the third party components in the project binary artifact repository (Nexus, etc.) are “approved” – that is, they have been reviewed by security experts, are up to date, and do not pose a risk. All this can be done using tools without knowing much about how the application actually works.

Unfortunately, we cannot test for
careful secure design: we can only build it in.

However, some controls pertain to application design and data design. These are the hard ones. Again, for Morticia’s website (see Part 1), we don’t need to worry about that. But for the other end of the spectrum, where we know that we are a juicy target for an expert level attack – such as what occurred for Target, Home Depot, and Sony Pictures in the past 12 months – we have no choice but to assume that very smart hackers will make protracted attempts to find mistakes in our system or cause our users to make mistakes that enable the hackers to get in. To protect against that, scanning tools are merely a first step – a baby step. The only things that really work are a combination of,
1.    Careful secure design.
2.    Active monitoring (intrusion detection).

Unfortunately, we cannot test for careful secure design: we can only build it in. To do that, we as developers need to know secure design patterns – compartmentalization, least privilege, privileged context, and so on. For monitoring, all large organizations have monitoring in place, but they need the development team’s help in identifying what kinds of traffic are normal and what are not normal – especially at points of interconnection to third party or partner systems. Teams should conduct threat modeling, and in the process identify the traffic patterns that are normal and those that might signify an attack. This information should be passed to the network operations team. Attacks cannot be prevented, but they can often be stopped while they are in progress – before damage is done. To do that, the network operations team needs to know what inter-system traffic patterns should be considered suspicious.

Compliance with laws

Compliance with laws is a matter of decomposing the legal requirements and handling them like any other – functional or non-functional, depending on the requirement. However, while it is important for all requirements to be traceable (identifiable through acceptance criteria), it is absolutely crucial for legal compliance requirements. Otherwise, there is no way to “demonstrate” compliance, and no way to prove that sufficient diligence was applied in attempting to comply.

Performance testing

There are many facets to performance testing. If you are doing performance testing at all, four main scenario types are generally universal:
1.    Normal usage profile.
2.    Spike profile.
3.    Break and “soak” test.
4.    Ad-hoc tests.

Normal usage includes low and high load periods: the goal is to simulate the load that is expected to occur over the course of normal usage throughout the year. Thus, a normal usage profile will include the expected peak period loads. One can usually run normal load profile tests for, say, an hour – this is not long duration testing. It is also not up-time testing.

The purpose of spike testing is to see what happens if there is an unusual load transient: does the system slow down gracefully, and recover quickly after the transient is over? Spike testing generally consists of running an average load profile but overlaying a “spike” for a brief duration, and seeing what happens during and after the spike.

Break testing is seeing what happens when the load is progressively increased until the system fails. Does it fail gracefully? This is a failure mode, and will be discussed further below. Soak testing is similar, in that lots of load is generated for a long time, to see if the system starts to degrade in some way.

The last category, “ad-hoc tests”, are tests that are run by the developers in order to examine the traffic between internal system interfaces, and how that traffic changes under load. E.g., traffic between two components might increase but traffic between two others might not – indicating a possible bottleneck between the first two. Performing these tests requires intimate knowledge of the system’s design and intended behavior, and these tests are usually not left in place. However, these tests often result in monitors being designed to permanently monitor the system’s internal operation.

In an Agile setting, performance tests are best run in a separate performance testing environment, on a schedule, e.g., daily. This ensures that the results are available every day as code changes, and that the tests do not disrupt other kinds of testing. Cloud environments are perfect for load testing, which might require multiple load generation machines to generate sufficient load. Performance testing is usually implemented as a Jenkins task that runs the tests on schedule.

Testing for resiliency

Acceptance criteria are usually “happy path”: that is, if the system does what is required, then the test passes. Often a few “user error” paths are thrown in. But what should happen if something goes wrong due to input that is not expected, or due to an internal error – perhaps a transient error – of some kind? If the entire system crashes when the user enters invalid input or the network connection drops, that is probably not acceptable.

Failure modes are extremely important to explicitly test for. For example, suppose Morticia’s website has a requirement,
Given that I am perusing the product catalog,
When I click on a product,
Then the product is added to my shopping cart.

But what happens if I double-click on a product? What happens if I click on a product, but then hit the Back button in the browser? What happens if someone else clicks on that product at the same instant, causing it to be out of stock? You get the idea.

Generally, there are two ways to address this: on a feature/action basis, and on a system/component basis. The feature oriented approach is where outcomes based story design comes into play: thinking through the failure modes when writing the story. For example, for each acceptance scenario, think of as many situations that you can about what might go wrong. Then phrase these as additional acceptance criteria. You can nest the criteria if you like: languages like Gherkin support scenario nesting and parameterized tables to help you decompose acceptance criteria into hierarchical paths of functionality.

Testing for resiliency on a component basis is more technical. The test strategy should include intentional disruptions to the physical systems with systems and applications highly instrumented, to test that persistent data is not corrupted and that failover occurs properly with minimal loss of service and continuing compliance with SLAs. Memory leaks should be watched for by running the system for a long time under load. Artifacts such as exceptions written to logs should be examined and the accumulation of temporary files should be watched for. If things are happening that are not understood, the application is probably not ready for release. Applying Agile values and principles to this, this type of testing should be developed from the outset, and progressively made more and more thorough.

Concurrency testing is a special case of functional testing.

Driving the system to failure is very important for high reliability systems. The intention is to ensure that the system fails gracefully: that it fails gradually – not catastrophically – and that there is no loss or corruption of persistent data and no loss of messages that are promised to be durable. Transactional databases and durable messaging systems are designed for this, but many web applications do not perform their transactions correctly (multiple transactions in one user action) and are vulnerable to inconsistency if a user action only partially completes. Tests should therefore check that as the system fails under load, nothing “crashes”, and each simulated update request that failed does not leave artifacts in the databases or file system, and as the system recovers, requests that completed as the system was failing do not get performed twice.

Concurrency testing is a special case of functional testing, but it is often overlooked. When I (Cliff) was CTO of Digital Focus (acquired by Command Information in 2006) we used to put our apps in our performance lab when the developers thought that the apps were done. (We should have done it sooner.) We generally started to see new kinds of failures at around ten concurrent simulated users, and then a whole new set of failures at around 100 concurrent users. The first group – at around ten users – generally were concurrency errors. The second group had to do with infrastructure: TCP/IP settings and firewalls.

Regarding the first group, these are of the kind in which, say, a husband and wife have a joint bank account, and the husband accesses the account from home and the wife accesses it from her office, and they both update their account info at the same time. What happens? Does the last one win – with the other one oblivious that his or her changes were lost? Do the changes get combined? Is there an error? These conditions need to be tested for, because these things will happen under high volume use, and they will result in customer unhappiness and customer support calls. There need to be test scenarios written, with acceptance criteria, for all these kinds of failure mode scenarios. These should be run on a regular basis, using a simulated light load of tens of users, intentionally inducing these kinds of scenarios. This is not performance testing however: it is functional testing, done with concurrent usage.

Is all this in the Definition of Done?

The definition of done (DoD) is an important Agile construct in which a team defines what it means for a story – any story – to be considered done. Thus, the DoD is inherently a story level construct. That is, it is for acceptance criteria that are written for stories. DoD is not applicable to system-wide acceptance criteria, such as performance criteria, security criteria, general legal compliance criteria that might apply to the implementation of many stories, etc.

It is not practical to treat every type of requirement as part of the DoD. For example, if one has to prove performance criteria for each story, then the team could not mark off stories as complete until the performance tests are run, and each and every story would have to have its performance measured – something that is generally not necessary: typically a representative set of user actions are simulated to create a performance load. Thus, non-functional requirements or system-wide requirements are best not included in the DoD. This is shown in Figure 1 of Part 1, where a story is checked “done” after the story has passed its story level acceptance tests, has not broken any integration tests, and the user has tried the story in a demo environment and agrees that the acceptance criteria have been met. Ideally, this happens during an iteration – not at the end – otherwise, nothing gets marked as “done” until the end of an iteration. Thus, marking a story as “done” is tentative because that decision can be rejected by the Product Owner during the iteration review, even if a user has tried the story and thought that it was done. Remember that the Product Owner represents many stakeholders – not just the users.

Another technique we use with larger sets of teams (think portfolio) – and especially when there are downstream types of testing (e.g., hardware integration testing) – is definition of ready (DoR). The state of “ready” is a precursor to the state of being “done”. This helps to ensure that the DoD – which might include complex forms of testing – can be met by the team. The team first ensures that a story meets the DoR. These are other criteria such as that the story has acceptance criteria (DoD would say the acceptance criteria have been met), certain analysis have be completed, etc. – just enough so that development and testing have a much higher likelihood of being completed within an iteration. This works with teams and programs of all sizes. We do find that for larger programs, the DoR is almost always very useful. Defining a DoR with the Product Owner is also a great way of engaging the Product Owner on non-functional issues to increase their understanding of those issues and ensure the team is being fed high quality stories.

End Of Part 3

Next time in Part 4 we will connect the dots on the organizational issues of all this!

Authors (alphabetically):
Scott Barnes
Cliff Berg

Monday, December 8, 2014

Real Agile Testing, In Large Organizations – Part 2

(Continued from Part 1)

Last time we saw that there is no single answer to the level of test planning needed for Agile projects – it depends!

We also remembered that the whole point of testing is to achieve an acceptable level of assurance that the system meets the actual business need – in every way that matters to the organization.

This time we will look at a kind of template for the pieces of an Agile test strategy. You can then add and subtract from this template for your own project – and perhaps even dispense with it altogether for a very simple project – but in that case it at least provides food for thought.

What about technical stories?

Many teams use “technical stories” to specify non-functional requirements. This is ok, except that these are not really stories: you never finish them – they are actually cross-functional acceptance criteria. But casting non-functional requirements as acceptance criteria does not work perfectly either: that means that no story is done until all of the non-functional criteria are done, and that is not a practical way to run an iteration.

Create a “theme” for each type of
non-functional requirement.

Thus, while the above approaches can work, it is often better to treat non-functional requirements as just that: requirements. Don’t try to fit that round peg into the story square hole. Instead, create a “theme” for each type of non-functional requirement, e.g., “performance”, “security”, etc., with theme level acceptance criteria – i.e., requirements! Then write stories for the work that needs to be done; but do not skip creating a strategy for how to test the requirements for each of these themes. A strategy (high level plan) is needed too, in order to think through the non-functional testing and other activities in an end-to-end manner. This is a strategy that the team should develop. The strategy is the design for the testing aspects of the release train. Without it, you will find it difficult to discuss testing issues and dependencies that arise during testing, because there will be no conceptual context, and you will also find it difficult to communicate the testing strategies to stakeholders.

You can define exploratory testing activities
that provide feedback to the application’s monitoring theme.

There is a side benefit. If you treat the testing pipeline as a system, then you are in a good position to identify ways to monitor the application. For example, exploratory performance testing will reveal bottlenecks, and the application can then be enhanced to monitor those bottlenecks during operation of the system. Monitoring platforms such as Sensu can be used to consolidate the monitors across the many components of the application. Thus, in your testing strategy, you can define exploratory testing activities that provide feedback to the application’s monitoring theme, resulting in stories pertaining to operational monitoring. Identifying this ahead of time – at a large grain level – is important for making sure that this type of work is not a surprise to the Product Owner and that it receives sufficient priority. The key is to treat the testing pipeline as an aspect of the development pipeline, and design it with feedback loops, minimum latency, and sufficient coverage of each requirement category.

The key is to treat the testing pipeline as an aspect
of the development pipeline.

The What, Why, Where/When, How/Who, and Coverage

Let’s look at the What, Why, Where, When, How/Who, and Coverage.

The “What” is the category of testing, such as “functional acceptance testing”, “performance testing”, or “exploratory testing”. If you like, these can be grouped together according to the “testing quadrants”.

The “Why” is the aspect of requirements that this type of testing is intended to address, such as “story acceptance criteria”, “system-wide performance requirements”, or “anomaly identification”.

The “Where” is the environment(s) in which the testing will occur. In a CI/CD process, most types of testing will occur in multiple environments (as shown in Figure 1), but not necessarily all – e.g., in the example shown in Figure 1, performance testing is only being done in the “SCALE” environment. Your test strategy should reference a table or information radiator depicting all of the identified test environment types.

The “When” is the event(s) that will trigger the testing, and the frequency if the triggering event is calendar or time based. (Examples are shown in Table 1.) The “How” is the strategy to be used for those types of tests, such as “Use JBehave/Java, Selenium”, or “Use JMeter in cloud instances”.

The “How” should include “Who” – i.e., who will do what: that is, who will write the tests, who will perform them if there are manual tests, etc.

The Where/When and How/Who are
especially important if you interface with
specialized “enterprise” testing functions.

The Where/When and How/Who columns are especially important if you interface with specialized “enterprise” testing functions of any kind, e.g., Security, Performance Testing, Independent Acceptance Testing, etc.: you want to integrate these groups in an Agile “pipeline” manner so that no one is ever waiting on anyone, and that requires that everyone have a very clear idea of what everyone will be doing and when.

Integrate these groups in an Agile “pipeline” manner
so that no one is ever waiting on anyone.

Table 1: Sample lines of a test strategy table.
WhatWhyWhere, WhenHow (Strategy), WhoCoverage
1. How measured;
2. How assessed;
3. Sufficiency
Functional acceptance testingStory acceptance criteria• LOCAL (Workstation or personal cloud). Continually.
• Cloud “CI”. When code pushed.
• Cloud TEST. Daily.
Use JBehave/Java, Selenium.

Acc test programmer must not be the story programmer.
• How meas: Cobertura.
• How asses: Use Gherkin executable test specs, to ensure that no acc crit are missed.
• Reqd: Need 100% coverage.
Performance testingSystem-wide performance requirements• “PERF” (in cloud). Nightly.Use JMeter in cloud instances.
Perf test team and architect.
QA will verify coverage of executable test specs.
ExploratoryTo detect unanticipated anomalies• DEMO. Any time.Manual. Anyone who volunteers – but not the story’s programmer.Amount of time/effort should be indicated by the story.

The final column, “Coverage”, is really about thoroughness. It has three parts: (1) how test coverage will be measured, (2) how coverage will be assessed, and (3) what level of coverage is considered to be sufficient. This gets into an important issue for testing: How do you know when you are done testing? How do you know how much testing is enough?

How much testing is enough?

In a traditional waterfall development setup, there is often a separate Quality Assurance (QA) function that independently assesses whether the test plan is adequate. This is usually implemented as a gate, such that the QA performs its assessment after the application has been tested. That whole approach is a non-starter for Agile projects – and even more so for continuous delivery (CD) – where working, tested software is produced frequently and can be deployed with little delay. But let’s not throw out the whole concept of QA – like “throwing the baby out with the bath water”. QA can play a vital role for Agile teams: independence.

QA can play a vital role for Agile teams: independence.

My mother Morticia knows the people who are building her website: they are our cousins, and she trusts them implicitly. But the EFT Management Portal is another matter. In that case, an external technical auditor has been engaged to provide independent assessment. But what about inbetween situations? What about run-of-the-mill mission critical applications developed by most organizations? Should you just “trust the team”?

To “trust the team” is not to have blind trust.

To answer that question, we need to clear up a common point of confusion. To “trust the team” is not to have blind trust: if there is a-lot at stake, then blind trust would be illogical and naïve. The Agile adage that one should “trust the team” does not mean to have blind trust: it means to give the team substantial (but not absolute) leeway to do the work in the way that it finds most effective. That does not relieve the team from explaining their processes, or from the responsibility to convince stakeholders that the processes (especially testing) will result in the required level of assurance. After all, some of those stakeholders are paying the bill – it’s their system.

Self-directing teams are never without leadership and vision. Leaders need to ensure that teams have a clear understanding of the end goal (product) and why the business needs the product (vision). When vision and goals are clear, acceptance criteria and intent become much clearer. By producing what stakeholders have described and by being provided a clear set of goals and a vision for a product, teams typically are able to build significant trust with their stakeholders and the business. This trust continues and the team feels empowered.

When clear goals and vision (leadership) are missing, there tend to be longer testing cycles because the testing starts to focus on ensuring the requirements are correct instead of ensuring the requirements are met.

Another consideration is that teams are under great pressure to create features for the Product Owner. If the Product Owner will not have to maintain the application, then the Product Owner will not be very concerned with how maintainable the system is – that is “IT’s problem”. (When Product Owners fulfill the role because they are the responsible person for an application, they are much better within this role. When Product Owners are not responsible for the product because produced but are only responsible for delivery of a project, they are no longer Product Owners and are now back to Project Managers.) Further, the Product Owner will not have the expertise to ask about things such as “concurrency testing”, for checking that the application works correctly when multiple users try to update the same data. In fact, some software teams do not know too much about that either – so should you simply “trust the team”? Teams cannot always be staffed with all of the skills sets that are needed – resources might be constrained. These reasons are why organizations need independent trustworthy assessment of testing – as a second pair of eyes on whether the testing has been sufficient. It is just common sense.

Have QA work with the team on a continuing basis 
– not as a “gate”.

If we don’t implement QA as a gate, then how should we do it? The Agile way to do it is to have QA work with the team on a continuing basis, examining the team’s test strategies, spot checking actual testing code, and discussing the testing strategies with the various stakeholders to get their thoughts. QA should have a frequent ongoing presence as development proceeds, so that when the application is ready for release, no assessment is needed – it has already been done – and in fact it has been used to help the team to refine its approach to testing. QA effectively becomes an independent test on the testing process itself – a feedback loop on the testing feedback loop.

How does QA decide how much testing is enough? I.e., how does QA decide what level of coverage is sufficient? That is a hard question. For functional testing there is a fairly straightforward answer: the organization should start tracking the rate at which bugs are found in production, and correlating that metric with the test coverage that was measured when the software was built. Over time, the organization will build a history of metrics that can provide guidance about how much coverage is effective in preventing production bugs – with sufficient assurance for that type of application.

In the previous article we mentioned that story development should include analysts, developers and testers: that one should consider testing strategy as an output of each story’s development, since each story might have unique testing requirements. We have found it very effective when testing or QA teams contribute to that discussion, so that the test plan evolves during the story writing rather than after software has been produced. The testers help write acceptance criteria during the story writing sessions. One of the great advantages of this is that the developer knows exactly how the story will be testing, thus helping implementation direction.

Accumulate operational robustness metrics
over time and use those to inform judgment
about the level of testing that is needed.

This is a little harder to do for other kinds of requirements, e.g., security requirements, performance requirements, maintainability requirements, and so on, but the concept is the same: accumulate operational robustness metrics over time and use those to inform judgment about the level of testing that is needed. We suggest that leveraging architectural themes will help teams keep an eye on key issues such as these.

Back to the table’s “Coverage” column. Consider the example for functional tests shown as row one in the table: we specify Cobertura for #1 (measuring coverage). But Cobertura checks code paths traversed: it does not check that you have actually coded everything that needs to be tested. Thus, #2 should be something like, “Express story level acceptance criteria directly in JBehave Gherkin”. That ensures that nothing gets left out. In other words, we will be using “executable test specs”, or “behavioral specifications”. Finally, as to what coverage is sufficient, we might specify that since we want a really, really robust application, we need 100% code coverage.

The real intent behind coverage is that the more important parts of the application are well covered. We typically do not see 100% coverage over entire application code bases, but that is a nice stretch goal. The most important part of coverage though is that coverage does not stagnate at anything less than 70% and steadily grows over time.

To test the response time, we can write “executable” specs.

Coverage is more difficult to specify for non-functional types of testing. For example, how would you do it for performance tests? The requirement is most likely expressed in SLA form, such as, “The system shall be up 99.99% of the time”, and “The response time will not be less than 0.1 second 99% of the time”.

To test the response time, we can write “executable” specs of the form, “Given that the system has been operating for one hour under a normal load profile (to be defined), when we continue that load for one hour, then the response time is less than 0.1 second for 99% of requests.” Of course, not all performance testing tools provide a language like Gherkin but one can still express the criteria in executable “if, when, then” form and then write matching scripts in the syntax required by the load testing tool.

Testing the up-time requirement is much harder: the only way to do it is to run the system for a long time, and to design in hot recovery mechanisms and lots of redundancy and elasticity. Defining coverage for these kinds of requirements is subjective and is basically a matter of checking that each requirement has a reasonable test.

The coverage requirement for Exploratory testing is interesting: In the example of Table 1, we list it as “Amount of time/effort should be indicated by the story”. In other words, for exploratory testing, decide this when the story is written: decide at that time how thoroughly the exploratory testing should be for that story. This gets back to writing stories that focus on outcomes, as we discussed in Part 1.

The test strategy wiki page is for recording decisions on how testing is actually being done. It is a living, evolving thing.

Most likely some narrative will be needed to explain the table entries, which need to be concise to fit in a cell. If the test strategy is maintained on a wiki (strongly encouraged), it is good Agile practice to use it as the scratchpad for thinking about testing and for recording decisions on how testing is actually being done. It is not a document that one creates and then forgets about: it is a living, evolving thing.

(Note: We consider Sharepoint to be a wiki if (a) everyone on a team can edit the pages, (b) everyone on the team can create new pages, and (c) full change history is maintained; but if you uses Sharepoint, don’t upload documents: create the content right in Sharepoint, as pages.)

The test strategy should inform the decisions on
what environments are needed.

The test strategy should inform the decisions on what environments are needed: this is an integrated decision that should be driven by the testing strategy. Since it takes time to provision environments, or to write scripts that provision cloud environments, this means that the testing strategy is something that is needed very early – enough to allow for the lead time of getting the environments set up. That is why testing strategy should be addressed during release planning, aka “sprint 0”. Ideally, a team continues to use a very similar testing process from one release to the next, or one project to the next, so that the environment architecture stays pretty much the same, and that way you always know what types of environments you will need.

What QA really does is inform us about the current state
of the system under test.

We believe that the term “QA” is a misnomer. We prefer the term “Quality Informers”. Due to the fact that the vast majority of people who make up QA teams are not allowed to touch source code, not allowed to hire and fire and are not allowed to impact or alter budgets, they clearly have no enforceable means of quality assurance. What they do really well though is inform us about the current state of system under test. This is an important point when you consider previous paragraphs where we talked about informing and feedback.

Not everything can be automated

Automation is central to Agile, and it is essential for continuous delivery. But not everything can be automated. For example, these things cannot usually be automated:
1.    Exploratory testing.
2.    Focus group testing and usability testing.
3.    Security penetration testing. (Basic automation is possible.)
4.    Up-time testing.
5.    Testing on every single target mobile platform.

To deal with these in the context of continuous integration (CI) and continuous delivery, the CI/CD process needs to focus on the tests that need to be run repeatedly, versus tests that can be done with sufficiently high confidence that things will not change too much. By automating tests that are repeatable, we free up more time for the type of testing that cannot be automated. For example, if usability testing is done once a month, that might be sufficient unless the entire UX paradigm changes. Security penetration (“pen”) testing should be done on a regular basis, but real (manual) penetration testing is an expensive process and so there is a cost/benefit tradeoff to consider – in many cases (depending on the level of risk), automated pen testing is sufficient, with perhaps expert manual pen testing done on a less frequent basis. Up-time testing can really only be tested in production, unless you are testing a new airplane’s system software before the first shipment and have the luxury of being able to run the software non-stop for a very long time.

Today’s mobile devices present a large problem for platform compatibility testing. There are so many different versions of Android out there, and versions of Apple’s iOS, and many versions of Android have significant differences. Fortunately, there are online services that will run mobile app tests on many different devices in a “cloud”. Even Apple’s OSX operating system can now be run in a cloud.

End Of Part 2

At this point, test-driven development (TDD) proponents are jumping up and down, feeling that their entire world has been sidestepped by this article, so in part 3 of this article we will start off with that. Also, while the Who column of the test strategy table provides a means for coordinating testing-related activities performed by multiple parties, we have not talked about who should do what. For example, some testing activities – such as performance testing and security testing and analysis – might require special skills; and if there are multiple teams, perhaps each working on a separate sub-component or sub-system (possibly some external teams), then how should integration testing be approached? We will examine these issues in a later part of this series.


Authors (alphabetically):
Scott Barnes
Cliff Berg

Tuesday, December 2, 2014

Real Agile Testing, In Large Organizations – Part 1

This article launches a new section on the challenges that organizations face when adopting Agile approaches to testing! We will start out with a four-part article series, and then add more articles over time.

Agile turns traditional (waterfall oriented) testing on its head. Organizations that transition to Agile methods typically find themselves very confused – What do we do with traditional testing functions? How do we integrate non-functional types of testing? Do you just leave it all up to the team? Does Agile somehow magically take care of everything? How does Agile handle all of our compliance rules and enterprise risks?

How does Agile handle all of our compliance rules
and enterprise risks?

There is no single answer to any of these questions.

The reason is that the approach to testing depends on (among others):
•    The nature of the requirements.
•    The types and degree of risks perceived by the organization.
•    Who the stakeholders are.
•    The types of environments that can be obtained for testing.
•    The skills of the development teams.

For example, consider Morticia’s website:
We are on a small team building a website for my grandmother, Morticia, to sell used odds and ends. Morticia expects a traffic of about five purchases per day – mostly from our many cousins around the world. My grandmother fulfills all of her orders herself from items in her attic and basement, has Lurch and Thing package them, and employs my uncle Pugsley and his kids (who are Uber drivers) to take the packages to the post office every day. It is very simple, and so we write the user stories for the website, with the acceptance criteria expressed in Gherkin, code the tests in Ruby, and we are done! We are not worried too much about security, because all of the payment on the site will be handled by PayPal, which is a pretty self-contained service, and not a-lot is at stake anyway since most of the items are priced under $10.

Now let’s consider another (fictitious) example, EFT Management Portal:
We are on a team in a six team program to build a website that enables banks to manage their electronic funds transfer (EFT) interconnections. There are lots of bank partners that will connect to this system via their own back end systems. We also have to comply with Sarbanes Oxley regulations, as well as all applicable State laws for each bank’s EFT endpoints. There is a great risk of fraud – both internal and external – given the nature of the application. The technology stacks include several open source, commercial, and proprietary stacks as well as mainframe systems, and there are some non-IP protocols (e.g., SWIFT) over private leased networks. As a result, we have some experts on the teams who know about these things. Finally, while the work can be done incrementally, and a one year roadmap of incremental releases has been defined, each release has to be rock solid – there can be no screw-ups. There is simply too much at stake. The first release will therefore not be live, serving more as a production quality demonstrator.

The test plan for the EFT management portal is quite a bit more involved than the one for my grandmother’s website. Certainly, “trusting the team” will not fly with management. In fact, management has hired an independent risk management company to verify that testing will be sufficient, and that all risks are being managed. This was insisted on by the boards of two of the primary banking partners. This risk management team wants to meet with the program manager and team leads next week to discuss the risk team’s information needs, which will certainly including knowing what the test plan is.

I think that the testing approach for my grandmother’s website will be quite a bit different from the one for the EFT management portal – don’t you?

What is the whole point of testing?

There are alternatives to testing. For example, the IBM “clean room” method substitutes careful review of source code, and its effectiveness is reported to be comparable to other forms of testing. From a Software Engineering Institute report,
Improvements of 10-20X and more over baseline performance have been achieved. Some Cleanroom-developed systems have experienced no errors whatsoever in field use. For example, IBM developed an embedded, real-time, bus architecture, multiple-processor device controller product that experienced no errors in two years use at over 300 customer locations.

The goal is therefore not to test, but to achieve a sufficient level of assurance that the code does what the customer needs it to do. I said “need” instead of “want” because an important aspect of Agile methods is to help the customer to discover what they actually need, through early exposure to working software, which constitutes a form of exploratory testing. There has also been a great deal of recent work in attempts to apply formal methods to Agile. This is especially valuable in high assurance systems such as aircraft, medical devices, and systems with very large capital costs such as the microcode that executes in mass produced hardware. Formal methods are also of great interest for mass produced consumer devices - especially security critical devices such as routers, for which people do not often update the firmware. As the “Internet Of Things” (IOT) becomes reality, formal methods are being considered as a way to increase the reliability of the myriad devices that will be operating throughout our environment – things will be quite a mess if all of those devices are buggy, require constant updates, and all present security risks.

The goal is not to test, but to achieve a
sufficient level of assurance.

The goal is therefore assurance. Testing is not a goal by itself. The question is, what are the Agile practices that help us to build software with sufficient assurance that it meets the need, including the various needs related to dependability? If you go to your bank’s website and transfer money from one account to another, do you expect to have a very high confidence that the money will not simply disappear from your accounts? If you make a purchase on the Internet, do you expect to have high confidence that the purchase will actually be fulfilled after your credit card has been charged? Thus, assurance is important even for the common activities that everyone does every day. Reliability affects a company’s reputation. When you code, do you think in terms of the required reliability of the end product?

Planning is important in Agile!

First of all, we need to dispel the notion that creating a plan is not Agile. As Eisenhower said, “Plans are worthless, but planning is everything.” Implicit in this is that creating plans is important – that is what planning does (creates a plan), but plans always change – that’s why plans are “worthless” – because they end up needing to be changed. Of course, saying that they are “worthless” is hyperbole: it would be more accurate to say that they are only a starting point. Plans are simply the recorded outcomes of a planning meeting – the decisions that were made, and any models that were used. The planning meeting is what is most important, and the plan – the recorded outcomes – is important too, because people forget.

An Agile test plan is not externally driven –
it is internally driven, by the team.

One of the main Agile push-backs on test planning is that a traditional (waterfall) test plan is something created ahead of time by someone who is not on the development team. That is a non-starter for Agile projects, because Agile teams need to decide how they do their work, and testing is a central element of software development. Thus, an Agile test plan is not externally driven – it is internally driven, by the team. It is their test plan. It is the output of their discussions on how they plan to test the application.

That said, other stakeholders should be present for those discussions – stakeholders such as Security, Architecture, or whatever external parties exist who have an interest in the quality of the delivered product. Further, those external parties have every right to form their own opinions on the efficacy of the test plan – that is, whether the tests will actually prove that the requirements of the stakeholders are covered.

The question is not whether to plan, but what an Agile test plan and Agile test planning process really looks like. Brett Maytom, manager of the LinkedIn Scrum Practitioners group, has said in a discussion in that group,
“The bigger questions are, when to prepare it, how to represent it, how far in advance is the plan, who creates it, in what format is the test plan.”

Shuchi Singla, an Agile QA practitioner, commented in that same discussion,
“…we do need test plans, but unlike traditional formats they should be quick reference points. Rather than having 50 page bulky document a crisp sticky or may be white board information like RACI, platforms would suffice…we cannot let test plans just go. They have their importance and should stay though in KISS layout.”

There is a problem though: the very phrase “test plan” conjures up memories of waterfall test plans: large documents with all of the details, right down to the test scripts to be used – and with little input from the developers, and not able to evolve during development. We don’t want to do that – not anything close. As Shuchi Singla said, we want our test plan to be lightweight: we want our test plans to be test strategies. So from now on, we will refer to it as a test strategy. The strategy’s written form should also be highly maintainable – such as a set of wiki pages – so that it can be kept up to date as the team’s decisions about testing evolve over time. An Agile test strategy is always being refined and adjusted as the team learns more and more about the application and the things that actually need to be tested.

The very phrase “test plan” conjures up
memories of waterfall test plans, so from now on,
we will refer to it as a test strategy.

In collaborating with a colleague Huett Landry, one of us (Cliff) found it to be very effective to have a test strategy consisting primarily of a table, with columns of essentially “What, Why, Where, When, How/Who, Coverage”. In a recent LinkedIn discussion, Eugene Joseph, an Engineering Program Manager at Lockheed Martin, said,
“We had a Sys Admin team that was constantly being pulled in different directions by agile teams. We stood up a Kanban board a prioritized task that we put on the board. Someone will have to prioritize the task across your agile teams. One key thing we did was made sure that tasks entered had a clear specification of what, where, when, who, and how support was provided and a clear definition of done. This helped reduce communication delay since the client needing support CLEARLY specified what the needed.”

Thus, if you have existing silos, establishing a way of interacting with them – perhaps documented in a table that lists who will do what, when, and how – is especially important for enabling the development team to operate in an Agile manner.

If you have existing silos, establishing a way of
interacting with is especially important.

A diagram is also very useful for identifying the up-stream and down-stream dependencies of a development and testing release train. As W. Edwards Deming said, “You can see from a flow diagram who depends on you and whom you can depend on. You can now take joy in your work.” ☺ This enables you to see at a glance the types of testing that occur at each stage of the release train, the environments that each will need (and any contention that might result), and the latency that might occur for each testing activity. After all, continuous delivery is a pipeline, and to treat it as such, it helps to view it as such. Figure 1 shows such a diagram. Note that the various environments are identified, along with the activities or events pertinent to each environment. The horizontal dashed lines are not a data flow – they represent a logical sequence. Many loops are implied in that sequence.

Figure 1: A delivery pipeline (aka “release train”).

Stories that focus on outcomes – not just functionality

One of us (Scott) who has done a-lot of Agile work in the embedded software space, has had great success using an evolutionary approach for exploratory testing. Awareness of the desired outcome – from a functional as well as a reliability perspective – provides the tester with a clarity of focus, so that he or she can push the application or device harder and find more issues. Scott has found this approach to be very effective as a general paradigm: creating a focus on desirable outcomes, from the perspective of both functional and non-functional requirements.

Develop the testing strategy as an
output of story development.

To implement this approach, story development should include analysts, developers and testers: develop the testing strategy as an output of story development. Some of the strategies end up as rows in the test strategy table, but others might be specific to a story. Being able to read specifically how something is going to be tested at the story level (because tests are part of the story detail) helps implementers of the story focus on the outcome that the user needs. The story should mention the types of non-functional testing that will occur. That way, the developer has the non-functional requirements at the top of their mind when they implement the story.

End Of Part 1

In the next article we will talk about the various kinds of testing and how to make sure that they occur – in an Agile manner!

Authors (alphabetically):
Scott Barnes
Cliff Berg