[cffi-devel] how to treat expected failures in tests

Wed Jan 11 16:30:48 UTC 2012

On Wed, 11 Jan 2012 07:00:53 -0800, Robert Goldman <rpgoldman at sift.info>  
wrote:

> On 1/11/12 Jan 11 -1:16 AM, Daniel Herring wrote:
>> On Wed, 11 Jan 2012, Daniel Herring wrote:
>>> On Tue, 10 Jan 2012, Jeff Cunningham wrote:
>>>> How about OK, FAIL, UNEXPECTEDOK, and EXPECTEDFAIL?
>>>
>>> FWIW, here's one established set of terms:
>>> PASS, FAIL, UNRESOLVED, UNTESTED, UNSUPPORTED
>>> (XPASS and XFAIL are not in POSIX; change test polarity if desired)
>>> http://www.gnu.org/software/dejagnu/manual/x47.html#posix
>>
>
> I guess I'd be inclined to say "too bad for POSIX" and add XPASS and
> XFAIL....
>
> The reason that I'd be willing to flout (or "extend and extinguish" ;->)
> the standard is that there is no obvious advantage to POSIX compliance
> in this case that would compensate for the loss in information.
>
> cheers,
> r

I agree.

I really have no idea what is common practice in standard Unit Testing  
protocols - it isn't my background (which is mathematics). The only reason  
I suggested the additions is that it is useful information, some of which  
is lost if you don't have all four cases. And in my consulting practice I  
have used all four and seen them in use by others in one form or another  
in most test settings.

There are many good descriptions of binary hypothesis testing, here is  
one: (the two models in this setting would be something like H='test  
passes' and 0='test fails')

"In binary hypothesis testing, assuming at least one of the two models  
does indeed correspond to reality, there are four possible scenarios:
Case 1: H	
  	0
   is true, and we declare H	
  	0
   to be true
Case 2: H	
  	0
   is true, but we declare H	
  	1
   to be true
Case 3: H	
  	1
   is true, and we declare H	
  	1
   to be true
Case 4: H	
  	1
   is true, but we declare H	
  	0
   to be true
  In cases 2 and 4, errors occur. The names given to these errors depend on  
the area of application. In statistics, they are called type I and type II  
errors respectively, while in signal processing they are known as a false  
alarm or a miss."

(from http://cnx.org/content/m11531/latest/)

One might argue that Bayes testing procedures are not appropriate in  
software verification tests but I think this would be short-sighted. It is  
virtually impossible to design tests which cover every possible data/usage  
scenario for any but the simplest pieces of code. So what in fact happens  
is that the test designer picks the tests he thinks are most important.  
That's where the statistics come in, in the broader sense. Testing several  
hundred out of the hundreds of thousands or millions of possible  
permutations of test parameters always implies that statistical  
assumptions are being made. Being limited to 2 of 4 test results makes it  
impossible to evaluate the results with any degree of rigor.

I am indifferent as to the terminology applied to cases 2 and 4, so long  
as they are available. If they are not, it throws unnecessary uncertainty  
over the entire corpus of test results. And having them available doesn't  
force those who don't see their necessity to use them. They can choose to  
simply ignore them and limit their information to the two conditional  
cases:

{Case 1 | not Case 2}
{Case 3 | not Case 4}

Regards,
Jeff