2006-04-11 00:00:00

Ken Kizer, CEO of Medsphere and former top VA official, pitches open source medical software to Congress

Kenneth Kizer spoke to the Ways & Means Committee in the U.S. House of Representatives last week. He pitched the virtues of electronic medical records but focused on open source. He suggested that Congress make the selection of open source software the default mode for federal funds. Here’s what he said specifically:

Open source software is less well developed in health care than for some other enterprises, but open source software solutions for health care are now rapidly evolving.

In this vein, I urge the Committee to consider making open source software the first consideration in selecting any new software purchased with federal funds. This should be the case across the federal government ??? for health care and non-health care federal procurement alike. This requirement should apply to software purchases made by all federal agencies and purchases made by state and local governments and private parties using federal funds (including research funds).

Even in the absence of federal funding per se, I believe that the federal government???s policy should be to support and utilize open source software as the preferred option whenever possible because of its many advantages over proprietary software.

I recommend Congress do as some states and other countries have done and legislate that open source software must be first considered when federal funds are used to purchase new software. If there is no appropriate open source solution available, then one could turn to proprietary options.

I am confidant that the federal government would save billions of dollars in licensing fees alone over the next 10 years by preferentially pursuing open source solutions. The government would likely also realize substantial savings through collaborative public-private projects and increased software functionality while harnessing a robust stream of innovation in the future.

He recommended the following strategy to the committee (the following is taken from the MedSphere press release sent out last week):

  • The use of open source software for any new healthcare and non-healthcare software purchased with federal funds in order to take advantage of the lower costs as well as the easier integration of open source systems with third-party and legacy applications.
  • A public-private partnership under which 5% of the funds annually appropriated to the VA for VistA research and development would be redirected for five years to support VistA enhancements that will enable both government and private health care providers to leverage the government???s existing investment in the VistA system.
  • A requirement that healthcare providers adopt an electronic health record by 2015 or another date established by lawmakers in order to continue participating in the Medicare program. To encourage more rapid adoption, Kizer will propose a two-phase process under which early adopters would receive higher Medicare payment rates for each year until the final deadline.
Filed under: — @ 2006-04-11 00:00:00
2006-04-11 00:00:00

Health IT systems and medical devices should learn about CAP

The Common Alerting Protocol (CAP) and Emergency Data Exchange Language (EDXL) are two standards that have been promoted by organizations that need to transfer information to each other during times of emergencies (natural disasters, terrorist incidents, etc). There are lots of vendors supporting CAP but I haven’t seen much use within health IT or medical devices so I thought I’d blog about what CAP and EDXL are so that our CIO and CTO readers out there can help their own teams understand why it might be important to learn about the standards.

Here’s is CAP’s purpose:

  • Simple and standardized format for exchanging alerts and warnings over various types of networks
  • Compatible with legacy and emerging ???transport??? methods, such as NOAA National Weather Radio specification and Web Services
  • Flexible geographic targeting
  • Phased and delayed effective times and expirations
  • Message update and cancellation features
  • Facility for including inline digital images and audio

As you can see, the most important purpose of CAP is to allow exchanging of alerts and warnings. Given that hospitals and ambulances are first responders we would assume there is some strategy in hospital IT departments to allow reading and writing of CAP messages. At first glance, it seems like there is not much reason for health IT or medical software to know about or share CAP messages. However, if the software we write gets smarter about diseases and clinical decision making we could use something like CAP to help inform CDC and FDA for hazards, infections, etc. Medical device manufacturers could use CAP to report adverse events to their own internal IT systems and CAP gateways could help route important messages outside the hospital.

There are lots of implications of CAP and EDXL. If you have started using these standards or feel like they are not applicable to health IT systems, share your thoughts with your colleagues by dropping a comment here.

Filed under: — @ 2006-04-11 00:00:00
2006-04-11 00:00:00

Ken Kizer, CEO of Medsphere and former top VA official, pitches open source medical software to Congress

Kenneth Kizer spoke to the Ways & Means Committee in the U.S. House of Representatives last week. He pitched the virtues of electronic medical records but focused on open source. He suggested that Congress make the selection of open source software the default mode for federal funds. Here’s what he said specifically:

Open source software is less well developed in health care than for some other enterprises, but open source software solutions for health care are now rapidly evolving.

In this vein, I urge the Committee to consider making open source software the first consideration in selecting any new software purchased with federal funds. This should be the case across the federal government ??? for health care and non-health care federal procurement alike. This requirement should apply to software purchases made by all federal agencies and purchases made by state and local governments and private parties using federal funds (including research funds).

Even in the absence of federal funding per se, I believe that the federal government???s policy should be to support and utilize open source software as the preferred option whenever possible because of its many advantages over proprietary software.

I recommend Congress do as some states and other countries have done and legislate that open source software must be first considered when federal funds are used to purchase new software. If there is no appropriate open source solution available, then one could turn to proprietary options.

I am confidant that the federal government would save billions of dollars in licensing fees alone over the next 10 years by preferentially pursuing open source solutions. The government would likely also realize substantial savings through collaborative public-private projects and increased software functionality while harnessing a robust stream of innovation in the future.

He recommended the following strategy to the committee (the following is taken from the MedSphere press release sent out last week):

  • The use of open source software for any new healthcare and non-healthcare software purchased with federal funds in order to take advantage of the lower costs as well as the easier integration of open source systems with third-party and legacy applications.
  • A public-private partnership under which 5% of the funds annually appropriated to the VA for VistA research and development would be redirected for five years to support VistA enhancements that will enable both government and private health care providers to leverage the government???s existing investment in the VistA system.
  • A requirement that healthcare providers adopt an electronic health record by 2015 or another date established by lawmakers in order to continue participating in the Medicare program. To encourage more rapid adoption, Kizer will propose a two-phase process under which early adopters would receive higher Medicare payment rates for each year until the final deadline.
Filed under: — @ 2006-04-11 00:00:00
2006-04-11 00:00:00

Health IT systems and medical devices should learn about CAP

The Common Alerting Protocol (CAP) and Emergency Data Exchange Language (EDXL) are two standards that have been promoted by organizations that need to transfer information to each other during times of emergencies (natural disasters, terrorist incidents, etc). There are lots of vendors supporting CAP but I haven’t seen much use within health IT or medical devices so I thought I’d blog about what CAP and EDXL are so that our CIO and CTO readers out there can help their own teams understand why it might be important to learn about the standards.

Here’s is CAP’s purpose:

  • Simple and standardized format for exchanging alerts and warnings over various types of networks
  • Compatible with legacy and emerging ???transport??? methods, such as NOAA National Weather Radio specification and Web Services
  • Flexible geographic targeting
  • Phased and delayed effective times and expirations
  • Message update and cancellation features
  • Facility for including inline digital images and audio

As you can see, the most important purpose of CAP is to allow exchanging of alerts and warnings. Given that hospitals and ambulances are first responders we would assume there is some strategy in hospital IT departments to allow reading and writing of CAP messages. At first glance, it seems like there is not much reason for health IT or medical software to know about or share CAP messages. However, if the software we write gets smarter about diseases and clinical decision making we could use something like CAP to help inform CDC and FDA for hazards, infections, etc. Medical device manufacturers could use CAP to report adverse events to their own internal IT systems and CAP gateways could help route important messages outside the hospital.

There are lots of implications of CAP and EDXL. If you have started using these standards or feel like they are not applicable to health IT systems, share your thoughts with your colleagues by dropping a comment here.

Filed under: — @ 2006-04-11 00:00:00
2006-04-09 00:00:00

Healthcare Blogs: The Authoritative Guide

Over the past several months, I’ve had the pleasure of participating in a special project that Fard Johnmar of Envision Solutions has been involved in. He first contacted me in early January about an interview I participated in with John Cass of Backbone Media. In that interview I spoke about the fear pharmaceutical companies have of the blogosphere and why they should be embracing this medium rather than avoiding it. Fard told me that he was working on a report on healthcare blogging and I agreed to be interviewed for it.

After that interview, Fard posted my article, Pharma: Have No Fear of the Blogosphere, on his blog and kept me updated about his progress on the report. Finally, a couple of weeks ago I had the pleasure of reading the report (The Emerging Healthcare Blogosphere: What Is It & Why Does It Matter?) and I was quite impressed. The report is well-referenced and features commentary from a number of bloggers (including yours truly). In addition, it addresses a number of questions people have about healthcare blogs, including:

-Who is writing them?
-Are blogs credible?
-How can healthcare organizations use this technology?
-What are the pros and cons of blogs for healthcare non-profit organizations?

I feel that this report is important for a host of reasons, including:

-It provides those new to the healthcare blogosphere with a “roadmap” that will help them track the healthcare blogosphere and learn about its evolution

-It addresses tough issues relating to healthcare organization blogs, including the regulatory environment, patient privacy and ROI (return on investment)

-It sketches out a framework for understanding why blogs may impact healthcare — especially the healthcare provider-patient relationship

I endorsed the report in the press release Fard sent out last week because I feel it is an important step forward in our understanding of how blogs may influence healthcare in the United States and around the world. With this report, we now have a framework for understanding and debating the influence (or lack thereof) of healthcare blogs.

Consider purchasing this reasonably-priced report (it’s about $37). If you need to convince others of a particular viewpoint, use it to help you form your arguments for why healthcare blogs are important. Help your colleagues, employers and staffs make informed decisions about whether blogs can help them further their communications, marketing, branding or stakeholder relations efforts.

You can learn more about this important report by clicking here.

Filed under: — @ 2006-04-09 00:00:00
2006-04-09 00:00:00

Healthcare Blogs: The Authoritative Guide

Over the past several months, I’ve had the pleasure of participating in a special project that Fard Johnmar of Envision Solutions has been involved in. He first contacted me in early January about an interview I participated in with John Cass of Backbone Media. In that interview I spoke about the fear pharmaceutical companies have of the blogosphere and why they should be embracing this medium rather than avoiding it. Fard told me that he was working on a report on healthcare blogging and I agreed to be interviewed for it.

After that interview, Fard posted my article, Pharma: Have No Fear of the Blogosphere, on his blog and kept me updated about his progress on the report. Finally, a couple of weeks ago I had the pleasure of reading the report (The Emerging Healthcare Blogosphere: What Is It & Why Does It Matter?) and I was quite impressed. The report is well-referenced and features commentary from a number of bloggers (including yours truly). In addition, it addresses a number of questions people have about healthcare blogs, including:

-Who is writing them?
-Are blogs credible?
-How can healthcare organizations use this technology?
-What are the pros and cons of blogs for healthcare non-profit organizations?

I feel that this report is important for a host of reasons, including:

-It provides those new to the healthcare blogosphere with a “roadmap” that will help them track the healthcare blogosphere and learn about its evolution

-It addresses tough issues relating to healthcare organization blogs, including the regulatory environment, patient privacy and ROI (return on investment)

-It sketches out a framework for understanding why blogs may impact healthcare — especially the healthcare provider-patient relationship

I endorsed the report in the press release Fard sent out last week because I feel it is an important step forward in our understanding of how blogs may influence healthcare in the United States and around the world. With this report, we now have a framework for understanding and debating the influence (or lack thereof) of healthcare blogs.

Consider purchasing this reasonably-priced report (it’s about $37). If you need to convince others of a particular viewpoint, use it to help you form your arguments for why healthcare blogs are important. Help your colleagues, employers and staffs make informed decisions about whether blogs can help them further their communications, marketing, branding or stakeholder relations efforts.

You can learn more about this important report by clicking here.

Filed under: — @ 2006-04-09 00:00:00
2006-04-08 00:00:00

The Connecting for Health Common Framework

The Markle Foundation recently unveiled their Common Framework on the Connecting For Health website. Here’s how they describe the project:

The Common Framework provides an essential set of technical and policy resources for private and secure health information sharing among existing and developing health information networks. All of the Connecting for Health Common Framework resources are available here at no cost.

The Common Framework includes 16 technical and policy components developed by experts in information technology, health privacy law, and policy, and tested since mid-2005 by Connecting for Health prototype teams in Indianapolis, Boston, and Mendocino County, California. The Common Framework provides an initial set of critical technical and policy components that make it possible for various health information networks to share information while protecting privacy and allowing for local autonomy and innovation.

The Common Framework includes technical documents and specifications, testing interfaces, and code, as well as a companion set of privacy and security policies and model contractual language to help organizations interested in information exchange to move quickly towards the necessary legal agreements for private and secure health information sharing.

I’ll take a more detailed look soon and report back what I think. If others have already delved deep into it and can share their thoughts please drop me a comment here or a note via e-mail.

Filed under: — @ 2006-04-08 00:00:00
2006-04-08 00:00:00

The Connecting for Health Common Framework

The Markle Foundation recently unveiled their Common Framework on the Connecting For Health website. Here’s how they describe the project:

The Common Framework provides an essential set of technical and policy resources for private and secure health information sharing among existing and developing health information networks. All of the Connecting for Health Common Framework resources are available here at no cost.

The Common Framework includes 16 technical and policy components developed by experts in information technology, health privacy law, and policy, and tested since mid-2005 by Connecting for Health prototype teams in Indianapolis, Boston, and Mendocino County, California. The Common Framework provides an initial set of critical technical and policy components that make it possible for various health information networks to share information while protecting privacy and allowing for local autonomy and innovation.

The Common Framework includes technical documents and specifications, testing interfaces, and code, as well as a companion set of privacy and security policies and model contractual language to help organizations interested in information exchange to move quickly towards the necessary legal agreements for private and secure health information sharing.

I’ll take a more detailed look soon and report back what I think. If others have already delved deep into it and can share their thoughts please drop me a comment here or a note via e-mail.

Filed under: — @ 2006-04-08 00:00:00
2006-04-05 00:00:00

Guest Article: Beware of NULLs in healthcare databases

Many readers have been asking for more “practical” advice on their database models so I’ve asked a fellow healthcare data architect to lend a hand. Tom Maloney is a Senior Data Architect for Stockamp and Associates with over 25 years of experience and knowledge working with and designing databases for most industries. Tom has done a lot of freelance contracting through his own company where he lives and breathes data modeling. In this guest article Tom is discussing the pitfalls of allowing NULLs in healthcare databases and his suggestions and arguments have a lot of merit. As usual, recommendations like the ones Tom is making do not apply in all circumstances but they are worth reviewing.

On our patient claim system the hospital wanted to know the average payment amount received between two periods of time. As we all know the average is the sum of the values divided by the number of rows. So we constructed a simple query to calculate the average.

Select SUM(Amount)/COUNT(*) FROM [dbo].[Payment]

WHERE
EffectiveDate BETWEEN ‘20060330′
AND ‘20060410′

After the answer was returned; we remembered that SQL
provides a function that calculates and returns the average, so we decided to
use it. When we ran this query

Select AVG(Amount)

FROM [dbo].[Payment]

WHERE EffectiveDate BETWEEN
‘20060330′ AND ‘20060410′

we had a completely different answer. Which one was correct or could they both be wrong? What cause the difference? The culprit turned out to be NULL values in some of the columns.

To prove this we performed a little experiment by creation and population the following table called Payment populated it with ten rows. We allowed someof the column in a few rows to contain NULL values (Note: the syntax is Transact-SQL from Microsoft SQL Server).

Table Creation:

CREATE TABLE [dbo].[Payment](

[Payment_Key] [int] NOT
NULL,

[Amount] [money] NULL,

[EffectiveDate] [datetime] NULL,

[Note] [varchar](1024) NOT NULL DEFAULT (‘’),

[PaymentRefNumber] [varchar](16) NOT NULL,

PRIMARY KEY CLUSTERED

(

[Payment_Key] ASC

)WITH (IGNORE_DUP_KEY = OFF)

)

Go

Populate the Table:

INSERT INTO [dbo].[Payment]

VALUES (1, $125.00, NULL, ‘Payment 1′, ‘1′) –?? Null in date

INSERT INTO [dbo].[Payment]

VALUES (2, $125.00, ‘20060403′, ‘Payment 2′, ‘2′)

INSERT INTO [dbo].[Payment]

VALUES (3, Null, ‘20060404′, ‘Payment 3′, ‘3′) –?? Null in Amount

INSERT INTO [dbo].[Payment]

VALUES (4, $125.00, ‘20060404′, ‘Payment 4′, ‘4′)

INSERT INTO [dbo].[Payment]

VALUES (5, $125.00, ‘20060405′, ‘Payment 5′, ‘5′)

INSERT INTO [dbo].[Payment]

VALUES (6, $125.00, ‘20060405′, ‘Payment 6′, ‘6′)

INSERT INTO [dbo].[Payment]

VALUES (7, NULL, NULL, ‘Payment 7′, ‘7′) –?? Null Amount and date

INSERT INTO [dbo].[Payment]

VALUES (8, $125.00, ‘20060406′, ‘Payment 8′, ‘8′)

INSERT INTO [dbo].[Payment]

VALUES (9, $125.00, NULL, ‘Payment 9′, ‘9′) –?? Null date

INSERT INTO [dbo].[Payment]

VALUES (10, $125.00, ‘20060407′, ‘Payment 10′, ‘10′)

GO

If you add all of the values in the Amount column we have $1000.00, there are ten rows, so the average should be $100.00, right? Not really. We were only looking for rows between Mar 30 and Apr 10, 2006. A NULL can represent any value, it is unknown at this time, and we do not know if NULL means not entered or if it is a place holder for valid date within the range we are interested in.

As we will find out a little later SQL counts NULL sometime and other times it ignores it. Using the BETWEEN predicate SQL ignores NULLs. This gives us actually seven rows to divide into the sum of the amount. Also, one of the rows with a valid date range contains a NULL amount giving us only six Amounts to sum (both the SUM and AVG SQL function ignores NULLs), giving a total amount of $750.00, divided by seven rows we have an average of $107.14 rounded). Not $1000.00 divided by 10 rows yielding $100.00 average.

Select SUM(Amount)/COUNT(*) FROM [dbo].[Payment]

WHERE
EffectiveDate BETWEEN ‘20060330′
AND ‘20060410′

The result is $107.1428, this is the answer we were looking for. Let???s try using the AVG function and see what is returned:

 

Select AVG(Amount)

FROM [dbo].[Payment]

WHERE EffectiveDate BETWEEN
‘20060330′ AND ‘20060410′

This time for the same set of data we got $125.00 as an average. What???s going on here? The answer lies in how SQL treats NULLs.

The following provides a deeper look into how SQL handled NULLs.

Other that the special way the SQL Server has to store a null and the extra logic SQL Server has to do to identify a column containing a null, here are a few favorites (comments from C. J. Date[1]):

  • To test in a WHERE clause whether a field is null, SQL provides the special comparison ???field IS NULL.??? It is not intuitively obvious why the user has to write ???field IS NULL??? and not ???field = NULL??? ??? especially as the system ???field = NULL??? is used in the SET clause of the UPDATE statement (and the SET assignment process) to update a field to the null value. (In fact, the WHERE clause ???WHERE field = NULL??? is syntactically illegal ??? SQL Server???s syntax does allow it (some vendors do provide a SET statement to allow this).
  • SQL92 has solved the problem of comparing NULLs by adding a new predicate of the form <search condition> IS [NOT] TRUE | FALSE | UNKNOWN, which will let you map any combination of three-value logic to two-value logic. For example, ((Age < 18) AND (Gender = ???F???)) IS NOT FALSE will return TRUE if (Age IS NULL) or (Gender IS NULL) and the remaining condition is not NULL.
  • NULL values are considered as duplicates of each other for the purpose on UNIQUE and DISTINCT and ORDER BY but not for the purpose of WHERE and GROUP BY. NULL values are also considered as greater than all non-null values for the purpose of ORDER BY but not for the purposes of WHERE.
  • NULL values are always eliminated from the argument to a built-in function such as SUM or AVG, regardless of whether DISTINCT is specified in the function reference ??? except for the case of COUNT(*), which counts all rows, including duplicates and including all-null rows. Thus for example, given:
  • SELECT AVG(Status) FROM S –
    Result: x

    SELECT SUM(Status) FROM S –
    Result: y

    SELECT COUNT(*) FROM S –
    Result: z

    there is no guarantee that x = y/z

  • Likewise, the function reference SUM (F) is not semantically equivalent to the expression: f1 + f2 + ??? + fn where f1, f2???, fn are the values appearing in field F at the time the function is evaluated. Perhaps, even more counter-intuitively, the expression
  • SUM (F1 + F2)

    Is not equivalent to the expression

    SUM (F1) + SUM (F2)

  • Since by definition NULL represents an unknown value, we define the results in every case to be unknown (i.e., NULL) also, rather than true or false. To deal with NULL values properly, therefore, it is necessary to adopt 3-valued logic in place of the usual 2-valued logic. The 3-valued logic is defined by the truth tables shown below. Note that unknown or null truth-value can reasonably be interpreted as ???maybe.???
  • AND T ? F OR
    T ? F
    NOT

    T T ? F
    T T T T T F

    ? ? ? F
    ? T ? ? ? ?

    F F F F F
    T ? F F T

  • Consider the question of whether set are allowed to contain NULL values. Suppose, for example, that the collection C = {1, 2, 3, ?} is to be permitted as a legal set. There are two possibilities.
  • The particular null value appears in C is of course unknown, but is known to be distinct from 1, 2, and 3.
  • The NULL value in C is completely unknown (i.e., it may in fact stand for one of the values 1, 2, 3), in which case the cardinality of C in turn is unknown (it may be either 3 or 4).
  • Using the BETWEEN predicate with NULL values. The result of this predicate with NULL values for <value expression>, <low value expression>, or <high value expression> follow directly from definition. If both <low value expression> and <high value expression> are NULL, the result is unknown for any value of <value expression>. If <low value expression> or <high value expression> is NULL, but not both, the results is determined by the value of <value expression> and its comparison with the remaining non-NULL term. If <value expression> is NULL the results are unknown for any values of <low value expression> and <high value expression>.
  • Host languages have to handle NULLs in non-standard ways. The programmer should know how NULLs are handled when they are passed to the host language. No standard host language for which embeddings are defined support NULLs, which is another good reason to avoid using them in database schemas.
  • On the whole I set all data type not to allow NULLs and provide default values when the column is not required. But what about dates and Boolean value columns where the data is not know at time of entry? There are times in designing a healthcare database schema where NULLs values may be allowed, these columns needs to be handled on an exception basis and rather than the norm. Usually I find a workaround, for example, if a Patient???s Gender is not always known at the time of entry, instead of using a Boolean, use a single character to hold a code (F=Female, M=Male, O=Other, U=Unknown, A=Ambiguous, N=Not applicable).

    Another example is for dates. If a date is not known at time of entry I provide a default. Most dates record an event in time or a date range (Begin and End dates). For dates that represent a date in time I choose a low (e.g., 01/01/1900) or high (e.g., 11/27/4637) default date depending on the how the column would be used in a query. If developers or anyone else want to see a NULL value returned, I create a view over the table that contains a NULLIF[2] returning a NULL (e.g., NULLIF(MyDate, 19000101). Columns that represent a date range I default the From Date with 01/01/1900 and the Thru Date with 46371127.

    When my queries are ran against range date with defaults I get the behavior I expect. When the range date columns contain NULLs the result are not what is expected or I have to write special SQL to handle it. In the example, by changing the table to not allow NULLs and replacing unknown dates with 19000101 and unknown Amounts with $0.00, both AVG and SUM(Amount)/Count(*) returns the same value.

    All Character and Variable-Character data types I default with an empty string. When the column is selected, it is displayed with as an invisible value. By using NOT NULL data types you are assured that the results returned will be as you expect without having to handle NULLs differently. In addition, some vendor database???s performance improves because columns that allow NULLs must be evaluated differently for comparisons. If Nulls are used try and minimize it use.



    [1]
    Relational Database, Selected Writings, CJ Date, .Addison Weisley

    [2]
    Microsoft SQL Server T-SQL

    Filed under: — @ 2006-04-05 00:00:00
    2006-04-05 00:00:00

    Guest Article: Beware of NULLs in healthcare databases

    Many readers have been asking for more “practical” advice on their database models so I’ve asked a fellow healthcare data architect to lend a hand. Tom Maloney is a Senior Data Architect for Stockamp and Associates with over 25 years of experience and knowledge working with and designing databases for most industries. Tom has done a lot of freelance contracting through his own company where he lives and breathes data modeling. In this guest article Tom is discussing the pitfalls of allowing NULLs in healthcare databases and his suggestions and arguments have a lot of merit. As usual, recommendations like the ones Tom is making do not apply in all circumstances but they are worth reviewing.

    On our patient claim system the hospital wanted to know the average payment amount received between two periods of time. As we all know the average is the sum of the values divided by the number of rows. So we constructed a simple query to calculate the average.

    Select SUM(Amount)/COUNT(*) FROM [dbo].[Payment]

    WHERE
    EffectiveDate BETWEEN ‘20060330′
    AND ‘20060410′

    After the answer was returned; we remembered that SQL
    provides a function that calculates and returns the average, so we decided to
    use it. When we ran this query

    Select AVG(Amount)

    FROM [dbo].[Payment]

    WHERE EffectiveDate BETWEEN
    ‘20060330′ AND ‘20060410′

    we had a completely different answer. Which one was correct or could they both be wrong? What cause the difference? The culprit turned out to be NULL values in some of the columns.

    To prove this we performed a little experiment by creation and population the following table called Payment populated it with ten rows. We allowed some of the column in a few rows to contain NULL values (Note: the syntax is Transact-SQL from Microsoft SQL Server).

    Table Creation:

    CREATE TABLE [dbo].[Payment](

    [Payment_Key] [int] NOT
    NULL,

    [Amount] [money] NULL,

    [EffectiveDate] [datetime] NULL,

    [Note] [varchar](1024) NOT NULL DEFAULT (‘’),

    [PaymentRefNumber] [varchar](16) NOT NULL,

    PRIMARY KEY CLUSTERED

    (

    [Payment_Key] ASC

    )WITH (IGNORE_DUP_KEY = OFF)

    )

    Go

    Populate the Table:

    INSERT INTO [dbo].[Payment]

    VALUES (1, $125.00, NULL, ‘Payment 1′, ‘1′) –?? Null in date

    INSERT INTO [dbo].[Payment]

    VALUES (2, $125.00, ‘20060403′, ‘Payment 2′, ‘2′)

    INSERT INTO [dbo].[Payment]

    VALUES (3, Null, ‘20060404′, ‘Payment 3′, ‘3′) –?? Null in Amount

    INSERT INTO [dbo].[Payment]

    VALUES (4, $125.00, ‘20060404′, ‘Payment 4′, ‘4′)

    INSERT INTO [dbo].[Payment]

    VALUES (5, $125.00, ‘20060405′, ‘Payment 5′, ‘5′)

    INSERT INTO [dbo].[Payment]

    VALUES (6, $125.00, ‘20060405′, ‘Payment 6′, ‘6′)

    INSERT INTO [dbo].[Payment]

    VALUES (7, NULL, NULL, ‘Payment 7′, ‘7′) –?? Null Amount and date

    INSERT INTO [dbo].[Payment]

    VALUES (8, $125.00, ‘20060406′, ‘Payment 8′, ‘8′)

    INSERT INTO [dbo].[Payment]

    VALUES (9, $125.00, NULL, ‘Payment 9′, ‘9′) –?? Null date

    INSERT INTO [dbo].[Payment]

    VALUES (10, $125.00, ‘20060407′, ‘Payment 10′, ‘10′)

    GO

    If you add all of the values in the Amount column we have $1000.00, there are ten rows, so the average should be $100.00, right? Not really. We were only looking for rows between Mar 30 and Apr 10, 2006. A NULL can represent any value, it is unknown at this time, and we do not know if NULL means not entered or if it is a place holder for valid date within the range we are interested in.

    As we will find out a little later SQL counts NULL sometime and other times it ignores it. Using the BETWEEN predicate SQL ignores NULLs. This gives us actually seven rows to divide into the sum of the amount. Also, one of the rows with a valid date range contains a NULL amount giving us only six Amounts to sum (both the SUM and AVG SQL function ignores NULLs), giving a total amount of $750.00, divided by seven rows we have an average of $107.14 rounded). Not $1000.00 divided by 10 rows yielding $100.00 average.

    Select SUM(Amount)/COUNT(*) FROM [dbo].[Payment]

    WHERE
    EffectiveDate BETWEEN ‘20060330′
    AND ‘20060410′

    The result is $107.1428, this is the answer we were looking for. Let???s try using the AVG function and see what is returned:

     

    Select AVG(Amount)

    FROM [dbo].[Payment]

    WHERE EffectiveDate BETWEEN
    ‘20060330′ AND ‘20060410′

    This time for the same set of data we got $125.00 as an average. What???s going on here? The answer lies in how SQL treats NULLs.

    The following provides a deeper look into how SQL handled NULLs.

    Other that the special way the SQL Server has to store a null and the extra logic SQL Server has to do to identify a column containing a null, here are a few favorites (comments from C. J. Date[1]):

  • To test in a WHERE clause whether a field is null, SQL provides the special comparison ???field IS NULL.??? It is not intuitively obvious why the user has to write ???field IS NULL??? and not ???field = NULL??? ??? especially as the system ???field = NULL??? is used in the SET clause of the UPDATE statement (and the SET assignment process) to update a field to the null value. (In fact, the WHERE clause ???WHERE field = NULL??? is syntactically illegal ??? SQL Server???s syntax does allow it (some vendors do provide a SET statement to allow this).
  • SQL92 has solved the problem of comparing NULLs by adding a new predicate of the form <search condition> IS [NOT] TRUE | FALSE | UNKNOWN, which will let you map any combination of three-value logic to two-value logic. For example, ((Age < 18) AND (Gender = ???F???)) IS NOT FALSE will return TRUE if (Age IS NULL) or (Gender IS NULL) and the remaining condition is not NULL.
  • NULL values are considered as duplicates of each other for the purpose on UNIQUE and DISTINCT and ORDER BY but not for the purpose of WHERE and GROUP BY. NULL values are also considered as greater than all non-null values for the purpose of ORDER BY but not for the purposes of WHERE.
  • NULL values are always eliminated from the argument to a built-in function such as SUM or AVG, regardless of whether DISTINCT is specified in the function reference ??? except for the case of COUNT(*), which counts all rows, including duplicates and including all-null rows. Thus for example, given:
  • SELECT AVG(Status) FROM S –
    Result: x

    SELECT SUM(Status) FROM S –
    Result: y

    SELECT COUNT(*) FROM S –
    Result: z

    there is no guarantee that x = y/z

  • Likewise, the function reference SUM (F) is not semantically equivalent to the expression: f1 + f2 + ??? + fn where f1, f2???, fn are the values appearing in field F at the time the function is evaluated. Perhaps, even more counter-intuitively, the expression
  • SUM (F1 + F2)

    Is not equivalent to the expression

    SUM (F1) + SUM (F2)

  • Since by definition NULL represents an unknown value, we define the results in every case to be unknown (i.e., NULL) also, rather than true or false. To deal with NULL values properly, therefore, it is necessary to adopt 3-valued logic in place of the usual 2-valued logic. The 3-valued logic is defined by the truth tables shown below. Note that unknown or null truth-value can reasonably be interpreted as ???maybe.???
  • AND T ? F OR
    T ? F
    NOT

    T T ? F
    T T T T T F

    ? ? ? F
    ? T ? ? ? ?

    F F F F F
    T ? F F T

  • Consider the questionof whether set are allowed to contain NULL values. Suppose, for example, that the collection C = {1, 2, 3, ?} is to be permitted as a legal set. There are two possibilities.
  • The particular null value appears in C is of course unknown, but is known to be distinct from 1, 2, and 3.
  • The NULL value in C is completely unknown (i.e., it may in fact stand for one of the values 1, 2, 3), in which case the cardinality of C in turn is unknown (it may be either 3 or 4).
  • Using the BETWEEN predicate with NULL values. The result of this predicate with NULL values for <value expression>, <low value expression>, or <high value expression> follow directly from definition. If both <low value expression> and <high value expression> are NULL, the result is unknown for any value of <value expression>. If <low value expression> or <high value expression> is NULL, but not both, the results is determined by the value of <value expression> and its comparison with the remaining non-NULL term. If <value expression> is NULL the results are unknown for any values of <low value expression> and <high value expression>.
  • Host languages have to handle NULLs in non-standard ways. The programmer should know how NULLs are handled when they are passed to the host language. No standard host language for which embeddings are defined support NULLs, which is another good reason to avoid using them in database schemas.
  • On the whole I set all data type not to allow NULLs and provide default values when the column is not required. But what about dates and Boolean value columns where the data is not know at time of entry? There are times in designing a healthcare database schema where NULLs values may be allowed, these columns needs to be handled on an exception basis and rather than the norm. Usually I find a workaround, for example, if a Patient???s Gender is not always known at the time of entry, instead of using a Boolean, use a single character to hold a code (F=Female, M=Male, O=Other, U=Unknown, A=Ambiguous, N=Not applicable).

    Another example is for dates. If a date is not known at time of entry I provide a default. Most dates record an event in time or a date range (Begin and End dates). For dates that represent a date in time I choose a low (e.g., 01/01/1900) or high (e.g., 11/27/4637) default date depending on the how the column would be used in a query. If developers or anyone else want to see a NULL value returned, I create a view over the table that contains a NULLIF[2] returning a NULL (e.g., NULLIF(MyDate, 19000101). Columns that represent a date range I default the From Date with 01/01/1900 and the Thru Date with 46371127.

    When my queries are ran against range date with defaults I get the behavior I expect. When the range date columns contain NULLs the result are not what is expected or I have to write special SQL to handle it. In the example, by changing the table to not allow NULLs and replacing unknown dates with 19000101 and unknown Amounts with $0.00, both AVG and SUM(Amount)/Count(*) returns the same value.

    All Character and Variable-Character data types I default with an empty string. When the column is selected, it is displayed with as an invisible value. By using NOT NULL data types you are assured that the results returned will be as you expect without having to handle NULLs differently. In addition, some vendor database???s performance improves because columns that allow NULLs must be evaluated differently for comparisons. If Nulls are used try and minimize it use.



    [1]
    Relational Database, Selected Writings, CJ Date, .Addison Weisley

    [2]
    Microsoft SQL Server T-SQL

    Filed under: — @ 2006-04-05 00:00:00
    Next Page »