Search This Blog

Wednesday, November 26, 2008

Return of the 4GL for eClinical - Part 2

In the first part of the series, I described how in the Technology business, Fourth Generation Languages provided a platform for the effective development of database driven application software.   With this instalment, I would like to examine how the principles of a 4GL might be applied in the design of an eClinical Application Development Tool. 

This part of the series of articles will focus on the requirements - in particular for handling rules, and briefly how these requirements might be met with a syntax.  It should be noted that many of the principles apply directly to EDC systems, however, the target application area is not EDC specifically, but rather a full eClinical platform.

Here are the bullet requirement items;

  • Triggers - Capable of defining rules that can be applied based on the occurrence of an event
  • Powerful - capable of describing all required rules that might arise in capturing and cleaning eClinical Data
  • Human readable syntax - it must act as a specification to a user as well as machine readable instructions when executed
  • Referencing - must provide a simple unambiguous mechanism to reference data
  • Repeatable - must be re-executable
  • Multi-Action - Queries - yes, but what about other actions
  • Testable - through a black box paradigm, built in self testing
  • Speed - must execute very quickly. <10ms per rule on a typical environment.
  • Business aware - must provide features that automate certain common business elements typical of eClinical


Firing mechanism

Not Winnie the Pooh's buddy, but a logical method of controlling the execution of rules in an EDC system, is to base the triggering mechanism on the change of an input value.  For example, to check if a subject is between the ages of 18 and 65 simply set an input value of the field containing the 'Age'.  When the age changes, the rule executes.

In practice, this works for the majority of requirements, but not for all.  Sometimes, triggers are required based on the change of other study elements or attributes.  For example, it might be required that when a query is raised for a subject, that a rule is triggered.  In this example, this is subject level triggering rather than response value triggering.

To achieve this, the syntax should support the attachment of rules to study objects.  For example a possible syntax for triggering might be;

on change of [subject|visit|repeat|field].[Object Id].[attribute]:

Each object would have a default attribute - this would be applied if not specified.   A sample reference might be;

on change of field.Age:


on change of subject.Status:

The default attribute of a field object might be the value. 

Also, the 'on change' syntax would be assumed based on the reference values.  If you referenced the 'age' in a rule, then that would automatically become the trigger point in addition to any 'on change' attributes defined.

Should a rule trigger based on all values being in existence, or, just some of the values?   With EDC application in particular,  a positive response is required - a page that is left blank cannot be assumed to represent no data - it could simply mean the data hasn't yet been entered.  Special consideration is required here.  Readers familiar with eCRF implementations will recognize the common Header question such as 'Have any Adverse Events occurred?' This is often used to interpret missing subsequent values from entries left blank on purpose. From a triggering perspective, rather than looking at value that may not have been entered - such as AE Description - the code would need to first check the AnyAE flag.


This is a more advanced concept, but important from the perspective of keeping the resulting tool simple.  Looking at the example above.  More thought is required when developed rules when conditional questions are involved.   So - with a dynamic eCRF, how can we simplify things.

A feature I will refer to a 'pre-requisites' that are aware to the rules engine would potentially solve this problem.  It is necessary to relate, at the metadata level the required need for a particular value. In our example, if AnyAE = No, then we would not expect a AEDescription. However, if a rule looked at AEDescription in isolation, it may not work. In reality, the rule would need to check AEDescription AND AnyAE.  Infact, this would be the case on any check that cross referenced the AE eCRF.

Now, lets image we could place a Pre-Requisite rule against the AEDescription field; AnyAE='Yes'.  That rule could meet two purposes.  1). it could control the availability of the field on the page and 2). it could act as a additional criteria for any rule that referenced the field.  With a Pre-requisite principle, the rule would simple check AEDescription.  The AND condition would be automatically added (behind the scenes) based on the attached pre-requisite rule.

The end result to the study developer would be that they wouldn't need to worry about whether or not a value should exist based on other criteria - this would be catered for by the pre-requisites.  The resulting syntax for rules would be easier.


This requirement is at odds with some of the other requirements.  How do you create a syntax that can cater for each and every rules situation, while at the same time be human readable, and re-executable.    The answer is that you will never ever create the perfect syntax that does everything.   What you do need, is a syntax that delivers on at least 99% of requirements.  The enhancements to the syntax from that of handling regular expressions need to be business aware.  Constructs for things that are common to eClinical system must be available as standard.  By providing support for these, the need to go outside of the 4GL bounds should be reduced to a negligible degree.

Human Readable Syntax

eClinicalOpinion raised the question of syntax on a comment on Part 1 of the series.  I see syntax applying to all language components.  What I mean by that is that all application components should be representable as a syntax that can be manipulated as free format text through a text editor OR from within an Application Development Environment. The maintenance of the syntax though will depend on the activity being performed.  For example, it might prove easier to create a data collection form (i.e. CRF) through a point and click UI.  The end result though would be a human and machine readable syntax.

So - if the metadata is all described in the form of a syntax, does that mean that the preparation of the syntax cannot be table driven - not at all - a syntax would consist of 3 things - basic language constructs, references to data/metadata and references to application objects. The preparation of the syntax would be through a table driven approach. Lists of metadata (fields, forms etc) would be stored in tables as would list of application objects such as what can be done to a subject or a visit event.

Should the syntax be XML - I don't think so.  XML might be one of the languages for representing the metadata - ala CDISC ODM - but the most effective syntax should be optimized for purpose. XML is not easily Human readable.

A number of other factors govern syntax.  Writing a compiler or an interpreter is not that easy to do well. Also, the execution of the resulting interpreted or compiled code needs to be fast. If the code that is produced needs to be processed many times before it reaches the executable machine code, then the length of time it takes to execute is longer. 4GL's are processed through 4 iterations. There are answers. For ECMAScript or JavaScript - SpiderMonkey is an opensource Javascript engine.  This can be embedded into an application and extended with high level application area constructs.  Other embedded scripting tools are available.

The combination of an open source script engine with an object model that is eClinical aware, it is possible to create a syntax that has all the power and flexibility required, while at the same time keeps the complexity of an underlying (potentially super normalized) database hidden.

Value Referencing

This is a key consideration in the definition of a rules syntax.  The inputs and outputs must be easy to reference (readable), non ambiguous and re-usable.  Control of the inputs and outputs is important in order to assure stability of the rules once deployed.   A form of 'black box' approach allows simplified and potentially even automated testing.


Version and Change Management

One of the critical factors in providing an application development solution for EDC is the need to support Version and Change Management - this is a key factor in defining the scripting language scope. Clinical Trials are unusual in that the structure and rules associated with the data may change during the deployment. More import still, the data that may have already been entered may need to be re-applied to the revised structure and rules.  This can prove to be considerably challenging.  If a Study Builder is given total control to add fields, CRF Pages and Visits to an existing study, it would be virtually impossibly to programmatically manage the mapping of data between an old version of a study definition to the new, and still guarantee data integrity.  What this means is that any environment must support metadata release management. In addition, to protect the integrity of regulatory compliance, once data is entered, it must be impossible to delete it even if a protocol change is applied.

Metadata Release Management

Two methods exist for the redeployment of a revised set of metadata against existing data.   You can either only process the changes, or, you can re-apply the existing data to the new definition.  Both methods are workable, but, the latter option can be slow, and requires an extensive and complete object based audit trail containing the former data and actions that can be re-executed against the new metadata.  The former option - to - process changes - requires that the metadata that has been released is managed.  Once data is entered into a system against a set of metadata - the metadata must go into a 'managed' state where all subsequent changes to the metadata can only be re-deployed when it is compatible with the data that has previously been entered.

So - how does this all impact the syntax?

Well, if the syntax is entirely open as far as the actions it performs, and the inputs that it receives, then it is very difficult to handle changes.  On the other hand, if the syntax is limited to operating in a 'blackbox' fashion - for example, comparing datapoints - and returning a boolean followed by the raising of a Query, then the management of a change, or, specifically, the re-execution of the rule against existing data, is predictable.

Lets imagine a large study. It has 1000 rules associated with data.  The study is up and running, and 1,000's of patients have been created.  During the 4th visit, it is discovered that one of the rules defined needs to be changed. The many thousands of other data points, and, rules executed are fine, but, the data points associated with this one rule needs to be changed. The change may effect the previous outcome.  With a combination of managed metadata - where the system is aware of the rule that has changed - and the object based audit trail, it is possible to limit the impact of the change to only the area of the study, and the associated data effected.  This is achieved by only re-executing the actions relative to the changes.

Some Metadata will arrive from an unmanaged source - for example, as an import from an external tool - in this instance, all unmanaged metadata will be assumed to be 'unclean'  and therefore changed.

Rule Actions

So, if a rule executes - by whatever mechanism - and the result of the execution demands an action, what should the action or actions be?   

Some EDC solutions are limited to raising Discrepancies or Queries. Even for the systems that support other actions, Queries, are by far the most common.  However, EDC systems are often differentiated by their ability to offer more advanced forms of actions.

Conditionally adding CRF Pages is one particular action that makes sense.  Changing the status of elements - such as a Subject status might also be useful.

However, one very specific consideration must be supported.  Any actions carried out here may potentially need to be rolled back, or, re-applied as the result of a Protocol update.  Each action that is offered must be fully compatible with the need for re-execution with no adverse results.


Many eClinical systems fall short when it comes to supporting the full requirements for implementations. In particular, support for testing.  In a strictly controlled, regulated environment, it is as important to prove that a configured eClinical system has been fully tested. The underlying product must be fully validated of course, but, over and above that, regulatory bodies are becoming increasingly aware of the need for configuration testing. 

Good test support in a eClinical 4GL must be built in from the start.  Adding this later is often impossible to achieve well.

To ensure a language is testable, the metadata objects - rules in many cases - need to be managed. The system must keep track of the elements that have been tested, and those that have not.


The underlying platform probably has a greater bearing on the performance of the 4GL, than simply the syntax.  Also, with web based systems, network latency is an issue.  A potential language needs to be capable of rapid execution. On a typical eCRF it should take less than 50ms to turn around the submission of a page - outside of network latency. Achieving this requires optimization at each step in the execution process.  Extracting data from a super-normalized database - 1 value from 1 record is an issue.  A means to address this is critical.  Avoiding slow interpretation of high level 4GL code is also key.   If the code that is manipulated by the user can be pre-compiled into a more CPU friendly form, then that will help.

Business Aware

Object Orientation

I must admit to being a person that has struggled to get my head around true - or pure - object orientation.  When people start talking about Polymorphism, modularity etc... it all becomes rather cryptic to me. For study developers this could all be too much.

However, I do think that Object Oriented Programming or OOP does lend itself well to specific business problem modelling.  With eClinical, you have certain rules that can be built into objects.  For example, lets imagine we have created an object called a 'Visit'.   A visit can have an attribute 'Name' with values of 'Screening, Visit 1 etc'.  It can also have another attribute 'Number' with values of  '1.0, 2.0 etc'.   Visits could belong to subjects.  Visits can have CRF Pages associated with them.  By defining these 'business objects' and 'object attributes' within the application tool, we can take away some of the complexity of handling relationships and actions from the study programmer.  Instead of having to create a SQL Select inner join between a Visit table and a CRF Form table, the relationship is pre-formed within the application business layer.

So - object orientation - yes, but, only where the resulting user (study developer) experience when preparing studies could be described as 'simple'.


This is all very high level still, but, the above does contain some concepts that may have value in the definition of a potential eClinical 4GL.  In the next posting of the series, we will most likely look at how technologies such as xForms might be supportive in providing an interactive user interface over a web front end to an eClinical 4GL.

Comments welcome!

Thursday, November 13, 2008


I have read with great interest an article posted in ClinPage - The Future of ODM, SDTM and CDISC.   These discussions relate primarily to the proposed requirement from the FDA for data submissions to be made in XML format rather than SAS Transport file format.   I don't think we will see many arguments around this point - XML is now the accepted extensible method of describing the combined data and metadata.  What is more contentious is that it is requested that data be provided in the HL7 v3 Message format.  FDA Docket No. FDA-2008-N-0428 from August 2008 elaborates on where the FDA are in the process.

In addition to the move to an HL7 Message format rather than SAS XPT, commentary exists on a suggestion that a move to ODM rather than SDTM would be considered.   This point is also put forward by Jozef Aerts of xml4pharma.

I would like to comment on a comparison of SDTM versus ODM.

Operational Data Model

ODM was the first CDISC standard to successfully go through the authoring process.  It was aimed as a means to represent data in to context of data capture. Data was indexed to Visits and Forms. The syntax was designed to describe data not from an effective storage format, but from a source to destination format.  You could get data from System A by Visit and Form to System B by Visit and Form.   This is great where the presentation of the data has importance and meaning.

Submission Data Tabulation Model

SDTM, unlike ODM, focuses on groupings of data - not by CRF Form - but by the use of data.  All Demographics information appears on the same record for example.  The SDTM structure has now also become the basis for data delivery and storage within many organizations.  A number of large PharmaBio companies based internal cross company standards on SDTM.

Modelling from Data Captured

The format of data will differ depending on the medium used to capture the data.  Some form factors might have 30 questions on a form, others such as Patient Diaries, might only have 1 or 2 question per form. In addition, when designing a CRF for ease of use, it may not make sense to apply the content of each SDTM domain as the basis for deciding what does and does not go onto a single form. Whether the data appeared on one form, or across many forms is not important when it comes to the value of the data.  Many EDC vendors have gone down the route of designing the database for data capture according to EAV rules - Entity-Attribute-Value form - where each value captured on any form is dropped into a single table. Once captured, data is then re-modelling into a relation structure that may or may not model the layout of the page. (xForms is a generic technology touted as being a potential means of addressing this challenge - I will leave further discussion on this to a later article).

Based on the above, it would seem logical that SDTM is of greater value when used as the method of delivery of data for submission or analysis than ODM.

However, that is not the only reason why SDTM makes sense over ODM when developing and executing eClinical studies.  The primary reason related to metadata re-use.

ODM is not a suitable format for modelling studies because it does not lend itself to ensuring that similar studies are able to effectively re-use metadata.  Sure - I can take a study, copy the metadata, and I have another study... easy... but what about changes.   What if I remove a few fields, add a few fields, change the visit structure. That will of course change the data outputs format if ODM was the format- an issue - see above, but, more importantly it will greatly impact any rules that might exist on the forms.  Rules that use some form of wildcarding mechanism may, or may not work.   Anyway, this is not a posting on metadata architecture, so I will leave it at that.

Bringing together SDTM and HL7 v3

So back to SDTM and HL7.  Is this the right way to go?   I can understand the logic behind this.  Being able to bring EHR and Clinical Trial data together within a common standard could be very useful.  However, at what cost?  

I am not aware of any eClinical application that automatically creates SDTM compliant data sets - regardless of transport layer.  The mapping of proprietary metadata to SDTM is quiet involved with varying degrees of software development required from the various system vendors.  Typically, either SAS macro transformations are used, or, some form of ETL (Extract, Transform and Load) Tool.  This is all complicated enough. Creating a tool that creates SDTM datasets in HL7 v3 is considerably more complicated. Even for large companies it will be a major development under taking.  The complexity is such that smaller companies will simply fail to manage to effectively deliver the data in a cost effective way.

Tools providers may step in - they may offer a means to convert a basic SDTM ASCII file with additional information into a SDTM HL7 v3 file. XML4Pharma as based on recent critique of the approach do not appear to be wishing to jump into supporting this, but, if this becomes a mandate, some companies will.

Playing on the other side of the argument - one of the principles of XML is that the data is also human readable.  In reality, once you add all of the 'overhead', especially with a complicate syntax such as HL7 v3, you end up with something that is only readable by technical gurus.  But then, maybe it shouldn't be people that interpret these files, maybe the complexity has got to the point where it only makes sense that a computer application interprets the files and then presents the appropriate information to the user.  Modern eClinical systems offer views on data. Maybe the presentation of the Submission data is managed in the same way - through an application that presents a view based on purpose.

Thursday, November 6, 2008

Cleaning the right data

We discussed the lack of significance of EndPoint data in EDC systems today.  I would like to put forward a model for improving the means of raising the significance of Endpoint information.

During a recent presentation by the Paul Clarkson, Director of Clinical Data Management at Genentech, it was described under the banner of Smart Clinical Trials how a better focus is being placing on the definition of data that drives Primary, Secondary and Safety Objectives in Genentech studies.   Paul eluded that the process he followed during the pilot of this approach was to simply create a spreadsheet built up from the events versus the procedures, and then dropping the metadata that was due to be captured into categories of either Primary, Secondary, Safety or Indeterminate purpose data. This was through color coding.  Following this, the assignments were reviewed with appropriate personnel to agree the value, or otherwise of the capturing and cleaning of the data.

Taking that above as a potentially valuable model, not only for identifying data that does not require to be captured, but also identifying the relative significance of the data captured against the target end-points, I started thinking about how this might be effectively support in the eClinical system.

The last end-point discussion posting highlighted a gap in the ability of eClinical systems to correctly prioritize the value behind different types of data.  For example, the cleaning of a verbatim comment entered onto a CRF form unrelated value to achieving any of the study end-points has as much procedural significance as the coding of an Adverse Event term. It is all just data that must be cleaned with equal significance.

For adaptive clinical trials, and for achieving end-point objectives, data is not all of equal significance.  So, how do we support the definition and use of data of differing comparative values.  Lets look at how Genentech did it. They took the metadata - the questions - and the categorized them against one (or more) endpoint objectives. From a study design perspective - without considerable effort, we could potential place a category on the metadata during the eCRF Form preparation.   Of course the categorization in itself has limited value.  The eClinical system would need to do something with it.

Today, EDC system often indicate through workflow and task lists who has to do what.  Currently - this is a blanket rule that does not consider the significance of types of data.  With a Smart model above, the view of the workflow and tasks could be adjusted to present activities that meet specific end point objectives.  So - instead of presenting to a monitor or data manager all outstanding activities, why not provide a list that is ordered, or even filtered by end-point categorization. This would allow the cleaning activity to focus work on information that first and foremost achieves the primary, secondary and safety end points in as short as period of time as possible.   That is not to say that other cleaning activity will not occur - it will - just the priorities will be presented appropriately based on the significance of data to achieving the objective of the study.

For Adaptive Clinical trials, a focus on end-point significance could be a differentiator in quickly achieving the statistically significant sample sizes required to drive dynamic randomization or decision making.

Tuesday, November 4, 2008

Running Rules in EDC - further commentary

Before returning with the 2nd part of the series - 4GL's for eClinical, I just wanted to discuss further the execution of rules in EDC. Here is further commentary on the topic.

All EDC solutions offer a means to set rules that check data that has been entered. Typically, these rules would compare one or more values, against one or more other values, and, either based on a true results, or a false result log a query.  

The traditional CDM systems used to run what are called Consistency Checks on a batch basis - (aka Batch Checks).  This was efficient in that the database would run across all data executing the rules in a sweep - typically once per rule. For a database, this was quiet efficient. However, this was designed to run sometime after the data was recorded. For centralized data entry, where the staff entering the data are not the staff responding to the queries, this was fine. Double Data Entry is often used to capture the data entry errors.

EDC systems work on a different model.  The personnel entering data into the system are typically at the site. It makes sense that once the data is entered into the EDC tool, that the rules run immediately giving the operator the opportunity to make corrections immediately. 

The question is, when should these rules run?

  1. As soon as the data has been entered and the user leaves the field?
  2. As soon as the data is submitted to the database?
  3. Later, as part of batch checking?

Many opinions are held on this topic.  Let me tackle the first, and easiest one.

Batch Checking

Option 3 - run later as part of batch checking.  I don't believe any person feels that running all rules on a batch basis for data entered at site makes sense.  I have heard it argued that 'some' rules should be on a batch basis.  The arguments for this have been a). for performance reasons and/or b). as all values are not available at the time the data is entered. I would respond to this argument by saying a). that a system should not have performance problems that would mean 'any' check cannot run during data entry. EDC systems should run even complex checks in < 100ms.  As to b). this is a design issue.  Most rules engines fire when all values are available, or, do not resolve to do anything if values are missing.  So - at least my opinion batch checking is largely superfluous.

Online Checking

Now - what about between the two online checking options?  At the field level, or on page submit?

Online Checking - Queries

If a user is recording data field by field, it can be distracting to see messages popping up repeatedly.  This is partly a question related to UI.  If the focus of the cursor is not adversely effected, this may be fine. Otherwise, it can be rather annoying. Some of you will be familiar with applications that 'steal' focus.  You think you are correctly keying information into an application only to find that the focus has been grabbed by a popup!  Very frustrating.  So - provided the focus is not impacted, producing queries, at least should be fine.  But what about other activities?  How about Dynamic CRF Pages?

Online Checking - Dynamics

It may make sense to insert a new field, or set of fields, based on a former response.  On paper, the content is fixed - it will say something like - If answered 'Y' then proceed to xxxx.  With an electronic medium, we have an opportunity to adjust the questions asked based on former responses. I believe dynamic forms only cause real problems for 'Heads Down Data Entry' staff [1].     With online EDC, heads down data entry is less common.  What is more typical is that the user reads the question, and completes a response.  If the next question changes, the impact is limited.  A common example - in a demographics form, subject is recorded as 'Female' - Dynamic adds a question such Is subject of child-bearing potential?

From a technical perspective, with web applications it is somewhat easier to handle a full page submit.  On an HTML based form, the actual data entry operates in the same fashion as an old paged style terminal (for those of us old enough to remember them!). The communication between the client (web browser) and the server only occurs when the user hits the save or submit button. 

Web 2.0 / Ajax

Web 2.0 technologies mean lots of things.  One feature though typical of these new application are that they are more interactive then traditional basic HTML paged apps. Ajax is a method now commonly used to create active response to data entry - an early example was used by google - Search-as-you-Type (SayT).  The technology provides the opportunity to capture data entered, and carry out an action immediately as a result - for google, that was to perform a search and present the results based on the term entered so far.   For EDC, this may result in some form of page dynamic such as the adding of a question, or block of questions.  From a browser independence perspective, Ajax doesn't tend to cause problems as code is available for virtually all browsers. The majority of work is completed on the server side.

So - with web2.0 Ajax technologies, what else can we do with online rules execution?  

Well, we can take all the values entered into a CRF Page, compare the values with other values entered on other pages, and execute any action that is suitable for execution prior to data submit.  From eclinical_revolutn's comment, some vendors such as PhaseForward are already doing this.

We could go as far as submitting the data as the page is completed - as the user leaves field 1 and goes to field 2, Ajax is used to submit the value for field 1. The argument against this approach is that users must make a positive statement to submit data. I don't concur with this.  In my mind, the positive statement is that the user has tabbed or cursor'ed out of the field.  The argument for the save as you go approach is that if a connection is lost, at least the data entered up to that point in time is saved. It is a training thing. If a user is trained that data is saved when you leave the field, then by leaving the field, they are confirming the save. A further argument against the save-as-you-go approach is that users are used to simply closing a browser, and the data entered, but not submitted is cancelled. Again, training and the removal of a Save or Submit button.  There are some challenges though - if the user completes information to the last field, and then closes the browser, should the last field value be saved?...

So - are EDC Vendors currently looking at new ways to interact with users using Web 2.0 technologies - I think so. Will we see user interfaces that match the interactivity that is offered by a thick client style rich UI - yes, but not until around 2010.

[1] Heads down data entry -an odd term  used to describe typically rapid keyboard data entry where the user does not look at the screen while keying - for example, a data entry clerk might be reading a paper CRF and entering the data into a CDM system.