An organic approach to developing data standards

This is the first in a series of blog posts about our work on data standards.  The intention is to present our work and thinking to a wider audience, learn from you about other work that may connect to this and explore new contexts and partnerships in which we can test these ideas.

Previous posts have covered work we’ve done implementing systems to help manage and monitor development programmes.  Since we’ve had the fortune to work on a number of related programmes (in the areas of social accountability and social protection), we’ve also been able to use this work to explore what it means to develop standards for data that similar programmes collect about activities, outputs and indicators.

There is of course a lot of work already done and ongoing on data standards.  Some examples are initiatives like IATIOpen Contracting and the Joined Up Data Alliance.  However, when I look for standards related to operational or performance related data I see less progress.

Various sectors and countries have worked to create shared libraries of indicators.  Herb Caudill at DevResults has also made proposals for an indicator standard.  However, it can be time-consuming to facilitate, agree and implement a new standard.  This makes it impractical to invest the time (and by extension money) to develop data standards unless there is a clear return on investment.

From our perspective, indicators are also just the tip of the iceberg.  Programmes we support collect data on participants, sites, facilities and groups.  They track attendance, satisfaction, feedback and a range of other things.  Developing data standards for such a wide range of different needs seems like an impossible task. 

Mindful of this we’ve been exploring a more organic approach to developing standards that we feel may work better in this kind of context.  We call it Self-aware Data Objects (or SDOs).

Before I can explain what we mean by an ‘organic approach’ I need to give some more background.  I’ll start with the building blocks that the approach builds upon.

The building blocks are:

  • A specific community which collaborates to share data definitions
  • A domain specific language
  • A shared register of data definitions
  • A shared database
  • A data governance team

Standards in the context of a specific community

Let’s explore each of these building blocks through an example.  Imagine in this case that DFID choose to adopt this approach to manage data collected by programmes that it funds.  (We could just as equally use a Government or a network of NGOs as an example). 

In this example DFID starts by setting up it’s own registry of data definitions.  Think of this as an online catalogue.  If you have access you can browse a list of data definitions, add a new one or adopt an existing one to use in your work.

Data governance 

We see this kind of registry as being managed by a data governance team.  Their specific role would depend on which organisation(s) is running the registry.  The kind of things they might be responsible for include:

  • Manage who can access the registry and to what extent
  • Pro-actively create data definitions for ‘important’ data
  • Curate data definitions added by other registry users
  • Guide new registry users as to which data definitions may be relevant to them
  • Look for opportunities to facilitate the development of new standards

So far this isn’t that different from existing registries of data standards.

However, we propose one important difference: using a domain specific language to create the data definitions.  As we’ll see later, this allows a much more decentralised and organic approach to creating and sharing data definitions, which in turn makes the process of agreeing standards much faster.  But first, what is a domain specific language?

What is a Domain Specific Language?

A domain specific language (or DSL) is a programming language designed to be used by domain experts, not just programmers.  In this case the DSL we have developed and used in our work is designed to help M&E managers, business analysts (or others that understand data collection needs) create the forms that they need to collect data.  On one level it’s a tool that lets you build a form from a series of form elements.

Image courtesy of http://www.slideshare.net/glaforge/groovy-domain-specific-languages-springone2gx-2012

Image courtesy of http://www.slideshare.net/glaforge/groovy-domain-specific-languages-springone2gx-2012

Alongside the form it also creates: 

(i) A schema to validate data entered on the form

This means that data published against this data definition can be validated against the schema.  As the creator of the data definition you can use the schema to enforce business rules that maintain data quality.

(ii) Application independent data 

The DSL is designed to also create a standard view and edit model.  This is intended to de-couple data from the application that produces it, making it easy to interact with it without the source application.  This has wide ranging implications that I’ll return to in a future post.

(iii) A common data envelope

Each data definition shares common fields.  We anticipate that these will evolve over time as we understand different user requirements.  For now they cover information like who created the data and when, who last updated it and when, geographical coordinates and linkages to other data or data definitions.

(iv) Programmatically transformable data definition files

Since the definition files have been created using a programming language, it is possible to create your own scripts to transform the definition file.  In our work we transform the definition files into a JSON file.  However, it could easily be transformed into other file types.

Also, since each data definition is created from the same elements, the task of merging or linking data from different definitions is much simpler.

Self-aware Data Objects

We call the data definitions created using the DSL Self-aware Data Objects.

SDOs can be used to define data at the lowest level at which it will be collected.  For example, a workshop attendance register or a group registration form.  By defining data at the operational level we can better assess it’s quality.  Indicators can instead be expressed as a query of the relevant SDO data.

Since SDOs are defined using a common DSL, it becomes possible to make connections between different definitions or data created based on a definition.  These can be made explicit, by including a linkage in the definition.  Or they can be expressed via a query that combines data from different SDOs.

Organic standards

Returning to our fictional example, DFID now has a growing library of data definitions (or SDOs) in their registry.  Approved partners can browse this registry and adopt SDOs that they want to use.  The registry keeps track of who adopts an SDO.

If necessary they can modify the SDO, adding additional fields or perhaps translating it into a different language.  Providing these changes do not conflict with the schema the modified SDO is still compatible with the original SDO.  Since the registry manages this adoption process, the SDO versions are linked automatically.

In this way a community of DFID programmes may share their data definitions with each other, adopting, tweaking and using those they consider useful.  Through this process of collaboration we see a more emergent and organic way of developing data standards.  Critically though these are standards in a context only - in this case the context of DFID programmes.

DFID might choose instead (or also) to take a more pro-active approach in some cases.  A DFID business analyst might assess which data is most needed to report to parliament on the outputs arising from DFID expenditure.  In this case they might pro-actively define a list of SDOs that provide the necessary data.  Programmes might now be required to use these ‘DFID standards’ (at least as a starting point) when selecting SDOs from the register.

While not so much an organic approach it is a way in which a specific donor, government or NGO network can develop it’s own data standards for it’s own purposes.

Image courtesy of XKCD.com

Image courtesy of XKCD.com

Shared database

So far I’ve discussed only four of the key building blocks.  While data standards are important, shared data related to these standards is what we are ultimately interested in.

Since we have used a DSL to create the data definitions the task of aggregating data published to one or more of these definitions becomes much easier.  There are three key steps.

First, map your existing data tables to one or more relevant SDOs from the registry.  Second, write a script that transforms your data into the SDO format.  We use JSON, but since SDOs can be transformed programmatically it could be any format.  Third, integrate with the API to publish your data.

The API has three calls: 

Validate - Your SDO data is validated against the schema in the registry

Publish - Your SDO data is transmitted to the DFID shared database

Query - Your application queries the DFID database for data on one or more SDOs

In this way any application can be modified to generate and publish data in this format to a central DFID database.  For DFID programmes this means that reporting on activities and outputs can happen in real-time, without the need for a separate PDF or other report.

Data aggregated in this way can be queried - either via the API or using dedicated business intelligence tools.  This makes it possible for different groups to perform their own analysis, limited only by the level of access that they are granted.

Challenges

While we’ve made a lot of progress on these ideas, there are still many challenges to work out.  I’d welcome your thoughts on these and others that we should consider 

Privacy and security: Centralising data in this way is a double edged sword.  Clearly privacy and security implications are of critical importance.  Some of the avenues we are exploring are ways of enabling each publisher to encrypt their data.  They can then choose who they share the encryption key with.

Curation: It’s not hard to picture how quickly the register might fill up with data definitions.  Clear, well thought through rules and guidelines will be important.  Equally important is the need for a team to provide guidance to register users and to curate data definitions already published.  Without this we will quickly see an unusable mess. 

Barrier to entry: For smaller organisations with limited capacity, using a DSL may be daunting.  We need to consider carefully the possibility that this approach adds additional burdens on those least able to bear them.  Work on visual tools to create SDOs will certainly help, as will the option of simply adopting SDOs created by others.

Incentives:  What are the incentives that will help drive adoption of this approach?  In the first instancethose that manage the registry stand to benefit the most - from access to more and better quality data.  However, if the data is accessible it’s not hard to think of how the publishers could derive value too.

For example, less time spent reporting (assuming that donors relax their current requirements), access to data from other related programmes for learning, access to data definitions created by others.  These and other factors may serve as incentives.

What next?

Some of these ideas are well developed and widely used in our work.  Others are still under development.  Over the last four weeks I’ve had a series of interesting meetings and conversations with people actively involved in the world of data standards.  Both in relation to open development data and open government data.  Thanks again to those that took the time to talk, it’s been a great learning experience. 

We’d like to hear from people interested in partnering with us to develop these ideas further.  If the concepts are validated then our next move will be to launch this as an open source project to leverage wider engagement and adoption.