The attention to make data available in machine-readable formats is something that is addressed in more articles related to open data for better proliferation. The Swedish government is proposing a bill for better reuse of public transportation data trough machine-readable format and we have addressed the importance of machine-readable format in a article about liquid open data. What is often not mentioned is the importance of metadata in regards to machine-readable formats to augment use and reuse of data. Therefore, we want to give our perspective on why metadata are an indistinguishable part of open data and machine-readable formats.
All digital files are machin-readable because they were created by machines. But there is a big difference if the files are meant to be read by humans or machines. Machine-readable data mean that the data is well structured in a standardized format so that machines can deduce the meaning and process the content without manual intervention. In order to make machines understand the meaning, the data need to contain metadata to describe its characteristics and context. The ability to describe metadata are determined by the data format being used. If the format does not support detailed metadata description, there is a risk that machines will misinterpret the nature of data and render it unusable. For example, machines can not guess whether a decimal is a geographical reference or an amount in a financial budget. For basic understanding of metadata and the implication of data format choice, read our previous article.
Machines can not yet understand the semantic meaning and context in the same way as a humans. For example, it is difficult for a machine to interpret an Excel file that refers to an employee’s personal expenses for the month of March 2016, since the format does not have sufficient metadata support and that the context is not known. For a machine to understand the meaning of data, the format needs to support metadata which can describe its properties and context. For example, the data needs to contain metadata to explain that staffs personal expenses is represented by the object employee which is an instance of the class staff, personal expenses is an instance of the class debt to the employee, which in turn inherits its properties from the class debt, which in turn can be associated with classes cost, month and year. Data format such as XML and JSON can describe these relationships amongst classes by using metadata schema which is a natural part of these formats.
Our definition of machine-readable format in open data context is as follows; data formats that are suitable for resource-efficient computation with the ability to define objects, attributes and hierarchies so that machines can interpret the content and context without manual intervention. This means that data format such as comma-separated files (CSV), can not be considered as a machine-readable format. Hence it lacks metadata support and the possibility to describe the meaning and context. However, comma-separated files are resource-efficient and appropriate for processing tabular data where the context of application is already known.
We analysed more than 20,000 links to CSV files on data.gov.uk – only around one third turned out to be machine-readable. – A case study of CSVs on data.gov.uk
If you undertake the task of washing and cleaning data before publication, the extra effort of transforming the data into XML or JSON is not that much greater. Given that the internal metadata repository and structure is in order. The reason for choosing machine-readable formats with support for metadata is the possibility for machines to interpret and merge data with other data sources from varies countries by using standardized taxonomies where attributes, classes and hierarchies are defined. An example of a metadata taxonomy is Datex, which is a European standardization for traffic and transport data.
The table present common formats and their support for metadata.
Format | Metadata | Description |
---|---|---|
ZIP (compressed file) | None | No support for metadata |
CSV (comma-separated file) | None | No support for metadata except that first row can contain name of column |
Limited | Metadata about creator and date. | |
Spreadsheet (Excel) | Limited | Metadata about the creator, date, data types. To extract metadata a special programs or module is needed. Metadata is not a natural and it is a proprietary format. |
JPG, PNG | Fair | Metadata about the creator, date, license regulations, geographic location, and the camera settings and more |
JSON, XML | Excellent | Allows defining metadata structures which describes, owner, date, time zones, complex data types and validation of appropriate values. The formats allow linking metadata schemas that are included in a hierarchies of objects which belongs to a taxonomy |