For the data to be put to use, there are several common obstacles that needs to be mediated. Previous article explained the affect of deficient metadata and choice of formats that makes data more inaccessible in regards to machine readability. Still data are being published today which raise barriers and impedes usability instead of the opposite. A swift overview of the registered datasets on the Swedish national portal for open data Öppnadata.se (similar to data.gov), shows that the most common format is ZIP. This format is one of the worst in regard to metadata and machine readability. Which shows that open data owners and providers in Sweden at least, don’t follow good practice and principles concerning putting data on the web.
A closer examination reveal that there are few organizations that chooses to publish data in ZIP format. The Swedish Environmental Protection Agency (Naturvårdsverket) is the organization that has the most registered datasets overall and nearly all of them are in ZIP or PDF format. Both of these formats are ranked low on our list of formats with good metadata support. Additionally, several of the EPA registered datasets has no link to actual data or explanation to where to find it. Another organization that has published many datasets on the portal is the Swedish Royal Library, which has omitted both to specificity format and providing link to data registered on the portal.
As a third-party user this leaves a frivolous impression, since many of the registered datasets are lacking specified data format or links to actual data. When the most common file format (aside from unspecified format) on the portal is ZIP, the portal appears more like a file sharing site than a place for open data. As a user with the aspiration of finding data on the portal that can be uses to build applications and services will be disappointed. Since most of the datasets are using format that are not appropriate for machine reading, and according to European data portal only 26 percent of the total datasets on the portal are machine readable. ZIP and PDF formats locks-in and makes it difficult to access data because manual intervention is needed to access the content. Now that the Swedish National Archives has taken over the responsibility for the portal, we hope for a concentrated effort to remove barriers and promote data formats that are machine readable with adequate metadata.
The table portray common data format and their metadata support.
Format | Metadata | Description |
---|---|---|
ZIP (compressed file) | None | No support for metadata |
CSV (comma-separated file) | None | No support for metadata except that first row can contain name of column |
Limited | Metadata about creator and date. | |
Spreadsheet (Excel) | Limited | Metadata about the creator, date, data types. To extract metadata a special programs or module is needed. Metadata is not a natural and it is a proprietary format. |
JPG, PNG | Fair | Metadata about the creator, date, license regulations, geographic location, and the camera settings and more |
JSON, XML | Excellent | Allows defining metadata structures which describes, owner, date, time zones, complex data types and validation of appropriate values. The formats allow linking metadata schemas that are included in a hierarchies of objects which belongs to a taxonomy |