Core Datasets

Important, commonly-used datasets as high quality, easy-to-use & open data packages

Core Datasets are important, commonly-used “core” datasets like GDP or country codes made available as high-quality, easy-to-use and open data packages. Find them online here on the DataHub:

http://datapackaged.com/core/

Key features are:

  • High Quality & Reliable – sourcing, normalizing and quality checking a set of key reference and indicator datasets such as country codes, currencies, GDP and population
  • Standardized & Bulk – all datasets provided in a standardized form and can be accessed in bulk as CSV together with a simple JSON schema
  • Versioned & Packaged – all data is in data packages and is versioned using git so all changes are visible and data can be collaboratively maintained

The “Core Datasets” effort is part of the broader Frictionless Data initiative.

Core Data Curators

The Core Data Curators curate the core datasets.

Curation involves identifying and locating core (public) datasets, then packaging them up as high-quality, reliable, and easy-to-use data packages (standardized, structured, open).

New team members wanted: We are always seeking volunteers to join the Data Curators team. Get to be part of a crack team and develop and hone your data wrangling skills whilst helping to provide high quality data to the community.

What Roles and Skills are Needed

We have a variety of roles from identifying new “core” datasets, to collecting and packaging the data, to performing quality control.

Core Skills – at least one of these skills is strongly recommended:

  • Data Wrangling Experience. Many of our source datasets are not complex (just an Excel file or similar) and can be “wrangled” in a Spreadsheet program. What we therefore recommend is at least one of:
    • Experience with a Spreadsheet application such as Excel or (preferably) Google Docs including use of formulas and (desirably) macros (you should at least know how you could quickly convert a cell containing ‘2014’ to ‘2014-01-01’ across 1000 rows)
    • Coding for data processing (especially scraping) in one or more of python, javascript, bash
  • Data sleuthing - the ability to dig up data on the web (specific desirable skills: you know how to search by filetype in google, you know where the developer tools are in chrome or firefox, you know how to find the URL a form posts to)

Desirable Skills (the more the better!):

  • Data vs Metadata: know difference between data and metadata
  • Familiarity with Git (and Github)
  • Familiarity with a command line (preferably bash)
  • Know what JSON is
  • Mac or Unix is your default operating system (will make access to relevant tools that much easier)
  • Knowledge of Web APIs and/or HTML
  • Use of curl or similar command line tool for accessing Web APIs or web pages
  • Scraping using a command line tool or (even better) by coding yourself
  • Know what a Data Package and a Tabular Data Package are
  • Know what a text editor is (e.g. notepad, textmate, vim, emacs, …) and know how to use it (useful for both working with data and for editing Data Package metadata)

Get Involved - Sign Up Now!

Here’s what you need to know when you sign up:

  • Time commitment: Members of the team commit to at least 8-16h per month (though this will be an average - if you are especially busy with other things one month and do less that is fine)
  • Schedule: There is no schedule so you can contribute at any time that is good for you - evenings, weekeneds, lunch-times etc
  • Location: all activity will be carried out online so you can be based anywhere in the world
  • Skills: see above

To register your interest fill in the following form. Any questions, please get in touch directly.