Skip to main content

Preparing data for Dojo

Dojo can accept CSV, Excel, GeoTIFF and NetCDF files. When you initially upload a file you may be presented with a set of options depending upon the detected file type. Dojo is not a data cleaning platform; your data should be clean prior to registering it to Dojo.

Tabular data

If your data is a CSV or Excel file it is considered to be tabular. Tabular data must be provided in a clean format. If your file has extraneous rows or columns, includes arbitrary linebreaks (e.g. for human readability) or is otherwise malformed it should be cleaned up in Excel before registering it to Dojo.

Additionally, you should consider removing any extraneous columns (if your data is in CSV or Excel file) before uploading it to Dojo to simplify your annotation task within Dojo.

Tabular data must have one column per feature. For example, a table that looks like this would be acceptable:

YearCountryCrop_Index
2015Djibouti0.7
2016Djibouti0.8
2017Djibouti0.9

However, a transposed dataset where time is represented by columns such as the following would be unacceptable:

Country201520162017
Djibouti0.70.80.9
Eritrea0.60.70.9

A dataset such as the above should be transformed by the user beforehand_ so that each item of interest has its own column. Datasets with line breaks or non-standardized formatting are unacceptable for registration to Dojo.

For example this dataset cannot be registered to Dojo as is:

Survey 1Notes: survey collected by 3rd party enumerator
YearCountryCrop_Index
2015Djibouti0.7
2016Djibouti0.8
Survey 2Notes: survey collected by World Bank
YearCountryFertilizer_Index
2015Djibouti1.8
2015Eritrea2.1

Prior to registration, it should be cleaned and converted to the below format before registration to Dojo:

Survey_NumberYearCountryCrop_IndexFertilizer_IndexNotes
12015Djibouti0.7survey collected by 3rd party enumerator
12016Djibouti0.8survey collected by 3rd party enumerator
22015Djibouti1.8survey collected by World Bank
22015Eritrea2.1survey collected by World Bank

Alternatively, a format like the following will be acceptable by Dojo:

Survey_NumberYearCountryIndex_NameIndex_ValueNotes
12015DjiboutiCrop Index0.7survey collected by 3rd party enumerator
12016DjiboutiCrop Index0.8survey collected by 3rd party enumerator
22015DjiboutiFertilizer Index1.8survey collected by World Bank
22015EritreaFertilizer Index2.1survey collected by World Bank

Excel files require that you select a worksheet. If your file is large, please wait until you see the detected worksheet names and select the appropriate one. When uploading an Excel file, you will be asked to select the sheet of interest. Currently Dojo only supports registering one sheet at a time.

Excel sheet selector

Excel sheet selector

Other formats

Data preparation for Dojo is less an issue for other acceptable formats. If your dataset is a GeoTIFF, you will be asked to provide the data band you wish to process and the name of the feature that resides in that band. You may optionally provide a date for the respective band and the value used by the GeoTIFF for nulls.

NetCDF files should be well formed and comply with the NetCDF standard.