1 Introduction

Hi, welcome to this sixth lesson, on new datasets upload.

In the previous lesson, you built an advanced and dynamic dashboard using public datasets that are already stored in the epigraphhub database.

In this lesson, you will learn how to upload new datasets to the EpiGraphHub platform, so that you can eventually apply all what you have learned in the previous lessons to your own data.

2 Learning Objectives

By the end of this lesson,

In this lesson you will learn about different ways of upload a new dataset to the EpiGraphHub. Before uploading a table, please make sure that no column in the table has mixed types. For example, (1,5,a,b,c,3,2,a) has a mix of strings and numbers.

3 Upload a CSV/Excel file

It’s possible to upload CSV and Excel files to the EpiGraphHub, to eventually explore them on the Explore page, create charts and build dashboards.

To upload a CSV or Excel file, just click on the plus button highlighted below:

Select the Data icon and the type of file you want to upload:

3.1 Upload a CSV File

For example, if you select Upload a CSV to database, you will see the screen below:

Let’s take an example to clarify each one of these steps. To do this, we need a .csv file. Open and download the table available at this link:

https://drive.google.com/file/d/1Gp-dNZCXfQT5LN5ivrTCoEEPM4hQrwRT/view?usp=sharing.

The slice_owid_covid is a set of rows from the owid_covid dataset maintained in the epigraphhub database in the public schema.

Now that we have a .csv file, let’s upload it into our platform. To do this, let’s fill the fields in the figure above.

  • Table Name: Fill this field with the name of table to be created from the csv data. In our case, let’s name it slice_owid_covid (You can use any other name).

  • CSV File: Select the CSV file to be uploaded. Select the file downloaded from the link provided above.

  • Database: Select the Sandbox database. Datasets only can be uploaded in the Sandbox database.

  • Schema: Select the public schema. This is the only schema currently available in the Sandbox database.

  • Delimiter: Delimiter used by CSV file (for whitespace use +). It is the character used to separate the columns of the table. In our case, it’s the comma ‘,’. So, the default value works for our case.

  • Table Exists: If table exists do one of the following:

    • Fail (do nothing),
    • Replace (drop and recreate table) or
    • Append (insert data).

In our case let’s select Fail. If we have a dataset with the same name, we will change the name before saving it.

  • Header Row: Row containing the headers to use as column names (0 is first line of data). Leave empty if there is no header row. The default value, 0, works for our case.

  • Index Column: Column to use as the row labels of the dataframe. Leave empty if no index column. Let’s keep it empty.

  • Mangle Duplicate Columns: Checkbox. Specify duplicate columns as “X.0, X.1”. Let’s keep it selected (it will not change our final result since we didn’t have any duplicate columns)

  • Use Columns: Json list of the column names that should be read. The default is None. If not None, only these columns will be read from the file. Let’s select the columns date, iso_code, new_cases, new_deaths typing in this field: ["date", "iso_code", "new_cases", "new_deaths"].

  • Skip Initial Space:Checkbox. Skip spaces after delimiter. Let’s keep it as default since we don’t want to skip initial space.

  • Skip Rows: Number of rows to skip at start of file. Let’s keep it as default, 0, since we don’t want to skip rows.

  • Rows to Read: Number of rows of file to read. Let’s keep it empty to read all all the rows of the file this time.

  • Skip Blank Lines: Skip blank lines rather than interpreting them as NaN values. Let’s keep it selected to avoid lines with all the values null.

  • Parse Dates: A comma separated list of columns that should be parsed as dates. In our case, let’s type date, as it is the unique date column we have.

  • Infer Datetime Format: Use Pandas to interpret the datetime format automatically. Let’s keep it selected.

  • Decimal Character: Character to interpret as decimal point. Let’s keep the default value ‘.’.

  • Dataframe Index: Write dataframe index as a column. Let’s keep the default value (unchecked box).

  • Column Label(s): Column label for index column(s). If None is given and Dataframe Index is True, Index Names are used. Let’s keep it empty.

  • Null values: Json list of the values that should be treated as null. Examples: [“”], [“None”, “N/A”], [“nan”, “null”]. Warning: Hive database supports only single value. Use [“”] for empty string. By default, the following values are interpreted as null values: ["", "<NA>", "#N/A N/A", "N/A", "NA", "#N/A", "n/a", "-nan", "#NA", "NaN", "-1.#IND", "NULL", "null", "-1.#QNAN", "-NaN", "1.#IND", "nan", "1.#QNAN"].Let’s keep the default value.

After filling this information about the dataset, click on the SAVE button. The table will upload and appear in the list of the datasets. You will then be able to explore it and use it to make charts and dashboards as shown in the last lessons.

If you open this dataset on the Explore page, the following result will be displayed:

Note that the dataset updated only has the columns specified in the Use Columns field, and the date was properly identified as a TIME COLUMN. Also, if you apply the metrics COUNT(*) in the AGGREGATE QUERY MODE, we can ensure that the dataset has one thousand rows as expected. This is shown in the figure below:

3.2 Upload an Excel File

By selecting Upload Excel file to database you will see the screen below:

Let’s take an example to clarify each one of these steps To do this, we need a .xlsx file. If you open this link:

https://docs.google.com/spreadsheets/d/13-o0GSVkTOd80VVS0Xuho3-16Kbf9rfZ/edit?usp=sharing&ouid=116313671710759075285&rtpof=true&sd=true

If you click on File and on Download and in Microsoft Excel (.xlsx) you will download the table as a .xlsx file. This file is the same as the CSV file uploaded earlier, but in Excel (.xlsx) format.

Now that we have an excel file, let’s upload it into our platform. To do this, let’s fill the fields in the figure above.

  • Table Name: Fill this field with the name of table to be created from the csv data. In our case, let’s name it slice_owid_covid_excel (You can use any other name).

  • Excel File: Select the Excel file to be uploaded. Select the file downloaded from the link provided above.

  • Sheet Name: Name of the sheet to be uploaded. If left empty, the first sheet of the excel file will be uploaded. In our case, our file has only one sheet, so we will keep this field empty.

  • Database: Select the Sandbox database. Datasets only can be uploaded in the Sandbox database.

  • Schema: Select the public schema. This is the only schema currently available in the Sandbox database.

  • Table Exists: If table exists do one of the following:

    • Fail (do nothing),
    • Replace (drop and recreate table) or
    • Append (insert data).

In our case let’s select Fail. If we have a dataset with the same name, we will change the name before saving it.

  • Header Row: Row containing the headers to use as column names (0 is first line of data). Leave empty if there is no header row. The default value, 0, works for our case.

  • Index Column: Column to use as the row labels of the dataframe. Leave empty if no index column. Let’s keep it empty.

  • Mangle Duplicate Columns: Checkbox. Specify duplicate columns as “X.0, X.1”. Let’s keep it selected (it will not change our final result since we didn’t have any duplicate columns)

  • Skip Rows: Number of rows to skip at start of file. Let’s keep it as default, 0, since we don’t want to skip rows.

  • Rows to Read: Number of rows of file to read. Let’s keep it empty to read all all the rows of the file this time.

  • Parse Dates: A comma separated list of columns that should be parsed as dates. In our case, let’s type date, as it is the unique date column we have.

  • Infer Datetime Format: Use Pandas to interpret the datetime format automatically. Let’s keep it selected.

  • Decimal Character: Character to interpret as decimal point. Let’s keep the default value ‘.’.

  • Dataframe Index: Write dataframe index as a column. Let’s keep the default value (unchecked box).

  • Column Label(s): Column label for index column(s). If None is given and Dataframe Index is True, Index Names are used. Let’s keep it empty.

  • Null values: Json list of the values that should be treated as null. Examples: [“”], [“None”, “N/A”], [“nan”, “null”]. Warning: Hive database supports only single value. Use [“”] for empty string. By default, the following values are interpreted as null values: ["", "<NA>", "#N/A N/A", "N/A", "NA", "#N/A", "n/a", "-nan", "#NA", "NaN", "-1.#IND", "NULL", "null", "-1.#QNAN", "-NaN", "1.#IND", "nan", "1.#QNAN"].Let’s keep the default value.

After filling this information about the dataset, click on the SAVE button and the table uploaded will appear in the list of the datasets.

If you open this dataset on the Explore page, note that the date was properly identified as a TIME COLUMN. Also, if you apply the metrics COUNT(*) in the AGGREGATE QUERY MODE, we can ensure that the dataset has one thousand rows as expected. This is shown in the figure below:

It’s recommended to upload a new dataset in the .CSV format since this format allows more possibilities when uploading it.

4 Upload a table from Google Sheets

Let’s now analyse a table saved in Google Sheets without the need to download it first in a different format and then upload it to our platform. We will upload it into the EpiGraphHub platform directly from Google Sheets.

As an example, let´s use this table: https://docs.google.com/spreadsheets/d/12rGwhwuIJuRZJ3eqfQUfuRK6yDSoTP9_Y4S_KidTyMY/edit?usp=sharing.

After clicking on the link, the page below will appear. As you can see, these are the same data that we uploaded previously in .csv and .xlsx formats.

(The steps explained below work with any table saved in Google Sheets.)

Click on the Share button highlighted above and the two boxes below will be displayed in your screen:

In the Get link box, you need to allow your table to be shared with anyone with the link. If you have already done it, the following box will be displayed:

So, click on the Copy link. Otherwise, if your table is not yet available for anyone with the link, you need to click on Change to anyone with the link, so you will see on your screen:

Click on the Copy link and on Done. By doing that you just allow the Viewer mode of your table. It doesn’t allow other people with the link to edit your table. After copying the link, we will go back to the EpiGraphHub platform and will select the SQL Lab and the SQL Editor option.

After clicking on it, the page below will be displayed on your screen. On this page, it’s possible to make multiple SQL queries. You can open a new query by clicking on the plus button highlighted below.

On the left side of the page, you can select the DATABASE and SCHEMA. On the right side, you can type your SQL query and run it clicking on the RUN button.

To upload a table from Google Sheets you must use the Google sheets DATABASE and the main SCHEMA, as shown below:

In the field to type the query, just type:

SELECT * FROM "Google Sheets link";

and click on the RUN button. In the Google Sheets link use the link copied above. Your final query is:

After running the query, the result below will be displayed on your screen:

To save this dataset and explore it click on the EXPLORE button. This window will be displayed:

You can save it as a new dataset (choosing a name) or overwrite an existing table. So, choose a name and click on the SAVE & EXPLORE button.

You will be redirected to the explore page, where as learned in the other lessons, you can make charts and explore your data.

5 Connect with other databases

The EpiGraphHub platform can be connected to a large number of databases. It can be connected with:

If you want to connect with a specific database that is not yet available on our platform, just open an issue in our repo: https://github.com/thegraphnetwork/EpiGraphHub, and we will consider adding it :)

Congratulations! You reached the end of this sixth lesson, where you uploaded new datasets, in CSV, Excel and Google Sheets formats, to the EpiGraphHub platform.

Hopefully, you feel now comfortable with uploading your own test datasets to the platform, to explore them and extract valuable information that you can easily share online.

If you need your new datasets (public or private) to be persisted, maintained and regularly updated in the future, please contact the EpiGraphHub team, who will be happy to help :)

Contributors

The following team members contributed to this lesson:

This work is licensed under the Creative Commons Attribution Share Alike license. Creative Commons License