Hi, welcome to this sixth lesson, on new datasets upload.
In the previous lesson, you built an advanced and dynamic dashboard using public datasets that are already stored in the epigraphhub database.
In this lesson, you will learn how to upload new datasets to the EpiGraphHub platform, so that you can eventually apply all what you have learned in the previous lessons to your own data.
By the end of this lesson,
In this lesson you will learn about different ways of upload a new dataset to the EpiGraphHub. Before uploading a table, please make sure that no column in the table has mixed types. For example, (1,5,a,b,c,3,2,a) has a mix of strings and numbers.
It’s possible to upload CSV and Excel files to the EpiGraphHub, to eventually explore them on the Explore page, create charts and build dashboards.
To upload a CSV or Excel file, just click on the plus button highlighted below:
Select the Data icon and the type of file you want to upload:
For example, if you select Upload a CSV to database, you
will see the screen below:
Let’s take an example to clarify each one of these steps. To do this,
we need a .csv file. Open and download the table available
at this link:
https://drive.google.com/file/d/1Gp-dNZCXfQT5LN5ivrTCoEEPM4hQrwRT/view?usp=sharing.
The slice_owid_covid is a set of rows from the
owid_covid dataset maintained in the epigraphhub database
in the public schema.
Now that we have a .csv file, let’s upload it into our
platform. To do this, let’s fill the fields in the figure above.
Table Name: Fill this field with the name of table
to be created from the csv data. In our case, let’s name it
slice_owid_covid (You can use any other name).
CSV File: Select the CSV file to be uploaded. Select
the file downloaded from the link provided above.
Database: Select the Sandbox
database. Datasets only can be uploaded in the Sandbox
database.
Schema: Select the public schema.
This is the only schema currently available in the Sandbox
database.
Delimiter: Delimiter used by CSV file (for
whitespace use +). It is the character used to separate the columns of
the table. In our case, it’s the comma ‘,’. So, the default value works
for our case.
Table Exists: If table exists do one of the
following:
Fail (do nothing),Replace (drop and recreate table) orAppend (insert data).In our case let’s select Fail. If we have a dataset with
the same name, we will change the name before saving it.
Header Row: Row containing the headers to use as
column names (0 is first line of data). Leave empty if there is no
header row. The default value, 0, works for our
case.
Index Column: Column to use as the row labels of the
dataframe. Leave empty if no index column. Let’s keep it empty.
Mangle Duplicate Columns: Checkbox. Specify
duplicate columns as “X.0, X.1”. Let’s keep it selected (it will not
change our final result since we didn’t have any duplicate
columns)
Use Columns: Json list of the column names that
should be read. The default is None. If not None, only these columns
will be read from the file. Let’s select the columns
date, iso_code, new_cases, new_deaths typing in this field:
["date", "iso_code", "new_cases", "new_deaths"].
Skip Initial Space:Checkbox. Skip spaces after
delimiter. Let’s keep it as default since we don’t want to skip initial
space.
Skip Rows: Number of rows to skip at start of file.
Let’s keep it as default, 0, since we don’t want to skip
rows.
Rows to Read: Number of rows of file to read. Let’s
keep it empty to read all all the rows of the file this time.
Skip Blank Lines: Skip blank lines rather than
interpreting them as NaN values. Let’s keep it selected to avoid lines
with all the values null.
Parse Dates: A comma separated list of columns that
should be parsed as dates. In our case, let’s type date, as
it is the unique date column we have.
Infer Datetime Format: Use Pandas to interpret the
datetime format automatically. Let’s keep it selected.
Decimal Character: Character to interpret as decimal
point. Let’s keep the default value ‘.’.
Dataframe Index: Write dataframe index as a column.
Let’s keep the default value (unchecked box).
Column Label(s): Column label for index column(s).
If None is given and Dataframe Index is True, Index Names are used.
Let’s keep it empty.
Null values: Json list of the values that should be
treated as null. Examples: [“”], [“None”, “N/A”], [“nan”, “null”].
Warning: Hive database supports only single value. Use [“”] for empty
string. By default, the following values are interpreted as null values:
["", "<NA>", "#N/A N/A", "N/A", "NA", "#N/A", "n/a", "-nan", "#NA", "NaN", "-1.#IND", "NULL", "null", "-1.#QNAN", "-NaN", "1.#IND", "nan", "1.#QNAN"].Let’s
keep the default value.
After filling this information about the dataset, click on the
SAVE button. The table will upload and appear in the list
of the datasets. You will then be able to explore it and use it to make
charts and dashboards as shown in the last lessons.
If you open this dataset on the Explore page, the following result will be displayed:
Note that the dataset updated only has the columns specified in the
Use Columns field, and the date was properly
identified as a TIME COLUMN. Also, if you apply the metrics
COUNT(*) in the AGGREGATE QUERY MODE, we can
ensure that the dataset has one thousand rows as expected. This is shown
in the figure below:
By selecting Upload Excel file to database you will see
the screen below:
Let’s take an example to clarify each one of these steps To do this,
we need a .xlsx file. If you open this link:
If you click on File and on Download and in
Microsoft Excel (.xlsx) you will download the table as a
.xlsx file. This file is the same as the CSV file uploaded
earlier, but in Excel (.xlsx) format.
Now that we have an excel file, let’s upload it into our platform. To do this, let’s fill the fields in the figure above.
Table Name: Fill this field with the name of table
to be created from the csv data. In our case, let’s name it
slice_owid_covid_excel (You can use any other
name).
Excel File: Select the Excel file to be uploaded.
Select the file downloaded from the link provided above.
Sheet Name: Name of the sheet to be uploaded. If
left empty, the first sheet of the excel file will be uploaded. In our
case, our file has only one sheet, so we will keep this field
empty.
Database: Select the Sandbox
database. Datasets only can be uploaded in the Sandbox
database.
Schema: Select the public schema.
This is the only schema currently available in the Sandbox
database.
Table Exists: If table exists do one of the
following:
Fail (do nothing),Replace (drop and recreate table) orAppend (insert data).In our case let’s select Fail. If we have a dataset with
the same name, we will change the name before saving it.
Header Row: Row containing the headers to use as
column names (0 is first line of data). Leave empty if there is no
header row. The default value, 0, works for our
case.
Index Column: Column to use as the row labels of the
dataframe. Leave empty if no index column. Let’s keep it empty.
Mangle Duplicate Columns: Checkbox. Specify
duplicate columns as “X.0, X.1”. Let’s keep it selected (it will not
change our final result since we didn’t have any duplicate
columns)
Skip Rows: Number of rows to skip at start of file.
Let’s keep it as default, 0, since we don’t want to skip
rows.
Rows to Read: Number of rows of file to read. Let’s
keep it empty to read all all the rows of the file this time.
Parse Dates: A comma separated list of columns that
should be parsed as dates. In our case, let’s type date, as
it is the unique date column we have.
Infer Datetime Format: Use Pandas to interpret the
datetime format automatically. Let’s keep it selected.
Decimal Character: Character to interpret as decimal
point. Let’s keep the default value ‘.’.
Dataframe Index: Write dataframe index as a column.
Let’s keep the default value (unchecked box).
Column Label(s): Column label for index column(s).
If None is given and Dataframe Index is True, Index Names are used.
Let’s keep it empty.
Null values: Json list of the values that should be
treated as null. Examples: [“”], [“None”, “N/A”], [“nan”, “null”].
Warning: Hive database supports only single value. Use [“”] for empty
string. By default, the following values are interpreted as null values:
["", "<NA>", "#N/A N/A", "N/A", "NA", "#N/A", "n/a", "-nan", "#NA", "NaN", "-1.#IND", "NULL", "null", "-1.#QNAN", "-NaN", "1.#IND", "nan", "1.#QNAN"].Let’s
keep the default value.
After filling this information about the dataset, click on the
SAVE button and the table uploaded will appear in the list
of the datasets.
If you open this dataset on the Explore page, note that the
date was properly identified as a TIME COLUMN. Also, if you
apply the metrics COUNT(*) in the AGGREGATE
QUERY MODE, we can ensure that the dataset has one thousand rows as
expected. This is shown in the figure below:
It’s recommended to upload a new dataset in the .CSV
format since this format allows more possibilities when uploading
it.
Let’s now analyse a table saved in Google Sheets without the need to download it first in a different format and then upload it to our platform. We will upload it into the EpiGraphHub platform directly from Google Sheets.
As an example, let´s use this table: https://docs.google.com/spreadsheets/d/12rGwhwuIJuRZJ3eqfQUfuRK6yDSoTP9_Y4S_KidTyMY/edit?usp=sharing.
After clicking on the link, the page below will appear. As you can see, these are the same data that we uploaded previously in .csv and .xlsx formats.
(The steps explained below work with any table saved in Google Sheets.)
Click on the Share button highlighted above and the two
boxes below will be displayed in your screen:
In the Get link box, you need to allow your table to be
shared with anyone with the link. If you have already done it, the
following box will be displayed:
So, click on the Copy link. Otherwise, if your table is
not yet available for anyone with the link, you need to click on
Change to anyone with the link, so you will see on your
screen:
Click on the Copy link and on Done. By
doing that you just allow the Viewer mode of your table. It
doesn’t allow other people with the link to edit your table. After
copying the link, we will go back to the EpiGraphHub platform and will
select the SQL Lab and the SQL Editor
option.
After clicking on it, the page below will be displayed on your screen. On this page, it’s possible to make multiple SQL queries. You can open a new query by clicking on the plus button highlighted below.
On the left side of the page, you can select the DATABASE and SCHEMA.
On the right side, you can type your SQL query and run it clicking on
the RUN button.
To upload a table from Google Sheets you must use the
Google sheets DATABASE and the main SCHEMA, as
shown below:
In the field to type the query, just type:
SELECT * FROM "Google Sheets link";
and click on the RUN button. In the
Google Sheets link use the link copied above. Your final
query is:
After running the query, the result below will be displayed on your screen:
To save this dataset and explore it click on the EXPLORE
button. This window will be displayed:
You can save it as a new dataset (choosing a name) or overwrite an
existing table. So, choose a name and click on the
SAVE & EXPLORE button.
You will be redirected to the explore page, where as learned in the other lessons, you can make charts and explore your data.
The EpiGraphHub platform can be connected to a large number of databases. It can be connected with:
If you want to connect with a specific database that is not yet available on our platform, just open an issue in our repo: https://github.com/thegraphnetwork/EpiGraphHub, and we will consider adding it :)
Congratulations! You reached the end of this sixth lesson, where you uploaded new datasets, in CSV, Excel and Google Sheets formats, to the EpiGraphHub platform.
Hopefully, you feel now comfortable with uploading your own test datasets to the platform, to explore them and extract valuable information that you can easily share online.
If you need your new datasets (public or private) to be persisted, maintained and regularly updated in the future, please contact the EpiGraphHub team, who will be happy to help :)
The following team members contributed to this lesson:
This work is licensed under the Creative Commons Attribution Share Alike license.