Usage
Genaral usage options
run-pipeline [-h] [--first-month FIRST_MONTH] [--last-month LAST_MONTH]
[--meilisearch-url MEILISEARCH_URL] [--in-dir IN_DIR]
[--output-dir OUTPUT_DIR] [--graph-dir GRAPH_DIR] [--local-scheduler]
options:
-h, --help show this help message and exit
--first-month FIRST_MONTH
The first month to process. Defaults to '2017-01'.
--last-month LAST_MONTH
The last month to process. Defaults to the last month.
--meilisearch-url MEILISEARCH_URL
The URL of the Meilisearch server. Defaults to
'http://localhost:7700'
--in-dir IN_DIR The directory to store the TED XMLs. Defaults to '/tmp/ted_notices'
--output-dir OUTPUT_DIR
The directory to store the output data. Defaults to '/tmp/output'
--graph-dir GRAPH_DIR
The name of the KuzuDB graph. Defaults to '/tmp/graph'
--local-scheduler Use the local scheduler.
Using PyPi package
After installation you should able to run both Luigi scheduler and pipeline:
run-server
# In different window
run-pipeline
Another extra thing that can be ran is a Meilisearch instance so that the search indexes can be built is meilisearch
.
It is NOT provided together with PyPi package, you can install it using your favourite package manager. It is recommended to install it if you plan to use the parsed data with TEDective API (opens in a new tab)
Using Nix
# The nix build will create a result folder inside it you will find these scripts
# This is how you can get more information about the possible arguments you can provide to the script
result/bin/run-pipeline --help
# IMPORTANT: As we previously said there are two parts to the ETL this is how to spin up luigi so the pipeline can run
result/bin/run-server
# We suggest for development purposes to use the --last-month flag to have it quickly setup. You can also set the first-month if you would like a specific time window of data. By default first month is going to be 2017-01
run-pipeline --last-month 2017-02
In this case you can also run Meilisearch to build search indexes. That can be done inside the devenv more on that further in contributing section
Manually (using poetry
)
Running the pipeline requires running luigi daemon. It is included in the project and you can run it with the following command:
poetry run run-server
# And pipeline itself in different window
poetry run run-pipeline
It is recommended to run Meilisearch as well, if using this method, you would have to install it manually as well.