🚧 This website is still under construction. Please stay tuned. 🚧
Developers
ETL Pipeline
Usage

Usage

Genaral usage options

run-pipeline [-h] [--first-month FIRST_MONTH] [--last-month LAST_MONTH]
                    [--meilisearch-url MEILISEARCH_URL] [--in-dir IN_DIR]
                    [--output-dir OUTPUT_DIR] [--graph-dir GRAPH_DIR] [--local-scheduler]
 
options:
  -h, --help            show this help message and exit
  --first-month FIRST_MONTH
                        The first month to process. Defaults to '2017-01'.
  --last-month LAST_MONTH
                        The last month to process. Defaults to the last month.
  --meilisearch-url MEILISEARCH_URL
                        The URL of the Meilisearch server. Defaults to
                        'http://localhost:7700'
  --in-dir IN_DIR       The directory to store the TED XMLs. Defaults to '/tmp/ted_notices'
  --output-dir OUTPUT_DIR
                        The directory to store the output data. Defaults to '/tmp/output'
  --graph-dir GRAPH_DIR
                        The name of the KuzuDB graph. Defaults to '/tmp/graph'
  --local-scheduler     Use the local scheduler.

Using PyPi package

After installation you should able to run both Luigi scheduler and pipeline:

run-server
# In different window
run-pipeline

Another extra thing that can be ran is a Meilisearch instance so that the search indexes can be built is meilisearch. It is NOT provided together with PyPi package, you can install it using your favourite package manager. It is recommended to install it if you plan to use the parsed data with TEDective API (opens in a new tab)

Using Nix

# The nix build will create a result folder inside it you will find these scripts
# This is how you can get more information about the possible arguments you can provide to the script
result/bin/run-pipeline --help
 
# IMPORTANT: As we previously said there are two parts to the ETL this is how to spin up luigi so the pipeline can run
result/bin/run-server
 
# We suggest for development purposes to use the --last-month flag to have it quickly setup. You can also set the first-month if you would like a specific time window of data. By default first month is going to be 2017-01
run-pipeline --last-month 2017-02

In this case you can also run Meilisearch to build search indexes. That can be done inside the devenv more on that further in contributing section

Manually (using poetry)

Running the pipeline requires running luigi daemon. It is included in the project and you can run it with the following command:

poetry run run-server
# And pipeline itself in different window
poetry run run-pipeline

It is recommended to run Meilisearch as well, if using this method, you would have to install it manually as well.