mirror of https://github.com/52North/ecmwf-dataset-crawl synced 2025-10-21 04:33:53 +02:00

History

Norwin Roosen ef3baf7279 add project description to README		2018-09-18 12:52:31 +02:00
..
src	add project description to README	2018-09-18 12:52:31 +02:00
.env	crawler: config via env vars, expose proxy config	2018-06-13 11:48:51 +02:00
.gitignore	gitignore python deps	2018-09-06 15:54:04 +02:00
crawler-conf.flux	fix regression of `3c5f249a9a`	2018-08-30 14:43:01 +02:00
Dockerfile	update Dockerfile to support the python component	2018-07-20 11:51:43 +02:00
es-conf.flux	avoid stalling of crawler #32	2018-09-06 15:01:02 +02:00
es-crawler.flux	avoid stalling of crawler #32	2018-09-06 15:01:02 +02:00
pom.xml	update dependencies	2018-09-06 15:54:04 +02:00
README.md	gitignore python deps	2018-09-06 15:54:04 +02:00

README.md

Crawler

Based on Apache Storm + Flux. Depends on Running elasticsearch instance and a python 3.6 installation. Can be configured via environment variables. All variables in .env must be set.

To run the topology without a storm executable:

env $(cat .env | xargs) \
  mvn compile exec:java -Dexec.mainClass=org.apache.storm.flux.Flux -Dexec.args="\
  --local --sleep 99999999 --env-filter es-crawler.flux"

To build a bundle that can be run with storm:

mvn package
env $(cat .env | xargs) storm jar target/crawler-alpha.jar org.apache.storm.flux.Flux --local --sleep 99999999 --env-filter ./es-crawler.flux

installation

The crawler makes use of storm's multilang support to use a classifier written in python. The resources for this classifier are located in src/main/resources/resources (due to classpath weirdness in the generated jar) with all dependencies vendored in that directory. To reinstall these sources, run

export target=src/main/resources/resources; pip3 install -r $target/requirements.txt --target $target