You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Norwin ef3baf7279 | 6 years ago | |
---|---|---|
.. | ||
src | 6 years ago | |
.env | 7 years ago | |
.gitignore | 6 years ago | |
Dockerfile | 6 years ago | |
README.md | 6 years ago | |
crawler-conf.flux | 6 years ago | |
es-conf.flux | 6 years ago | |
es-crawler.flux | 6 years ago | |
pom.xml | 6 years ago |
README.md
Crawler
Based on Apache Storm + Flux. Depends on Running elasticsearch instance and a python 3.6 installation.
Can be configured via environment variables. All variables in .env
must be set.
To run the topology without a storm executable:
env $(cat .env | xargs) \
mvn compile exec:java -Dexec.mainClass=org.apache.storm.flux.Flux -Dexec.args="\
--local --sleep 99999999 --env-filter es-crawler.flux"
To build a bundle that can be run with storm:
mvn package
env $(cat .env | xargs) storm jar target/crawler-alpha.jar org.apache.storm.flux.Flux --local --sleep 99999999 --env-filter ./es-crawler.flux
installation
The crawler makes use of storm's multilang support to use a classifier written in python.
The resources for this classifier are located in src/main/resources/resources
(due to classpath weirdness in the generated jar) with all dependencies vendored in that directory.
To reinstall these sources, run
export target=src/main/resources/resources; pip3 install -r $target/requirements.txt --target $target