mirror of https://github.com/52North/ecmwf-dataset-crawl synced 2026-04-11 12:47:03 +02:00

stream webcrawler identifying environmental datasets

Find a file

EHJ-52n 8efb21eb3c Update README.md		2024-09-20 17:05:15 +02:00
controller	show query score in result list	2018-09-14 11:12:19 +02:00
crawler	add project description to README	2018-09-18 12:52:31 +02:00
elasticsearch	update to ES 6.4.0	2018-08-27 17:21:43 +02:00
frontend	show query score in result list	2018-09-14 11:12:19 +02:00
kibana	update to ES 6.4.0	2018-08-27 17:21:43 +02:00
proxy	Revert "block kibana routes with write access in proxy"	2018-09-06 14:59:10 +02:00
.env	document api key setup, closes #23	2018-06-13 14:06:20 +02:00
docker-compose.yml	fixes, clean local working state of repo	2018-06-27 16:05:54 +02:00
LICENSE	Initial commit	2018-05-11 12:28:35 +02:00
Makefile	add makefile #19	2018-08-31 12:15:34 +02:00
README.md	Update README.md	2024-09-20 17:05:15 +02:00

README.md

ARCHIVED

This project is no longer maintained and will not receive any further updates. If you plan to continue using it, please be aware that future security issues will not be addressed.

ecmwf-dataset-crawl

A webcrawler for (hydrological) datasets. Developed as part of the ECMWF Summer of Weather Code 2018.

Within the project "Web Crawler for hydrological Data" we have developed a web crawling solution for multilingual discovery of environmental data sets. The discovered pages can help to add new data sources to global predictive weather forecasting models.

The application offers a specialised web search engine, which can be tasked to discover websites containing data sets based on keywords and countries. Keywords for each task are automatically translated into the languages of the desired countries to support multilingual discovery. Each discovered web-page's content is classified with its probability of linking to data by a custom trained machine learning model. Relevant content such as contact information, data license and direct-links is extracted and indexed for faster accessibility.

A web based user interface offers list of pages with their extracted content, sorted by relevance. These results can be filtered with a full text search or on metadata such as content language, classification label. Each result can be manually classified into categories, to help in training new models for the machine learning classifier. The interface furthermore offers usability features such as direct links to translated pages and search queries. Comparative assessment of the different keywords can be done in a visualization of the crawler's performance metrics.

Design notes & more information can be found in the wiki.

run (docker-compose)

you can also have a look in the wiki for more hints.

# get API keys for google custom search, Azure Text Translator
# and insert them into configuration via environment vars.
# each required VAR is documented in the file.
vi .env

# start all the services
docker-compose up --build --force-recreate -d

# stop the services
docker-compose stop

# stop the services DELETING ALL DATA
docker-compose down --volumes

To configure Kibana visualizations:

set action.auto_create_index to true in elasticsearch/config/elasticsearch.yml and restart elasticsearch with docker-compose restart elasticsearch
visit http://localhost/kibana/app/kibana#/management/objects and click "Import".
select kibana/saved_objects.json from this project's directory.
mark any of the index patterns as "favorite" (star button)
reset the elasticsearch configuration and restart it again.

dev

For information about the development environment, look at the readme of each component.

Licensed under Apache License 2