15
Architecture
Norwin edited this page 6 years ago
- microservices (but not going wild with it)
- backend is streambased, operates continuously -> results are available as they come in
- abstraction of external APIs within controller via generic classes
- all persistence in search index
-
Crawler + Content Extractor: Apache Storm
- gets seed urls by polling ES, starts crawling those
- extracts content as part of crawl topology
- scores content as part of crawl topology
-
Controller: Typescript + Express + Swagger
- serves UI, proxies ES
- translation API abstraction
- search API abstraction
- notification API abstraction
- result export API abstraction
- inserts seed-URLs into ES to start crawl
- purges status indices to stop crawl once condition X is met
-
Elasticsearch: persistence
- "results" index with fetched & scored URLs
- "status" index for recursive crawl. one per crawl job, so they can be independently stopped