7
Tech Stack Research
Norwin edited this page 6 years ago
Translation APIs
- Azure Text Translator Free Tier
- free up to 2mio characters, should be more than enough
- EU eTranslation Service
- accessible for EU administrations + CEF members only
- mostly trained on EU-internal documents, so best suited for policy documents :(
- Yandex.Translate
- $15 per 1 mio characters
- Google Translation
- $20 per 1 mio characters
Search APIs
This is important to get right, as the initial search defines the result set of the crawl. Search could become costly, expected search volume: 20 queries per crawl request and language
- Google Custom Search
- 100 requests / day free, then $5 / 1000 requests
- 👍 allows result localization, emphasis
- Azure Bing Search S2 Tier
- €2.53 / 1000 requests
- 👍 allows result localization, emphasis
- faroo
- free
- 👎 bad results?
- chatnoir
- free? API key on request
- based on CommonCrawl data
Crawling
-
- 👍 featurecomplete webcrawling application
- 👎 operates in batch mode, slow
- 👎 old, community rather inactive
-
- webcrawling SDK based on apache storm
- 👍 stream-based, very efficient, results available as they come in
- 👎 not as feature complete, more work required, but probably friendlier
Indexing
- Elasticsearch
- Solr
Both work well with both crawlers, we have more know how with Elastic.
Content Analysis
The exact approach - and thus the tooling - is TBD.
- Apache Tika: content identification, extraction tool + SDK
- MALLET: java package for statistical NLP, document classification,clustering, topic modeling, information extraction
- Apache openNLP: NLP toolkit
UI / Result Presentation
- Vue.js 2?
- Views:
- "Launch Crawl"
- "View Crawls (completed/in progress)"
- "Search Results"
Deployment
- all dockerized
- orchestration with compose for now?