12 Content Analysis
Norwin edited this page 6 years ago

The goal of this module is twofold:

  1. Page Ranking: identify pages containing or linking to environmental datasets
  2. Content Extraction: find metadata about the datasets such as contact information

To reach these goals, different approaches with differing complexity come to mind. A subset of them can be implemented according to time constraints, and most of them could be combined for enhanced results. As our experience in this whole field is rather low, we would first evaluate these approaches further before deciding on the solution.

Approaches for Page Ranking (sorted by complexity)

We can identify different types of pages of interest provided by different data providers:

  • Dedicated Dataportal
  • Agency Homepage with data link (linking to datasets, PDF reports)
  • Agency Homepage with data mention (only contact information provided)
  • Agency Homepage with inline data (html tables, maps, diagrams)
  • Dataset (API, raw data, ...)
  • ...

A page rank should consider these different page types, by providing separate scores for each type as well as a weighted total score.

  1. Find metadata in the HTML that identifies known and widely used data sources.

    • eg. CKAN: <meta name="generator" content="ckan 2.0.1" />, TODO: more sources!
    • identify dataset urls (WFS getCapabilities, geojson, ...)
    • detect pdf reports, html data tables?
  2. Simple keyword search in the Text (or HTML) that suggest presence of a dataset

    • RegEx search of manually curated list of (translated) keywords
    • fuzzy string search (broader matches for keywords)
  3. Topic Modeling of a known corpus using NLP approaches such as Latent Dirichlet allocation

    • Compare the extracted list of topics/keywords just like approach 2. for classification
    • Topic Model generated on a training corpus of known dataset-containing pages using LDA.
      • Requires one Model for each Language
    • The NLP domain is unknown to us, here be dragons.
  4. Classify Text with one or more tags ("probability of a dataset/link to dataset/data portal in this page”)

    • Just today a deep learning approach has been published (1, 2) which could make this task practical (high accuracy with a small training dataset)
      • As this approach uses transfer learning, only a small dataset (100 - 500 documents) has to be manually classified for training!
      • This requires a model for each language, for now only feasible for english text.
    • This involves the application of (deep) machine learning, which is mostly unknown to us.

With these methods multiple scores can be calculated for each website. These can be used for sorting and filtering during result exploration. Which of these methods is feasible for this project is TBD (see also this flowchart on NLP feasibility).

Approaches for Content Extraction

"Extractability" of metadata depends on the way it is encoded in an HTML document:

  1. encoded with semantic structure such as Microdata or <meta> tags
    • rather well extractable, as structure is known. tooling is available as well (1, XPath)
  2. as plain text (likely the common case, eg. even in CKAN)
    • keyword/regex based extraction. content will be free form text, possibly containing irrelevant sections

A keyword / regex based approach will quickly hit its limits, as no detailed content extraction is possible, and it is highly language dependant. Alternatively, content extraction using a machine learning approach such as Conditional Random Fields could be explored, though the usual caveat of "here be dragons" applies.

The list of metadata fields to be extracted includes (@Ruth, please review/extend this!):

  • Agency / Contact
  • License
  • Timestamp (or Realtime / Historic flag)
  • Spatial Extent

Implementation Proposal

We propose a implementation without NLP techniques, leveraging simpler methods such as XML parsing and fuzzy string matching based on predefined queries. On a high level, page content will be classified into several topics, and scored accordingly. Relevant content is extracted during topic classification, and stored as metadata.

Page Ranking and Content Extraction is implemented via custom StormCrawler ParseFilters, which can extract information from page content and store it in a Metadata dictionary. The next step of the pipeline is a custom bolt that evaluates the extracted metadata and scores the page.

Both goals (Page Ranking + Content Extraction) can be achieved using the same filters:

  • A XPath filter, extracting content or metadata of known structure (meta tags, anchor refs, embedded data, ..)
  • A fuzzy string matcher using Lucene, searching the visible text for a list of keywords. Whether the Lucene query score can be used as result score directly is TBD.
  • (Microdata filter)

Whether Selenium can be used for enhanced page crawling (with interpreted JavaScript) is unclear, as StormCrawler can use one protocol for all fetchs only, and Selenium seems to be very slow.

Page Ranking

There are various topics, consisting of filter definitions, which when applied to a page will result in a score each. There are topics for various types of pages of interest, as well as for general data properties.

Most topic queries should be translated into various languages to apply to international pages. This can be done once and then persisted. Language detection of the page text can be omitted, as queries in foreign languages should yield no matches, and thus not distort the result.

topic name filter types content extraction translation required
data.portal xpath, keywords ? n
data.link xpath y n
data.inline xpath, keywords ? y
data.realtime keywords n y
data.historic keywords ? y
data.license xpath, keywords y y
contact xpath, keywords y y
...