1 - Overview

3 minute read

The posts are based on the class CS410 Text information system. They will cover mostly on search engines, because search engines is widely used, and it supports for both querying and recommendation.

Text Retrieval slide from CS410

1. Natural Language Content Analysis

To use text data, computers need to understand natural language to some extent. Natural language content analysis is to transform raw text data into meaningful representations that computers can better use of them. Therefore, content analysis comes as first step for further text data mining or retrieval. It generally includes:

Figure from Text Data Management and Analysis

  • Lexical analysis: Find out the basic meaningful units such as word in text. For instance, we can tag boy as a noun.
  • Syntactic analysis: Determine relationships between words in a sentence, the syntactic structure of a sentence. a boy is a noun phrase.
  • Semantic analysis: Determine the meaning of a sentence. It is to map noun phrases to entities and verb phrases to relations. In the example, a dog is noun entity and chasing is a relation.
  • Pragmatic analysis: Find out the meaning in context. Deeper understanding of words than semantic analysis.
  • Discourse analysis: Learn connections between sentences in the context

2. Text Access

Text Access is to provide users the right information by retrieving most relevant text data to a particular topic, excluding unnecessary information, and interpreting knowledge in appropriate context. There are two ways of accessing text:

  • Pull(search engines): User takes the initiative and find for relevant text data. Ad hoc information need.
    • Querying: User accesses text data by entering keywords, and system returns relevant document. Works well when the user queries well.
    • Browsing: User navigates into relevant information by following defined structures. Works well when users do not have much information.
  • Push(recommender systems): System takes the initiative and good knowledge need about users

3. Text Retrieval

Text retrieval is a task of returning relevant documents for queries. So, there are collection of text documents, and user gives a query to get information among document. Then search engine system retrieves relevant documents to the query. It is also called as ‘search technology’. Text retrieval is different from database retrieval from several points:

  • Information: Information might be difficult to interpret, having some ambiguity.
  • Query: Unlike structured query language, query is usually short and incomplete.
  • Answers: While precise query retrieves correct answers, ambiguous query retrieves relevant documents.

Due to its ambiguity, TR relies on empirical evaluation, which includes users.

There are mainly two ways of deciding set of relevant documents:

  • Document selection: System decides whether a document is relevant or not(absolute relevance).
  • Document ranking: System decides if one document is more likely relevant than another(relative relevance).

Text Retrieval slide from CS410 1.3 Text Retrieval Problem

Based on the probability ranking principle by Robertson in 1997, the utility of a document to user is independent of the utility of any other document, and a user will browse the results sequentially. That is, descending order of relevant documents, ranking, comes out to be an optimal strategy.

4. Text Retrieval Methods

Here we introduce some widely used retrieval methods: vector space models and probabilistic retrieval models.

Notations are:

Query: \(q = q_1, ..., q_m, \; where \;q_i \ ∈ V\)

Document: \(d = d_1, ..., d_m, \; where \;d_i \ ∈ V ​\)

Ranking function: \(f(q,d)\)

Vector space model assumes the relevance between a document and a query can be measured by its similarity. Suppose a high dimensional space, and each dimension corresponds with a term. So, each query or document can be represented in this vector space, and the similarity between vectors show the relevance between queries and documents.

Basic probabilistic retrieval model, on the other hand, defines the ranking function based on the probability of relevance given a document and query.

\[f(d,q) = p(R = 1 | d,1), \;\; where \; R ∈ {0,1} ​\]

Probabilistic retrieval models have:

  • Classical probabilistic model
  • Language model
  • Divergence from randomness model

Each models and further implementations will be discovered in later posts.

Reference

  • CS410 Text Information System by Professor ChengXiang Zhai
  • Zhai, C., & Massung, S. (2016). Text data management and analysis: a practical introduction to information retrieval and text mining. Morgan & Claypool.
  • Robertson, S. E. (1977). The probability ranking principle in IR. Journal of documentation, 33(4), 294-304.

Categories:

Updated:

Leave a comment