Prof. Dr. Uwe Quasthoff, Universität Leipzig, Germany
Prof. Uwe Quasthoff works at the Natural Language Processing Group at the Department of Computer Science at Leipzig University in Germany. His main research topics are language independent methods in Natural Language Processing, building very large corpora, and the structure of natural language. The research method is the analysis of large text corpora with statistical and pattern based as well as machine learning methods. The Leipzig Corpora collection (http://corpora.uni-leipzig.de/) started in 1995 and now contains pre-processed text collections and monolingual dictionaries in more than 250 languages. The approach is language independent, hence the algorithms for further processing apply usually to a large group of languages. The analysis of word co-occurrence patterns was the starting point for machine learning used for semantic similarities in different granularities.
Speech Title: Corpora as a resource for IR
Knowledge about words is helpful for IR. Knowledge about single words like word frequencies and knowledge about replacement candidates like inflected forms or synonyms are used heavily. Syntactic and semantic relations between consecutive words are of interest for text understanding. POS tagging and syntactic parsing is the bases for a deeper semantic analysis with statistical methods.
The talk will give an overview of the complete pipeline of corpus building and exploration: Crawling and preprocessing (language identification, sentence segmentation, tokenization, de-duplication, POS-Tagging etc.), word co-occurrences and semantic similarities using word embeddings, word and text classification problems.
As an approach to relation extraction and sentence understanding, so-called typical sentences are used: Sentences of simple syntactic structure repeatedly found with rich lexical variability. The syntactic structures are selected by high frequency, and with large corpora they allow the usage of word similarities to cluster such sentences to basic statements.
More information will be relased soon...