Mark Steedman

Treebanking in the Language of Thought

There has recently been some interest among computational linguists in the task of inducing grammar-based "semantic parsers" from sets of paired strings and meaning representations, following pioneering work by Zettlemoyer and Collins (2005). Work of this kind is currently limited by the paucity of datasets for training.
The talk reviews the state of the art in this field, then proposes a way to semi-automatically generate much larger language-independent datasets, on the same order of magnitude as syntactic treebanks, using linguistic knowledge that has only recently begun to become available, for use in inducing semantic parsers for under-resourced languages for application in statistical machine translation.

Nianwen Xue

Treebanking Chinese text: what it is like

he Chinese TreeBank (CTB) has been in development for over a decade now and as of this talk, it has about 1.4M words fully segmented, POS-tagged and syntactically bracketed. It is currently under expansion to informal genres such as on-line discussion forums under the DARPA BOLT Program.
In this talk, I will provide an overview of the annotation standards for the CTB and our annotation procedure. In particular, I will discuss how our revised annotation procedure enlarges the annotator pool and makes it possible to scale up our annotation efforts. I will also discuss some of the challenges in developing this corpus, resulting from some salient linguistic characteristics of the Chinese language. These linguistic characteristics include the lack of reliable sentence and word boundaries, the scarcity of formal morpho-syntactic cues, and pervasive dropped elements. Finally, I will touch upon some general methodological issues in treebanking and other related annotation tasks that still need to be clarified.