The probabilistic representation of linguistic knowledge: Linguistic data sets annotated for grammatical acceptability

The files contain crowd sourced (Amazon Mechanical Turk) speaker annotated sentences in several domains, and for several languages. The annotations are mean acceptability judgements in several modes of presentation. Full documentation of the experimental protocols through which the annotation of these data sets was obtained is provided on the Statistical Models of Grammaticality website (SMOG), please see the related resources section to access the (SMOG) website. This data collection contains the linguistic data sets in excel, and two papers which explain the project and data and experiments in greater detail.SMOG is exploring the construction of an enriched stochastic model that represents the syntactic knowledge that native speakers of English have of their language. We are hoping that this kind of model will provide a straightforward explanation for the fact that individual native speakers generally judge the well formedness of sentences along a continuum, rather than through the imposition of a sharp boundary between acceptable and unacceptable sentences. We are experimenting with different sorts of language models that contain a variety of parameters encoding properties of sentences and probability distributions over corpora. We are training these models on subsets of the British National Corpus (BNC), and we are testing them on additional subsets of the BNC into which we have introduced grammatical deformations and infelicities of varying degrees of severity and subtlety. We hope to show that a sufficiently complex enriched language model can encode a fair amount of what native speakers know about the syntax of their language. This research holds out the prospect of important impact in two areas. (1) It can shed light on the relationship between the representation and acquisition of linguistic knowledge on one hand, and learning and the encoding of knowledge in other cognitive domains. This can, in turn, help to clarify the respective roles of biologically conditioned learning biases and data driven learning in human cognition. (2) This work can contribute to the development of more effective language technology by providing insight into the way in which humans represent the syntactic properties of sentences in their language. To the extent that natural language processing systems take account of this class of representations they will provide more efficient tools for parsing and interpreting text and speech.

Show More

Geographic Coverage:

Crowd sourced through Amazon Mechanical Turk

Temporal Coverage:

2012-10-01/2015-09-30

Resource Type:

dataset

Available in Data Catalogs:

UK Data Service

Topics: