Multi-Language Statistical Classification of Natural and Computer-Generated Texts

About the Project

This project concerns the statistics of written documents, whose properties are analogous to those of other type-token systems such as city populations, personal incomes and galactic superclusters.

Large-scale studies on the Project Gutenberg corpus reveal wide statistical variations between written documents [1][2]. Though many texts can be modelled with existing statistical theory, others are anomalous and deserve deeper examination. Significant differences emerge between different languages, so that languages of diverse groups (e.g. Finnish and English) could potentially be classified by their statistics alone.

It would be instructive to observe the differences between AI and natural text for different languages and investigate how statistical detection of the former could operate between languages. We envisage the following program of study:

The compilation of a large body of literature from multiple corpora in which all significant language groups are evenly represented.
The identification and separation of statistical outliers within this group.
The generation of a corresponding corpus of AI-generated texts, of a similar size and covering the same language groups as the natural corpus.
The characterization of items within each corpora in terms of existing type-token models.
The subsequent ranking of these models by their ability to distinguish between languages and between AI and natural texts.

Assuming the results are positive, this could lead to the creation of a language classifier. If the results are less successful, they may nevertheless guide the development of better models, applicable to a wider range of documents and languages.

Funding Notes

there is no funding for this project

References

Tunnicliffe, Martin and Hunter, Gordon (2022) Random sampling of the Zipf-Mandelbrot distribution as a representation of vocabulary growth. Physica A: Statistical Mechanics and its Applications, p. 128259. ISSN (print) 0378-4371
Tunnicliffe, Martin and Hunter, Gordon (2021) The predictive capabilities of mathematical models for the type-token relationship in English language corpora. Computer Speech & Language, 70, p. 101227. ISSN (print) 0885-2308

Multi-Language Statistical Classification of Natural and Computer-Generated Texts

Post My Job

London, United Kingdom