Abstract
Heap’s Law
https://dl.acm.org/citation.cfm?id=539986 Heaps, H S 1978 Information Retrieval: Computational and Theoretical Aspects (Academic Press). states that in a large enough text corpus, the number of types as a function of tokens grows as N = K{M^\beta } for some free parameters K, \beta . Much has been written
http://iopscience.iop.org/article/10.1088/1367-2630/15/9/093033 Font-Clos, Francesc 2013 A scaling law beyond Zipf’s law and its relation to Heaps’ law (New Journal of Physics 15 093033).,
http://iopscience.iop.org/article/10.1088/1367-2630/11/12/123015 Bernhardsson S, da Rocha L E C and Minnhagen P 2009 The meta book and size-dependent properties of written language (New Journal of Physics 11 123015).,
http://iopscience.iop.org/article/10.1088/1742-5468/2011/07/P07013 Bernhardsson S, Ki Baek and Minnhagen 2011 A paradoxical property of the monkey book (Journal of Statistical Mechanics: Theory and Experiment, Volume 2011).,
http://milicka.cz/kestazeni/type-token_relation.pdf Milička, Jiří 2009 Type-token & Hapax-token Relation: A Combinatorial Model (Glottotheory. International Journal of Theoretical Linguistics 2 (1), 99–110).,
https://www.nature.com/articles/srep00943 Petersen, Alexander 2012 Languages cool as they expand: Allometric scaling and the decreasing need for new words (Scientific Reports volume 2, Article number: 943). about how this result and various generalizations can be derived from Zipf’s Law.
http://dx.doi.org/10.1037/h0052442 Zipf, George 1949 Human behavior and the principle of least effort (Reading: Addison-Wesley). Here we derive from first principles a completely novel expression of the type-token curve and prove its superior accuracy on real text. This expression naturally generalizes to equally accurate estimates for counting hapaxes and higher n-legomena.