Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2013 Jun 27;8(6):e66813.
doi: 10.1371/journal.pone.0066813. Print 2013.

Language Individuation and Marker Words: Shakespeare and His Maxwell's Demon

Affiliations
Comparative Study

Language Individuation and Marker Words: Shakespeare and His Maxwell's Demon

John Marsden et al. PLoS One. .

Abstract

Background: Within the structural and grammatical bounds of a common language, all authors develop their own distinctive writing styles. Whether the relative occurrence of common words can be measured to produce accurate models of authorship is of particular interest. This work introduces a new score that helps to highlight such variations in word occurrence, and is applied to produce models of authorship of a large group of plays from the Shakespearean era.

Methodology: A text corpus containing 55,055 unique words was generated from 168 plays from the Shakespearean era (16th and 17th centuries) of undisputed authorship. A new score, CM1, is introduced to measure variation patterns based on the frequency of occurrence of each word for the authors John Fletcher, Ben Jonson, Thomas Middleton and William Shakespeare, compared to the rest of the authors in the study (which provides a reference of relative word usage at that time). A total of 50 WEKA methods were applied for Fletcher, Jonson and Middleton, to identify those which were able to produce models yielding over 90% classification accuracy. This ensemble of WEKA methods was then applied to model Shakespearean authorship across all 168 plays, yielding a Matthews' correlation coefficient (MCC) performance of over 90%. Furthermore, the best model yielded an MCC of 99%.

Conclusions: Our results suggest that different authors, while adhering to the structural and grammatical bounds of a common language, develop measurably distinct styles by the tendency to over-utilise or avoid particular common words and phrasings. Considering language and the potential of words as an abstract chaotic system with a high entropy, similarities can be drawn to the Maxwell's Demon thought experiment; authors subconsciously favour or filter certain words, modifying the probability profile in ways that could reflect their individuality and style.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Observed frequency of Fletcher's usage of the word ‘’ in his 15 plays, compared to that of the 153 plays by other authors in the text corpus dataset.
A significantly higher frequency of ‘formula image’ usage by Fletcher is demonstrated, indicating ‘formula image’ as an appropriate choice of marker to assist in the classification of his plays. Fletcher's predilection for the word ‘formula image’ has been previously shown by Hoy .
Figure 2
Figure 2. Observed frequency of Shakespeare's usage of the word ‘’ in his 28 plays, compared to that of the 140 plays by other authors in the text corpus dataset.
A significantly lower frequency of ‘formula image’ usage by Shakespeare is demonstrated, indicating ‘formula image’ as an appropriate choice of marker to assist in the classification of his plays.
Figure 3
Figure 3. CM1 scores for the 50 highest and 50 lowest ranked words for Fletcher, based on the 168 plays in the text corpus dataset.
The 20 highest and 20 lowest ranked words are shown in red and green respectively, and are presented in Tables 2 and 3. As expected, the CM1 score for ‘formula image’ is significantly higher than that of any other marker word.
Figure 4
Figure 4. Difference between the cumulative CM1 scores for Fletcher's 20 highest and 20 lowest scoring marker words, as presented in Tables 2 and 3.
Fletcher's plays are highlighted in green. It is observed that the majority of his plays score considerably higher than the majority of plays by the other authors. One notable exception, The Faithful Shepherdess, is considered to be of a significantly different genre to the remainder of Fletcher's plays, and has been omitted in two previous studies attempting to identify his stylistic signature , .
Figure 5
Figure 5. CM1 scores for the 50 highest and 50 lowest ranked words for Jonson, based on the 168 plays in the text corpus dataset.
The 20 highest and 20 lowest ranked words are shown in red and green respectively, and are presented in Tables 2 and 3. CM1 ranks ‘formula image’ and ‘formula image’ as words that Jonson distinctively overuses, in contrast to ‘formula image’ and ‘formula image’, which are distinctively underused.
Figure 6
Figure 6. Difference between the cumulative CM1 scores for Jonson's 20 highest and 20 lowest scoring marker words, as presented in Tables 2 and 3.
Jonson's plays are highlighted in green. Although not as evident as with Fletcher, Jonson's plays demonstrate an overall higher score than the majority of plays by the other authors. The worst scoring of Jonson's plays, The Case is Altered, is generally regarded as a stylistic anomaly among his works .
Figure 7
Figure 7. CM1 scores for the 50 highest and 50 lowest ranked words for Middleton, based on the 168 plays in the text corpus dataset.
The 20 highest and 20 lowest ranked words are shown in red and green respectively, and are presented in Tables 2 and 3. CM1 ranks ‘formula image’, ‘formula image’, ‘formula image’ and the demonstrative form of ‘formula image’ among the words that Middleton distinctively overuses; ‘formula image’ is ranked amongst the words that Middleton underuses, as opposed to plays by Jonson, for which ‘formula image’ is a strong positive marker.
Figure 8
Figure 8. Difference between the cumulative CM1 scores for Middleton's 20 highest and 20 lowest scoring marker words, as presented in Tables 2 and 3.
Eight of these marker words appeared among the ten word-variables determined earlier by Craig (by discriminant analysis). Middleton's plays are highlighted in green. It is observed that the majority of his plays score higher than the majority of plays by the other authors. The worst scoring of Middleton's plays, A Game at Chess, is unusual stylistically among his works, being a satire on contemporary international politics , .
Figure 9
Figure 9. CM1 scores for the 50 highest and 50 lowest ranked words for Shakespeare, based on the 168 plays in the text corpus dataset.
The 20 highest and 20 lowest ranked words are shown in red and green respectively, and are presented in Tables 2 and 3. CM1 ranks ‘formula image’, ‘formula image’ and ‘formula image’ as words that Shakespeare distinctively overuses, in contrast to ‘formula image’ (as discussed by Craig [29]), ‘formula image’ and the infinitive form of ‘formula image’, which are distinctively underused.
Figure 10
Figure 10. Difference between the cumulative CM1 scores for Shakespeare's 20 highest and 20 lowest scoring marker words, as presented in Tables 2 and 3.
Shakespeare's plays are highlighted in green. It is observed that the majority of his plays score higher than by other authors, although the overall range of values is lower than for Fletcher, Jonson and Middleton. This supports previous research suggesting that Shakespeare generally adheres to the norms of the work of his peer group .
Figure 11
Figure 11. Authorship classification performance of 50 methods evaluated in terms of Matthews' correlation coefficient for Fletcher, resulting from a 10-by-10 fold cross validation of his 20 highest and 20 lowest CM1 scoring marker words.
The results of individual folds are presented in blue/green, with the average performance for each method in red.
Figure 12
Figure 12. Authorship classification performance of 50 methods evaluated in terms of Matthews' correlation coefficient for Jonson, resulting from a 10-by-10 fold cross validation of his 20 highest and 20 lowest CM1 scoring marker words.
The results of individual folds are presented in blue/green, with the average performance for each method in red.
Figure 13
Figure 13. Authorship classification performance of 50 methods evaluated in terms of Matthews' correlation coefficient for Middleton, resulting from a 10-by-10 fold cross validation of his 20 highest and 20 lowest CM1 scoring marker words.
The results of individual folds are presented in blue/green, with the average performance for each method in red.
Figure 14
Figure 14. Authorship classification performance of 8 methods evaluated in terms of Matthews' correlation coefficient for Shakespeare, resulting from a 10-by-10 fold cross validation of his 20 highest and 20 lowest CM1 scoring marker words.
These 8 methods were selected as those which yielded the best classification performance for Fletcher, Jonson and Middleton. The results of individual folds are presented in blue/green, with the average performance for each method in red. The performance across all 8 methods is demonstrated to be above 80%, with the best performing method (formula image) yielding classification performance of 99%.
Figure 15
Figure 15. Frequency of occurrence of words appearing among the 20 highest scoring marker words for Shakespeare, resulting from a 10 fold cross validation.
This process involved the removal of 10% of plays by Shakespeare (3), and 10% of plays by other authors (14). The 20 highest (left) and lowest (right) scoring marker words were calculated for every possible triplet of removed plays by Shakespeare (formula image combinations), and for each, a random selection of 14 plays by other authors. The marker words determined across the full text corpus are highlighted in green. This demonstrates this selection of words as valid for classification, and that the CM1 score is robust against the removal and addition of plays.
Figure 16
Figure 16. Difference between the cumulative CM1 scores for the 20 highest and 20 lowest scoring marker words for Fletcher, Jonson, Middleton and Shakespeare.
For each author, the left box represents the distribution of scores for their plays, and the right box the distribution of scores for plays by all other considered authors. The worst scoring play belonging to each author is indicated by a green cross. These are: a) The Faithful Shepherdess (Fletcher); b) The Case is Altered (Jonson); c) A Game at Chess (Middleton); and d) Love's Labour's Lost (Shakespeare).

Similar articles

Cited by

References

    1. De Saussure F (2011) Course in general linguistics. Columbia University Press.
    1. Johnstone B, Bean JM (1997) Self-expression and linguistic variation. Language in Society 26: 221–246.
    1. Ellegård A (1962) A statistical method for determining authorship: The Junius Letters, 1769–1772. Acta Universitatis Gothoburgensis.
    1. Mosteller F, Wallace D (1964) Inference and disputed authorship: The Federalist. Addison-Wesley.
    1. Burrows J (1987) Word-patterns and story-shapes: The statistical analysis of narrative style. Literary and Linguistic Computing 2: 61–70.

Personal name as subject

Grants and funding

This work has been supported by The University of Newcastle thanks to a funding contribution to the Priority Research Centre for Bioinformatics, Biomarker Discovery and Information-based Medicine (2006–2012). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources