Methodology
The process of properly collecting, analysing, and visualising data includes multiple steps. To go through these steps, I have used certain tools and methods.
-
Texual Content: It is difficult to meaningfully extract data through computational means using plain text files. Thus, creating or finding XML-encoded literary works is a necessity for data extraction. EarlyPrint provides a plethora of such machine-readable files based on TEI guidelines. I specifically downloaded the XML-encoded versions of Congreve’s comedies from which the word-level tags have been stripped to suit the purposes of this project.
-
Data Extraction and Analysis: I extracted the lines spoken by the characters of the plays using Python and the lxml library, and I stored the data in JSON files. I used Python libraries, such as the Natural Language Toolkit (NLTK), syllapy and statistics, to analyse the data and find out gender-based token ratio, average speech length, Flesch-Kincaid Grade Level, and so on. The whole process has been documented in Jupyter Notebook. The IPYNB and JSON files can be found in the GitHub repository.
-
Data Visualisation: Finally, I used pandas to create DataFrames and plotly to make interactive graphs based on the analyses.