Zareen Farooqui wrote a very interesting article on Harry Potter Text Analysis in which she used Python to do some text analyses comparing the seven Harry Potter Novels. I highly recommend you check her article out.
It got me thinking about how Harry Potter fanfiction compares to JK Rowling’s texts and I was inspired to do my own analyses using some of the tools I have learned over the last year, such as Python 3, the BASH Command Shell, AntConc and Raw Graphs, as well as Excel and Notepad++.
First, I sourced .txt files of the seven Harry Potter novels online, used BASH to concatenate them into a single .txt file and then used Notepad++ to remove metadata, chapter numbers, and perform some basic analyses.
I counted the total number of words in the file (1,112,028) and set about building a corpus of Harry Potter fanfiction of similar length. For this, I searchedfor the most popular works of Harry Potter and scraped the first 20 using a Python script which also converted them to individual text files. Using BASH I counted the words in each file and selected the 17 that most closely matched the total word count of the Harry Potter novels, ending up with 1,110,349 words in a single .txt file.
ANALYSES – WORD FREQUENCY
The top ten words in each are very similar so I compared them in a single visualisation using Excel:
The largest difference is in the use of “said”, being far less frequent in Harry Potter fanfiction. My intuition was that fanfiction used other speech verbs for “said”, such as “asked”, “murmured”, “shouted” so I went back to the corpora and used Notepad++ and AntConc to count the occurrences of quotation marks.
The Harry Potter corpus contains 36,827occurrences of quotation marks, indicating 18,413 occurrences of speech. There are 14396 uses of “said” and only 205 uses of “says”, leaving 3812 potential other speech verbs, 20.70% of the speech occurrences.
The Harry Potter fanfiction corpus contains 33,388 occurrences of quotation marks, indicating 16,694 occurrences of speech. There are 6534 uses of “said” and 2342 uses of “says”, leaving 7,818 potential other speech verbs, 46.83% of the speech occurrences.
|number of speech occurences||number of uses of said||percentage occurrence of other speech verbs|
|Harry Potter Fanfiction||16,694||6534||46.83%|
Therefore, the discrepancy in each corpus’ use of “said” can somewhat be explained by Harry Potter fanfiction’s use of other speech verbs, but it is also caused by more frequent use of the present tense in fanfiction.