![]() There are two obstacles we must overcome with this approach. Lastly, we do not split entities like “Vernon Dursley’s secretary” or “Draco Malfoy’s eagle owl”. It is not a perfect solution as “Black” could be title-cased due to being at the start of a sentence, but it is a good enough solution.Ī particular case I’ve introduced is that “Ronald Weasley” is mainly referred to as “Ron” in the text. Only if the “Black” is title-cased, will we assume that “Sirius Black” is referenced. This is to avoid all mentions of the color “black” being a reference to “Sirius Black”. Next, we want all single words to be title-cased. It might be pretty intuitive not to define words like “of” or “the” as the matcher patterns of an entity. I have defined a list of stop words that should not be included in the single word pattern for a given character.įor example, there is a “Keeper of the Zoo” character present in the book. So, for example, if we are defining matcher patterns for Albus Dumbledore, we will end up with three different text patterns that could represent the given character: Then we split the name by whitespace and create a pattern out of every word of the term. #HARRY POTTER BOOKZ FULL#The list of characters by chapter is available under the CC-BY-SA license, so we don’t have to worry about any copyright infringement.įirst, we add the full name as the pattern we are looking for. As mentioned, we will begin by scraping the characters in the Harry Potter and the Philosopher’s Stone book. I have prepared a Google Colab notebook if you want to follow along. Entity recognition with SpaCy’s rule-based matching.Preprocess book text (Co-reference resolution).If two characters appear within 14 words of each other, we will assume they have interacted somehow and store the number of those interactions as the relationship weight. We will use the same co-occurrence threshold as was used in the Game of Thrones extraction. Once we have found all the occurrences of entities, the only thing left is to define the co-occurrence metric and store the results in Neo4j. We also know in which chapter they first appeared, which will help us even further disambiguate the characters.Īrmed with this knowledge, we will use SpaCy’s rule-based matcher to find all mentions of a character. Luckily for us, the Harry Potter fandom page contains a list of characters in the first book. I’ve tried most of the open-source named entity recognition models to compare which worked best, but in the end, I decided that none were good enough. I did a lot of experiments to decide the best way to go about it. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |