Thesis wordcloud

For one of my Covid-19 lockdown projects I wanted to create a wordcloud of my PhD thesis and as the subject of my thesis was neuroscience, I thought it would be fun to make it in the shape of a brain.

Environment setup & installation of packages

I use the terminal on MacOS and run Python through Anaconda but you should be able to run these commands with minimal adjustments on other systems.

Run the command below to initialise a new environment called wordcloud, install python, the wordcloud package and its dependencies.

conda create -n wordcloud python=3.7 pip wordcloud 
conda activate wordcloud

pip install nltk pdfminer

Loading packages

import sys
sys.modules[__name__].__dict__.clear()

# Loading libraries
import numpy as np
import re # regular expression
from PIL import Image # [Python Imaging Library](https://pypi.org/project/PIL/)

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

from io import StringIO

import unicodedata
from nltk.corpus import stopwords

from wordcloud import WordCloud, ImageColorGenerator

# import packages from pdfminer
from pdfminer.converter ``import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

Importing the pdf

I used the package pdfminer to read my my PhD thesis into python. I limited the number of pages to import in order to exclude the pages with references.

page_numbers = (0, 135) # get pages up to references

# import pdf as text
output_string = StringIO()
with open('data/thesis_JochemVanKempen.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page_number, page in enumerate(PDFPage.create_pages(doc)):
        if (page_number>=page_numbers[0]) and (page_number<=page_numbers[1]):
            interpreter.process_page(page)

# print(output_string.getvalue())

new_string = str(output_string.getvalue())

Cleaning up the text

Next, I cleaned up the text by removing brackets, special characters and numbers. Finally, I made the text lowercase.

# Remove brackets 
new_string = re.sub('[\(\[].*?[\)\]]', ' ', new_string)

# Remove line breaks
new_string = re.sub('\n', '', new_string)

# Remove page breaks
new_string = re.sub('\x0c', '', new_string)

# Remove accented characters and normalise using the unicodedata library
new_string = unicodedata.normalize('NFKD', new_string).encode('ascii', 'ignore').decode('utf-8', 'ignore')

# make lower case
new_string = new_string.lower()

Define a few acronyms, or make them upper case

Some acronyms I reverted back to upper case, others I defined.

# convert specific acronyms back to upper case
new_string = new_string.replace(' cpp ', ' CPP ')
new_string = new_string.replace(' eeg ', ' EEG ')
new_string = new_string.replace(' fef ', ' FEF ')
new_string = new_string.replace(' hmm ', ' HMM ')
new_string = new_string.replace('hz', 'Hz')
new_string = new_string.replace(' itpc ', ' ITPC ')
new_string = new_string.replace(' lc ', ' LC ')
new_string = new_string.replace(' lfpb ', ' LFPb ')
new_string = new_string.replace(' n2c ', ' N2c ')
new_string = new_string.replace(' rf ', ' RF ')
new_string = new_string.replace(' rt ', ' RT ')
new_string = new_string.replace(' v1 ', ' V1 ')
new_string = new_string.replace(' v4 ', ' V4 ')

# replace some acronyms with words
new_string = new_string.replace(' ach ', ' acetylcholine ')
new_string = new_string.replace(' da ', ' dopamine ')
new_string = new_string.replace(' na ', ' noradrenaline ')

Exluding words

Using the natural language toolkit package nltk, I obtained a list of stopwords to exclude from the wordcloud. I extended this list with some words specific to scientific referencing (for instance ‘doi’ and ‘et al’) as well as words specific to my thesis (for instance specific author names that I cited frequently)

from nltk.corpus import stopwords

stopword_list = stopwords.words('english')
stopword_list.extend(['et', 'al', 'doi',
                      'figure', 'chapter',
                      'and','well','thus','first','also',
                      'nat','neurosci','jneurosci','sci','science',
                      'one','two','although',
                      'thiele','aston','jones','cohen','arnsten','bellgrove','connell',
                      'non','within',
                      'single','additionally', 'p','ms','u', 'd1r', 'b', 'pre'])

Making/importing the mask

I created a mask for the wordcloud in the shape of a brain using inkscape by adapting this image. The WordCloud package uses all black values of the mask to plot words (RGB = [0 0 0]). I added a few white lines to mark the cerebellum and the temporal lobe so that these boundaries will still be visible in the wordcloud.

Importing the mask into python and plotting it:

mask_brain = np.array(Image.open("Img/brain_blackw.png"))    

plt.imshow(mask_brain)
plt.axis("off")
plt.show()

wordcloud mask

Making the wordcloud

Making the wordcloud is fairly straightforward using the WordCloud package. I chose a black background and to display the 200 most used words. Make sure to set mask = mask_brain.

wc = WordCloud(stopwords = stopword_list, 
               background_color = "black",                         
               mask = mask_brain, 
               collocations = False,
               max_words=200, width = 1600, height = 900, 
               random_state=1).generate(new_string)

print(*wc.words_, sep = "\n")

Set the colour scheme

To change the colour scheme, I wanted to only use part of the bone colormap, from 0.5 to 1. For this, I used the function truncate_colormap that I found here.

# truncate colormap
def truncate_colormap(cmap, minval=0.0, maxval=1.0, n=-1):
    if n == -1:
        n = cmap.N
        
    new_cmap = mcolors.LinearSegmentedColormap.from_list(
         'trunc({name},{a:.2f},{b:.2f})'.format(name=cmap.name, a=minval, b=maxval),
         cmap(np.linspace(minval, maxval, n)))
    
    return new_cmap

minColor = 0.5
maxColor = 1.00

cmap2use = "bone"
cmap_t = truncate_colormap(plt.get_cmap(cmap2use), minColor, maxColor)

The final result

Finally, I plotted the wordcloud with the truncated colormap.

plt.imshow( wc.recolor(colormap = cmap_t, random_state = 1), alpha = 1 , interpolation='bilinear')
plt.axis("off")