This post is for analysts who (1) would rather not copy-paste balance sheets and income statements from pdf files as they start to make their model, and (2) would like to get the principal topics from wordy sections.
Nothing beats practical examples for me so the rest of the post makes use of various Python packages on PNB’s publicly available annual reports.
Part 1. Extracting financials from pdf to csv files
Step 1: Save all the relevant annual reports in the same folder as the Jupyter notebook.
In my case, I saved 5 different annual reports ahead.
Step 2: Import the tabula package
Step 3: Manually indicate the pages you want to extract
Step 4: Convert to csv
This will produce two new files in the same folder according to how you named it.
Once you open the files, you’ll see they need a little bit of Excel work to clean them up, but I think for very few lines of code, it’s pretty worth it!
Part 2. Textual Analysis on the President’s Report section
Step 1: Import the PyPDF2 package
> import PyPDF2
Step 2: Load the files containing the text you want to analyze
Step 3: Indicate the specific pages for the section you want to analyze
In this case, I chose the “President’s report” section which are unfortunately on different pages for each file, so I had to indicate the ranges manually.
The code gets the pages you indicate and compiles them into one string. The “+=” signifies that we append the text in each relevant page.
A quick check on the extract using the print() function will show something is very wrong with the extract for year 2017. At this point, I had considered analyzing another company instead that would yield “prettier” results but that would be lying to myself. In reality, things like this will happen so I move on.
Step 4: Clean Text
I convert the entire text to lowercase and combine these into one list called “texts” for ease in analysis moving forward. I then remove punctuations and numbers.
Step 5: Import necessary packages for textual analysis
Step 6: Produce a word count
The Python codes and concepts used from this part and moving forward were mostly learned from the EDHEC course on Coursera. It was a great course!
The fit_transform() produces an object, so we need to convert this to a pandas dataframe to see the counts it produced. The counts.toarray() will produce an array of counts, whilst the get_feature_names() produces all the words in all the texts. According to the len() function, there are 1197 unique words.
Since we now have the counts, we can compute the TF-IDF.
The Term Frequency indicates how many times a word appears in a document. MANY → HIGH importance
The Inverse Document Frequency indicates in how many documents the word appears. MANY →LOW importance
For instance, the word “bank” may appear a lot which increases its importance, but it also appears in every document which lessens its importance.
Step 7: Normalize the counts via log transform
Step 8: Count the number of documents that contain each term then normalize the results as well
This adds all the counts for all 5 years.
This normalizes the counts found in the 5 documents we have.
Step 9: Compute TF_IDF by multiplying the IDF vector to each column in the TF table.
Step 10: Compare results year on year
We get the differences for each column using .diff(axis=1) so the 2020 column will be 2020 minus 2019, 2019 column will be 2019 minus 2018 and so on. Hence, we remove the 2016 column since it has nothing to substract with and will be NA.
We first check the top 10 most important words for 2020 vs. 2019 which can be obtained using .iloc[:,-1].sort_values(ascending=False).head(10).
Unsurprisingly, the most important word in 2020 that isn’t in 2019 is “pandemic.”
We search the second most important word “provisions” and see that it was relevant because provisions for loan losses increased several fold in this period. A snippet of the report is shown below:
I also noticed that both these words were only ever mentioned in 2020.
As another example, we explore year 2019’s most important words relative to both 2020 and 2018.
Notice we sort it in ascending order this time to get the more important words for 2019 vs. 2020.
And we sort it back on descending order when comparing 2019 with 2018 since this column is 2019 minus 2018.
Eyeing both lists, the most important word is “usd.” Searching the 2019 annual report will show that this was because PNB had a USD Senior Note issuance that year.
I think nothing still beats actually reading the reports but I’m hoping this can aid in either getting a quick sense of possible important things to note of as well as seeing things we might otherwise miss.
I would be happy to hear any thoughts on other applications or improvements.