This script should run whenever a document is uploaded.
The idea is that we generate some statistics on the content, such as:
- how long the document is, based on th number of characters and number of words,
- the proportion of ESG-topics discussed, based on the measure developed in our ESRS-study (Donau et al., 2025),
- how readable it is, based on the Gunning-Fog-Index, and
- the tone, based on the Loughran and McDonald (2011) word list
To do so, we first parse the sentences from the pages and then apply the measures to (a) sentences from all pages, (b) sentences marked as discussing environmental or social matters, and, if the pages are specified, (c) sentences from pages of the sustainability statement, (d) sentences from pages in the sustainability statement marked as discussing environmental or social matters. Options (c) and (d) are only possible if pdfpage_sust_start and pdfpage_sust_end are specified.
The output json should be stored together with the document.