Skip to content

mlibrary/serial_analysis

Repository files navigation

Using U-M GPT Toolkit for Serial Document Analysis

Purpose

This document describes a process that allows a user to analyze individual documents in a serial fashion using the U-M GPT Toolkit.

The U-M GPT Toolkit allows a user to interact with one of several LLMs U-M has contracted to access (e.g., GPT-4o from OpenAI). This analysis is set to use GPT-4o by default, but this is configurable.

Why Conduct Serial Document Analysis?

The University of Michigan offers two generative AI (GenAI) tools with chat interfaces: U-M GPT and U-M Maizey. Each of these has strengths and limitations for analyzing a set of documents. For instance, U-M GPT is effective for detailed analysis of one or a small number of documents. U-M Maizey is effective for exploring a larger corpus of material, but less effective for detailed or comprehensive analysis (e.g., identifying all instances of a topic within one or many documents). You can read more about these strengths and weaknesses in the Prompt Guide created by the U-M Library GenAI Instruction Team.

The process described in this document uses the U-M GPT Toolkit to automate U-M GPT, which enables the individual document analysis strengths of U-M GPT to be applied to a large number of documents serially.

For instance, this process may be preferred if a user wanted to extract pre-determined information such as research methods and results or other topical information from a large set of research articles. A “large set” would be defined as more articles than it is convenient to submit for analysis one-by-one in U-M GPT.

Requirements

This method involves:

  1. Obtaining an API key from U-M ITS (see the link to request a key at ITS Services Features & Benefits).

  2. Having a U-M shortcode and the authority to use it

  3. Installing and using Docker on your computer

    1. Using Docker requires running typed commands from a command line interface
  4. Editing text files to configure and customize your analysis

  5. PDF files to process (instructions on how to obtain samples for testing are included below).

    1. PDF files containing fewer than 260,000 tokens of text, which is equivalent to about 96,000 words, or, roughly 300 pages of double spaced text.

    Note:

    • At this time, this method can only be used with documents in PDF format.
    • The OpenAI GPT4o model is being used. The temperature is set to 0.0. This reduces randomness and diversity of responses, which often means less imaginative or unexpected content.

Guide

Request an API key from ITS

  1. Click the link and go through the process to request a key at ITS Services Features & Benefits. ITS uses keys to limit who has access to the U-M GPT Toolkit and assign responsibility for payment. A U-M shortcode is required to use the service, though it is not required in the initial request.
  2. You will need to type this key into a file in order to run the script to serially analyze a series of documents (see below).
  3. If your request is approved, you will receive an email from ITS that notifies you of several requirements. As of Winter 2025, these include:
    1. Abiding by Proper Use of Information Resources, Information Technology, and Networks at the University of Michigan (SPG 601.07), in addition to all relevant state and federal laws.
    2. Only using the tool for University business purposes and not for personal purposes.
    3. Only accessing the API with the provided API keys or authentication methods.
    4. Keeping your API key and any data retrieved using the service secure.
    5. Using the service only with data approved to the level designated in the email (e.g., moderate data).
    6. Being responsible for all charges on your API key (including if the key is shared with students under your supervision).
    7. Sharing any of the provided resources only with others within the University of Michigan community.

Install Docker

  1. Download the Docker Desktop
    1. Mac: Choose the version for Apple Silicon or Intel chip based on your CPU type (see Apple's support page for guidance).
    2. Windows/Linux: Follow platform-specific installation instructions on Docker’s website.
    3. Docker is software that allows users to create and use “containers.” These are environments that run independently of other software on your computer and greatly simplify the task of developing applications and processes to run on multiple platforms (e.g., Windows, Mac, Linux).
    4. We use Docker for that purpose here. After installing Docker, you will create a container that has all the files necessary to run the serial analysis process. You will just need to modify a few files (see below) and copy the documents you want to analyze to a designated folder.

Use Docker

The instructions below are to build a Docker container using a terminal application, such as:
Mac: Terminal
Linux: the default terminal for your version of Linux
Windows: Windows Terminal

Running these processes will produce initial results (so you can see that everything is working) which you can then customize for your needs (see Edit Text Files to Customize Your Analysis) .

  1. Use a text editor of your choice to open and edit text files when instructed.
    1. Mac: e.g. TextEdit or nano
    2. Linux: e.g. nano or vim
    3. Windows: Notepad
      2.Download the serial_analysis code directory from GitHub
    4. Go specifically to: [https://github.com/mlibrary/serial_analysis] (https://github.com/mlibrary/serial_analysis)
    5. From the green "<> Code" pop-up menu, choose "Download ZIP".
    6. Double-click the .zip file to decompress it.
  2. Rename the resulting directory to "serial_analysis" without anything else in the name, e.g. "-main".
  3. Place the serial_analysis code directory in your home directory
    1. Mac: /Users/username (or ~/)
    2. Linux: /home/username (or ~/)
    3. Windows: C:\Users\YourUsername\ (or %USERPROFILE%)
  4. Run the Docker desktop application
  5. In the terminal application, On the command line, type:
    1. Mac & Linux
      1. cd ~/serial_analysis
      2. cp env.example.txt .env
    2. Windows
      1. cd %USERPROFILE%\serial_analysis
      2. copy env.example.txt .env
  6. Open and read the env.example.txt file, which has important instructions within it.
    1. populate the .env file with your key and shortcode, and take great care not to share the file or its contents.
  7. create a folder in the input_and_output directory named "PDFs" and place the following PDFs within it for the initial test run. If you encounter paywalls, turn on the UMICH VPN.
    1. https://agsjournals.onlinelibrary.wiley.com/doi/10.1111/jgs.16817
    2. https://pmc.ncbi.nlm.nih.gov/articles/PMC4090288/
    3. https://europepmc.org/backend/ptpmcrender.fcgi?accid=PMC2442159&blobtype=pdf
    4. (i.e. 6. back in the terminal application, On the command line, type:
      1. Mac, Linux and Windows
      2. docker build -t serial_analysis .

You will get a screen that looks something like the screenshot below as the container is built. It may take a few minutes. Docker is configuring a container. It is installing all of the code libraries specified in the files in serial_analysis that are needed to run the code for the serial analysis process.
[Screenshot of Mac Terminal window, showing the docker build command and its output.]

  1. Mac and Linux
    1. docker run -v ~/serial_analysis/input_and_output:/input_and_output serial_analysis
    2. Windows
      1. docker run -v %USERPROFILE%\serial_analysis\input_and_output:/input_and_output serial_analysis
  2. the results (CSV) will be output to the following location:
    1. Mac and Linux: ~/serial_analysis/input_and_output/extracted_data.csv
    2. Windows: %USERPROFILE%\serial_analysis/input_and_output/extracted_data.csv

Edit Text Files to Customize Your Analysis

The main files to modify to customize your analysis are in the input_and_output folder.

  • PDFs

    • This is the folder where you will put your PDF documents for analysis. Add your own documents.
    • Only PDF documents are supported at this time.
  • You should see 3 files:

    • assistant_message.txt

      • The API requires the user to provide an example of the text the LLM will encounter and an example of how the LLM should respond.
      • You can run the process without specifying anything in the assistant_message.txt file, but your results will likely be of lower quality.
      • We’ll look at the example response first:

      Example response

      • The Appendix is a copy of the default file. Lines 62-90 represent the information the user desires to extract or analyze from the documents. This includes:
        • PDF Path
        • Article title
        • Author Details
        • Year of publication
        • Sample size
        • Caregiver population(s)
        • Assessment tools used to capture details about the caregiver network
        • Outcomes measures that were evaluated
      • You can see that for each element of information, there is an example of how the user would like the LLM to respond. For instance, for Sample size, the user specificies:
        • "Sample size": ["Total: 2,589 community-living Medicare fee-for-service beneficiaries age ≥ 65 with self-care or mobility disability and their primary family or unpaid caregiver."],
      • By customizing the example text (e.g., [“Total: 2,589 community-living…”], you can train the LLM in how you would like it to respond.
      • You also need to update fieldnames.txt (see below) to match your list of information elements to extract, one per line.

      Example text

      • In lines 2-60 of the Appendix, the user has selected example text from one of the documents to be analyzed. The user selected the example text so that it includes the kinds of information that are desired for extraction.
      • Including these examples helps the LLM better understand the kinds of text it is looking for in its analysis.
    • fieldnames.txt

      • This file contains just the names of the fields specified in assistant_message.txt. For instance, in the example above, it would contain:
        • PDF Path
        • Article title
        • Author Details
        • Year of publication
        • Sample size
        • Caregiver population(s)
        • Assessment tools used to capture details about the caregiver network
        • Outcomes measures that were evaluated
    • system_message.txt

      • The system message corresponds to the “system prompt” in Maizey. You can specify your specific instructions for the system prompt here.
    • user_message.txt

      • The user message specifies the task you would like the LLM to carry out. This corresponds to the prompt the user would put into the chat Maizey. In the example we’ve given, the system prompt gives instructions for what should happen when the user asks the LLM to “analyze” an article. Then, in the user prompt, the user needs to ask Maizey to analyze. The user prompt, therefore, is

        Please analyze the following article. If you don't know the answer, just say that you don't know, don't try to make up an answer.

  • After you customize these files, you will need to run the docker run command again:

    • Mac and Linux
      • docker run -v ~/serial_analysis/input_and_output:/input_and_output serial_analysis
    • Windows
      • docker run -v %USERPROFILE%\serial_analysis\input_and_output:/input_and_output serial_analysis
  • The new results will overwrite the old results in:

    • Mac and Linux: ~/serial_analysis/input_and_output/extracted_data.csv
    • Windows: %USERPROFILE%\serial_analysis/input_and_output/extracted_data.csv

Keep in mind that trial and error is often necessary to configure prompts to obtain the desired output, and patience may be necessary to work with the files and the API. Please reach out to jjyork@umich.edu with any questions.

Appendix

Example:
Text: 'Author Manuscript Author Manuscript Author Manuscript Author Manuscript
HHS Public Access
Author manuscript
J Am Geriatr Soc . Author manuscript; available in PMC 2022 January 01.
Published in final edited form as:
J Am Geriatr Soc . 2021 January ; 69(1): 129–139. doi:10.1111/jgs.16817.
Do Caregiving Factors Affect Hospitalization Risk Among Disabled Older Adults?
Halima Amjad, MD, MPHa,b, John Mulcahy, MSPHc,d, Judith D. Kasper, PhDc,d, Julia Burgdorf, PhDc,d, David L. Roth, PhDa,b, Ken Covinsky, MD, MPHe, Jennifer L. Wolff, PhDc,d
aJohns Hopkins University School of Medicine, Division of Geriatric Medicine and Gerontology, Baltimore, MD
bCenter on Aging and Health, Johns Hopkins University, Baltimore, MD
cJohns Hopkins University Bloomberg School of Public Health, Department of Health Policy and Management, Baltimore, MD
dRoger C. Lipitz Center for Integrated Health Care, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD
eUniversity of California San Francisco, Division of Geriatrics, San Francisco, CA

Participants: 2,589 community-living Medicare fee-for-service beneficiaries age ≥ 65 (mean age 79, 63% women, 31% with dementia) with self-care or mobility disability and their primary family or unpaid caregiver.

Inclusion and exclusion criteria for this study maximized cross-year comparability in measurement of disability32 and caregiving33 and have been described previously.31,34,35

From NLTCS and NHATS, we included community-living older adult participants whoreceived help with at least one self-care (eating, dressing, bathing, or toileting) or indoor mobility (transferring or getting around inside) task from family or unpaid caregivers who completed a caregiver survey. We excluded residents of nursing homes or other residential facilities (n=4,318). We also excluded older adults insured by Medicare Advantage plans (n=833) and focused on those enrolled in fee-for-service Medicare, for whom we could assess hospitalization. The NLTCS-linked ICS was administered to the “primary” caregiver, defined as the family or unpaid nonfamily caregiver who helped the participant the most. In the NHATS-linked NSOC, we included a single “primary” caregiver as well, defined as the family or unpaid nonfamily caregiver providing the most hours of help weekly among the caregivers interviewed. Our sample included 2,589 older adult and caregiver dyads.

Caregiver Factors
Caregiver characteristics and caregiving circumstances were self-reported in ICS and NSOC. Sociodemographic caregiver characteristics included age, gender, and relationship to older adult (spouse or non-spouse). Caregiving circumstances included duration of caregiving, whether the caregiver lived with the older adult, employment status (holding paid employment unrelated to caregiving), hours of care provided each week, type of help provided, and self-rated health. Type of help encompassed self-care tasks, mobility, and healthcare tasks. Healthcare tasks included medication management or assistance, skin, wound, or dressing care, and administration of injections. We also examined measures of caregiving support and challenges. Caregiving support included support group and respite care use. Caregiving challenges refer to appraisal of financial strain, emotional strain, and physical strain based on responses to questions about whether helping the participant was financially, emotionally, or physically difficult for them.31'

{{
"PDF Path": ["PDFs/nihms-1645441.pdf"],
"Article title": ["Do Caregiving Factors Affect Hospitalization Risk Among Disabled Older Adults?"],
"Author Details": ["Amjad, H."],
"Year of publication": ["2021"],
"Sample size": ["Total: 2,589 community-living Medicare fee-for-service beneficiaries age ≥ 65 with self-care or mobility disability and their primary family or unpaid caregiver."],
"Caregiver population(s)": ["Caregivers are providing help to older adults with self-care or mobility disabilities, and 31% of these older adults have dementia." ],
"Caregiver definition":["Primary caregivers are defined as family or unpaid nonfamily caregivers who help the participant the most. In the NHATS-linked NSOC, the primary caregiver is the one providing the most hours of help weekly among the caregivers interviewed."],
"Assessment tools used to capture details about the caregiver network":["The study assessed caregiver characteristics and caregiving circumstances through self-reported surveys (ICS and NSOC). It included details such as caregiver age, gender, relationship to the older adult, duration of caregiving, living arrangement, employment status, hours of care provided weekly, type of help provided, and self-rated health. It also examined caregiving support and challenges, including support group and respite care use, and appraisal of financial, emotional, and physical strain."],
"Outcomes measures that were evaluated":["The study evaluated caregiver-reported characteristics and challenges, such as physical strain, emotional strain, financial strain, and the type and duration of care provided. It also assessed the impact of these factors on the risk of older adult hospitalization within 12 months."],
}}

Note: The lines immediately above, starting with "{{" and ending with "}}" represents the formatting (JSON) that must be passed to U-M GPT as an example for it to mimic. The exact structure of this needs to be maintained.

About

UMGPT Toolkit Python example script to serially analyze PDFs and extract data to a CSV file.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors