Skip to content

1 rewrite of scan batch dir#2

Open
bdgregg wants to merge 22 commits intomasterfrom
1-rewrite-of-scan-batch-dir
Open

1 rewrite of scan batch dir#2
bdgregg wants to merge 22 commits intomasterfrom
1-rewrite-of-scan-batch-dir

Conversation

@bdgregg
Copy link
Contributor

@bdgregg bdgregg commented Feb 26, 2026

This PR addresses issue #1.

This is a full rewrite of the original scan-batch-dir script to allow it to be more modular so that changes needed in the future should be easier to implement. This also added the ability to use PDF files as newspaper issues.

@bdgregg bdgregg self-assigned this Feb 26, 2026
@bdgregg bdgregg linked an issue Feb 26, 2026 that may be closed by this pull request
Copy link
Member

@ctgraham ctgraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial readthrough with some questions and suggestions.

@@ -1,4 +1,5 @@
#!/usr/bin/python3
#!/usr/bin/python3.12

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly depending on 3.12 should be fine as long as the machine running the script has 3.12 (in the future, some machines may have versions >3.12). It might help to add a warning like if sys.version_info() < (3, 12): logger.warn("unsupported python version")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to not use the 3.6 default, but also not use an explicitly defined version would be to use #!/usr/bin/env python and then run the script within a virtual environment generated by python 3.12

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choosing of what version of python being run has been an ongoing issue for a while. There are various ways this can be achieved. None seem to be the best way. RHEL 8 ships with 3.6, RHEL 9 ships with 3.9, and the Ansible version used to build our systems requires 3.12. It is recommended by RedHat to not remove the shipped version as it impacts their code. But 3.6 is too old for some of the python modules we need to use. How best to not have to worry about the required version? Your test of sys.version_info() will probably help but only if the OS can expose the correct version to the script at runtime even if 3.12 is installed. RHEL has the 'alternatives' command but again changing the default version of python may impact OS required scripts is not recommended. This is why I hardcoded the 3.12 version. Additionally, I don't believe logger.warn will work as the logger hasn't been setup at the time of the check which I would do very early in the script.

print(f"Invalid content in YAML file: {path}.")
raise
except Exception as e:
print(f"An unexpected error occurred while reading the YAML file: {str(e)}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log yaml error so it can be seen in log file if script is executed from non cli environment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ojas-uls-dev at the time of executing read_yaml_file the log hasn't been created yet for the function to write to the log. This is a catch 22 if we move the creation of the log file before the read_yaml_file as the config has the default log file settings. I'm not sure what the best approach is to resolve this.

parent (str) The parent directory of the object.
df (pd.DataFrame) The Pandas DataFrame we will be updating.
# Validate columns
if match_column not in df.columns:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is match_column intended to be global/inherited from caller? if not, it should be passed as arg.

match_value and return_column also seem to be inherited

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ojas-uls-dev, the match_column and return_column are both passed into the get_value_from_df function as arguements. They are not per say global.

headers["Authorization"] = f"Bearer {auth_token}"

try:
response = requests.get(url, headers=headers, params=params)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response.raise_for_status() can also be used to handle 403s and 404s in same try except block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ojas-uls-dev, are you suggesting halting the program if a 403/404 is encountered? And yes, we can add stanzas to handle 403/404 as well. Just wasn't sure you were indicating a failure should trigger a stop or just handle the 403/404 differently.

return(df)


def get_directory(directory: str):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.walk does something very similar to this function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ojas-uls-dev, it does look like it might do the same as this function except the function sorts the result alphabetically. os.walk doesn't look like it does that, but maybe there is away to sort the os.walk result the same way. This would be something that could be looked into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rewrite of scan-batch-dir

3 participants