Skip to content

add s3 glacier connector example#416

Open
fivetran-surabhisingh wants to merge 4 commits intomainfrom
s3_glacier_add_connector_example
Open

add s3 glacier connector example#416
fivetran-surabhisingh wants to merge 4 commits intomainfrom
s3_glacier_add_connector_example

Conversation

@fivetran-surabhisingh
Copy link
Collaborator

Jira ticket

Closes <ADD TICKET LINK HERE, EACH PR MUST BE LINKED TO A JIRA TICKET>

Description of Change

<MENTION A SHORT DESCRIPTION OF YOUR CHANGES HERE>

Testing

<MENTION ABOUT YOUR TESTING DETAILS HERE, ATTACH SCREENSHOTS IF NEEDED (WITHOUT PII)>

Checklist

Some tips and links to help validate your PR:

  • Tested the connector with fivetran debug command.
  • Added/Updated example specific README.md file, refer here for template.
  • Followed Python Coding Standards, refer here

@fivetran-surabhisingh fivetran-surabhisingh self-assigned this Oct 31, 2025
@fivetran-surabhisingh fivetran-surabhisingh requested review from a team as code owners October 31, 2025 09:43
@github-actions github-actions bot added the size/L PR size: Large label Oct 31, 2025
@github-actions
Copy link

github-actions bot commented Oct 31, 2025

🧹 Python Code Quality Check

✅ No issues found in Python Files.

🔍 See how this check works

This comment is auto-updated with every commit.

@fivetran-surabhisingh fivetran-surabhisingh added hackathon For all the PRs related to the internal Fivetran 2025 Connector SDK Hackathon. and removed size/L PR size: Large labels Oct 31, 2025
@fivetran-dejantucakov fivetran-dejantucakov self-assigned this Oct 31, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new S3 Glacier-aware connector example that demonstrates how to sync S3 object metadata using boto3, with special handling for Glacier storage classes and their restoration status.

  • Implements incremental sync using LastModified timestamps with state checkpointing
  • Handles Glacier storage classes by inspecting restore headers via head_object calls
  • Uses pagination for memory-efficient processing of large S3 buckets

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 26 comments.

File Description
connectors/s3_glacier/connector.py Main connector implementation with S3 listing, Glacier restore detection, and incremental sync logic
connectors/s3_glacier/configuration.json Configuration template defining AWS credentials and bucket parameters
connectors/s3_glacier/requirements.txt Declares boto3 and botocore dependencies
connectors/s3_glacier/README.md Comprehensive documentation covering setup, authentication, features, and table schema
Comments suppressed due to low confidence (1)

connectors/s3_glacier/connector.py:11

  • Import of 'Optional' is not used.
from typing import Any, Dict, List, Optional

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@github-actions github-actions bot added the size/L PR size: Large label Nov 3, 2025
@@ -0,0 +1,211 @@
# Amazon S3 (Glacier-aware) Connector Example

## Connector overview
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-surabhisingh You need to list and link to the connector on the main README - https://github.com/fivetran/fivetran_connector_sdk/blob/main/README.md

- Windows: 10 or later (64-bit only)
- macOS: 13 (Ventura) or later (Apple Silicon [arm64] or Intel [x86_64])
- Linux: Distributions such as Ubuntu 20.04 or later, Debian 10 or later, or Amazon Linux 2 or later (arm64 or x86_64)
- Active **AWS account** with S3 access.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Active **AWS account** with S3 access.
- Active AWS account with S3 access.

@fivetran-surabhisingh fivetran-surabhisingh added the top-priority A top priority PR for review label Nov 6, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 13 comments.

Comments suppressed due to low confidence (1)

connectors/s3_glacier/connector.py:11

  • Import of 'Optional' is not used.
from typing import Any, Dict, List, Optional

Copy link
Contributor

@fivetran-sahilkhirwal fivetran-sahilkhirwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address these along with copilot and tech writer's review comments

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 14 comments.

log.info(f"Starting S3 sync bucket={bucket} prefix={prefix} watermark={watermark}")

new_wm = watermark
for row in _iterate_objects(s3, bucket, prefix, page_size):
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function name mismatch: calling _iterate_objects but the function is defined as iterate_objects (without leading underscore) on line 94. This will cause a NameError at runtime.

Change to: for row in iterate_objects(s3, bucket, prefix, page_size):

Suggested change
for row in _iterate_objects(s3, bucket, prefix, page_size):
for row in iterate_objects(s3, bucket, prefix, page_size):

Copilot uses AI. Check for mistakes.
"aws_secret_access_key": "<YOUR_AWS_SECRET_ACCESS_KEY>",
"aws_region": "<YOUR_AWS_REGION>",
"bucket": "<YOUR_S3_BUCKET_NAME>",
"prefix": "<OPTIONAL_S3_PREFIX>"
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuration parameter mismatch with configuration.json. The README lists "prefix": "<OPTIONAL_S3_PREFIX>" as optional, but configuration.json shows "prefix": "<YOUR_S3_FOLDER_PATH>".

The placeholder descriptions should match. Use consistent wording:

  • Either <OPTIONAL_S3_PREFIX> in both files
  • Or <YOUR_S3_FOLDER_PATH_OPTIONAL> in both files

Additionally, since prefix is optional (has a default value of "" in code), the configuration.json could either omit this field or clearly mark it as optional in the placeholder.

Copilot generated this review using guidance from repository custom instructions.
"GLACIER_IR", "GLACIER_FLEXIBLE_RETRIEVAL"
}

def iso(dt):
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing type hint for parameter and return type. Function signature should be:

def iso(dt) -> str | None:

Or if using older Python typing:

from typing import Optional
def iso(dt) -> Optional[str]:

This improves code clarity and enables better IDE support.

Copilot uses AI. Check for mistakes.
aws_session_token=configuration.get("aws_session_token", ""),
region_name=configuration.get("aws_region"),
)
return session.client("s3", config=BotoConfig(retries={"max_attempts": 10, "mode": "standard"}))
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number without constant definition. The retry count 10 should be defined as a constant at the module level:

__MAX_RETRY_ATTEMPTS = 10

Then use it in the config:

config=BotoConfig(retries={"max_attempts": __MAX_RETRY_ATTEMPTS, "mode": "standard"})

This makes the code more maintainable and the retry policy explicit.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines 145 to 153
"""
Define the update function, which is a required function, and is called by Fivetran during each sync.
See the technical reference documentation for more details on the update function
https://fivetran.com/docs/connectors/connector-sdk/technical-reference#update
Args:
configuration: A dictionary containing connection details
state: A dictionary containing state information from previous runs
The state dictionary is empty for the first sync or for any full re-sync
"""
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect docstring for update function. The docstring has extra text that doesn't match the required template. Lines 150-152 should be:

    """
    Define the update function which lets you configure how your connector fetches data.
    See the technical reference documentation for more details on the update function:
    https://fivetran.com/docs/connectors/connector-sdk/technical-reference#update
    Args:
        configuration: a dictionary that holds the configuration settings for the connector.
        state: a dictionary that holds the state of the connector.
    """

Remove "which is a required function, and is called by Fivetran during each sync" and "The state dictionary is empty for the first sync or for any full re-sync" as these deviate from the required template.

Copilot generated this review using guidance from repository custom instructions.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Copy link
Contributor

@fivetran-sahilkhirwal fivetran-sahilkhirwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase the PR and address the copilot comments and re-request the review :)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.

"aws_secret_access_key": "<YOUR_AWS_SECRET_ACCESS_KEY>",
"aws_region": "<YOUR_AWS_REGION>",
"bucket": "<YOUR_S3_BUCKET_NAME>",
"prefix": "<OPTIONAL_S3_PREFIX>"
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prefix placeholder in the configuration example is inconsistent with the table below. Change <OPTIONAL_S3_PREFIX> to <YOUR_S3_FOLDER_PATH> to match the configuration.json file, or update both to use a more descriptive placeholder like <YOUR_S3_OBJECT_PREFIX_OPTIONAL>.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +111 to +112
## Additional considerations
This example is provided to help teams integrate AWS S3 metadata and storage class history into their data pipelines. Fivetran makes no guarantees regarding support or maintenance. For assistance, contact Support or submit improvements via pull request. No newline at end of file
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Additional considerations" section does not match the required disclaimer format. It must use this exact text:

## Additional considerations
The examples provided are intended to help you effectively use Fivetran's Connector SDK. While we've tested the code, Fivetran cannot be held responsible for any unexpected or negative consequences that may arise from using these examples. For inquiries, please reach out to our Support team.

Copilot generated this review using guidance from repository custom instructions.
| bucket | The target S3 bucket | Yes |
| prefix | Object prefix (folder) filter | No |

Note: Do not commit this file to source control.
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Configuration file" section does not include the required warning about not versioning sensitive data. Add this note after the table:

Note: Ensure that the `configuration.json` file is not checked into version control to protect sensitive information.

Copilot generated this review using guidance from repository custom instructions.
session = boto3.session.Session(
aws_access_key_id=configuration.get("aws_access_key_id"),
aws_secret_access_key=configuration.get("aws_secret_access_key"),
aws_session_token=configuration.get("aws_session_token", ""),
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuration field aws_session_token is used in the code but not declared in configuration.json. Either remove this field from the code if it's not needed, or add it to configuration.json with a placeholder value like "aws_session_token": "<YOUR_AWS_SESSION_TOKEN_OPTIONAL>".

Copilot generated this review using guidance from repository custom instructions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check this. This key is missing in the configuration.json. Please add it there as well as in the readme :)

botocore==1.40.59
```

Note: `fivetran_connector_sdk` and `requests` are pre-installed in the Fivetran runtime and should not be listed.
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Requirements file" section does not include the required note about pre-installed packages. Replace line 58 with:

Note: The `fivetran_connector_sdk:latest` and `requests:latest` packages are pre-installed in the Fivetran environment. To avoid dependency conflicts, do not declare them in your `requirements.txt`.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +191 to +198
# Required for SDK loader
connector = Connector(update=update, schema=schema)

# Entry point for local testing
if __name__ == "__main__":
with open("configuration.json", "r") as f:
configuration = json.load(f)
connector.debug(configuration=configuration)
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment for the main block must use the exact required format. Replace with:

# Create the connector object using the schema and update functions
connector = Connector(update=update, schema=schema)

# Check if the script is being run as the main module.
# This is Python's standard entry method allowing your script to be run directly from the command line or IDE 'run' button.
# This is useful for debugging while you write your code. Note this method is not called by Fivetran when executing your connector in production.
# Please test using the Fivetran debug command prior to finalizing and deploying your connector.
if __name__ == "__main__":
    # Open the configuration.json file and load its contents
    with open("configuration.json", "r") as f:
        configuration = json.load(f)

    # Test the connector locally
    connector.debug()

Copilot generated this review using guidance from repository custom instructions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow this. This is the standard for defining the main module in the connector SDK examples. The comments make it more readable for the users. please use this exact template :)

| last_modified | UTC_DATETIME | Last modified timestamp |
| restore_status | STRING | Glacier restore status |
| restore_expiry | UTC_DATETIME | Glacier restore expiration time (if applicable) |
| _fivetran_deleted| BOOLEAN | Soft delete flag |
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _fivetran_deleted column is listed in the table schema, but it's not defined in the schema() function in connector.py. This is a system column automatically added by Fivetran and should not be included in connector documentation unless explicitly managed by the connector code. Remove this row from the table.

Suggested change
| _fivetran_deleted| BOOLEAN | Soft delete flag |

Copilot uses AI. Check for mistakes.
new_wm = lm

if new_wm:
op.checkpoint({"s3_objects": {"last_modified": new_wm}})
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing required comment before op.checkpoint(). According to the coding guidelines, EVERY op.checkpoint() call must be preceded by this exact comment:

# Save the progress by checkpointing the state. This is important for ensuring that the sync process can resume
# from the correct position in case of next sync or interruptions.
# Learn more about how and where to checkpoint by reading our best practices documentation
# (https://fivetran.com/docs/connectors/connector-sdk/best-practices#largedatasetrecommendation).
op.checkpoint({"s3_objects": {"last_modified": new_wm}})

Copilot generated this review using guidance from repository custom instructions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this comment before checkpoint()

Comment on lines +7 to +17
import json # For reading configuration from JSON file
from datetime import datetime, timezone # For working with UTC timestamps
from typing import Any, Dict, List, Optional # Type hints

import boto3 # AWS SDK for Python to interact with S3
from botocore.config import Config as BotoConfig # For setting retry and timeout configs
from botocore.exceptions import ClientError # Exception handling for AWS responses

from fivetran_connector_sdk import Connector # Core SDK functionality
from fivetran_connector_sdk import Logging as log # For logging
from fivetran_connector_sdk import Operations as op # For data operations (upsert, checkpoint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import json # For reading configuration from JSON file
from datetime import datetime, timezone # For working with UTC timestamps
from typing import Any, Dict, List, Optional # Type hints
import boto3 # AWS SDK for Python to interact with S3
from botocore.config import Config as BotoConfig # For setting retry and timeout configs
from botocore.exceptions import ClientError # Exception handling for AWS responses
from fivetran_connector_sdk import Connector # Core SDK functionality
from fivetran_connector_sdk import Logging as log # For logging
from fivetran_connector_sdk import Operations as op # For data operations (upsert, checkpoint)
# For reading configuration from JSON file
import json
# For working with UTC timestamps
from datetime import datetime, timezone
# Type hints
from typing import Any, Dict, List, Optional
# AWS SDK for Python to interact with S3
import boto3
# For setting retry and timeout configs
from botocore.config import Config as BotoConfig
# Exception handling for AWS responses
from botocore.exceptions import ClientError
# Import required classes from fivetran_connector_sdk
from fivetran_connector_sdk import Connector
# For enabling Logs in your connector code
from fivetran_connector_sdk import Logging as log
# For supporting Data operations like Upsert(), Update(), Delete() and checkpoint()
from fivetran_connector_sdk import Operations as op

Copy link
Contributor

@fivetran-sahilkhirwal fivetran-sahilkhirwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check these review comments and re-request the review once resolved :)

Comment on lines +20 to +23
_GLACIER_CLASSES = {"GLACIER", "DEEP_ARCHIVE", "GLACIER_IR", "GLACIER_FLEXIBLE_RETRIEVAL"}

_MAX_RETRY_ATTEMPTS = 10 # Max retry attempts for AWS API
_CHECKPOINT_EVERY = 1000 # Frequency of checkpointing rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_GLACIER_CLASSES = {"GLACIER", "DEEP_ARCHIVE", "GLACIER_IR", "GLACIER_FLEXIBLE_RETRIEVAL"}
_MAX_RETRY_ATTEMPTS = 10 # Max retry attempts for AWS API
_CHECKPOINT_EVERY = 1000 # Frequency of checkpointing rows
__GLACIER_CLASSES = {"GLACIER", "DEEP_ARCHIVE", "GLACIER_IR", "GLACIER_FLEXIBLE_RETRIEVAL"}
__MAX_RETRY_ATTEMPTS = 10 # Max retry attempts for AWS API
__CHECKPOINT_EVERY = 1000 # Frequency of checkpointing rows

session = boto3.session.Session(
aws_access_key_id=configuration.get("aws_access_key_id"),
aws_secret_access_key=configuration.get("aws_secret_access_key"),
aws_session_token=configuration.get("aws_session_token", ""),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check this. This key is missing in the configuration.json. Please add it there as well as in the readme :)

session = boto3.session.Session(
aws_access_key_id=configuration.get("aws_access_key_id"),
aws_secret_access_key=configuration.get("aws_secret_access_key"),
aws_session_token=configuration.get("aws_session_token", ""),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the default to correctly send None for unavailable key values.

Comment on lines +64 to +73
def schema(_: dict) -> List[Dict[str, Any]]:
"""
Define the output schema for the connector.

Args:
_ (dict): Unused config.

Returns:
List[Dict[str, Any]]: Schema definition for s3_objects table.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this docstring instead of the one added. This is the standard docstring we use for schema() method.

]


def iterate_objects(s3, bucket: str, prefix: str, page_size: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has high cognitive complexity ( Required is less than 15 but this method has 22 ). please break this into smaller method for better readability and maintainability :)


new_wm = watermark
for row in iterate_objects(s3, bucket, prefix, page_size):
lm = row.get("last_modified")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use descriptive variable names for better readability :)

if watermark and lm and lm < watermark:
continue

op.upsert("s3_objects", row)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
op.upsert("s3_objects", row)
# The 'upsert' operation is used to insert or update data in the destination table.
# The op.upsert method is called with two arguments:
# - The first argument is the name of the table to upsert the data into.
# - The second argument is a dictionary containing the data to be upserted,
op.upsert("s3_objects", row)

new_wm = watermark
for row in iterate_objects(s3, bucket, prefix, page_size):
lm = row.get("last_modified")
if watermark and lm and lm < watermark:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, One question: We are fetching all the data from the source and we are skipping all records before the watermark. Can we not fetch only required data from source instead of fetching everything?

new_wm = lm

if new_wm:
op.checkpoint({"s3_objects": {"last_modified": new_wm}})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this comment before checkpoint()

Comment on lines +191 to +198
# Required for SDK loader
connector = Connector(update=update, schema=schema)

# Entry point for local testing
if __name__ == "__main__":
with open("configuration.json", "r") as f:
configuration = json.load(f)
connector.debug(configuration=configuration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow this. This is the standard for defining the main module in the connector SDK examples. The comments make it more readable for the users. please use this exact template :)

@fivetran-rishabhghosh fivetran-rishabhghosh removed their request for review December 22, 2025 07:34
@fivetran fivetran deleted a comment from cla-assistant bot Jan 2, 2026
@cla-assistant
Copy link

cla-assistant bot commented Jan 2, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Surabhi Singh seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hackathon For all the PRs related to the internal Fivetran 2025 Connector SDK Hackathon. size/L PR size: Large top-priority A top priority PR for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants