On-Premise Ingest Agent Documentation

minware offers an on-premise agent for data ingestion that runs inside of your infrastructure. The on-premise agent is available with our enterprise-level plan.

The on-premise agent works by connecting to your source system APIs like GitHub and Jira (either cloud or on-premise enterprise), retrieving data, processing it, and then uploading it to a shared storage bucket. minware then reads the data from that bucket and proceeds with loading it into the system.

First, this page offers guidance for deciding whether to use an on-premise agent. Then, it has documentation for installing and running the on-premise agent.

What Are the Security Benefits of an On-Premise Ingest Agent?

The reason that most people opt to use the on-premise agent is to reduce their security risk.

minware is of course SOC 2 compliant and holds a high standard for securing customer data.

However, the on-premise agent allows you to follow the principle of least privilege and only grant minware access to the minimum data it needs to run. Using an on-premise agent may also put minware in a lower risk category and reduce the burden of internal vendor review in your organization.

Here are the primary security benefits of using the on-premise agent for data ingest:

  • If you are ingesting data from another on-premise system (like GitHub Enterprise, GitLab Enterprise, or Jira Enterprise), then using an on-premise agent avoids having to create a firewall rule allowing access to those systems from an outside IP address.
  • There are some accessible API fields that minware does not ingest, or that minware can anonymize before storing. An on-premise agent limits minware’s read access to not include unused fields or the original values of anonymized fields.
  • The agent itself has a small code base with minimal dependencies. You vet and control updates to the agent. You can also run it in an environment with firewall rules that prevent the agent from accessing anything other than your source systems and the output storage bucket. This minimizes security risk.
  • The agent can output data to any storage bucket, so you can review the raw data that minware will receive in a private staging bucket before sharing it with minware if you’d like. This lets you double-check that the contents contain no sensitive information.

What Are the Downsides of an On-Premise Ingest Agent?

The downside of using an on-premise agent is that you are responsible for running it on a regular schedule in your infrastructure. This means that you have to provision resources to execute the agent, and monitor them to ensure that they are running properly.

Additionally, if you have a large amount of data, you may need to execute multiple agents with load balancing parameters to ensure that the ingest process completes in a reasonable amount of time.

Finally, though the agent performs a simple task and has been tested thoroughly, errors can occur including response failures from your source systems. minware normally handles errors like this behind the scenes, but with the on-premise agent, you may need to share log files if failures occur.

The on-premise agent is only available in our enterprise-level plan.

Installing the On-Premise Agent

Installing Docker

The on-premise agent is available as a docker container, so you must install docker before running the on-premise agent. See the Docker Installation Guide for more information about how to install docker for your system.

Installing AWS CLI

You will also need the AWS CLI to authenticate to AWS to download the docker images. Please see the AWS CLI documentation for details of how to install the CLI and authenticate to your AWS account.

Resource Requirements

The on-premise agent only needs a modest amount of memory and storage to run since the main bottleneck is usually the source system (i.e., Jira, GitHub). We recommend allocating 1 CPU, 2 GB of memory, and 20 GB of disk space to each instance of the on-premise agent.

Network Access Requirements

From a network access perspective, the on-premise agent only needs to connect to your source system to load data, and to AWS S3 to upload data to the output bucket.

Obtaining Access

To install and use the on-premise agent, the easiest method is to send the AWS IAM ARN of an existing user or role in your AWS account to your enterprise account manager and indicate which sources you want to load. We will then grant your account access to the on-premise agent docker images and output bucket.

If you do not have an AWS account, then please contact your account manager to discuss other access options.

Installing the Agent

Before pulling the image, you need to authenticate with the ECR repository for the on-premise-agent docker images. Log in with the user/role of the AWS IAM ARN previously provided using the AWS CLI, and then run the following command to load credentials into docker for pulling the on-premise agent docker image:

aws ecr get-login-password --region us-east-1 | docker login --username AWS \
    --password-stdin 274599735868.dkr.ecr.us-east-1.amazonaws.com

You can then install the agent by pulling the docker image. Please use one of the following URLs to pull the appropriate agent for the system from which you’d like to load data:

  • GitHub - docker pull "274599735868.dkr.ecr.us-east-1.amazonaws.com/on-premise-agent:github-latest"
  • Jira - docker pull "274599735868.dkr.ecr.us-east-1.amazonaws.com/on-premise-agent:jira-latest"

Running the On-Premise Agent

S3 Bucket Access

To run the on-premise agent, you must specify an output S3 bucket name and AWS credentials.

You will have received bucket names after providing your AWS IAM ARN for ingesting Jira and/or GitHub data that you will be able to access with your user account.

Your output bucket will be configured to require encryption both in transit (using a minimum of TLS 1.2) and at rest (using the SYMMETRIC_DEFAULT key spec, which uses the AES-256-GCM algorithm). The at-rest encryption key will be unique to your organization and only you and minware will have access to the bucket.

When to Run the Agent

We recommend running the on-premise ingest agent at least once per day after the end of the regular work day for the time zone where most of your employees reside. This will ensure that work from the previous day will show up in minware the next day.

We recommend scheduling the agent to run using a regular daily cron job.

If you decide to run the agent more frequently and would like your data to be updated in minware more than once per day, please contact your account manager to discuss your requirements.

Monitoring Agent Success

When you run the agent, it will output information about its progress and any errors to standard error.

When the agent completes, the process exit code will be 0 if it is successful, or a non-zero value if it fails. If it fails, error details will be included in the standard error output.

You should monitor the return code of the agent process with your cron system and review the error log if the return code is non-zero.

S3 Access Parameters

The commands to run the on-premise agent listed in the sections that follow for all of the sources take the following common parameters to specify how to access the S3 bucket:

  • --aws_access_key_id (required) - The ID of the access key for your AWS identity
  • --aws_secret_access_key (required) - The secret access key for your AWS identity
  • --aws_session_token (optional) - If your identity requires a session token for two-factor authentication, you can provide that with this parameter
  • --s3_path (required) - The name of the output bucket provided by minware (without any leading s3://), e.g., org-minware-jira-1f87b26477684a21

GitHub

GitHub Connection Parameters

All the commands to run the GitHub agent require the following parameters to connect to your Jira instance:

  • --github_base_url - The API URL of your Github instance including https:// and the trailing slash. To run against GitHub cloud, use https://api.github.com/.
  • --github_access_token - The access token to authenticate to GitHub. Follow these instructions to create a personal access token. If you are using fine-grained personal access tokens, you will need read-only access to: Actions, Contents, Deployments, Issues, Metadata, Projects (if enabled) and Pull Requests. For a classic access token, you will need repo and read:project.

Listing GitHub Repositories

The first step in running the GitHub agent is to execute in discovery mode to list repositories from your data source. This lets you verify the source connection and obtain a list of repositories that you can include or exclude in the data ingestion step.

For the GitHub on-premise agent, the command to list available repositories is as follows:

docker run \
  274599735868.dkr.ecr.us-east-1.amazonaws.com/on-premise-agent:github-latest \
  --github_base_url=<base URL parameter> \
  --github_access_token=<access token parameter> \
  --command=find_repos

If this command succeeds, the last line of output will be the list of available repositories, which will look like:

INFO Found repositories for include/exclude argument:
  orgname/repoA,orgname/repoB

You can then optionally provide some of these repositories to the ingest run command to include or exclude specific projects.

Ingesting GitHub Data

To begin ingesting GitHub data, you can run the agent with the following command:

docker run \
  274599735868.dkr.ecr.us-east-1.amazonaws.com/on-premise-agent:github-latest \
  --aws_access_key_id=<access key parameter> \
  --aws_secret_access_key=<secret key parameter> \
  --s3_path=<s3 bucket name> \
  --github_base_url=<base URL parameter> \
  --github_access_token=<access token parameter> \
  --command=all

This will begin ingesting GitHub data for all of your repositories and save it in the provided S3 bucket.

Include/Exclude Repository Parameters

The GitHub data ingest command can also be run with --command=exclude or --command=include and the additional --github_repos parameter to exclude or only include certain repositories, like:

docker run \
  274599735868.dkr.ecr.us-east-1.amazonaws.com/on-premise-agent:github-latest \
  --aws_access_key_id=<access key parameter> \
  --aws_secret_access_key=<secret key parameter> \
  --s3_path=<s3 bucket name> \
  --github_base_url=<base URL parameter> \
  --github_access_token=<access token parameter> \
  --command=exclude \
  --github_repos=orgname/repoA,orgname/repoB

The --github_repos parameter should contain a comma-separated list of repository names from the previous find_repos command output that you want to include or exclude.

Alternatively, you can select repositories to include when creating a fine-grained access token and just use --command=all.

What GitHub Data is Ingested

The GitHub on-premise agent will ingest the following types of data:

  • Pull requests and reviews, including events related to pull requests
  • GitHub issues and projects if you are using them for project management.
  • All commit metadata including commit messages, authors, times, SHAs, line added/removed counts, and events related to commits like pushes.
  • Branch metadata
  • Team information
  • GitHub action, deployment, and job data

The GitHub agent does not read the contents of any commits (i.e., source code).

You can inspect the output bucket contents for a full list of all the schemas and fields ingested.

Jira

Jira Connection Parameters

All the commands to run the Jira agent require the following parameters to connect to your Jira instance:

  • --jira_base_url - The base URL of your Jira instance including https and the trailing slash, e.g., https://orgname.atlassian.net/.
  • --jira_username - The email address of the user whose access token or password you will be using to connect to the Jira instance
  • --jira_password - An access token or password for the user. Follow these instructions to create an access token for a cloud instance. For a self-hosted instance, set this parameter to the user’s password. (Note: if you’d prefer to use OAuth with a self-hosted instance, please contact minware support for assistance.)

Listing Jira Projects

The first step in running the Jira agent is to execute in discovery mode to list projects from your data source. This lets you verify the source connection and obtain a list of projects that you can include or exclude in the data ingestion step.

For the Jira on-premise agent, the command to list available projects is as follows:

docker run \
  274599735868.dkr.ecr.us-east-1.amazonaws.com/on-premise-agent:jira-latest \
  --jira_username=<username parameter> \
  --jira_base_url=<url parameter> \
  --jira_password=<password parameter> \
  --command=find_projects

If this command succeeds, the last line of output will be the list of available project keys, which will look like:

INFO Found projects for include/exclude argument: ABC,XYZ

You can then optionally provide some of these projects to the ingest run command to include or exclude specific projects.

Ingesting Jira Data

To begin ingesting Jira data, you can run the agent with the following command:

docker run \
  274599735868.dkr.ecr.us-east-1.amazonaws.com/on-premise-agent:jira-latest \
  --aws_access_key_id=<access key parameter> \
  --aws_secret_access_key=<secret key parameter> \
  --s3_path=<s3 bucket name> \
  --jira_username=<username parameter> \
  --jira_base_url=<url parameter> \
  --jira_password=<password parameter> \
  --command=all

This will begin ingesting Jira data for all of your projects and save it in the provided S3 bucket.

Include/Exclude Project Parameters

The Jira ingest command can also be run with --command=exclude or --command=include and providing the additional --jira_projects parameter to exclude or only include certain projects, like:

docker run \
  274599735868.dkr.ecr.us-east-1.amazonaws.com/on-premise-agent:jira-latest \
  --aws_access_key_id=<access key parameter> \
  --aws_secret_access_key=<secret key parameter> \
  --s3_path=<s3 bucket name> \
  --jira_username=<username parameter> \
  --jira_base_url=<url parameter> \
  --jira_password=<password parameter> \
  --command=exclude \
  --jira_projects=XYZ,ABC \

The --jira_projects parameter should contain a comma-separated list of project keys from the previous find_projects command output that you want to include or exclude.

What Jira Data is Ingested

The Jira on-premise agent will ingest the following types of data:

  • All issues and fields (including custom fields)
  • Changelogs for all issues and fields
  • Worklogs
  • Boards
  • Sprints
  • Projects
  • Users
  • Versions

You can inspect the output bucket contents for a full list of all the schemas and fields ingested.

Viewing Output

If you’d like to review the output of the on-premise agent before running the agent with the bucket provided by minware, you can create and target your own S3 bucket for output. To do this, you can set the --s3-path parameter to the name of another bucket you control.

The bucket contents are as follows:

  • /state.json - A state file that contains bookmarks for incremental ingest.
  • /output/YY/MM/DD/<schema>/HHMMSS_partXXXXX.jsonl - Files with data for a particular schema and run at a given time.

Files with the same combined YY-MM-DD HH:MM:SS timestamp are part of the same run, though one run may output file sets with multiple timestamps as the run progresses, which allows long runs to resume where they left off if there is a failure.

Each of the data files is a multi-line JSON file. The initial schema lines at the start of the file describe the shape of the data and all fields that can be included in JSON schema format. Each of the subsequent record lines contains JSON for an individual record of that schema matching the schema format.

The last line of the final part file contains a state line describing the incremental ingest state of the original source loader (Jira/GitHub). It is included to indicate that the file set is complete and ingest did not fail part way through.

If there is ever a failure, minware will correctly handle partial data and you should never need to manually change the files in the bucket.

Anonymizing Fields

If particular fields in your data contain sensitive information, it may be possible to omit those fields entirely or anonymize them using a hash function in the on-premise agent, depending on how they are used. Please contact your enterprise account manager for guidance if you have any anonymization requirements.

State and Reset State Parameter

When you run the agent, it will pick up where it left off last time by reading a state file from the provided bucket. This state file will be added to the bucket with information about the last successfully ingested data. Subsequent runs will read from this state file to avoid redoing work.

If you ever need to ignore this state and redo ingest from the beginning of time, you can provide the --reset-state parameter by setting it to true:

  --reset_state=true

After the job runs with this parameter, it will upload a new state so that you can resume incremental ingest in subsequent runs.

You should not use this parameter unless instructed to do so by minware to perform a complete data reload, which won’t be necessary under normal circumstances.

Open-Source Notices

The on-premise agent code is written in Python and Javascript, so all of the source code including license notices can be read from the docker image.

The only applications in the container including “copyleft” license source code that we have modified are as follows. These two pieces of software are licensed under the GNU Affero General Public License v3.0. Here are links to the forked repositories on GitHub containing the modified source code:

  • tap-github - Python package for loading data from the GitHub API
  • tap-jira - Python package for loading data from the Jira API