Maintenance & Support

Frends services are constantly being monitored and taken care of.

The Frends Cloud services are being monitored and managed by the Platform Operations team. Here's an overview of what is being done and at what frequency.

Maintenance

Cloud-hosted Frends resources are updated in line with our release cycle.

For customers with hybrid configurations (Cloud and On-Prem Agents) we arrange a suitable time/date to update the Frends Core service and Agents in order to synchronize the update of On-Prem resources.

OS and Security updates for Cloud Agent are kept in line with Microsoft’s monthly patching cycle.

To avoid Agent downtime during patching we recommend using load balanced Agents in a high availability configuration for production environments.

Database Maintenance

Log and configuration databases for Frends Tenants are periodically maintained by purging old Process Instance logs, including Instance header data and promoted values, and reorganizing the indices. This is automated for both cloud and self hosted instances of Frends. Note that Process Instance step data is uploaded to Azure Blob Storage, allowing less data to be stored on each Agent's database.

By default, the Process Instance purge and index reorganization will be run on Log Service startup, and is rescheduled to run every 24 hours after finishing successfully. The purge will delete Process Instances older than what is set as the retention period by log settings for each Agent Group, which is up to 60 days.

The purge is done by executing the stored procedure PurgeProcessHistory. The purge procedure has a 30 minute timeout and if it cannot finish or an error occurs the execution is retried after 30 minutes up to five times by default.

You can configure the maintenance actions with the following optional settings in deploymentSettings.json when installing an Agent. These settings should be put directly under the root settings node:

  • maintenanceTimeWindowStart - string with a format of "[hour]:[minute]:[second]", e.g. 00:30:00 for half past midnight

  • maintenanceRetryCount - number

  • disableDatabaseMaintenance - boolean, set to true if you have set up your own scheduled cleanup and maintenance procedures.

Backups

Frends Databases (LogStore, ConfigurationStore) have Point-in-time-restore and geo-redundant backups with 30 day retention. ConfigurationStore have an additional off-site backup with 7 day retention.

Recovery

Recovery in case of catastrophic failure depends partially on your deployment setup.

Recovery of Frends PaaS Agents

If Frends PaaS Agent machine is destroyed it can be recreated and configured to use the existing Azure SQL database that the PaaS Agents use by default. This usually takes less than half an hour. If there are customizations to the Agent machine in Frends Cloud, such as VPNs or certificates, those need to be handled separately. The PaaS Agents are meant to be kept as transient as possible to be able to recreate them as easily as possible.

Recovery for self-hosted Agents

Recovery of self-hosted Agents depend largely on the administrators for configuration of backups and performing the recovery. However, reinstalling an Agent to a new machine and synchronizing it will re-establish the same Agent from the view of Frends. Any customizations, such as firewall settings or certificates, will have to be reconfigured unless the machine is recovered from a backup.

Recovery for databases

The Configuration store is the most important database for the survival of the Frends installation, it is backed up on Azure and off-site. The Log store is backed up in Azure.

If Frends Configuration store or Log store databases, or core services crash, the Agents continue their operation. If the Service Bus is still operational it may fill up and stop operating as messages are not processed while the databases are down.

The Agent databases usually can be recreated at will, but the databases for PaaS Agents are backed up in Azure. If the Agent database is recreated any File Triggers may be re-executed if the files are not moved from their directories, which should be avoided anyway.

If Agent databases crash the Agent can not execute Schedule or File Triggers. If a connection to the database is lost the Agent will continue to execute integrations based on its cache and tries to reconnect to the database. This state can not be sustained long and for example, restarting the Agent will clear its cache.

Recovery for Service Bus

Even if the Service Bus connection is down, the Agents will continue to execute integrations as usual, but any remote Subprocesses cannot be executed as they rely on the Service Bus connection.

The Agent will start storing log data locally until it can reconnect to the Service Bus. If the connectivity is down for a longer period of time, the Agent may run out of space for the log data and cannot continue functioning.

Agents also cannot communicate with the Frends Core while the Service Bus is down, meaning there might not be any activity being updated to the Frends UI and any Process deployments to Agents will not function until the Service Bus is restored.

Major catastrophic failures

In a total destruction scenario, the Frends Environment can be recreated from the Configuration store database backup that is kept off-site.

Monitoring

All Frends Cloud resources are monitored around the clock with alert thresholds set for key metrics.

HTTPS Endpoints for Monitoring

Each agent can have a status and monitoring HTTPS endpoints enabled for them, where status and metrics of the Agent can be read from. The endpoints are enabled by configuring HTTPS for the Agent, or by setting HttpStatusinfoPort option in Agent application settings. API key authentication for the endpoints can be configured by HealthCheckApiKey option. If enabled, API key is to be provided in the HTTP request headers with name health-check-api-key.

With the base URL set to be your Agent's base URL, the /frendsstatusinfo endpoint can be used to check if the Agent is running and not paused, and the response message also contains a message of agent health, executing processes, and memory usage. Information is provided as JSON so it can be used to create automated monitoring.

For additional Agent metric details the /metrics endpoint can be used, which similarly allows scraping by third party tools.

Self-hosted Agents

For self-hosted Agents the monitoring is provided by the above-mentioned endpoints, as well as by any monitoring tools available in the Agent machine's operating system. For example, Windows Performance Counters can be used to retrieve agent status information.

PaaS Agents

Frends Cloud Agent monitoring services include:

Frends Heartbeat Monitor Service

  • Agent Service Availability

Azure Log Analytics Agent (30 Day retention)

  • Event logs (System/Application)

    • Error Events (OOM)

    • Service Failures

  • Performance Counters (Ingested by Azure Log Analytics)

    • Memory

    • CPU

    • Page File

    • Threads

    • Disk

    • Frends Process Executions

    • Frends Process Logging

    • Network IO

Application Insights

  • Availability Checks

  • Monitor Agent API Endpoints (Ports 80 and/or 443)

  • 5 Second ping from 5 geographically different locations

Frends Core & UI

Monitoring for the Frends Core services and the UI includes:

Web UI

  • Performance Counters (Ingested by Azure Log Analytics)

    • Memory

    • CPU

    • Response Time

  • Application Insights (Ingested by Azure Log Analytics)

    • Application Exceptions

    • Availability Checks

      • Monitor Frends UI

      • 5 Second ping from 5 geographically different locations

Azure Storage

  • Performance Counters (Ingested by Azure Log Analytics)

    • Usage

    • Latency

    • Ingress/Egress

    • Transactions

    • Availability

Database

  • Performance Counters (Ingested by Azure Log Analytics)

    • Storage Usage & Growth

    • DTU Load

Service Bus

  • Performance Counters (Ingested by Azure Monitor)

    • Inbound/Outbound Messages

    • Requests

    • Connections

    • Active Messages

    • Inbound/Outbound Queue Size

    • Errors (User/Server)

    • Dead Letter Count

    • Message Count

    • Throttled Requests

Last updated

Was this helpful?