Recovery from Disaster

How your services are recovered after a disastrous event.

Frends has built comprehensive disaster recovery capabilities into the platform to ensure your integrations keep running even when things go wrong. This page explains how we protect your data, what happens during a disaster, and what you can expect based on your service tier.

Recovery Objectives

When disaster strikes, two things matter most: how quickly we get you back online (RTO) and how much data you might lose (RPO).

Recovery Time: Enterprise and Infinite tiers get the fastest recovery, followed by Business tier, then Startup and Growth tiers. Basic tier operates on best-effort recovery. For catastrophic multi-region failures, recovery times are longer across all tiers. These are objectives that guide our planning—not legal guarantees—though SLA percentages still apply.

Data Protection: Across all tiers, we target minimal data loss for standard scenarios through continuous database transaction logs. Only catastrophic failures affecting multiple regions would result in significant data loss.

Availability: We target high availability for all tiers, with Enterprise and Infinite getting the highest uptime commitments. This translates to minimal downtime per year, which is why we invest heavily in redundancy and automated failover for premium tiers.

Data Protection and Backups

The foundation of good disaster recovery is solid backups. We use Azure's enterprise-grade backup infrastructure with Point-in-Time Restore capabilities, meaning we can restore your database to any recent moment—like a time machine for your data. Backups happen automatically and continuously in the background.

All data is geo-redundant, stored in multiple Azure regions simultaneously. Your data exists in triplicate within your primary region and is continuously replicated to a secondary region. If an entire Azure region goes offline, your data remains safe and accessible.

Database Backups: The LogStore and ConfigurationStore have point-in-time restore and geo-redundant backups. The ConfigurationStore—which holds all your process definitions and settings—gets extra protection through off-site backups stored separately from the main Azure infrastructure. This is critical for surviving catastrophic failures and gives us the ability to rebuild your entire Frends instance from scratch.

We keep process definitions indefinitely, execution logs for a configurable period (typically up to two months based on your Agent Group log settings), and audit logs for at least a year. Process Instance step data is uploaded to Azure Blob Storage to minimize database storage requirements.

Task NuGet Package Backups: Task NuGet packages are stored in an internal repository backed by Azure Blob Storage with Read-Access Geo-Redundant Storage (RA-GRS), ensuring they are stored in triplicate and replicated to a secondary region. The ConfigurationStore database tracks these versions and is protected by continuous point-in-time restore capabilities and off-site backups, allowing for a complete rebuild of the task catalog even if the primary infrastructure is lost.

High Availability Architecture

The best disaster recovery is preventing disasters from affecting you in the first place. We've built high availability into the Frends architecture using Azure's Availability Zones—services spread across multiple physically separate data centers. If one data center fails, the others keep running.

Cloud Agents run on virtual machines across these zones with load balancers distributing traffic. On the database side, Azure SQL Failover Groups provide automatic failover to a secondary region. The web application runs in active-passive configuration with Azure Traffic Manager monitoring health and redirecting traffic as needed.

For Enterprise and Infinite tier customers, Agent High Availability configuration is available. This requires multiple agents in the same group, a load balancer, and a shared database for synchronization. With this setup, all trigger types can be handled by any agent in the group. Without a shared database, some trigger types only run on the primary agent while others are distributed—a trade-off between setup complexity and full redundancy.

When Disaster Strikes

Despite all the preventive measures, disasters happen. We follow a three-phase response process that's been tested regularly.

Response: Our monitoring detects issues immediately and alerts on-call engineers. We assess the situation, initiate recovery, and start customer communication quickly. Critical issues get immediate attention, less severe issues get progressively longer response windows.

Resumption: We recover services by activating failover systems, restarting agents, or restoring databases. Throughout this phase, we validate that recovered services work correctly and your integration processes execute as expected.

Restoration: Once stable in the recovery environment, we conduct root cause analysis and plan the rollback to primary infrastructure. Every incident ends with a post-incident review where we document lessons learned and update procedures.

Agent Recovery

Frends PaaS Agents: If a PaaS Agent machine is destroyed, it can be recreated and configured to use the existing Azure SQL database within about half an hour. PaaS Agents are designed to be transient and easily recreatable. If there are customizations like VPNs or certificates on the Agent machine, those need separate handling.

Self-Hosted Agents: Recovery depends on your backup configuration. Reinstalling an Agent on a new machine and synchronizing it will re-establish the same Agent from Frends' perspective. Any customizations like firewall settings or certificates must be reconfigured unless the machine is recovered from backup.

When Agents operate without database connectivity, they continue executing integrations based on their cache and attempt to reconnect. This state can't be sustained long—restarting the Agent will clear its cache. Without database access, Agents cannot execute Schedule or File Triggers.

Database Recovery

Configuration and Log Stores: The Configuration Store is the most critical database for Frends survival—it's backed up both in Azure and off-site. The Log Store is backed up in Azure. When these databases crash, Agents continue operating independently, though the Service Bus may fill up and stop processing messages while databases are down.

Agent Databases: Agent databases for PaaS Agents are backed up in Azure and can usually be recreated when needed. However, if an Agent database is recreated, File Triggers may re-execute if files haven't been moved from their directories. When Agent databases crash, Agents lose the ability to execute Schedule or File Triggers.

Service Bus Recovery

Even when the Service Bus connection is down, Agents continue executing integrations normally. However, remote Subprocesses cannot execute as they rely on Service Bus connectivity.

During Service Bus outages, Agents store log data locally until reconnection. Extended outages may cause Agents to run out of local storage space. Additionally, Agents cannot communicate with Frends Core while the Service Bus is down, meaning activity updates won't appear in the UI and Process deployments won't reach Agents until Service Bus is restored.

Task Recovery

In the event of a disaster or if you need to revert to a previous version of a Task, the locally stored NuGet package can be used for re-importing. Tasks can be managed and imported from the Administration > Tasks menu in the Frends UI.

In a disaster, Frends initiates a failover to the secondary region by synchronizing the geo-replicated storage and restoring the ConfigurationStore database from the latest point-in-time or off-site backup. This process recovers all process definitions and the metadata required to link them to their specific task versions in the recovered repository.

Agents then automatically pull the necessary task packages to resume operations, while any missing or custom packages can be manually re-imported through the administration interface if necessary.

Regional Outage

If an entire Azure region fails, recovery is largely automated. Databases failover to the secondary region, Traffic Manager redirects web traffic, and our team verifies everything works correctly before notifying customers. This typically takes several hours.

Catastrophic Failure

Complete platform failure requiring full restoration is extremely rare. In total destruction scenarios, the Frends Environment can be recreated from the Configuration Store database backup kept off-site. Recovery involves provisioning new infrastructure, restoring all databases, deploying applications, and validating everything—typically a day-plus process.

Disaster Recovery Testing

We don't wait for disasters to test our recovery procedures. Regular testing includes monthly backup verification, quarterly database restoration to test environments, and annual full disaster recovery rehearsals where we recover the complete platform in a secondary region. After every incident, we conduct post-incident reviews to document lessons learned and continuously improve our procedures.

Your Responsibilities

Cloud Customers: We handle disaster recovery entirely. You should understand your tier's RTO and RPO, test your integrations after recovery events, document external dependencies, and provide timely incident feedback.

On-Premise/Hybrid: Responsibilities are shared. You handle infrastructure, database backups, network redundancy, and agent deployment. We provide configuration guidance, the Database Initializer tool, technical support during recovery, and assistance with process restoration. For hybrid configurations with Cloud and On-Prem Agents, we coordinate updates to synchronize your resources.

Getting Help

Cloud Customers: Check the Frends Status Page and contact Support through your standard channel. Critical incidents are automatically detected and escalated.

On-Premise Customers: Follow your organization's incident response procedures, then contact Frends Support for technical assistance.

Last updated

Was this helpful?