Disaster recovery planning for work folders storage at microsoft

Unlike traditional file servers, where the data is only stored on the server, Work Folders being a sync technology, has data stored at multiple locations. Each sync client (e.g. user windows or iOS devices) and the sync server (i.e. Windows 2012 R2 file server) has a local copy of the same data for a given user. With Work Folders, user can always access their data on their devices even when the Work Folders server is down. This is very different from the traditional file share, where user access depends on the availability of the file server. Work Folders server failure results to user data not being synced across multiple devices, because the device sync depends on the server availability to mediate the process.

In general, two methods for data recovery can be utilized. You can either use the client data to replenish the server copy, or you can replicate the server data to another server at another site to pre-stage the file set. Recovery using client data

If a user typically has a single device in your environment, and devices are always connected to file server through high speed connections, you may evaluate the option for recovery using only the client data copy. Using this approach, after a primary server failure, you can configure the replacement, and the client will re-upload the data to the server once the replacement is up and running. The advantage is that you don’t need to have a standby server and storage, and the downside is the network can be busy when the replacement server is put in place, and all the clients try to upload their data. Dataset size vs network bandwidth should be carefully evaluated for this option. Export Work Folders server configuration

To have the client automatic sync with the replacement server, you can use the same server name on the replacement server. To ensure the certificate works on the replacement server, you need to export the certificate you have acquired for the original Work Folders server (if you are using the same server name) with the private key, then install the .pfx file on the new server.

Both file data and metadata database are present, and they are in a consistent state. It’s important to ensure the database and the file data are in a consistent state, otherwise data loss may occur. Using VSS snapshot with the Work Folders VSS writer for example, will make sure the database and the file set are in a consistent state; however, doing file copy of the file data and metadata database will not guarantee the data in a consistent state.

Using BCDR (business continuity and disaster recovery) products: many storage products offer block level replication, if your data is stored on these products, you may consider leverage them to build the DR plan. These products need to guarantee the data replication order, and data consistency at a given point in time.

Scheduled file replication with Windows applications (e.g. Robocopy or DFSR): make sure only the file data is getting replicated, and not the metadata database. After failover, the client and server will reconcile the data, and create the metadata database on the replacement server.

I’ll not go into details for this option, as the configuration depends on the BCDR product. You need to follow the guidance of the BCDR products, make sure the file data and the metadata database is in the same replication group, so that the IO ordering is maintained across these data set, and the product can guarantee the data consistency between the file data and the metadata database for any given point in time. Using VSS backup and restore

Using VSS doing backup and restore with the Work Folders VSS writer can keep the consistency between the file set and the metadata database on the server. Depending on the VSS backup application, you need to configure the backup using the Work Folders VSS writer.

Upon restore, both file data and the metadata will be restored to the replacement server. The data restored on the server using VSS is called “non-authoritative” restore. That means, the files restored on the server could be overwritten by the client, if there are newer changes made on the client.

(note: In contrast, there is another mode of VSS restore called “authoritative restore” by copy and paste files from the backup. The restored files will be treated as a newer version, and will overwrite the client copy through sync, even when the client may actually have a newer copy in comparison to the restored file. This approach can be used to recover individual corrupted files when necessary, but not a focus for this blog).

Using the non-authoritative restore, when the client comes to sync, the sync engine will be able to compare the file versions between the client (current) and the server (restored at a past point in time), and sync the changes between client and the server. File replication with Robocopy or DFSR

You must exclude the sync share state folder on the server for file replication (both staging files and metadata database). After failover, the metadata database must not be present before syncsharesvc service starts. This will trigger data reconciliation. This process may have a performance impact on the server if the file set is large.

Before the secondary server is put in action, make sure file replication (from primary server to secondary server) is stopped, so that after client starts to sync with the secondary server, changes will not be deleted, and may result data loss.

Due to lack of metadata database tracking with file changes, any changes on the client during the time of server failover can become a conflict file, or delete files can come back, moved directories may surface again. Although not data loss, but user will need to manually figure out what to keep, and what to delete/move again.

Although all 3 options are supported, we recommend DR using BCDR products or VSS backup and restore approach, as the secondary server will be in a data consistency state after failover. As you can see from the steps, using file replication app is very error prone, and may result in data loss if the procedures are not followed correctly.