Introduction
A LucidLink Filespace is a data storage repository which functions as a global storage collaboration platform. Users from anywhere around the world can connect to a Filespace from Windows, Linux and/or macOS machines. Whilst LucidLink offer excellent redundancy and availability, users in multiple geographic regions could be affected by a service outage.
A LucidLink Filespace is cloud storage based on the backend but mounts and functions as local storage to each client machine. Depending on the OS, a LucidLink Filespace is able to be mounted as a local volume or network share/UNC path. This adds flexibility to user’s workflows and also provides flexibility with backup and recovery options and the tools that can be used.
Before any backup and recovery plan can be designed, implemented, tested and eventually executed, a business continuity plan (BCP) must be created.
Basic Components
The following 4 basic components of any BCP are:
- Assessment
- Preparedness
- Response
- Recovery
A business continuity plan at a macro level will typically encompass far more than just the data contained in a LucidLink Filespace. However, a business continuity plan will provide the foundation and workflow that will be required to successfully recover from an event such as a LucidLink outage or a ransomware attack affecting data in a LucidLink Filespace.
Let’s take a look at how these 4 components apply to the backup and recovery of data in a LucidLink Filespace.
Assessment
The data contained in a LucidLink Filespace may affect multiple business functions and processes. It is important to conduct a business impact analysis (BIA) to identify each business function and process that would be affected by the inability to access data in a LucidLink Filespace. According to ASIS International (the American Society for Industrial Security), the 4 main components of a BIA are:
- Identify critical processes.
- Assess impact if a crisis were to occur.
- Determine the maximum allowable outage and recovery time objective (RTO).
- Identify the resources required for recovery.
Simply put, not all the data in a LucidLink Filespace will necessarily have the same impact on business operations since not all business processes and functions have the same criticality. Thus, not all data in the Filespace will have the same recovery point objective (RPO) and recovery time objective (RTO).
This is very important. For example, if one has 50TBs of data in a Filespace, only a subset of that 50TBs will have the shortest RPO. Thus, negating the need to restore all 50TBs at once. For example, the BIA may have identified that “these specific folders in the Filespace amounting to a total of 500GB” need to be restored for business critical operations to resume within the identified RTO.
Also important to note is that the RTO for specific data in the Filespace may change throughout the calendar year correlating to the fluctuation of the criticality of certain business functions. This must be identified in the BIA as well.
Preparedness
Here is where the tool(s) one will leverage to backup and recover data in the Filespace are identified, configured and tested. There are various tools that can be used, ranging from commercial backup products from Veeam, Quest (NetVault), Acronis, and others to open source command line options such as rsync, robocopy and rclone. Once the tool or tools are identified, configuration and testing will begin.
Backups of Filespace data should follow the time tested 3-2-1 strategy rule.
- 3 copies of your data (Production + 2 backup copies).
- 2 different types of media storage.
- 1 offsite copy.
A 3-2-2 methodology is sometimes used in today’s cloud storage centric world that dictates 2 offsite copies with each in two different geographic locations.
With a LucidLink Filespace; the user, group and permission structure should be backed up in a 3-2-1 fashion as well, especially if there are a large number of users and groups with complex Filespace permission structures. This information can easily be gathered via the LucidLink CLI.
At this point, Disaster Recovery testing or simply, DR testing, comes into play. A well documented plan for a DR test must be created and adhered to during the test. This plan is the playbook which the company will rely upon in the event of a disaster event. Successful DR testing a) must be performed as close to a real scenario as possible, and just as important, b) recovery must include failback to the existing production Filespace, or a new Filespace, if the outage event caused the original Filespace to no longer be accessible. Simply restoring data from backups is not the aim here. In addition to the restoration of data, appropriate permissions to that data must be configured and the end-users who access this data in their workflows must be able to successfully work off of the temporary DR solution.
Failover from backups can be architected in different ways. For example, one could take backups of their LucidLink Filespace every 4 hours using Veeam. The backup data is written to local storage. In the event of a disaster, one could restore the critical data from the Veeam backup to a Windows file server, assign permissions accordingly on that Windows File Server for the restored data and let users begin accessing the file server via the LAN in the office and VPN remotely.
After the LucidLink outage is over as communicated by LucidLink, failback to the LucidLink production Filespace will occur. Failback procedures are a reversal typically leveraging the same tool(s) used to backup the original production Filespace. Using our previous example, Veeam can then be configured to backup the restored data that is being used on the Windows File Server every 4 hours as was done previously and store those backups on local storage as before. These failback procedures will begin after the successful failover of Filespace data to the temporary DR environment. Backups will continue until a scheduled date for failback (restoration of changed data) to the production Filespace will be performed. A successful failback to the production LucidLink Filespace is determined successful when the users are able to use the production LucidLink Filespace in their workflows.
After each DR test, it is important to have a debriefing session to document lessons learned and how to further improve recovery processes and the business continuity plan in general.
Response
Initially, an outage or unavailability of the LucidLink Filespace must be confirmed and a crisis declared and documented. Basic communication of service status from LucidLink is available 24/7 at lucidlink.com/status . From this information and any additional information provided by LucidLink via LucidLink’s customer facing Slack channel and/or email, one will be able to determine whether or not to place their business continuity plan into action. As with LucidLink’s responsibility to communicate on an outage event to their existing customers, it is also important for a LucidLink customer to communicate the current status of the outage event to affected internal and potentially external parties.
Recovery
Here is where the regularly tested and refined business continuity plan goes into effect and the recovery work from Filespace backups begins. As critical business processes come back online from the successful restoration of critical Filespace data, determination will be made on the restoration of the remaining Filespace data for the other remaining business processes which are offline.
The business continuity plan is designed to provide successful DR failover during an outage and ultimately the successful failback to the production Filespace once the outage is over. Only then, once a successful failback has been successfully completed, can the disaster or crisis event be declared over. Communication of this successful recovery must then be made to affected customers and affected internal personnel.
A post mortem will be conducted and any appropriate modifications will be made to the business continuity plan. These documented changes will then need to be tested regularly in the future.