Windows resiliency: Best practices and the path forward

John_Cable · Jul 26, 2024

The broad, open nature and scale of the Windows computing ecosystem is part of what makes it a powerful and unmatched choice across the globe. The recent CrowdStrike incident underscores the need for mission-critical resiliency within every organization, and our unique ability to support the change required.

When a major incident arises, we focus on remediation, learning, and change, all while communicating transparently to our ecosystem. On Saturday, David Weston described our "first responder" approach. Since the start, we engaged over 5,000 support engineers working 24x7 to help bring critical services back online. We are providing ongoing updates via the Windows release health dashboard, where we detail remediation steps, including a signed Microsoft Recovery Tool.

Our goal is to be your trusted partner as you leverage technology and the end-to-end Microsoft stack to deliver amazing value for your workforce, your customers, and your partners. That means, when an issue arises, we immediately engage with partners and customers to dig into the details, help, learn, and evolve.

This incident shows clearly that Windows must prioritize change and innovation in the area of end-to-end resilience. These improvements must go hand in hand with ongoing improvements in security and be in close cooperation with our many partners, who also care deeply about the security of the Windows ecosystem.

Examples of innovation include the recently announced VBS enclaves, which provide an isolated compute environment that does not require kernel mode drivers to be tamper resistant, and the Microsoft Azure Attestation service, which can help determine boot path security posture. These examples use modern Zero Trust approaches and show what can be done to encourage development practices that do not rely on kernel access. We will continue to develop these capabilities, harden our platform, and do even more to improve the resiliency of the Windows ecosystem, working openly and collaboratively with the broad security community.

There is always the chance that an outage will impact an organization. Over the last few days, we've been on thousands of calls with organizations around the world. We've observed that those who were able to remediate and recover the most quickly followed a similar set of practices. We want to share those best practices with you.

Best practices to support resiliency in your organization

Have business continuity planning (BCP) and a major incident response plan (MIRP) in place. Include response and recovery best practices that outline the steps needed to get your environment back up and operating, including who to call and how to get support.
Back up data securely and often. We recommend your organization utilize cloud storage and backup solutions, as these are great options for securely accessing, sharing, and collaborating on files from anywhere. Organizations utilizing cloud storage solutions have had better experiences getting back online, as this removed barriers to simply resetting the device.
Ensure that you can restore your Windows devices quickly. A key component of resiliency in the event of an issue is to regularly create system restore points and use Windows built-in recovery options to restore devices. If you use Azure virtual machines, you can take a snapshot of your VMs. Organizations with recent restore points were able to recover more quickly from the recent CrowdStrike issue and we observed that virtualized/cloud environments were among the quickest to recover.
Utilize deployment rings. Extend safe deployment practices into your environment by creating deployment rings to manage the rollout of updates and new features. Utilize your existing device management tools to manage deployment risk using the same approach Microsoft does. Alternatively, take advantage of automated deployment with Windows Autopatch. If you are using non-Microsoft products in your environment, including antivirus solutions, ensure that they offer ring-based deployment so you can control the pace and scale for your environment. As an example, Microsoft Defender allows for custom configuration of both engine and intelligence update staging.
Use the latest Windows security defaults and enable Windows security baselines. Enable the security features that are available in Windows by default. Take advantage of Windows security baselines, which provide Microsoft-recommended, well-tested configurations based on feedback from Microsoft security engineering teams, product groups, partners, and organizations. Windows offers several built-in security features to leverage, from firewalls to encryption to biometrics, and more at the enterprise level with endpoint detection and response (EDR), data protection, vulnerability management, compliance monitoring and more.
Adopting a cloud-native approach to managing Windows devices can make it easier to deploy updates and support recovery efforts in outage scenarios. Look at ways to move away from on-premises solutions to cloud management solutions, cloud identity solutions, and ring-based deployment and update management solutions like Windows Autopatch.

Our commitment to transparency

Our focus continues to be on helping our customers recover from this incident. We will practice transparency in sharing learnings, best practices, and, eventually, more detailed discussions that include changes designed to strengthen the broader ecosystem moving forward.

Continue reading...

Windows resiliency: Best practices and the path forward

John_Cable

Best practices to support resiliency in your organization​

Our commitment to transparency​

Similar threads

Best practices to support resiliency in your organization

Our commitment to transparency