Cloud data operations: an update on our progress

As the new cloud architecture for our data systems is taking shape, it’s time to start sharing information about the new capabilities that will become available. There will be a lot to talk about in the coming months (and years!) but we want to start with a brief summary of why we’re doing this, where we’re at currently, and what our near-term roadmap looks like. Don’t worry if some of these ideas are unfamiliar—there is a lot to introduce and we’ll unpack them in the future.

You may have noticed a survey on our mailing list recently, and there will be more in the future. We want to tailor informational materials and training resources to meet community needs as they evolve, so we appreciate your input.

Why migrate?

The preservation, discovery, and delivery of geophysical data are foundational to EarthScope’s commitment to advancing science. To that end, EarthScope is ending operations of the on-premise data centers in Seattle (formerly IRIS) and Boulder (formerly UNAVCO) and redesigning data flow and archiving for optimized operation on a commercial cloud service. This will improve reliability and scalability of existing core geophysical data and services while also enabling us to accelerate data archiving and support emerging data processing methods.

In addition to the reliability offered by cloud systems, researchers will have the ability to utilize scalable cloud computing adjacent to the entire data archive. On-demand serverless elasticity and resilient resources (as NIST defines cloud computing) are essential characteristics of the EarthScope Data Services cloud offering. In cloud-optimized scientific workflows, data, services, and compute scale to experimental needs with minimal copying, downloading, or other data “toil”. Machine learning projects, for example, will benefit greatly from fast and simple access to large datasets.

We strongly believe this can also broaden opportunities by decreasing reliance on the availability of computing resources at an individual’s institution. As part of a broader movement towards more open and equitable cutting-edge science (e.g., NASA Openscapes, Pangeo, ESIP, etc.), EarthScope will facilitate these cloud-based, data-adjacent workflows through a pilot compute platform and community engagement. For EarthScope systems and for our community, this will be a living, evolving process, but we are clear-eyed in our goal of supporting this community’s journey into leveraging the commercial cloud.

What will change for existing users?

You will still be able to download data as you have in the past—but this will no longer be the only option available to you. While we believe there are strong reasons to take advantage of these new capabilities, we also want to emphasize that we are doing everything possible to avoid disrupting your existing workflows.

Of course, we’ll be developing some new tools for data access, visualization, and other functions, and we know that even positive change can be frustrating. So there will be a focus on developing clear resources to help minimize that friction. We don’t expect to retire many existing tools, but we intend to give ample notice and identify alternatives in those few cases.

Our larger task is making sure we’re ready to assist those who want to employ new methods of data processing. That’s where our efforts will be the most visible.

What is the current status?

A lot of work has been done behind the scenes, especially over the last few months! This started with exploring and deciding how to configure our presence on Amazon Web Services. We have learned a lot, and design work at this stage will have considerable influence on how easily our systems can grow in the future.

Next was laying the foundation for a robust operation and development system—commonly referred to as “infrastructure as code”. These are the kinds of tools that help us orchestrate running processes so they smoothly scale up and down based on demand, or manage software updates to avoid service interruption.

Operating efficiently in the cloud also requires a new data storage structure for some data types, which will provide a number of advantages. This change will allow our systems to more easily convert data selected by users to the file format of their choice. Importantly, the more efficient cloud-native storage format means that data processing tasks can run faster and at much larger scale—making it possible for you to process and analyze more data than you’ve been able to before.

Progress hasn’t been limited to these foundations, though. The GNSS data flow architecture for the USGS ShakeAlert® Earthquake Early Warning System is also in the final stages of evaluation for incorporation into the operational system. The rest of our real-time GNSS data access will function similarly, so this development extends beyond the ShakeAlert System project.

A great deal of effort went into lifting geodetic data operations out of the Boulder data center, which recently closed. These systems probably transitioned to the cloud without you noticing! Legacy systems have been preserved for continuity of service, with an ongoing process to optimize the way they function “under the hood” moving forward—a good example of the approach we’re aiming for with these transitions.

What comes next?

The cloud transition of geodetic data systems previously housed in Boulder will be mirrored for the seismic data systems in Seattle in the near future. There, too, continuity of service will be the initial priority. But once that’s complete, we expect to start providing some upgraded performance—increased speed, more simultaneous connections allowed, and the like.

Subsequently, the things we’re really excited about will start to appear—the tools that will enable you to take full advantage of cloud processing for your data analysis. There is still a lot more work to do to make that happen, but we’ll share more in the interim.

We know that change at this scale can be overwhelming. So to help everyone in the community acquire the skills and familiarity to make the most of these new opportunities, we’ll be providing a variety of tutorial documentation, webinars, short courses, and other informational resources. If you haven’t already, join our general announcements and data announcements mailing lists to stay connected. We’ll be inviting feedback on new releases as we go, but you can ask questions or get help anytime by visiting the Contact Us page.