PEARC '22: Practice and Experience in Advanced Research Computing

Full Citation in the ACM Digital Library

SESSION: Full Papers, Applications and Software

Parallel Multi-Physics Simulation of Biomass Furnace and Cloud-based Workflow for SMEs

Biomass combustion is a well-established process to produce energy that offers a credible alternative to reduce the consumption of fossil fuel. To optimize the process of biomass combustion, numerical simulation is a less expensive and time-effective approach than the experimental method. However, biomass combustion involves intricate physical phenomena that must be modeled (and validated) carefully, in the fuel bed and in the surrounding gas. With this level of complexity, these simulations require the use of High-Performance Computing (HPC) platforms and expertise, which are usually not affordable for manufacturing SMEs.

In this work, we developed a parallel simulation tool for the simulation of biomass furnaces that relies on a parallel coupling between Computation Fluid Dynamics (CFD) and Discrete Element Method (DEM). This approach is computation-intensive but provides accurate and detailed results for biomass combustion with a moving fuel bed. Our implementation combines FOAM-extend (for the gas phase) parallelized with MPI, and XDEM (for the solid particles) parallelized with OpenMP, to take advantage of HPC hardware. We also carry out a thorough performance evaluation of our implementation using an industrial biomass furnace setup. Additionally, we present a fully automated workflow that handles all steps from the user input to the analysis of the results. Hundreds of parameters can be modified, including the furnace geometry and fuel settings. The workflow prepares the simulation input, delegates the computing-intensive simulation to an HPC platform, and collects the results. Our solution is integrated into the Digital Marketplace of the CloudiFacturing EU project and is directly available to SMEs via a Cloud portal.

As a result, we provide a cutting-edge simulation of a biomass furnace running on HPC. With this tool, we demonstrate how HPC can benefit engineering and manufacturing SMEs, and empower them to compute and solve problems that cannot be tackled without.

Scholarly Data Share: A Model for Sharing Big Data in Academic Research

The Scholarly Data Share (SDS) is a lightweight web interface that facilitates access to large, curated research datasets stored in a tape archive. SDS addresses the common needs of research teams working with and managing large and complex datasets, and the associated storage. The service adds several key features to the standard tape storage offerings that are of particular value to the research community: (1) the ability to capture and manage metadata, (2) metadata-driven browsing and retrieval over a web interface, (3) reliable and scalable asynchronous data transfers, and (4) an interface that hides the complexity of the underlying storage and access infrastructure. SDS is designed to be easy to implement and sustain over time by building on existing tool chains and proven open-source software and by minimizing bespoke code and domain-specific customization. In this paper, we describe the development of the SDS and the implementation of an instance to provide access to a large collection of geospatial datasets.

The C-MĀIKI Gateway: A Modern Science Platform for Analyzing Microbiome Data

In collaboration with the Center for Microbiome Analysis through Island Knowledge and Investigations (C-MĀIKI), the Hawaii EPSCoR Ike Wai project and the Hawaii Data Science Institute, a new science gateway, the C-MĀIKI gateway, was developed to support modern, interoperable and scalable microbiome data analysis. This gateway provides a web-based interface for accessing high-performance computing resources and storage to enable and support reproducible microbiome data analysis. The C-MĀIKI gateway is accelerating the analysis of microbiome data for Hawaii through ease of use and centralized infrastructure.

Performance Optimization of the Open XDMoD Datawarehouse

Open XDMoD is an open source tool to facilitate the management of high performance computing resources. It is widely deployed at academic, industrial, and governmental HPC centers and is used to monitor large and small HPC and cloud systems. The core of Open XDMoD is a MySQL based data warehouse that is designed to support the storage of historical information for hundreds of millions of jobs with a fast query time for the interactive web portal. In this paper, we describe the transition that we made from the MyISAM to the InnoDB storage engine. In addition, other improvements were also made to the database queries such as reordering and adding indices. We were able to attain substantial performance improvements in both the query execution and in the data ingestion/aggregation. It is a common trend that databases tend to grow in size and complexity throughout their lifetime; this work presents a practical guide for the types of practices and procedures that can be done to maintain data retrieval and ingestion performance.

Metrics of financial effectiveness: Return On Investment in XSEDE, a national cyberinfrastructure coordination and support organization

This paper explores the financial effectiveness of a national advanced computing support organization within the United States (US) called the eXtreme Science and Engineering Discovery Environment (XSEDE). XSEDE was funded by the National Science Foundation (NSF) in 2011 to manage delivery of advanced computing support to researchers in the US working on non-classified research. In this paper, we describe the methodologies employed to calculate the return on investment (ROI) for governmental expenditures on XSEDE and present a lower bound on the US government’s ROI for XSEDE from 2014 to 2020. For each year of the XSEDE project considered, XSEDE delivered measurable value to the US that exceeded the cost incurred by the Federal Government to fund XSEDE. That is, the US Federal Government’s ROI for XSEDE is at least 1 each year. Over the course of the study period, the ROI for XSEDE rose from 0.99 to 1.78. This increase was due partly to our ability to assign a value to more and more of XSEDE’s services over time and partly to the value of certain XSEDE services increasing over time. From 2014 to 2020, XSEDE offered an ROI of more than $1.5 in value for every $1.0 invested by the US Federal Government. Because our estimations were very conservative, this figure represents the lower bound of the value created by XSEDE. The most important part of ”returns” created by XSEDE are the actual outcomes it enables in terms of education, enabling new discoveries, and supporting the creation of new inventions that improve quality of life. In future work we will use newly developed accounting methodologies to begin assessing the value of the outcomes of XSEDE.

A Framework to capture and reproduce the Absolute State of Jupyter Notebooks

Jupyter Notebooks are an enormously popular tool for creating and narrating computational research projects. They also have enormous potential for creating reproducible scientific research artifacts. Capturing the complete state of a notebook has additional benefits; for instance, the notebook execution may be split between local and remote resources, where the latter may have more powerful processing capabilities or store large or access-limited data. There are several challenges for making notebooks fully reproducible when examined in detail. The notebook code must be replicated entirely, and the underlying Python runtime environments must be identical. More subtle problems arise in replicating referenced data, external library dependencies, and runtime variable states. This paper presents solutions to these problems using Juptyer’s standard extension mechanisms to create an archivable system state for a running notebook. We show that the overhead for these additional mechanisms, which involve interacting with the underlying Linux kernel, does not introduce substantial execution time overheads, demonstrating the approach’s feasibility.

C3F: Collaborative Container-based Model Coupling Framework

Solving complex real-world grand challenge problems requires in-depth collaboration of researchers from multiple disciplines. Such collaboration often involves harnessing multiscale and multi-dimensional data and combining models from different fields to simulate systems. However, the progress on this front has been limited mainly due to significant gaps in domain knowledge and tools that are typically employed in silos of the domains. Researchers from different fields face considerable barriers to understanding and reusing each other’s data/models in order to collaborate effectively. For example, in solving the global sustainability problems, researchers from hydrology, climate science, agriculture, and economics need to run their respective models to study different components of the global and local food, energy and water systems while, at the same time, need to interact with other researchers and integrate the results of one model with another. Developing this kind of model coupling workflow calls for (1) a large amount of data being processed and exchanged across domains and organizations, (2) identifying and processing the output of one model to make it ready for integration into another model, (3) controlling the workflow dynamically so that it runs until a certain convergence condition or other criteria is met, and (4) close collaboration among the modelers to explore, tune, and test the configuration and data transformation needed to link the models. We have developed C3F, a flexible collaborative model coupling framework to help researchers accelerate their model integration and linking efforts by leveraging advanced cyberinfrastructure such as high-performance computing and virtual containers. In this paper, we describe our experience and lessons learned in developing this cyberinfrastructure solution to support the linking of Water Balance Model (WBM) and SIMPLE-G agricultural economic model in an NSF funded INFEWS project and a DOE-funded Program on Coupled Human and Earth Systems (PCHES) to study the implications of groundwater scarcity for food-energy-water systems. The C3F model coupling framework can be extended to facilitate other model linkages as well.

SESSION: Full Papers, Systems and System Software

Migrating towards Single Sign-On and Federated Identity

This paper describes a two-tier architecture and implementation for single sign-on (SSO) federated identity support in Chameleon and the rationale that shaped it. We also describe how we migrated our users to a new account management system in privacy-preserving ways, a community that numbered in several thousand users and had created hundreds of thousands of digital artifacts.

Experiences in network and data transfer across large virtual organizations—a retrospective

The XSEDE Data Transfer Services (DTS) group focuses on streamlining and improving the data transfer experiences of the national academic research community, while also buttressing and future-proofing the underlying networks that support these transfers. In this paper, the DTS group shares how network and data transfer technologies have evolved over the past six years, with the backdrop of the Distributed Terascale Facility (DTF) and TeraGrid projects that served the national community before the advent of XSEDE. We delve into improvements, challenges, and trends in network and data transfer technologies, and the uses of these technologies in academic institutions across the country, which today translate into 100s of users of CI moving many terabytes each month. We also review the key lessons learned while serving the community in this regard, and what the future holds for academic networking and data transfer.

Comparing single-node and multi-node performance of an important fusion HPC code benchmark

Fusion simulations have traditionally required the use of leadership scale High Performance Computing (HPC) resources in order to produce advances in physics. The impressive improvements in compute and memory capacity of many-GPU compute nodes are now allowing for some problems that once required a multi-node setup to be also solvable on a single node. When possible, the increased interconnect bandwidth can result in order of magnitude higher science throughput, especially for communication-heavy applications. In this paper we analyze the performance of the fusion simulation tool CGYRO, an Eulerian gyrokinetic turbulence solver designed and optimized for collisional, electromagnetic, multiscale simulation, which is widely used in the fusion research community. Due to the nature of the problem, the application has to work on a large multi-dimensional computational mesh as a whole, requiring frequent exchange of large amounts of data between the compute processes. In particular, we show that the average-scale nl03 benchmark CGYRO simulation can be run at an acceptable speed on a single Google Cloud instance with 16 A100 GPUs, outperforming 8 NERSC Perlmutter Phase1 nodes, 16 ORNL Summit nodes and 256 NERSC Cori nodes. Moving from a multi-node to a single-node GPU setup we get comparable simulation times using less than half the number of GPUs. Larger benchmark problems, however, still require a multi-node HPC setup due to GPU memory capacity needs, since at the time of writing no vendor offers nodes with a sufficient GPU memory setup. The upcoming external NVSWITCH does however promise to deliver an almost equivalent solution for up to 256 NVIDIA GPUs.

PROWESS: An Open Testbed for Programmable Wireless Edge Systems

Edge computing is a growing paradigm where compute resources are provisioned between data sources and the cloud to decrease compute latency from data transfer, lower costs, comply with security policies, and more. Edge systems are as varied as their applications, serving internet services, IoT, and emerging technologies. Due to the tight constraints experienced by many edge systems, research computing testbeds have become valuable tools for edge research and application benchmarking. Current testbed infrastructure, however, fails to properly emulate many important edge contexts leading to inaccurate benchmarking. Institutions with broad interests in edge computing can build testbeds, but prior work suggests that edge testbeds are often application or sensor specific. A general edge testbed should include access to many of the sensors, software, and accelerators on which edge systems rely, while slicing those resources to fit user-defined resource footprints. PROWESS is an edge testbed that answers this challenge. PROWESS provides access across an institution to sensors, compute resources, and software for testing constrained edge applications. PROWESS runs edge workloads as sets of containers with access to sensors and specialized hardware on an expandable cluster of light-weight edge nodes which leverage institutional networks to decrease implementation cost and provide wide access to sensors. We implemented a multi-node PROWESS deployment connected to sensors across Ohio State University’s campus. Using three edge-native applications, we demonstrate that PROWESS is simple to configure, has a small resource footprint, scales gracefully, and minimally impacts institutional networks. We also show that PROWESS closely approximates native execution of edge workloads and facilitates experiments that other systems testbeds can not.

Measuring XSEDE: Usage Metrics for the XSEDE Federation of Resources: A comparison of evolving resource usage patterns across TeraGrid and XSEDE

The Extreme Science and Engineering Discovery Environment (XSEDE) program and its predecessor, the TeraGrid, have provided a range of advanced computing resources to the U.S. research community for nearly two decades. The continuously collected data set of resource usage spanning these programs provides a unique opportunity to examine the behaviors of researchers on these resources. By revisiting analyses from the end of TeraGrid, we find both similarities and differences in ecosystem activity, not all of which can be explained by the technological advances of the past decade. Many of the basic metrics of computing system use show familiar patterns, but community composition and engagement have evolved. Along with growth in the number of individuals using the resources, we see significant changes in the fraction of students in the user community. And while many individuals have only short-term interaction with the resources, we can see signs of more sustained use by a growing portion of projects.

Phoenix: The Revival of Research Computing and the Launch of the New Cost Model at Georgia Tech

Originating from partnerships formed by central IT and researchers supporting their own clusters, the traditional condominium and dedicated cluster models for research computing are appealing and prevalent among emerging centers throughout academia. In 2008, Georgia Institute of Technology (GT) launched a campus strategy to centralize the hosting of computing resources across multiple science and engineering disciplines under a group of expert support personnel, and in 2009 the Partnership for an Advanced Computing Environment (PACE) was formed. Due to the increases in scale over the past decade, however, the initial models created challenges for the research community, systems administrators, and GT’s leadership. In 2020, GT launched a strategic initiative to revitalize research computing through a refresh of the infrastructure and computational resources in parallel with the migration to a new state-of-the-art datacenter, Coda, followed by the transition to a new consumption-based cost model. These efforts have resulted in an overall increase in cluster utilization, access to more hardware, a decrease in queue wait times, a reduction in resource provision times, and increase in return on investment, suggesting that such a model is highly advantageous for academic research computing centers. Presented here are the methods employed in making the change to the new cost model, data supporting these claims, and the ongoing improvements to continue meeting the needs of the GT research community whose research is accelerated by the deployment of the new cost model and the Phoenix cluster that ranked #277 on the Top500 November 2020 list.

CHI-in-a-Box: Reducing Operational Costs of Research Testbeds

Making scientific instruments for computer science research available and open to all is more important than ever given the constantly increasing pace of opportunity and innovation – yet, such instruments are expensive to build and operate given their complexity and need for rapid evolution to keep pace with the advancing frontier of science. This paper describes how we can lower the cost of computer science testbeds by making them easier to deploy and operate. We present CHI-in-a-Box, a packaging of CHameleon Infrastructure (CHI) underlying the Chameleon testbed, describe the practices that went into its design and implementation, and present three case studies of its use.

High Performance MPI over the Slingshot Interconnect: Early Experiences

The Slingshot interconnect designed by HPE/Cray is becoming more relevant in High-Performance Computing with its deployment on the upcoming exascale systems. In particular, it is the interconnect empowering the first exascale and highest-ranked supercomputer in the world, Frontier. It offers various features such as adaptive routing, congestion control, and isolated workloads. The deployment of newer interconnects raises questions about performance, scalability, and any potential bottlenecks as they are a critical element contributing to the scalability across nodes on these systems. In this paper, we will delve into the challenges the slingshot interconnect poses with current state-of-the-art MPI libraries. In particular, we look at the scalability performance when using slingshot across nodes. We present a comprehensive evaluation using various MPI and communication libraries including Cray MPICH, OpenMPI + UCX, RCCL, and MVAPICH2-GDR on GPUs on the Spock system, an early access cluster deployed with Slingshot and AMD MI100 GPUs, to emulate the Frontier system.

A Fully Automated Scratch Storage Cleanup Tool for Heterogeneous Parallel Filesystems

Transitional data are a common component of most large-scale simulations and data analysis. Most research computing centers provide scratch storage to keep temporary data needed only during the runtime of jobs. Efficient management of scratch storage becomes critical for HPC centers with limited resources. Different research computing centers employ various policies and approaches to sustain the tricky balance between filesystem capabilities, user expectations, and excellence in customer support. In this paper, we present a homegrown fully-automated scratch storage cleanup tool, along with policies and procedures that we’ve built around it. This tool runs without human intervention to clean up our scratch space periodically, and it’s compatible with both GPFS and Lustre, two popular parallel filesystems that we use for our scratch service. Our approach takes into consideration both filesystems’ unique features and makes the workflow generic enough while accommodating these differences. The workflow has successfully run at our center for several months. Due to the limited literature documenting this type of work, we share our experience with the community in benefit of other centers with similar needs.

NetInfra - A Framework for Expressing Network Infrastructure as Code

NetInfra is a framework designed to manage both DHCP and DNS services. It does this through single source of truth formatted to be consumed by configuration management software which reduces duplication of efforts and unifies the configuration of these interrelated services across otherwise unrelated software systems. In this paper, we review a production deployment of managing DHCP and DNS services via the NetInfra framework and discuss the strengths and weaknesses of the the framework itself.

Automatic Benchmark Testing with Performance Notification for a Research Computing Center

Automatic performance testing on HPC systems draws increasing interest. It provides great advantages by freeing operational staff time and making tests run more consistently with fewer errors. However, it remains challenging to automate a benchmark suite along with data analytics and notifications, often requiring software system redesign and additional code development. This paper successfully extends our previous benchmark framework ProvBench with a decoupled workflow, allowing integration with the GitLab CI framework. The performance workflow includes data analytics for performance comparison with historical baseline data and generates recommendation notifications to PACE staff. The automatic benchmark testing has been successfully deployed on GT PACE systems and has proven useful in the early detection of system failures.

Benchmarking the Performance of Accelerators on National Cyberinfrastructure Resources for Artificial Intelligence / Machine Learning Workloads

Upcoming regional and National Science Foundation (NSF)-funded Cyberinfrastructure (CI) resources will give researchers opportunities to run their artificial intelligence / machine learning (AI/ML) workflows on accelerators. To effectively leverage this burgeoning CI-rich landscape, researchers need extensive benchmark data to maximize performance gains and map their workflows to appropriate architectures. This data will further assist CI administrators, NSF program officers, and CI allocation-reviewers make informed determinations on CI-resource allocations. Here, we compare the performance of two very different architectures: the commonly used Graphical Processing Units (GPUs) and the new generation of Intelligence Processing Units (IPUs), by running training benchmarks of common AI/ML models. We leverage the maturity of software stacks, and the ease of migration among these platforms to learn that performance and scaling are similar for both architectures. Exploring training parameters, such as batch size, however finds that owing to memory processing structures, IPUs run efficiently with smaller batch sizes, while GPUs benefit from large batch sizes to extract sufficient parallelism in neural network training and inference. This comes with different advantages and disadvantages as discussed in this paper.As such considerations of inference latency, inherent parallelism and model accuracy will play a role in researcher selection of these architectures. The impact of these choices on a representative image compression model system is discussed.

Quantifying the Impact of Advanced Web Platforms on High Performance Computing Usage

The deployment of Science Gateways for High Performance Computing (HPC) systems can alter long-accepted usage patterns on supercomputing systems in positive ways as an ever-increasing number of users migrate their workflows to HPC systems. Idaho National Laboratory (INL) has deployed two separate advanced web platforms, Open OnDemand and NICE DCV, for integration with HPC resources to improve web accessibility for HPC users. We conducted a multi-year study on how HPC usage patterns changed in the presence of these platforms. This work reports the results of that study and quantifies the observed impacts, including adoption by visualization and Jupyter Notebook/Lab users, decreased job submission friction, rapid uptake of HPC by Windows users, and increased overall system utilization. The most significant impacts were observed from the deployment of Open OnDemand, and this work also identifies some best practices for Open OnDemand deployment for HPC datacenters.

Measuring the Relative Outputs of Computational Researchers in Higher Education

Studies have shown that in the aggregate, investment in high-performance computing contributes to research output at both the departmental and institutional levels. Missing from this picture is data about the impact on output and productivity at the level of the individual faculty member. This study will explore the impact of the use of high-performance computing on the output of the average university researcher.

Containerizing Visualization Software: Experiences and Best Practices

The standard process for software development has changed dramatically in the past decade. What was once a large effort of installing the same software across different systems has become much more streamlined with the rapid emergence and wide-scale adoption of Docker as the de facto container management ecosystem. Coincidentally, this has had an impact in the HPC and scientific computing community, allowing system maintainers to maintain and install packages with easier effort[12]. This can be seen through the adoption of containers on many large scale systems, including those supported by the Texas Advanced Computing Center (TACC), DOE, XSEDE, and the wider NSF community[23] [34] [15]. An extra layer of work is necessary when developing containers that require visualization technologies. This includes applications that require a windowing system such as X[9] to render GUIs. It can be further difficult when wanting to expose NVIDIA or GPU related capabilities to containers[32]. While the ability to containerize applications is widely available, there is no central resource for creating those with visualization requirements. Our work aims to consolidate common issues for visualization containers, including both similar and unique solutions. We give detail to the various ways we have worked at TACC to make visualization containers more available to researchers using HPC systems, made development easier, and the promise to being applied in lab research spaces. Our hope is to share similar challenges that other researchers may face, and providing possible solutions, so that further adoption of containers can be more easily developed.

Anvil - System Architecture and Experiences from Deployment and Early User Operations

Anvil is a new XSEDE advanced capacity computational resource funded by NSF. Designed with a systematic strategy to meet the ever increasing and diversifying research needs for advanced computational capacity, Anvil integrates a large capacity high-performance computing (HPC) system with a comprehensive ecosystem of software, access interfaces, programming environments, and composable services in a seamless environment to support a broad range of current and future science and engineering applications of the nation’s research community. Anchored by a 1000-node CPU cluster featuring the latest AMD EPYC 3rd generation (Milan) processors, along with a set of 1TB large memory and NVIDIA A100 GPU nodes, Anvil integrates a multi-tier storage system, a Kubernetes composable subsystem, and a pathway to Azure commercial cloud to support a variety of workflows and storage needs. Anvil was successfully deployed and integrated with XSEDE during the world-wide COVID-19 pandemic. Entering production operation in February 2022, Anvil will serve the nation’s science and engineering research community for five years. This paper describes the Anvil system and services, including its various components and subsystems, user facing features, and shares the Anvil team’s experience through its early user access program from November 2021 through January 2022.

SESSION: Full Papers, Workforce Development, Training, Diversity, and Education

Exchanging Best Practices for Supporting Computational and Data-Intensive Research, The Xpert Network

We present best practices for professionals who support computational and data-intensive (CDI) research projects. The practices resulted from the Xpert Network activities, an initiative that brings together major NSF-funded projects for advanced cyberinfrastructure, national projects, and university teams that include individuals or groups of such professionals. Additionally, our recommendations are based on years of experience building multidisciplinary applications and teaching computing to scientists. This paper focuses particularly on practices that differ from those in a general software engineering context. This paper also describes the Xpert Network initiative where participants exchange best practices, tools, successes, challenges, and general information about their activities, leading to increased productivity, efficiency, and coordination in the ever-growing community of scientists that use computational and data-intensive research methods.

Understanding Factors that Influence Research Computing and Data Careers

Research Computing and Data (RCD) professionals play a crucial role in supporting and advancing research that involve data and/or computing, however, there is a critical shortage of RCD workforce, and organizations face challenges in recruiting and retaining RCD professional staff. It is not obvious to people outside of RCD how their skills and experience map to the RCD profession, and staff currently in RCD roles lack resources to create a professional development plan. To address these gaps, the CaRCC RCD Career Arcs working group has embarked upon an effort to gain a deeper understanding of the paths that RCD professionals follow across their careers. An important step in that effort is a recent survey the working group conducted of RCD professionals on key factors that influence decisions in the course of their careers. This survey gathered responses from over 200 respondents at institutions across the United States. This paper presents our initial findings and analyses of the data gathered. We describe how various genders, career stages, and types of RCD roles impact the ranking of these factors, and note that while there are differences across these groups, respondents were broadly consistent in their assessment of the importance of these factors. In some cases, the responses clearly distinguish RCD professionals from the broader workforce, and even other Information Technology professionals.

Building the Research Innovation Workforce: Challenges and Recommendations from a Virtual Workshop to Advance the Research Computing Community

The workforce for research computing, cyberinfrastructure, and data analytics is a complex global ecosystem comprised of workers across academia, national laboratories, and industry. To explore the underlying factors that affect the growth and vitality of this workforce ecosystem, we conducted an NSF funded virtual workshop during the third quarter of 2020 attended by 100 participants. The workshop identified challenges affecting the workforce pipeline and ecosystem and generated recommendations to help address these challenges. This paper provides a summary of the workshop, challenges, and recommendations.

Characterizing the US Research Computing and Data (RCD) Workforce

A growing share of computationally and data-intensive research, both inside and outside of academia, requires the involvement and support of computing and data professionals. Yet little is known about the composition of the research computing and data (RCD) workforce. This paper presents the results of a survey (N=563) of RCD professionals’ demographic and educational backgrounds, work experience, current positions, job responsibilities, and views of working in the RCD field. We estimate the size of the RCD workforce and discuss how the demographic diversity and distribution of backgrounds of those in the RCD workforce fail to match that of the larger academic and technical workforces. These survey results additionally support the insights of those working in the field concerning the need to recruit a wider variety of professionals into the RCD profession, better define job descriptions and career pathways, and improve institutional recognition for the value of RCD work.

Evaluating research computing training and support as part of a broader digital research infrastructure needs assessment

Digital Research Infrastructure (DRI) refers to the suite of tools and services that enables the collection, processing, dissemination, and disposition of research data. This includes strategies for planning, organizing, storing, sharing, computing, and ultimately archiving or destroying one's research data.  These services must be supported by highly qualified personnel with the appropriate expertise.  From May 17 - June 12, 2021, the University of British Columbia (UBC) Advanced Research Computing (UBC ARC) and the UBC Library from both Vancouver and Okanagan Campuses launched the DRI Needs Assessment Survey to investigate UBC researchers’ needs in 25 distinct DRI tools and services.  The survey received a total of 241 responses, and following the survey, three focus groups were conducted with survey respondents to gain additional insights.

This paper outlines the DRI Needs Assessment Survey and its findings, focusing on those directly related to UBC ARC services and training in high-performance computing (HPC) and cloud computing (“Cloud”), and discusses next steps for implementing a more collaborative, comprehensive research computing training and support model. Key findings suggest that while advanced research computing infrastructure is a key pillar of DRI, researchers utilizing UBC ARC also rely on a number of other DRI tools and services to conduct their research.  These services are widely scattered across various departments and groups within and outside the institution and are oftentimes not well communicated, impacting researchers’ ability to find them.  Current research training and support has been found to be inadequate, and there are duplicated service efforts occurring in silos, resulting in an inefficient service model and wasted funds.

SESSION: Short Papers, Applications and Software

A Design Pattern for Recoverable Job Management

Processing scientific workloads involves staging inputs, executing and monitoring jobs, archiving outputs, and doing all of this in a secure, repeatable way. Specialized middleware has been developed to automate this process in HPC, HTC, cloud, Kubernetes and other environments. This paper describes the Job Management (JM) design pattern used to enhance workload reliability, scalability and recovery. We discuss two implementations of JM in the Tapis Jobs service, both currently in production. We also discuss the reliability and performance of the system under load, such as when 10,000 jobs are submitted at once.

Cyberinfrastructure value: a survey on perceived importance and usage

The research landscape in science and engineering is heavily reliant on computation and data storage. The intensity of computation required for many research projects illustrates the importance of the availability of high performance computing (HPC) resources and services. This paper summarizes the results of a recent study among principal investigators that attempts to measure the impact of the cyberinfrastructure resources allocated by the XSEDE (eXtreme Science and Engineering Discovery Environment) project to various research activities across the United States. Critical findings from this paper include: a majority of respondents report that the XSEDE environment is important or very important in completing their funded work, and two-thirds of our study’s respondents developed products (e.g., datasets, websites, software, etc.) using XSEDE-allocated resources. About one-third of respondents cited the importance of XSEDE-allocated resources in securing research funding. Respondents of this survey have secured approximately $3.3B in research funding from various sources, as self-reported by respondents.

Webots.HPC: A Parallel Simulation Pipeline for Autonomous Vehicles

In the rapidly evolving and maturing field of robotics, computer simulation has become an invaluable tool in the design and evaluation process. Autonomous vehicle (AV) and microscopic traffic simulators can be integrated to produce cost-effective tools for simulating AVs in the traffic stream. Our research sets out to develop a formalized parallel pipeline for running sequences of Webots simulations on powerful high performance computing (HPC) resources. Since running these simulations on personal lab computers is challenging, this paper presents a framework to support the Webots and Simulation of Urban Mobility (SUMO) simulation tools in an HPC environment. Simulations can be run in sequence, with a batch job being distributed across an arbitrary number of computing nodes and each node having multiple instances running in parallel. We have demonstrated parallel execution of Webots and SUMO, a microscopic traffic simulator, with as many as 2304 simulations across 6 nodes in a 12 hour time period. Overall, this capable pipeline can be used to extend existing research or to serve as a platform for new robotics simulation endeavors. This paper will serve as an important reference for researchers in efficiently simulating AVs in a mixed roadway traffic stream.

Workflow management for scientific research computing with Tapis Workflows: Architecture and Design Decisions behind Software for Research Computing Pipelines

Developing research computing workflows often demands significant understanding of DevOps tooling and related software design patterns, requiring researchers to spend time learning skills that are often outside of the scope of their domain expertise. In late 2021, we began development of the Tapis Workflows API to address these issues. Tapis Workflows provides researchers with a tool that simplifies the creation of their workflows by abstracting away the complexities of the underlying technologies behind a user-friendly API that integrates with HPC resources available at any institution with a Tapis deployment. Tapis Workflows Beta is slated to be released by the end of April 2022. In this paper, we discuss the high level system architecture of Tapis Workflows, the project structure, terminology and concepts employed in the project, use cases, design and development challenges, and solutions we chose to overcome them.

ScriptManager: an interactive platform for reducing barriers to genomics analysis

ScriptManager was built to be a lightweight and easy to use genomics analysis tool for novice bioinformaticians. It includes both a graphical interface for easy navigation of inputs and options while also supporting a command line interface for automation and integration with workflow managers like Galaxy. We describe here how a user unfamiliar with the command line can leverage national supercomputing resources using a graphical desktop interface like Open OnDemand to perform their analyses and generate publication quality figures for their research. Widespread adoption of this tool in the genomics community would lower technical barriers to accessing supercomputing resources and allow biochemists to prototype their own workflows that can be integrated into large scale production pipelines. Source code and precompiled binaries available at https://github.com/CEGRcode/scriptmanager.

Validating new Automated Computer Vision workflows to traditional Automated Machine Learning

This paper presents some experiments to validate the design of an Automated Computer Vision (AutoCV) library for applications in scientific image understanding. AutoCV attempts to define a search space of algorithms used in common image analysis workflows and then uses a fitness function to automatically select individual algorithmic workflows for a given problem. The final goal is a semi-automated system that can assist researchers in finding specific computer vision algorithms that work for their specific research questions. As an example of this method the researchers have built the SEE-Insight tool which uses genetic algorithms to search for image analysis workflows. This tool has been used to implement an image segmentation workflow (SEE-Segment) and is being updated and modified to work with other image analysis workflows such as anchor point detection and counting. This work is motivated by analogous work being done in Automated Machine Learning (AutoML). As a way to validate the approach, this paper uses the SEE-Insight tool to recreate an AutoML solution (called SEE-Classify) and compares results to an existing AutoML solution (TPOT). As expected the existing AutoML tool worked better than the prototype SEE-Classify tool. However, the goal of this work was to learn from these well-established tools and possibly identify one of them that could be modified as a mature replacement for the core SEE-Insight search algorithm. Although this drop-in replacement was not found, reproducing the AutoML experiments in the SEE-Insight framework provided quite a few insights into best practices for moving forward with this research.

CyberGIS for Scalable Remote Sensing Data Fusion

Satellite remote sensing data products are widely used in many applications and science domains ranging from agriculture and emergency management to Earth and environmental sciences. Researchers have developed sophisticated and computationally intensive models for processing and analyzing such data with varying spatiotemporal resolutions from multiple sources. However, the computational intensity and expertise in using advanced cyberinfrastructure have held back the scalability and reproducibility of such models. To tackle this challenge, this research employs the CyberGIS-Compute middleware to achieve scalable and reproducible remote sensing data fusion across multiple spatiotemporal resolutions by harnessing advanced cyberinfrastructure. CyberGIS-Compute is a cyberGIS middleware framework for conducting computationally intensive geospatial analytics with advanced cyberinfrastructure resources such as those provisioned by XSEDE. Our case study achieved remote sensing data fusion at high spatial and temporal resolutions based on integrating CyberGIS-Compute with a cutting-edge deep learning model. This integrated approach also demonstrates how to achieve computational reproducibility of scalable remote sensing data fusion.

Continuous Integration for HPC with Github Actions and Tapis

Continuous integration and deployment (CICD) are fundamental to modern software development. While many platforms such as GitHub and Atlassian provide cloud solutions for CICD, these solutions don’t fully meet the unique needs of high performance computing (HPC) applications. These needs include, but are not limited to, testing distributed memory and scaling studies, both of which require an HPC environment. We propose a novel framework for running CICD workflows on supercomputing resources. Our framework directly integrates with GitHub Actions and leverages TACC’s Tapis API for communication with HPC resources. The framework is demonstrated for PYthon Ocean PArticle TRAcking (PYOPATRA), an HPC application for Lagrangian particle tracking.

Extending Functionalities on a Web-based Portal for Research Computing

This paper introduces a research computing portal built as an extension to the OpenOnDemand (OOD) framework. Students 100% implemented the portal at Texas A&M University's (TAMU) High Performance Research Computing (HPRC) facility. It offers an intuitive way for researchers to see all their research computing information on a single web page. This information includes billing accounts, file quotas, recently completed jobs, and currently running jobs. A researcher will also be able to view detailed job information, both for running and completed jobs. The dashboard also provides a 100% visual interface for creating jobs and “offloading” user codes to the cluster and functionality to manage accounts and request quota increases and software installations,

Sophon: an Extensible Platform for Collaborative Research

In the last few years, the web-based interactive computational environment called Jupyter notebook has been gaining more and more popularity as a platform for collaborative research and data analysis, becoming a de-facto standard among researchers. In this paper we present a first implementation of Sophon, an extensible web platform for collaborative research based on JupyterLab. Our aim is to extend the functionality of JupyterLab and improve its usability by integrating it with Django. In the Sophon project, we integrate the deployment of dockerized JupyterLab instances into a Django web server, creating an extensible, versatile and secure environment, while also being easy to use for researchers of different disciplines.

AlphaFold2 Workflow Optimization for High Throughput Predictions in HPC Environment

In this work, we propose a high-throughput implementation that executes AlphaFold2 efficiently in a High-Performance Computing environment. In this case, we have tested our proposed workflow with the T1050 CASP14 sequence on PSC’s Bridges-2 HPC system. The results showed an improvement in computation-only runtimes and the opportunity to reuse the protein databases when calculating many structures simultaneously, which would lead to massive time savings while maximizing the utilization of computing resources.

Custos Secrets: a Service for Managing User-Provided Resource Credential Secrets for Science Gateways

Custos is open source software that provides user, group, and resource credential management services for science gateways. This paper describes the resource credential, or secrets, management service in Custos that allows science gateways to safely manage security tokens, SSH keys, and passwords on behalf of users. Science gateways such as Galaxy are well-established mechanisms for researchers to access cyberinfrastructure and, increasingly, couple it with other online services, such as user-provided storage or compute resources. To support this use case, science gateways need to operate on behalf of the users to connect, acquire, and release these resources, which are protected by a variety of authentication and access mechanisms. Storing and managing the credentials associated with these access mechanisms must be done using “best of breed” software and established security protocols. The Custos Secrets Service allows science gateways to store and retrieve these credentials using secure protocols and APIs while the data is protected at rest. Here, we provide implementation details for the service, describe the available APIs and SDKs, and discuss integration with Galaxy as a use case.

Data Discoverability in Science Gateways at Scale using Elasticsearch Cluster Architecture

Science gateways allow science & engineering communities to access shared data, software, computing services, instruments, educational materials, and other resources specific to their disciplines. One specific example is the use of science gateways to connect researchers with HPC resources by providing a graphical interface to submit jobs and manage shared data sets. In addition to job and data management, the ability to offer robust search features are highly valuable additions to gateways because they enhance the navigability of a user’s personal data as well as the discoverability of collaborative data resources. For a facility managing multiple science gateway products, maintaining up-to-date search indices is a challenge. In this paper, we discuss our framework, architecture, and operation of a multitenant Elasticsearch cluster designed to fulfill the search needs of an expanding portfolio of science gateways. By leveraging Elasticsearch’s distributed data model and role-based access control, we designed a secure search solution which has scaled to over 200 million indexed entities representing approximately 700 terabytes of research data.

COSMO: a Research Data Service Platform and Experiences from the BlueTides Project

We present details and experiences related to the COSMO project advanced by the Pittsburgh Supercomputing Center (PSC) and the McWilliams Center for Cosmology at Carnegie Mellon University for the BlueTides Simulation project. The design of COSMO focuses on expediting access to key information, minimizing data transfer, and offering an intuitive user interface and easy-to-use data-sharing tools. COSMO consists of a data-sharing web portal, API tools that enable quick data access and analysis for scientists, and a set of recommendations for scientific data sharing. The BlueTides simulation project, one of the most extensive cosmological hydrodynamic simulations ever performed, provides voluminous scientific data ideal for testing and validating COSMO. Successful experiences include COSMO enabling intuitive and efficient remote data access which resulted in a successful James Webb Telescope proposal to observe the first quasars in the first observing cycle.

Towards Practical, Generalizable Machine-Learning Training Pipelines to build Regression Models for Predicting Application Resource Needs on HPC Systems

This paper explores the potential for cost-effectively developing generalizable and scalable machine-learning-based regression models for predicting the approximate execution time of an HPC application given its input data and parameters. This work examines: (a) to what extent models can be trained on scaled-down datasets on commodity environments and adapted to production environments, (b) to what extent models built for specific applications can generalize to other applications within a family, and (c) how the most appropriate model may change based on the type of data and its mix. As part of this work, we also describe and show the use of an automatable pipeline for generating the necessary training data and building the model.

Experience Migrating a Pipeline for the C-MĀIKI gateway from Tapis v2 to Tapis v3

The C-MĀIKI gateway is a science gateway that leverages a computational workload management API called Tapis to support modern, interoperable, and scalable microbiome data analysis. This project is focused on migrating an existing C-MĀIKI gateway pipeline from Tapis v2 to Tapis v3 so that it can take advantage of the new robust Tapis v3 features and stay modern. This requires three major steps: 1) Containerization of each existing microbiome workflow. 2) Create a new app definition for each of the workflows. 3) Enabling the ability to submit jobs to a SLURM scheduler inside of a singularity container to support the Nextflow workflow manager. This work presents the experience and challenges in upgrading the pipeline.

SESSION: Short Papers, Systems and System Software

SciAuth: A Lightweight End-to-End Capability-Based Authorization Environment for Scientific Computing

We introduce a new end-to-end software environment that enables experimentation with using SciTokens for capability-based authorization in scientific computing. This set of interconnected Docker containers enables science projects to gain experience with the SciTokens model prior to adoption. It is a product of our SciAuth project, which supports the adoption of the SciTokens model through community engagement, support for coordinated adoption of community standards, assistance with software integration, security analysis and threat modeling, training, and workforce development.

CyberGIS-Cloud: A unified middleware framework for cloud-based geospatial research and education

Interest in cloud-based cyberinfrastructure continues to grow within the geospatial community to tackle contemporary big data challenges. Distributed computing frameworks, deployed over the cloud, provide scalable and low-maintenance solutions to accelerate geospatial research and education. However, for scientists and researchers, the usage of such resources is highly constrained by the steep curve for learning diverse sets of platform-specific tools and APIs. This paper presents CyberGIS-Cloud as a unified middleware to streamline the execution of distributed geospatial workflows over multiple cloud backends with easy-to-use interfaces. CyberGIS-Cloud employs bringing computation-to-data model by abstracting and automating job execution over distributed resources hosted in the cloud environment where the data resides. We present details of CyberGIS-Cloud with support for popular distributed computing frameworks backed by research-oriented JetStream Cloud and commercial Google Cloud Platform.

Buzzard: Georgia Tech’s Foray into the Open Science Grid

Open Science Grid (OSG) is a consortium that enables many scientific breakthroughs by providing researchers with access to shared High Throughput Computing (HTC) compute clusters in support of large-scale collaborative research. To meet the demand on campus, Georgia Institute of Technology (GT)’s Partnership for an Advanced Computing Environment (PACE) team launched a centralized OSG support project, powered by Buzzard, an NSF-funded OSG cluster. We describe Buzzard’s unique multi-tenant architecture, which supports multiple projects on a single CPU/GPU pool, for the benefit of other institutions considering a similar approach to support OSG on their campuses.

Corralling sensitive data in the Wild West: supporting research with highly sensitive data

Due to increased demand from researchers working with highly sensitive data, UC Berkeley developed the Secure Research Data and Compute (SRDC) platform and service. This article describes the design and architecture of the platform as well as key use cases for researchers working on SRDC. The article concludes with observations and lessons learned about the platform and service.

Federating CI Policy in Support of Multi-institutional Research: Lessons from the Ecosystem for Research Networking

The ERN (Ecosystem for Research Networking) works to address challenges that researchers face when participating in multi-campus team science projects. There are a variety of technical and collaborative coordination problems associated with shared access to research computing and data located across the national cyberinfrastructure ecosystem. One of these problems is the need to develop organizational policy that can work in parallel with policies at at different institutions or facilities. Generally, universities are not set up to support science teams that are distributed across many locations, making policy alignment an even more complex issue. We describe some of the work of the ERN Policy Working Group, and introduce some key issues that surfaced while developing a guiding policy framework.

Navigating Dennard, Carbon and Moore: Scenarios for the Future of NSF Advanced Computational Infrastructure

After a long period of steady improvement, scientific computing equipment (SCE, or HPC) is being disrupted by the end of Dennard scaling, the slowing of Moore's Law, and new challenges to reduce carbon, to fight climate change. What does this mean for the future? We develop a system and portfolio model based on historical NSF XSEDE site systems and apply it to examine potential technology scenarios and what they mean for future compute capacity, power consumption, carbon emissions, datacenter siting, and more.

Early Experiences with Tight Integration of Kubernetes in an HPC Environment

The Ohio Supercomputer Center has deployed a Kubernetes cluster with tight integration to a high performance computing (HPC) environment. This deployment leverages existing file systems for data sharing between HPC systems and Kubernetes objects, monitoring, account management, resource management, and accounting systems. This paper describes the motivation and overall design, the novel methods for the implementation, and the applications supported by this new resource. It also presents a short description of future work and some of the questions raised by this design.

The ERN Cryo-EM Federated Instrument Pilot Project

Feedback and survey data collected from hundreds of participants of the Ecosystem for Research Networking (formerly Eastern Regional Network) series of NSF (OAC-2018927) funded community outreach meetings and workshops revealed that Structural Biology Instrument driven science is being forced to transition from self-contained islands to federated wide-area internet accessible instruments. This paper discusses phase 1 of the active ERN CryoEM Federated Instrument Pilot project whose goal is to facilitate inter-institutional collaboration at the interface of computing and electron microscopy through the implementation of the ERN Federated OpenCI Lab’s Instrument CI Cloudlet design. The conclusion will be a web-based portal leveraging federated access to the instrument, workflows utilizing edge computing in conjunction with cloud computing, along with real-time monitoring for experimental parameter adjustments and decisions. The intention is to foster team science and scientific innovation, with emphasis on under-represented and under-resourced institutions, through the democratization of these scientific instruments.

Integrating End-to-End Exascale SDN into the LHC Data Distribution Cyberinfrastructure

The Compact Muon Solenoid (CMS) experiment at the CERN Large Hadron Collider (LHC) distributes its data by leveraging a diverse array of National Research and Education Networks (NRENs), which CMS is forced to treat as an opaque resource. Consequently, CMS sees highly variable performance that already poses a challenge for operators coordinating the movement of petabytes around the globe. This kind of unpredictability, however, threatens CMS with a logistical nightmare as it barrels towards the High Luminosity LHC (HL-LHC) era in 2030, which is expected to produce roughly 0.5 exabytes of data per year. This paper explores one potential solution to this issue: software-defined networking (SDN). In particular, the prototypical interoperation of SENSE, an SDN product developed by the Energy Sciences Network, with Rucio, the data management software used by the LHC, is outlined. In addition, this paper presents the current progress in bringing these technologies together.

Artificial Intelligence to Classify and Detect Masquerading Users on HPC Systems from Shell Histories

Modern high-performance computing (HPC) systems are typically accessed through interactive Linux shell sessions. Comprised of login and compute nodes, the system is accessed by researchers who will first access a login node and be provided a bash shell session. By default, bash shell histories are recorded up to a certain number of commands. Since 2013, Arizona State University HPC clusters have enabled a login-sourced shell utility to enable session histories to be recorded to a hidden user home directory. These HPC shell histories are typically utilized to help diagnose researcher issues as they arise and have proved to be invaluable in that respect. However, these histories may have additional value by being able to provide characterizations of how researchers engage with HPC systems, or perhaps be leveraged to foster collaboration or improve the HPC research cycle. This study documents a novel analysis of these prospective datasets by training two different machine learning methods on typical shell behavior as to detect a masquerading user.

Institutional Value of a Nobel Prize

The Nobel Prize is awarded each year to individuals who have conferred the greatest benefit to humankind in Physics, Chemistry, Medicine, Economics, Literature, and Peace, and is considered by many to be the most prestigious recognition for one’s body of work. Receiving a Nobel prize confers a sense of financial independence and significant prestige, vaulting its recipients to global prominence. Apart from the prize money (approximately US$1,145,000), a Nobel laureate can expect to benefit in a number of ways, including increased success in securing grants, wider adoption and promulgation of one’s theories and ideas, increased professional and academic opportunities, and, in some cases, a measure of celebrity. A Nobel laureate’s affiliated institution, by extension, also greatly benefits. Because of this, many institutions seek to employ Nobel Prize winners or individuals who have a high likelihood of winning one in the future. Many of the recent discoveries and innovations recognized with a Nobel prize were made possible only because of advanced computing capabilities. Understanding the ways in which advanced research computing facilities and services are essential in enabling new and important discoveries cannot be overlooked in examining the value of a Nobel Prize. This paper explores an institution’s benefits of having a Nobel Prize winner among its ranks.

Designing a Vulnerability Management Dashboard to Enhance Security Analysts’ Decision Making Processes

Network vulnerability management reduces threats posed by weaknesses in software, hardware, or organizational practices. As networks and related threats grow in size and complexity, security analysts face the challenges of analyzing large amounts of data and prioritizing and communicating threats quickly and efficiently. In this paper, we report our work-in-progress of developing a vulnerability management dashboard that helps analysts overcome these challenges. The approach uses interviews to identify a typical security analyst workflow and proceeds with an iterative design that relies on real-world data. The vulnerability dashboard development was based on a common security analyst workflow and includes functions to allow vulnerability prioritization according to their age, persistence, and impact on the system. Future work will look to execute full-scale user studies to evaluate the dashboard’s functionality and decision-making utility.

Auto-scaling HTCondor pools using Kubernetes compute resources

HTCondor has been very successful in managing globally distributed, pleasantly parallel scientific workloads, especially as part of the Open Science Grid. HTCondor system design makes it ideal for integrating compute resources provisioned from anywhere, but it has very limited native support for autonomously provisioning resources managed by other solutions. This work presents a solution that allows for autonomous, demand-driven provisioning of Kubernetes-managed resources. A high-level overview of the employed architectures is presented, paired with the description of the setups used in both on-prem and Cloud deployments in support of several Open Science Grid communities. The experience suggests that the described solution should be generally suitable for contributing Kubernetes-based resources to existing HTCondor pools.

The anachronism of whole-GPU accounting

NVIDIA has been making steady progress in increasing the compute performance of its GPUs, resulting in order of magnitude compute throughput improvements over the years. With several models of GPUs coexisting in many deployments, the traditional accounting method of treating all GPUs as being equal is not reflecting compute output anymore. Moreover, for applications that require significant CPU-based compute to complement the GPU-based compute, it is becoming harder and harder to make full use of the newer GPUs, requiring sharing of those GPUs between multiple applications in order to maximize the achievable science output. This further reduces the value of whole-GPU accounting, especially when the sharing is done at the infrastructure level. We thus argue that GPU accounting for throughput-oriented infrastructures should be expressed in GPU core hours, much like it is normally done for the CPUs. While GPU core compute throughput does change between GPU generations, the variability is similar to what we expect to see among CPU cores. To validate our position, we present an extensive set of run time measurements of two IceCube photon propagation workflows on 14 GPU models, using both on-prem and Cloud resources. The measurements also outline the influence of GPU sharing at both HTCondor and Kubernetes infrastructure level.

Developing Accurate Slurm Simulator

A new Slurm simulator compatible with the latest Slurm version has been produced. It was constructed by systematically transforming the Slurm code step by step to maintain the proper scheduler output realization while speeding up simulation time. To test this simulator, a container-based Virtual Cluster was generated which fully mimicked a production HPC cluster. As for all Slurm simulators, the realization is a stochastic process dependent on the computational hardware. Under favorable conditions the simulator is able to approximate the actual Slurm scheduling realization. The simulation fidelity is sufficient to use the simulator for its main function, that is, to test Slurm parameter configurations without having to experiment on full production systems.

Return on Investment in Research Cyberinfrastructure: State of the Art

“What is the Return On Investment (ROI) for a cyberinfrastructure system or service?” seems like a natural question to ask. Existing literature shows strong evidence of good return on investment in cyberinfrastructure. This paper summarizes key points from historical studies of ROI in cyberinfrastructure for the US research community. In so doing, we can draw new conclusions based on existing studies. A wide variety of studies show that many types of important “returns” increase in response to more investment in or use of advanced cyberinfrastructure facilities. Published analyses show a positive (>1) ROI for investment in cyberinfrastructure by higher education institutions and federal funding agencies.

Aggregating and Consolidating two High Performant Network Topologies: The ULHPC Experience

High Performance Computing (HPC) encompasses advanced computation over parallel processing. The execution time of a given simulation depends upon many factors, such as the number of CPU/GPU cores, their utilisation factor and, of course, the interconnect performance, efficiency, and scalability. In practice, this last component and the associated topology remains the most significant differentiators between HPC systems and lesser performant systems. The University of Luxembourg operates since 2007 a large academic HPC facility which remains one of the reference implementation within the country and offers a cutting-edge research infrastructure to Luxembourg public research. The main high-bandwidth low-latency network of the operated facility relies on the dominant interconnect technology in the HPC market i.e.,Infiniband (IB) over a Fat-tree topology. It is complemented by an Ethernet-based network defined for management tasks, external access and interactions with user’s applications that do not support Infiniband natively. The recent acquisition of a new cutting-edge supercomputer Aion which was federated with the previous flagship cluster Iris was the occasion to aggregate and consolidate the two types of networks. This article depicts the architecture and the solutions designed to expand and consolidate the existing networks beyond their seminal capacity limits while keeping at best their Bisection bandwidth. At the IB level, and despite moving from a non-blocking configuration, the proposed approach defines a blocking topology maintaining the previous Fat-Tree height. The leaf connection capacity is more than tripled (moving from 216 to 672 end-points) while exhibiting very marginal penalties, i.e. less than 3% (resp. 0.3%) Read (resp. Write) bandwidth degradation against reference parallel I/O benchmarks, or a stable and sustainable point-to-point bandwidth efficiency among all possible pairs of nodes (measured above 95.45% for bi-directional streams). With regards the Ethernet network, a novel 2-layer topology aiming for improving the availability, maintainability and scalability of the interconnect is described. It was deployed together with consistent network VLANs and subnets enforcing strict security policies via ACLs defined on the layer 3, offering isolated and secure network environments. The implemented approaches are applicable to a broad range of HPC infrastructures and thus may help other HPC centres to consolidate their own interconnect stacks when designing or expanding their network infrastructures.

SESSION: Short Papers, Workforce Development, Training, Diversity, and Education

A Partnership Framework for Scaling a Workforce of Research Cyberprofessionals

The research cyberinfrastructure community has long recognized challenges recruiting, developing, retaining and scaling a strong workforce given the irregular cycle of sponsored research projects and institutional initiatives. The challenges are myriad, and across academic institutions, often vary based on the particular route through which the cyberinfrastructure enterprise evolved. Here we focus on a few specific challenges: building credibility with the research community, ensuring avenues for staff development and advancement, and providing diverse ’hard money’ positions to retain talented staff. This paper lays out the approach behind some recent successes in confronting these challenges at the Minnesota Supercomputing Institute in the hopes of providing a general framework for other institutes to adapt.

SESSION: Posters

HPC Outreach and Education at Nebraska

Outreach and education play a critical role in any high-performance computing (HPC) center to help researchers accelerate their research and analyses. At the University of Nebraska’s Holland Computing Center (HCC), this remains true with numerous users, classes, and research groups utilizing the high-performance resources available at HCC. The Holland Computing Center is working on expanding and growing the capabilities of researchers using HPC resources and reducing the barrier to using HPC resources at all stages of experience. This is currently accomplished with training events such as workshops and tutorials, documentation, and different tools. With these, HCC is aiming to further improve training and learning opportunities for researchers utilizing HCC resources.

Expanding the Reach of Research Computing: A Landscape Study: Pathways Bringing Research Computing to Smaller Universities and Community Colleges

Research-computing continues to play an ever increasing role in academia. Access to computing resources, however, varies greatly between institutions. Sustaining the growing need for computing skills and access to advanced cyberinfrastructure requires that computing resources be available to students at all levels of scholarship, including community colleges. The National Science Foundation-funded Building Research Innovation in Community Colleges (BRICCs) community set out to understand the challenges faced by administrators, researchers and faculty in building a sustainable research computing continuum that extends to smaller and two-year terminal degree granting institutions. BRICCs purpose is to address the technology gaps, and encourage the development of curriculum needed to grow a computationally proficient research workforce. Toward addressing these goals, we performed a landscape study that culminated with a community workshop. Here, we present our key findings from workshop discussions and identify next steps to be taken by BRICCs, funding agencies, and the broader cyberinfrastructure community.

Broadening the Reach for Access to Advanced Computing: Leveraging the Cloud for Research

Many smaller, mid-sized and under-resourced campuses, including MSIs, HSIs, HBCUs and EPSCoR institutions, have compelling science research and education activities along with an awareness of the benefits associated with better access to cyberinfrastructure (CI) resources. These schools can benefit greatly from resources and expertise for cloud adoption for research to augment their in-house efforts. The Ecosystem for Research Networking (ERN), formerly the Eastern Regional Network, Broadening the Reach (BTR) working group is addressing this by focusing on learning directly from the institutions on how best to support them. ERN BTR findings and recommendations will be shared based on engagement with the community, including results of workshops and surveys related to challenges and opportunities as institutions are evaluating using the cloud for research and education, as part of the NSF sponsored CC*CRIA: OAC-2018927.

ITSM in Supercomputing: Improving service delivery, reliability, and user satisfaction

Supercomputing in small academic centers has traditionally been driven by informal “get it done” practices. As environments grow more complex and user needs become more diverse, Information Technology Service Management (ITSM) practices are a valuable tool for formalizing processes. Implementing ITSM practices allows centers to better manage risk, increase stability, and better record data about changes and support needs to further the science mission of the clients more effectively. Additionally, the maturity of ITSM practices within University IT (Information Technology) has advanced the expectation that supercomputer centers work this way. In this paper we will describe the ongoing ITSM implementations at the Ohio Supercomputer Center (OSC) and explain the challenges and benefits we have seen.

Cybersecurity and Research are not a Dichotomy: How to form a productive operational relationship between research computing and data support teams and information security teams

Cybersecurity and research do not have to be opposed to each other. With increasing cyberattacks, it is more important than ever for cybersecurity and research to corporate. The authors describe how Research Liaisons and Information Assurance: Michigan Medicine (IA:MM) collaborate at Michigan Medicine, an academic medical center subject to strict HIPAA controls and frequent risk assessments. IA:MM provides its own Liaison to work with the Research Liaisons to better understand security process and guide researchers through the process. IA:MM has developed formal risk decision processes and informal engagements with the CISO to provide risk-based cybersecurity instead of controls-based. This collaboration has helped develop mitigating procedures for researchers when standard controls are not feasible.

Regional Collaborations Supporting Cyberinfrastructure-Enabled Research During a Pandemic: The Structure and Support Plan of the SWEETER CyberTeam

CyberInfrastructure enthusiasts in the South West United States collaborated to form the National Science Foundation CC* - funded SWEETER CyberTeam. SWEETER offers CI support to foster research collaborations at several minority serving institutions in Texas, New Mexico, and Arizona. Its training programs and student mentorship have supported participants, with several taking CI professional positions at research computing facilities. In this paper, we discuss the structure of the CyberTeam and the impact of the COVID 19 pandemic on its activities. The SWEETER CyberTeam has a hub-and-spoke structure that adopted a federated approach to ensure that each site maintained its own identity and was able to leverage local programs. It took a ”boots on the ground” approach that ensured that services were up and running in a short period of time. To ensure adequate coverage of all fields of science, the project adopted an inclusive fractional service approach that leveraged expertise at the participating sites. The Cyberteam has organized several workshops, hackathons, and training events. Team members have participated in completions and several follow-on programs have been funded. We present the achievements and learnings from this effort and discuss efforts to make it sustainable.

Building Experience and Confidence in HPC Practitioners through the Project-Based, Hands-On Practical HPC Course

The MIT SuperCloud and Lincoln Laboratory Supercomputing Center have been introducing High Performance Computing (HPC) to a new audience through the ”Practical High Performance Computing: Scaling Beyond your Laptop” class for the past four years. This informal class, open to the entire MIT community, introduces HPC, identifies canonical HPC workflows, and provides hands-on activities to explore the challenges encountered in the HPC environment. The students use their own research applications as project work to apply the class concepts to gain experience and confidence in using an HPC system and throughout the scaling process. Survey data collected before and after each class demonstrate that students feel they gain familiarity and experience in the concepts taught in the course and confidence in their own ability to apply those concepts.

The Impact of Penn State Research Innovation with Scientists and Engineers (RISE) Team, a joint ICDS and NSF CC* Team Project: How the RISE Team has accelerated and facilitated cross-disciplinary research for Penn State's researchers statewide

The use of computing in science and engineering has become nearly ubiquitous. Whether researchers are using high performance computers to solve complex differential equations modeling climate change or using effective social media strategies to engage the public in a discourse about the importance of Science, Technology, Engineering, and Mathematics (STEM) education, cyberinfrastructure (CI) has become our most powerful tool for the creation and dissemination of scientific knowledge. With this sea change in the scientific process, tremendous discoveries have been made possible, but not without significant challenges.

The Research Innovation with Scientists and Engineers (RISE) team was created to address some of these challenges. Over the past two years, Penn State Institute for Computational and Data Sciences’ (ICDS) research staff have partnered with RISE CI experts who facilitate research through a variety of CI resources. These include, but are not limited to, Penn State's high performance computing resources (Roar), national resources such as the Open Science Grid and XSEDE, and cloud services provided by Amazon, Google, and Microsoft.

Using funds provided by the National Science Foundation (NSF) CC* program, the RISE team has had direct engagement through multiple activities that benefit research projects conducted at Penn State. In addition, the RISE team has conducted seminars, workshops, and other training activities to bolster the cyberinfrastructure literacy of students, postdocs and faculty across disciplines. The RISE team has grown as a workforce shared across investigators who have consulted on projects both large and small. We show that the RISE team has already paid substantial dividends through increased productivity of faculty and more efficient use of external funding.

Challenges and Lessons Learned of Formalizing the Partnership Between Libraries and Research Computing Groups to Support Research: The Center for Research Data and Digital Scholarship

At universities, collaboration between libraries and research computing personnel can enhance services for data-intensive research in a manner that encompasses the entire data lifecycle. The University of Colorado Boulder has undertaken such a libraries-research computing collaboration known as the Center for Research Data and Digital Scholarship (CRDDS). Here, the challenges, successes, and lessons learned during the first five years of CRDDS are shared. Differences in culture, nomenclature, tools, budgets, and operations can cause confusion and misunderstandings between the two groups and can hinder the goal of providing collaborative research support. CRDDS has mitigated these issues by implementing a coordinator role, developing and documenting standard procedures, and providing structured venues for members to share ideas and experiences. Some early successes include development of a comprehensive training program for data-oriented topics, implementation of support for "big data" publishing, and establishment of a graduate certificate program in Digital Humanities.

IndySCC: A New Student Cluster Competition That Broadens Participation

SimVascular Gateway for Education and Research

Over the last two decades, science gateways have become essential tools for supporting both research and education. The SimVascular application is an open source software package providing a complete pipeline from medical image data segmentation to patient-specific blood flow simulation and analysis. With an ever-increasing user base of students, educators, clinicians, and researchers, the development group wanted a user-friendly web portal for users to run SimVascular flow simulations and to be able to support a large number of users with minimum effort and also hide complexity of using HPCs. This paper discusses how the SimVascular Science Gateway became a tool for students, educators, and researchers of all levels and continues to gather and grow a strong research community.

HPC Data Analysis Pipeline for Neuronal Cluster Detection

Obtaining neural clusters from data sets collected over different developmental stages poses a computational challenge that is complicated by the number of data sets, clustering methods, and hyperparameters. We used MATLAB parallel toolkit to parallelize the execution of the hyperparameter sweeps as well as developed a workflow for parallelizing the data processing. We present a run-time performance comparison of the workflow for two clustering methods on Stampede2 supercomputer. Our study explored the performance of MATLAB implementations of the K-means and Louvain algorithms for cluster detection, using covariance and cosine similarity matrices, and investigated hyperparameter settings for each algorithm.

Developing a Data Science Outreach Program with Rural Native Americans: Southern California Tribal Youth Participate in DataJam via San Diego Supercomputer Center

Using Containers and Tapis to Structure Portable, Composable and Reproducible Climate Science Workflows

Provenance and Reproducibility have been growing needs in scientific computing workflows. This project seeks to split the traditionally monolithic code-base of a climate data computing workflow into small, functional, and semi-independent containers. Each container image is built from public code repositories, and allows a researcher to determine the exact process that was executed for both technical and scientific validation. These containers are composed into their workflows using the Tapis API’s Actor-Based Container (Abaco) system, which can be hosted on a variety of computing infrastructures. They may also be run as standalone containers on computers or virtual machines with Docker installed.

Accelerating PET Image Reconstruction with CUDA

Yale MOLAR is an in-house Positron Emission Tomography (PET) image reconstruction application written in C++ and MPI. It deals with hundreds of millions of lines-of-response (LORs) independently to reconstruct an image. The nature of the image reconstruction process makes MOLAR an ideal candidate for GPU acceleration. In this study, we present our work on accelerating MOLAR with CUDA, and show the results that demonstrate the effectiveness and correctness of our CUDA implementation. Overall, Yale MOLAR with CUDA runs up to 6 times faster than the CPU-only code, reducing a typical high resolution image reconstruction time from several hours to less than one hour.

Simplifying Scientific Application Access in Kubernetes with Push Button Deployments

The Geddes Composable Platform is a Kubernetes-based private cloud resource at Purdue University. To streamline adoption of the platform and lower the barrier to entry, we created push button deployments for some of the popular applications used by Purdue researchers and made them available via Geddes’ Rancher web-based user interface using Helm and Rancher Charts. With little knowledge of the underlying system, a new user can use a web form to deploy custom applications, including JupyterHub instances, Alphafold, CryoSPARC and the Triton Inference Server.

Extending Tapis Workflow Management Framework with Elastic Google Cloud Distributed System using CloudyCluster by Omnibond

The goal of a robust cyberinfrastructure (CI) ecosystem is to catalyse discovery and innovation. Tapis does this through offering a sustainable production-quality set of API services to support modern science and engineering research, which increasingly span geographically distributed data centers, instruments, experimental facilities, and a network of national and regional CI. Leveraging frameworks, such as Tapis, enables researchers to accomplish computational and data-intensive research in a secure, scalable, and reproducible way and allows them to focus on their research instead of the technology needed to accomplish it.

This project aims to enable the integration of the Google Cloud Platform (GCP) and CloudyCluster resources into Tapis-supported science gateways to provide on-demand scaling needed by computational workflows. The new functionality uses Tapis event-driven Abaco Actors and CloudyCluster to create an elastic distributed cloud computing system on demand. This integration allows researchers and science gateways to augment cloud resources on top of existing local and national computing resources.

HyperShell v2: Distributed Task Execution for HPC

HyperShell is an elegant, cross-platform, high-performance computing utility for processing shell commands over a distributed, asynchronous queue. It is a highly scalable workflow automation tool for many-task scenarios. There are several existing tools that serve a similar purpose, but lack some aspect that HyperShell provides (e.g., distributed, detailed logging, automated retries, super scale). Novel aspects of HyperShell include but are not limited to (1) cross-platform, (2) client-server design, (3) staggered launch for large scales, (4) persistent hosting of the server, and optionally (5) a database in-the-loop for restarts and persisting task metadata. HyperShell was originally created to support researchers at Purdue University, out of a specific unmet need. It has been in use for several years now. With this next release, we’ve completely re-implemented HyperShell as both an application and a library to provide new features, scalability, flexibility, robustness, and wider support. (https://github.com/glentner/hyper-shell)

Methodology for Imagery Metadata Collection and Entry into a PostgreSQL Database using Stampede2

Agencies such as National Oceanic and Atmospheric Administration (NOAA), Texas Natural Resources Information System (TNRIS), and the National Geographic Society, to name a few, collect Light Detection and Ranging (LIDAR) imagery data through small surveys which are often used to generate digital elevation models (DEMs). Surface water simulations and hazard planning simulations often use these DEM data sets for localized calculations. The data sets area at times in data silos, creating bottlenecks modelers must overcome when creating needed applications. Moreover, the imagery data sets are not standardized, meaning they have different coordinate reference systems, they are of different spatial sizes, they are of different resolutions, and they are of different geographic locations. To ease usability a database was needed which included the metadata for the different types of LIDAR imagery files. This database system would include the following information: where the file was located in the file system, what the file included, what coordinate reference system was utilized, and what geographic location was associated with the file. The research explains a method as to how this task was accomplished using the Texas Advanced Computing Center's Stampede2 supercomputer and associated storage systems.

Broadening Student Participation in Cyberinfrastructure Research and Development

Automated Support Request Categorization using Machine Learning

The automatic categorization of user support requests/tickets for Pennsylvania State University’s high performance computing system is carried out using decision tree and artificial neural network models. We explore estimated model prediction performance across different text embedding techniques (TF-IDF and BERT) and prediction models (Gradient Boosted Decision Trees and Multi-Layer Perceptron Neural Networks) for this multiclass problem. The dataset is comprised of 6213 support tickets categorized using a broad (14 classes) and specific (17 classes) set of labels. The results indicate the optimal prediction accuracy for the ’broad’ and ’specific’ categories are 94.6% and 83.2%, respectfully.

Halcyon: Unified HPC Center Operations: An extensible web framework for resource allocation, documentation, and more

Due to the increasing complexity of user and resource management under a shared campus cluster model, particularly with many research groups investing distinct amounts and annual new hardware acquisitions, the Research Computing division at Purdue set out in 2011 to design a cluster management solution to empower faculty to manage access to their own purchased resources. As operations expanded, the internal portal took on many aspects of the operation of an HPC center beyond resource allocation and management. Eventually, components included HPC and storage resource management, user management and authorization, customer relations, communications, documentation, and ordering/purchasing. Halcyon reconstitutes these in a modular, extensible framework to allow for the growth and maintenance necessary to encompass all aspects of HPC center management. This will allow centers to operate not only more efficiently, but more effectively, and deliver better services to researchers.

Investigating Bias in Resource Allocation for Homelessness Prevention and Intervention

Building the RNAMake Gateway on PATh: a Student-Led Design Project

We summarize student-led work to build a science gateway for RNAMake, which is software for modeling the three-dimensional structure of RNA molecules. The gateway uses Apache Airavata, which has been extended to support HTCondor submissions. The students also extended the Airavata Django Portal to provide customized user interfaces. In the process, the students learned open source software and open governance practices.

Digital Evidence Acquisition and Deepfake Detection with Decentralized Applications

Data Management Workflows in Interdisciplinary Highly Collaborative Research

Data curation is an important aspect in research projects. Effective data management is critical for data curation, and it not only contributes to the success of projects but makes research outputs findable, accessible, interoperable and reusable. We have examined interdisciplinary highly collaborative research (IHCR) practices in selected projects to propose data management workflows. This synopsis of work in progress discusses one of these workflows that helps locate information when there are multiple collaborators and the digital assets are spread across multiple storage systems and institutions.

Azure-based Hybrid Cloud Extension to Campus Clusters

We provide an overview of recent successes integrating and using Microsoft Azure public cloud resources for scientific computing at Purdue University, including benchmarking efforts for new processor architectures and a hybrid cloud extension to on-campus computing resources. The architecture of the hybrid cloud extension is described, which allows users to seamlessly burst workloads from Purdue community clusters to the Azure cloud. We also cover two scientific computing use cases demonstrating bursting capabilities using 3rd Gen. AMD EPYC ”Milan” based HBv3 Azure instances.

Migrating a Pipeline for the C-MĀIKI gateway from Tapis v2 to Tapis v3

The <Url href="https://cmaiki.its.hawaii.edu/">C-MĀIKI science gateway</Url> allows researchers to run microbial workflows on a computer cluster with just a click of a button. This is possible because of the Tapis[1] framework developed at the Texas Advanced Computing Center (TACC). Currently, the C-MĀIKI gateway uses the v2 version of Tapis, which has been refactored into a new v3 version that is more robust and has added capabilities such as support for containerized apps, a new Streaming Data API, and multi-site security kernel. This project aims to keep the C-MĀIKI gateway up-to-date and modern by migrating the gateway from the pre-existing Tapis v2 framework to the new Tapis v3 framework starting with one of the pipelines applications as a pilot which required three major steps: 1) Containerizing an microbiome pipeline. 2) Developing a new app definition for the workflow. 3) Enabling the ability to submit jobs to a SLURM scheduler from inside a singularity container. This initial pilot illustrates that it is possible to run these pipelines as a ”container within a container” for parallelization providing the ability to leverage a single application definition in Tapis that can execute across multiple compute infrastructures.

PEARC ’22: Practice and Experience in Advanced Research Computing Proceedings

The growing prevalence of online hate speech is concerning, given the massive growth of online platforms. Hate speech is defined as language that attacks, humiliates, or incites violence against specific groups. According to research, there is a link between online hate speech and real-world crimes, as well as victims’ deteriorating mental health. To combat the online prevalence of abusive speech, hate speech detection models based on machine learning and natural language processing are being developed to automatically detect the toxicity of online content. However, current models tend to mislabel African American English (AAE) text as hate speech at a significantly higher rate than texts written in Standard American English (SAE). To confirm the existence of systematic racism within these models, I evaluate a logical regression model and a BERT model. Then, I determine the efficacy of the bias reduction method for the BERT model and the correlation between model performance and reduced bias.

BioContainers on Purdue Clusters

Container technologies such as Docker, Kubernetes, and SingularityCE have been receiving an increasing level of attention in academic institutions. Containers wrap up the application into an isolated file system containing everything it needs to run, such as compiler, libraries and dependencies. This enables containers to always run the same, regardless of the environment in which they are running, promoting container technology as a critical tool for reproducible research. In high-performance computing (HPC) context, containers gain popularity because they can significantly reduce the administrators’ work of deploying applications. On Purdue University HPC clusters, several hundred SingularityCE containers have been deployed. Here, we will introduce how SingularityCE containers are used to create the bioinformatic tool collection (biocontainers). Due to the ease of deployment and portability, biocontainers have been deployed in Purdue’s 6 HPC clusters as well as XSEDE Anvil, providing a reliable and reproducible computing environment for life science researchers.