Pangeo framework for training: experience with FOSS4G, the CLIVAR bootcamp and the eScience course

Anne Fouilloux, Pier Lorenzo Marasco, Tina Odaka, Ruth Mottram, Paul Zieger, Michael Schulz, Alejandro Coca-Castro, Jean Iaquinta, Guillaume Eynard Bontemps

crossref（2023）

引用 0|浏览5

暂无评分

摘要

<p>The ever increasing number of scientific datasets made available by authoritative data providers (NASA, Copernicus, etc.) and provided by the scientific community opens new possibilities for advancing the state of the art in many areas of the natural sciences. As a result, researchers, innovators, companies and citizens need to acquire computational and data analysis skills to optimally exploit these datasets. Several educational programs dispense basic courses to students, and initiatives such as “The Carpentries” (https://carpentries.org/) complement this offering but also reach out to established researchers to fill the skill gap thereby empowering them to perform their own data analysis. However, most researchers find it challenging to go beyond these training sessions and face difficulties when trying to apply their newly acquired knowledge to their own research projects. To this regard, hackathons have proven to be an efficient way to support researchers in becoming competent practitioners but organising good hackathons is difficult and time consuming. In addition, the need for large amounts of computational and storage resources during the training and hackathons requires a flexible solution. Here, we propose an approach where researchers  work on realistic, large and complex data analysis problems similar to or directly part of  their research work. Researchers access an infrastructure deployed on the European Ocean Science Cloud (EOSC)  that supports intensive data analysis (large compute and storage resources). EOSC is a European Commission initiative for providing a federated and open multi-disciplinary environment where data, tools and services can be shared, published, found and re-used. We used jupyter book for delivering a collection of FAIR training materials for data analysis relying on Pangeo EOSC deployments as its primary computing platform. The training material (https://pangeo-data.github.io/foss4g-2022/intro.html, https://pangeo-data.github.io/clivar-2022/intro.html, https://pangeo-data.github.io/escience-2022/intro.html) is customised (different datasets with similar analysis) for different target communities and participants are taught the usage of Xarray, Dask and more generally how to efficiently access and analyse large online datasets. The training can be completed by group work where attendees can work on larger scale scientific datasets: the classroom is split into several groups. Each group works on different scientific questions and may use different datasets. Using the Pangeo (http://pangeo.io) ecosystem is not always new for all attendees but applying Xarray (http://xarray.pydata.org)  and Dask (https://www.dask.org/) on actual scientific “mini-projects” is often a showstopper for many researchers. With this approach, attendees have the opportunity to ask questions, collaborate with other researchers as well as Research Software Engineers, and apply Open Science practices without the burden of trying and failing alone. We find the involvement of scientific computing research engineers directly in the training is crucial for success of the hackathon approach. Feedback from attendees shows that it provides a solid foundation for big data geoscience and helps attendees to quickly become competent practitioners. It also gives infrastructure providers and EOSC useful feedback on the current and future needs of researchers for making their research FAIR and open. In this presentation, we will provide examples of achievements from attendees and present the feedback EOSC providers have received.</p>

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要