CIF21 DIBBs: EI: Virtual Data Collaboratory
A Regional Cyberinfrastructure for Collaborative Data Intensive Science
The Virtual Data Collaboratory (VDC) is a federated data cyberinfrastructure that is designed to drive data-intensive, interdisciplinary and collaborative research, and enable data-driven science and engineering discoveries. VDC accomplishes this by providing seamless access to data and tools to researchers, educators, and entrepreneurs across a broad range of disciplines and scientific domains as well as institutional and geographic boundaries. In addition to enabling researchers to advance research frontiers across multiple disciplines, VDC also focuses on (1) training the next generation of scientists with deep disciplinary expertise and a high degree of competence in leveraging data, cyberinfrastructure, and tools to address research problems and (2) helping data scientists and engineers develop and apply advanced federated data management and analysis tools for high impact scientific applications. To meet this mission, VDC extends beyond its collaborating institutions and leverages NSF investments to provide cyberinfrastructure typically not available to community colleges, state-associated colleges and universities, and regional liberal arts colleges and universities, and to stimulate intense user engagement and adoption by scientists across domains and institutions.
VDC represents state of the art data-intensive computing, storage, and networking solutions, integrated with an innovative data services layer. VDC is federated and coordinated across three geographically distributed Rutgers University campuses in New Jersey and multiple campuses in Pennsylvania and New York by a high-speed network, with the potential to incorporate academic/research institutions across the Mid-Atlantic and the nation. VDC builds on and integrates existing national/international and regional data repositories, including NSF-funded repositories, and leverages local/regional/national ACI investments. Central to the VDC vision are three infrastructural innovations, a regional science data science DMZ network that provides services to enable efficient and transparent access to data and computing capabilities, an expandable and scalable architecture for data-centric infrastructure federation, and a data services layer to support research workflows that utilize cutting-edge semantic web technologies, support interdisciplinary research, expand access, and increase the impact of data-science worldwide.
- Provide seamless access to data and tools to researchers, educators, and entrepreneurs across a broad range of disciplines and scientific domains as well as institutional and geographic boundaries. tools to address research problems.
- Enable data scientists and engineers develop and apply advanced federated data management and analysis tools for high impact scientific applications.
- Train the next generation of scientists with deep disciplinary expertise and a high degree of competence in leveraging data, cyberinfrastructure, and tools to address research problems.
Proposed VDC Architecture
- Deciphering Sequence and Structural Correlates of Protein Nucleic Acid Interactions (H. Berman & V. Honavar)
- High-Volume City Data Sharing and Processing for Smart, Resilient, and Sustainable Cities (J. Gong, RU; Z. Zhu, CUNY; X. Liang, University of Pittsburgh; M. Balduccini, Drexel University)
- Ocean Observatories Initiative (I. Rodero, M. Parashar)
Education and Outreach
Resources for Educators
The main goal of the Virtual Data Collaboratory is to leverage our partnership with the NJBDA in order to impact analytics and data science courses across the state, fostering learning communities through easy-to-use online modules and classes centered on research-based data science and analytics. This will include analytics and data science programs at universities such as Rutgers, Penn State, Drexel, and CUNY, and across academic levels from high school workshops to post-graduate seminars.
Part of this mission includes providing resources for educators and students, bringing Big Data skills into the classroom. The resources provided on this site via the VDC are ready and easy to use, both within the classroom and beyond.
High School Workshops
The “Dive Into Big Data” high school-level workshop is a great way to introduce students to the fundamentals of Data Science, as they complete an interactive experiment with live oceanographic data (courtesy of the Ocean Observatory Institute, or OOI) and visualize the results of simple data transformations. The materials for this workshop have been made available so that educators can host their own Dive Into Big Data workshop.
These materials include a description of the program, a workshop agenda for those planning on touring the Advanced Cyberinfrastructure facilities, a presentation giving an overview of the objectives of the program, and hands-on step-by-step instructions for how to complete the workshop using live data from the OOI.
After the activity is completed, students will take a short quiz and survey to assess the effectiveness of the workshop’s objectives. Questions regarding the RDI² facilities tour may be omitted or modified if the workshop was hosted from an external location.
To set up a tour of the RDI² facilities for your class or club, contact Forough Ghahramani, Associate Director of Administration and Partnerships at RDI².
Seminars and Roundtables
RDI²'s VDC project team has hosted several events for undergraduates, graduate students, and beyond, including distinguished seminar series and roundtables.
Going forward, these events will be posted on our YouTube channel for instructors to utilize; one such event is the Data Science Career Panel, featuring industry speakers from within the field of Data Science. The speakers answer questions from prospective computer and data science students regarding what they can expect once entering the field professionally. This roundtable event is a useful resource for undergraduate and high school students interested in pursuing a career in data science.
As part of the NSF funded Virtual Data Collaboratory project, RDI2 is developing educational modules to help researchers solve their data issues and increase the impact of their research. One such module was the Introduction to Data Management seminar; held during May 2018, this seminar invited career researchers to a join RDI2 for a discussion of best practices for managing research data. The data created as part of research is important, and should be well-organized, well-preserved, accessible, understandable, and usable by the scholarly community.
The discussion included developments in data sharing, data collaboration, reproducible research, and more. Insights and feedback shared during the seminars are further incorporated into the educational modules, the materials for which have been made available as resources for educators.
For news on upcoming and future events, see the News & Events page of the RDI² main site.
List of Personnel
- Manish Parashar (Co-PI)
- Grace Agnew (Data Services Lead)
- Helen Berman
- Forough Ghahramani (Education Co-Lead)
- Charles Hedrick
- Thu Nguyen (Education Co-Lead)
- Ivan Rodero (Systems Lead)
- Vasant Honavar (Co-PI and Use Cases Lead)
- Chuck Gilbert
- Wayne Figurelle
- Karen Estlund
- Edward Chapel (Policy and Governance Lead)
- Wendy Huntoon (Network Lead)