WP4: JRA – Standards for data archiving & exploitation

  • Leader: Jan Dohnalek (MoB-IBT , Vestec, CZ)
  • Co-leader: Mark Williams (ISMB-BIOPHYSX, London, UK)

WP4 aims to support and consolidate the MOSBRI network by developing data formats, data processing procedures and database(s) for biophysical data archiving and retrieval.

In comparison to other fields of biomolecular research, databases or archives for deposition of biophysical data, data mining, retrieval and interfacing to e.g. structural data (PDB) or bioinformatics servers (EBI, SwissProt and other) are missing. WP4 aims at developing a pilot solution that could then be further developed into a universal access point for biophysical data.

WP4 has the following main objectives:

  • Provision of new data standards for biophysical data exchange and archiving, to ensure interoperability of information obtained by different biophysical methods and with other bioinformatics research infrastructures producing and processing biological/biomolecular data.

The field of molecular-scale biophysics suffers from the lack of unity of data formats between the many different experimental techniques that may be applied to study a biological system. A focussed effort is needed to implement a unified format encompassing the principal existing biophysical methods, to enable efficient data exchange and reuse. Standardization would facilitate deposition of experimental information in existing and emerging databases of biological and molecular data.

Survey of needs concerning the data formats for individual biophysical techniques
Given the diversity that we observe in the field of molecular-scale biophysics, a survey will be performed to identify and classify all the data types generated by current instruments, as well as data exchange formats and data storage requirements. It will cover the techniques provided in the frame of MOSBRI’s TNA scheme and other related ones as deemed appropriate. The landscape of current data types spans from simple one-dimensional spectroscopic data, to multidimensional stability, interaction or particle distribution data, to more complex scattering, microscopy, single-molecule or chemical library screening data sets. While data standards already exist for a few techniques, other output types will be considered for the first time.

Definition of data standards/formats with focus on interoperability
The activities under WP4 will contribute to the formation of a research data environment as set in the EOSC strategic documents (European Open Science Cloud Strategic Implementation Plan, https://ec.europa.eu/info/publications/european-open-science-cloud-eosc-strategic-implementation-plan_en). In particular, the main principles of implementation regarding Cooperation, Decision Process, Common interest, Availability and Voluntary adoption will be adhered to. Development of data standards and formats, will follow the FAIR data principles

  • Support for the development of improved data processing protocols for selected methods and for the implementation of standards for data storage to ensure the efficient safeguard of biophysical data.

Activities under this task will focus on developing software tools that enable users to gather and record data in standard data formats and perform quality checks.

  • Data quality – ensuring sufficiency and technical correctness of original and processed instrument data, and metadata describing the experimental system
  • Data transparency – Availability of record of the data processing workflow/software/methods
  • Archiving-ready data – Readiness for archiving with maximum information content and long-term sustainability
  • Technical requirements – Implementation of critical technical details covering data security, long-term storage capacities, readable vs. proprietary data formats for processing, etc.
  • Provision of a pilot database for biophysical data to improve its efficiency of use and re-use.

Areas of biomolecular research focused on structural data, bioinformatics or organic molecules have strongly profited from the availability of curated, fully searchable, and electronically accessible databases. In the recent 20 years the number of biophysical techniques, their routine use and sample description power have grown significantly. The amounts of experimental biophysical data acquired to date in all biomolecular research fields worldwide represents a wealth of information which could be very efficiently used in methods and sample preparation optimization, development of improved processing protocols, troubleshooting of difficult problems in molecular or structural biology, and development of new drugs or biotechnologies. A properly designed database of experimental biophysical results and interpretations would be of high value. It would encourage/require depositions of control data, often not included in publications, and enable reanalysis of published data and/or its reuse by combination of data from different laboratories collected at different times. It could also serve education of experts and knowledge dissemination among non-experts in biochemistry, medical doctors, private sector, and computational tools developers.

WP4 will map the real need for a publicly accessible searchable database of experimental biophysical data for biomolecular research, and design and implement accordingly a pilot database to verify the principles, set standards, offer access to data, and prepare for its future expansion.