In June 2020 I finished an online training on using Neo4J database management system along with its query language, Cypher. Neo4j is a non-SQL graph database that has emerged over the last couple of years as a major player in analyzing highly connected data. Graph-databases enable faster queries of highly connected data, in comparison to the classical SQL-based database management system.
It is safe to say that Bioinformatics is a big data field, and hence data-oriented tasks like searching and archiving massive datasets are the beating heart of the field. Implying that a reliable and efficient DBMS is crucial. Traditionally, structured query language, SQL-based DBMS have been the default systems to store and archive data. However, over the last decade, there has been an emergence of a different type of DBMS, the so-called, non-SQL DBMS.
In the SQL world, data are stored using separate tables and the relation between these tables is stored either as new columns in the table or in a spate table depending upon the relationship type. This works perfectly well, if the major aim is to store the data, however, if the most import thing is to express the relationship between the elements of the data, then SQL system performance will deteriorate with increasing the complexity of the query as different tables have to be searched and indexed. This problem is where graph-databases truly shine, as they are designed to store the data elements and its relationship in a graph. Hence, if the aim is to analyze highly-connected data then graph-database is a valuable tool to consider.
Biological systems are inherently complex and connected systems, for example, in a protein-protein network thousands of proteins are interacting with each other where one protein might interact with hundreds of other proteins. Further examples of these highly connected data include metabolic networks, gene-regulatory networks and microbial networks. Thus, representing these highly connected data into a graph-database that excels in handling connected data is crucial for an efficient data analysis pipeline. A second advantage of using a graph-database is connecting high-dimensional omics data, for example, genomics, transcriptomic, proteomics, and metabolic data. Where entities like gene name, transcript name, protein name, and metabolite name can be represented as nodes in the graph and the interaction within and between these different omic layers can be represented as edges. Once constructed this database can be mined to extract biologically interesting patterns and correlations.
The use and adaptation of graph-database within the bioinformatics community are still in its infancy, however, as both the field and technology develop and evolve it is highly likely that graph-database will be one of the driver engines for data analysis within the bioinformatics community. This training will truly be helpful for all pipelines we are developing at the moment.
Hesham El Abd