You are browsing the archive for Genomics.

Clarifying the semantics of data matrices and results tables: a Frictionless Data Pilot

- July 21, 2020 in Frictionless Data, Genomics, pilot

As part of the Frictionless Data for Reproducible Research project, funded by the Sloan Foundation, we have started a Pilot collaboration with the  Data Readiness Group  at the Department of Engineering Science of the University of Oxford; the group will be represented by Dr. Philippe Rocca-Serra, an Associate Member of Faculty. This Pilot will focus on removing the friction in reported scientific experimental results by applying the Data Package specifications. Written with Dr. Philippe Rocca-Serra. Oxford department of engineering science logo Oxford Data Readiness Group Publishing of scientific experimental results is frequently done in ad-hoc ways that are seldom consistent. For example, results are often deposited as idiosyncratic sets of Excel files or tabular files that contain very little structure or description, making them difficult to use, understand and integrate. Interpreting such tables requires human expertise, which is both costly and slow, and leads to low reuse.  Ambiguous tables of results can lead researchers to rerun analysis or computation over the raw data before they understand the published tables. This current approach is broken, does not fit users’ data mining workflows, and limits meta-analysis. A better procedure for organizing and structuring information would reduce unnecessary use of computational resources, which is where the Frictionless Data project comes into play. This Pilot collaboration aims to help researchers publish their results in a more structured, reusable way. In this Pilot, we will use (and possibly extend) Frictionless tabular data packages to devise both generic and specialized templates. These templates can be used to unambiguously report experimental results. Our short term goal from this work is to develop a set of Frictionless Data Packages for targeted use cases where impact is high. We will first focus first on creating templates for statistical comparison results, such as differential analysis, enrichment analysis, high-throughput screens, and univariate comparisons, in genomics research by using the STATO ontology within tabular data packages.  Our longer term goals are that these templates will be incorporated into publishing systems to allow for more clear reporting of results, more knowledge extraction, and more reproducible science.  For instance, we anticipate that this work will allow for increased consistency of table structure in publications, as well as increased data reuse owing to predictable syntax and layout. We also hope this work will ease creation of linked data graphs from table of results due to clarified semantics.  An additional goal is to create code that is compatible with R’s ggplot2 library, which would allow for easy generation of data analysis plots.  To this end, we plan on working with R developers in the future to create a package that will generate Frictionless Data compliant data packages.  This work has recently begun, and will continue throughout the year. We have already met with some challenges, such as working on ways to transform, or normalize, data and ways to incorporate RDF linked data (you can read our related conversations in GitHub). We are also working on how to define a ‘generic’ table layout definition, which is broad enough to be reused in as wide a range of situation as possible. If you are interested in staying up to date on this work, we encourage you to check out these GitHub repositories: and Additionally, we will (virtually) be at the eLife Sprint in September to work on closely related work, which you can read about here:  Throughout this Pilot, we are planning on reaching out to the community to test these ideas and get feedback. Please contact us on GitHub or in Discord if you are interested in contributing.

Opening Genomic data – it’s debateable

- May 21, 2014 in Debate, Genome, Genomics, News, Open Health

DNA Origami by Alex Bateman CC BY2.0

On a night threatening rain about 25 people came together to discuss Open Genomics in a lab at the Edge, overlooking the Brisbane river. The debate was led by two experts: Mark Crowe, a BioInformatics Scientist and Naveen Sharma, an Information Risk and Compliance Manager. Mark spoke in favour of opening genomic data and his paper can be read in full here at the QFAB website and Naveen noted the risks of data misuse, which can be read here.

Tools for genetic testing are becoming more advanced and less expensive, therefore are increasingly prevalent, and so more profiles are being created. The benefits of conducting genomic analysis on massive datasets were clearly asserted by Mark Crowe, and Naveen Sharma described several examples of risks with open data, and in conclusion quoted Professor Ohm: “data can either be useful or perfectly anonymous but never both.”

An active discussion ensued and was focussed not on the future, but on what’s available to the public now, such as kits that provide genetic reports and uninterpreted raw genetic data. It was noted that 23andMe now have the world’s largest autosomal DNA database and they can no longer offer health-related genetic reports after legal action. The National Geographic Genographic project also allows anyone to identify their genome ancestry, and holds over 660,000 profiles – which can optionally be shared with other study participants.

We envisioned a hypothetical future Gattaca-like dystopia of genotype profiling, eugenics (genetic manipulation) and discrimination where DNA determines social class.  Will we see diseases eliminated via selective breeding, and as a further step will we see differences treated as abnormalities?

Will genetic data availability increase pressure on our medical services for non-threatening treatments and personalised medicine? And is this beneficial because it is less costly to prevent medical issues than treat them? What will the impact of increased demand be on Genomics experts and their analytical tools?

A theoretical issue was raised of insurers using genomic tests to decline insurance customers for illnesses to which their tests indicate a predisposition. Mark Crowe replied that on the other hand it’s positive because some insurers, once a predisposition is identified, will pay for preventative measures because they cost less than treatments.

Through discussion it became apparent that the question really is a highly personal, sensitive one: ‘am I, as an individual prepared to make my private genomic data public and identifiable to me?’ as per the Personal Genome Project. A voluntary ‘show of hands’ vote was called for and the count was only slightly in favour of yes.

However we all agreed the public shareable release of genomic data is a complex growing issue that the Australian government should address in more detail, and importantly must begin the process now to ensure it is in place before rather than during or after genomic data use becomes more  prevalent in Australia. Although genomics analysis is still in its relative infancy, beginning the process of a regulatory response now is vital given the US took ten years. A second step is forming agreed global industry standards for use of genomic data, including open data. It appears this is in train, with the Global Alliance for Genomics and Health meeting in March to begin a coordinating the development of standards, addressing openness, interoperability, regulatory barriers and discovery.

Discussion continued overtime until we needed to exit the venue as it was closing…. so please feel free to add your comments and thoughts here to continue the conversation.  Anna, James and I look forward to the next debate!

Image: DNA origami by Alex Bateman CC BY 2.0