A photo of a cat I use as my avatar image

Gaurav Vaidya


§Oct 2019 to Dec 2020 Center for Cancer Data Harmonization (CCDH)

I work on the CCDH Tools and Data Quality team at RENCI. We aim to develop software tools that will allow the validation of biomedical data against the CRDC-H data model to store cancer-related biomedical data, to publish and transform the data model into different formats as needed, and to fulfil any other software development tasks that are needed to complete this project.

  • Software: csv2caDSR§ (Jul 2020 to Dec 2020): A tool for harmonizing biomedical data using the Cancer Data Standards Registry and Repository (caDSR) as a source of validation information.
    • Technologies used: Scala, caDSR, PFB, CEDAR Workbench.
    • Provides the following features:
      • Harmonize input data against the caDSR.
      • Store harmonization information (such as mappings from values in data files to concepts in vocabularies) in a JSON format that can be converted into other mapping formats if needed.
      • Export harmonized data in a variety of formats, such as the Portable Format for Biomedical Data (PFB) and the CEDAR Instance format.
    • Source code available at https://github.com/cancerDHC/csv2caDSR.
  • Software: UMLS-RRF-Scala§ (Mar 2020 to Nov 2020): A set of tools for identifying mappings from terms in one vocabulary to others, intending to be used to provide mappings from closed-source vocabularies to open-access vocabularies.
    • Technologies used: Scala, UMLS, SSSOM.
    • Provides the following features:
      • Extract individual mappings from the Unified Medical Language System (UMLS) Metathesaurus as SSSOM files.
      • Use web services (such as BioPortal and the EMBL-EBI Ontology Lookup Service) to find additional mappings between vocabularies.
    • Source code available at https://github.com/cancerDHC/umls-rrf-scala.

§Oct 2019 to Dec 2020 Small projects at RENCI

While working as a Semantic Web technologist at RENCI, I developed a number of ad-hoc software tools to meet various needs from other RENCI teams.

  • Software: Omnicorp and OmniCORD§ (Oct 2019 to Nov 2020): I built upon the existing Omnicorp tool to improve the RDF data being produced by it, such as authorship information. I also wrote OmniCORD, a variant of Omnicorp for extracting entities from the COVID-19 Open Research Dataset (CORD-19).
    • Technologies used: Scala, RDF, SHACL.
    • Provides the following features:
      • Extract terminologies from the MEDLINE/PubMed Baseline Repository and export them as RDF.
      • Extract terminologies from the COVID-19 Open Research Dataset (CORD-19) and export them as RDF.
      • Validate produced RDF information against SHACL shapes using SHACLI.
    • Source code available at https://github.com/NCATS-Gamma/omnicorp.
  • Software: SHACLI§ (Oct 2019 to May 2020): A command-line interface (CLI) for the Shapes Constraint Language (SHACL) to validate RDF data against SHACL shapes with improved error messages. This was part of a project to simplify data model creation, so that one document can be used to produce both the documentation and can be used to validate input data.
  • §Citation: Daniel Korn, Tesia Bobrowski, Michael Li, Yaphet Kebede, Patrick Wang, Phillips Owen, Gaurav Vaidya, Eugene Muratov, Rada Chirkova, Chris Bizon, Alexander Tropsha (November 11, 2020) COVID-KOP: integrating emerging COVID-19 data with the ROBOKOP database. Bioinformatics .

§Aug 2011 to Dec 2020 PhD dissertation

In my PhD dissertation, I developed methods for quantifying the rate of change with taxonomic checklists, and the effect such changes can have on the interpretation of biodiversity data.

§Jan 2018 to Oct 2019 The Phyloreferencing Project

The Phyloreferencing project aims to digitize definitions of clades (groups of related taxa that are a key concept in evolutionary biology) as semantically rich OWL ontologies, which are both human-readable and computer-interpretable.

  • Software: Phyx.js§ (from Jan 2019): A software library for reading and validating Phyx files for storing clade definitions, which can convert them into OWL ontologies for publication and reasoning.
  • Software: Phyloref Ontology§ (Jun 2013 to Nov 2020): An ontology containing terms used to concisely and precisely define phyloreferences.
  • Software: Clade Ontology§ (Jun 2017 to May 2020): An ontology containing phyloreferences translated from published clade definitions.
    • Technologies used: JavaScript, Web Ontology Language, ontologies.
    • Provides the following features:
      • A Node.js-based workflow for converting folders of Phyx files into a single large OWL ontology.
      • Includes tools for testing resolution of all included Phyx files.
  • Software: JPhyloRef§ (Sep 2017 to May 2020): A Java-based command line tool for reasoning over and testing ontologies that include phyloreferences.
    • Technologies used: Java, Web Ontology Language, JSON-LD, Software testing, web APIs.
    • Provides the following features:
      • Reason over ontologies containing phyloreferences and phylogenies to report on which nodes each phyloreference resolves to.
      • Test whether ontologies containing phyloreferences and reference phylogenies resolve as expected.
      • Run as a web server to serve a web API as a backend for Klados and other services.
    • Source code available at https://github.com/phyloref/jphyloref.
  • Software: Open Tree Resolver§ (Feb 2019 to Sep 2020): A single-page application for testing phyloreference resolution on the Open Tree of Life.
    • Technologies used: JavaScript, Vue.js, web APIs.
    • Provides the following features:
      • Upload OWL ontologies representing phyloreferences as JSON-LD files.
      • Reason over the phyloreferences to resolve them on the relevant section of the Synthetic Tree as downloaded via Open Tree of Life APIs.
  • Software: Klados§ (Oct 2017 to Aug 2020): A single-page application for curating, testing and resolving phyloreferences.
    • Technologies used: JavaScript, Vue.js, JSON, web APIs.
    • Provides the following features:
      • Provides a graphical user interface for reading and writing Phyx files describing phyloreferences.
      • Allows users to add and visualize phylogenies, and test resolution of phyloreferences on those phylogenies.
      • Export Phyx files as OWL ontologies.

§Jan 2011 to Jan 2017 Small projects during my PhD

Some small software projects I worked on during my PhD.

  • Software: BibURI§ (Oct 2013 to Jan 2014): A Ruby Gem to extract BibTeX information from a number of bibliographic resources, such as DOIs and COinS webpages. I wrote this at the first TaxonWorks hackathon.
    • Technologies used: Ruby, BibTeX.
    • Provides the following features:
      • Extract a BibTeX entry given a DOI.
      • Extract a BibTeX entry given a webpage with COinS.

§Jan 2016 to May 2016 General Biology Labs 2 (EBIO 1240)

I led three lab sections of eighteen students each, taking my students on a tour of the tree of life while reinforcing concepts in evolutionary biology, ecology, anatomy and physiology.

    §Aug 2015 to Dec 2015 General Biology Labs 1 (EBIO 1230)

    I led three lab sections of eighteen students each, teaching the philosophical underpinings of science through hands-on experiments in cellular and molecular biology.

      §Jan 2015 to May 2015 Evolutionary Biology (EBIO 3080)

      I led three lab sections of twenty students each, teaching evolutionary biology through R-based statistics and modeling labs, measurements and phylogenetics.

        §May 2012 to Apr 2015 Art of Life

        A project organized by the Biodiversity Heritage Library (BHL) to identical and annotate hundreds of thousands of illustrations from the documents in this digital library.

        • §Link: Art of Life Schema (Apr 2015): I worked with my PhD advisor and some librarians at the BHL to develop a data schema for annotating biological illustrations in a way that would make them useful for biodiversity researchers.

        §Jan 2011 to Jan 2015 Map of Life

        An NSF-funded project to synthesize different kinds of biodiversity data — from occurrences to rangemaps — into a single, easy-to-use tool. I worked on the Map of Life project, usually during summer holidays, under the supervision of my PhD advisor, Rob Guralnick.

        • Software: Vernacular Names§ (Feb 2014 to Jul 2015): A web application for managing the synthesis and verification of vernacular name information for Map of Life.
          • Technologies used: Python, PostgreSQL.
          • Provides the following features:
            • Lists all vernacular names across multiple languages for each taxonomic name.
            • Includes scripts for importing particular vernacular name databases into this system.
            • Allow curators to improve vernacular names.
            • Produce reports on vernacular name coverage across important languages.
            • Use regular expressions to make multiple simulatenous changes.
          • Source code available at https://github.com/MapofLife/vernacular-names.
        • Software: TaxRefine§ (Jun 2013 to May 2014): Provides an OpenRefine reconcilation service API for matching taxonomic names against several services.
        • §Link: Validating scientific names with the GBIF Checklist Bank (Jul 2013): A blog post describing TaxRefine.

        §May 2014 to Aug 2014 DBpedia Commons at Google Summer of Code

        For a Google Summer of Code project in summer 2014, I worked on extending the DBpedia Extraction Framework to be able to extract information from the Wikimedia Commons database dumps and make them available as RDF.

        §Nov 2011 to Jul 2012 The Junius Henderson Field Notebook Project

        This project began as an idea that my collaborators (Andrea Thomer and Rob Guralnick) discussed regarding the annotation of historical field notebooks to extract the biodiversity data found in them. I suggested that we use Wikisource, a part of Wikipedia, to crowdsource the annotation process and to develop the software tools needed to carry out such annotations of Wikisource content. We published our findings as a Darwin Core Archive and in the journal ZooKeys in 2012.

        §Jan 2008 to Jan 2011 OCR Terminal

        A start-up that focussed on providing services around optical character recognition (OCR) technology.

        • Software: OCR Terminal§ (Feb 2008 to Dec 2010):

          OCR Terminal was an online optical character recognition (OCR) service: it read text from uploaded images and provided the image files in an editable format such as Microsoft Word, Adobe PDF or plain text. Between mid-2008 and 2011, tens of thousands of user accounts have been created and over 100,000 documents had been processed on this website. Apart from the website itself, the service featured a simple API which can be used to submit documents for processing programmatically.

          I was lead developer of OCR Terminal right from project inception. I have written all of OCR Terminal's underlying code, first as a Perl/CGI application, and later as a Perl/Catalyst application. I am particularly proud of designing the public API, which is used by our own desktop client, several in-house tools, and several clients of ours who use it for both bulk processing and as a backend for their own software.

          I also manage OCR Terminal's main server administrator, responsible for maintaining all the servers and backend components. I was able to learn about server monitoring with tools such as Munin. Since early 2009, OCR Terminal has been hosted on the Amazon EC2 processing cloud, giving me experience with setting up, bundling and managing EC2 instances.</p>

          • Technologies used: Perl, Amazon Web Services, web APIs.
          • Provides the following features:
            • A web application allowing users to register an account, OCR a small number of documents for free, and then pay to OCR additional images.
            • Included a job management system, allowing jobs uploaded to OCR Terminal to be processed by the ABBYY OCR Engine we used for OCR.
            • Included a web API, allowing customers to batch submit multiple jobs by processing.
          • Source code available at https://github.com/gaurav/ocrterminal.

        §Aug 2006 to Nov 2010 SequenceMatrix

        Genetic analysis softwares at this time were designed to compare a single genetic or protein sequence between different taxa to perform comparative analyses. In the Evolutionary Biology lab, we would often perform multi-gene analyses, requiring aligned genetic sequences to be concatenated together. Modifying such datasets after concatenation could be problematic, especially if one of the constituent sequences turned out to be contamination. Sequence Matrix was intended to simplify the process of contatenation such that preserved gene boundaries could be unconcatenated if necessary, and included some tools for detecting and removing contamination from multi-gene, multi-taxon datasets.

        §Jan 2003 to Jan 2006 Species Identifier

        This project came about in response to a need for a series of projects by the Evolutionary Biology lab to investigate whether genetic distance methods could be used to correctly identify the species that a genetic sequence had been sequenced from. The tool we built allowed us to compare several different approaches to investigate this approach, and then to replicate these analyses on larger datasets than we had originally envisioned.