A photo of a cat I use as my avatar image

Gaurav Vaidya


(Toggle showing products)

§Oct 2019 to Dec 2020 Semantics web technologist at the Renaissance Computing Institute (RENCI)

I work on several projects to facilitate the re-use of biomedical data by other scientists, with a particular focus on using Semantic Web technologies to make scientific data findable, accessible, interoperable, and reusable. To achieve this, I work with tools that use standards such as RDF and JSON-LD to store and transmit semantically rich data, use ontologies written in OWL to reason over data, and use descriptions in SHACL and ShEx to validate data and document data models. Where existing tools are unavailable or inadequate, I write software tools to fill this need. My current projects focus on genetic, disease and other biomedical data, but I look forward to working with other kinds of data in the future. I work primarily in Scala to build these tools.

  • §Project: Center for Cancer Data Harmonization (CCDH) (Oct 2019 to Dec 2020): I work on the CCDH Tools and Data Quality team at RENCI. We aim to develop software tools that will allow the validation of biomedical data against the CRDC-H data model to store cancer-related biomedical data, to publish and transform the data model into different formats as needed, and to fulfil any other software development tasks that are needed to complete this project.
    • Software: csv2caDSR (Jul 2020 to Dec 2020): A tool for harmonizing biomedical data using the Cancer Data Standards Registry and Repository (caDSR) as a source of validation information.
    • Software: UMLS-RRF-Scala (Mar 2020 to Nov 2020): A set of tools for identifying mappings from terms in one vocabulary to others, intending to be used to provide mappings from closed-source vocabularies to open-access vocabularies.
  • §Project: Small projects at RENCI (Oct 2019 to Dec 2020): While working as a Semantic Web technologist at RENCI, I developed a number of ad-hoc software tools to meet various needs from other RENCI teams.
    • Software: SHACLI (Oct 2019 to May 2020): A command-line interface (CLI) for the Shapes Constraint Language (SHACL) to validate RDF data against SHACL shapes with improved error messages. This was part of a project to simplify data model creation, so that one document can be used to produce both the documentation and can be used to validate input data.
    • Software: Omnicorp and OmniCORD (Oct 2019 to Nov 2020): I built upon the existing Omnicorp tool to improve the RDF data being produced by it, such as authorship information. I also wrote OmniCORD, a variant of Omnicorp for extracting entities from the COVID-19 Open Research Dataset (CORD-19).
    • §Citation: Daniel Korn, Tesia Bobrowski, Michael Li, Yaphet Kebede, Patrick Wang, Phillips Owen, Gaurav Vaidya, Eugene Muratov, Rada Chirkova, Chris Bizon, Alexander Tropsha (November 11, 2020) COVID-KOP: integrating emerging COVID-19 data with the ROBOKOP database. Bioinformatics .
  • §Link: My page on the RENCI website

§Jan 2018 to Oct 2019 Postdoctoral associate at the Florida Museum of Natural History

I worked full-time as the lead software developer on the Phyloreferencing project. Our goal is to build an ontology of definitions for groups of related biological organisms, as well as the software and ontological infrastructure needed to create, edit, organize and test these definitions. We are building a demonstration website that will allow users to resolve these definitions on any evolutionary hypothesis. I built these tools using JavaScript in Node.js, Vue CLI, Java and Python.

§Aug 2011 to Dec 2017 PhD in Ecology and Evolutionary Biology at Ecology and Evolutionary Biology (EBIO)

I went into graduate school with two goals: (1) to become a scientific software developer, and (2) to spend a period of time doing a focussed study on the informatics of taxonomic names. Taxonomic names are one of the oldest information management schemes in science, and have had to change dramatically over the last 285 years as biologists' understanding of species and evolution has changed. I wondered what we could learn about this old-yet-new information management system. Apart from the projects listed here that were directly relevant to my PhD, I also spent my time in graduate school learning about several other tools and techniques in a variety of other roles, listed elsewhere.

§Jan 2015 to May 2016 Graduate teaching assistant at Ecology and Evolutionary Biology (EBIO)

While a graduate student, I taught three Evolutionary Biology and General Biology labs, in which I led classes of up to eighteen undergraduate students through hands-on exercises to build their understanding of biology and evolution. CU Boulder asks students to evaluate their instructors on a scale from 1 (worst) to 6 (best); my reviews increased from 4.8-5.5 in my first semester to 5.3-5.6 in my last semester.

  • §Project: General Biology Labs 2 (EBIO 1240) (Jan 2016 to May 2016): I led three lab sections of eighteen students each, taking my students on a tour of the tree of life while reinforcing concepts in evolutionary biology, ecology, anatomy and physiology.
  • §Project: General Biology Labs 1 (EBIO 1230) (Aug 2015 to Dec 2015): I led three lab sections of eighteen students each, teaching the philosophical underpinings of science through hands-on experiments in cellular and molecular biology.
  • §Project: Evolutionary Biology (EBIO 3080) (Jan 2015 to May 2015): I led three lab sections of twenty students each, teaching evolutionary biology through R-based statistics and modeling labs, measurements and phylogenetics.

§Aug 2011 to Dec 2015 Graduate research assistant at the University of Colorado Museum of Natural History

During the first four years of my PhD, I worked on the Map of Life and VertNet projects, where I developed a web application for synthesizing and managing vernacular names extracted from multiple sources, a web API for efficiently searching a large database of taxonomic names, and a Python tool to identify errors in species names (now deprecated).

  • §Project: Art of Life (May 2012 to Apr 2015): A project organized by the Biodiversity Heritage Library (BHL) to identical and annotate hundreds of thousands of illustrations from the documents in this digital library.
    • §Link: Art of Life Schema (Apr 2015): I worked with my PhD advisor and some librarians at the BHL to develop a data schema for annotating biological illustrations in a way that would make them useful for biodiversity researchers.
  • §Project: Map of Life (Jan 2011 to Jan 2015): An NSF-funded project to synthesize different kinds of biodiversity data — from occurrences to rangemaps — into a single, easy-to-use tool. I worked on the Map of Life project, usually during summer holidays, under the supervision of my PhD advisor, Rob Guralnick.
    • Software: Vernacular Names (Feb 2014 to Jul 2015): A web application for managing the synthesis and verification of vernacular name information for Map of Life.
    • Software: TaxRefine (Jun 2013 to May 2014): Provides an OpenRefine reconcilation service API for matching taxonomic names against several services.
    • §Link: Validating scientific names with the GBIF Checklist Bank (Jul 2013): A blog post describing TaxRefine.
  • §Project: The Junius Henderson Field Notebook Project (Nov 2011 to Jul 2012): This project began as an idea that my collaborators (Andrea Thomer and Rob Guralnick) discussed regarding the annotation of historical field notebooks to extract the biodiversity data found in them. I suggested that we use Wikisource, a part of Wikipedia, to crowdsource the annotation process and to develop the software tools needed to carry out such annotations of Wikisource content. We published our findings as a Darwin Core Archive and in the journal ZooKeys in 2012.

§May 2014 to Aug 2014 Student developer at DBpedia
Funded by Google Summer of Code (part of Google)

I extended DBpedia's fact extraction software to support extracting facts in RDF from the Wikimedia Commons, an online repository that then contained around 25 million media files across a number of formats and licenses.

§Jan 2013 to May 2013 Graduate research assistant at the National Evolutionary Synthesis Center (NESCent)

This fellowship allowed me to work exclusively on my PhD for one semester with a mentor at the National Evolutionary Synthesis Center (NESCent) in Durham, North Carolina. My mentor at NESCent, Hilmar Lapp, would later hire me to work on the Phyloreferencing project.

§Nov 2007 to Feb 2011 Software architect at Paper Terminal Pte Ltd

I was the primary software developer, responsible for developing new web applications from prototypes to final deployment, including OCR Terminal, my company's flagship product. I also managed our computer systems on both on-site and Amazon EC2 cloud plaform across multiple operating systems.

  • §Project: OCR Terminal (Jan 2008 to Jan 2011): A start-up that focussed on providing services around optical character recognition (OCR) technology.
    • Software: OCR Terminal (Feb 2008 to Dec 2010):

      OCR Terminal was an online optical character recognition (OCR) service: it read text from uploaded images and provided the image files in an editable format such as Microsoft Word, Adobe PDF or plain text. Between mid-2008 and 2011, tens of thousands of user accounts have been created and over 100,000 documents had been processed on this website. Apart from the website itself, the service featured a simple API which can be used to submit documents for processing programmatically.

      I was lead developer of OCR Terminal right from project inception. I have written all of OCR Terminal's underlying code, first as a Perl/CGI application, and later as a Perl/Catalyst application. I am particularly proud of designing the public API, which is used by our own desktop client, several in-house tools, and several clients of ours who use it for both bulk processing and as a backend for their own software.

      I also manage OCR Terminal's main server administrator, responsible for maintaining all the servers and backend components. I was able to learn about server monitoring with tools such as Munin. Since early 2009, OCR Terminal has been hosted on the Amazon EC2 processing cloud, giving me experience with setting up, bundling and managing EC2 instances.</p>

  • §Link: Description of OCR Terminal on ABBYY's website

§Aug 2006 to Jun 2007 Lab officer at the Evolutionary Biology Laboratory

I helped manage computer-related infrastructure, from sending computers for servicing to installing scientific software on both local hardware and remote computing clusters. I also finished work on several scientific tools, which I have documented under my educational history below.

  • §Project: SequenceMatrix (Aug 2006 to Nov 2010): Genetic analysis softwares at this time were designed to compare a single genetic or protein sequence between different taxa to perform comparative analyses. In the Evolutionary Biology lab, we would often perform multi-gene analyses, requiring aligned genetic sequences to be concatenated together. Modifying such datasets after concatenation could be problematic, especially if one of the constituent sequences turned out to be contamination. Sequence Matrix was intended to simplify the process of contatenation such that preserved gene boundaries could be unconcatenated if necessary, and included some tools for detecting and removing contamination from multi-gene, multi-taxon datasets.
    • Software: SequenceMatrix (Aug 2006 to Jun 2015): Sequence Matrix facilitates the assembly of phylogenetic data matrices with multiple genes. Files for individual genes are dragged and dropped into a window and the sequences are concatenated. A table provides an overview over how much sequence information is available for the different genes and species. The user can request Sequence Matrix generate a wide variety of character and taxon sets (e.g. a taxon set with all species that have more than a specified number of genes or basepairs). The concatenated sequences can be exported in NEXUS or TNT format. Individual sequences can be excluded from being exported.
    • §Citation: Gaurav Vaidya, David J. Lohman, Rudolf Meier (March 8, 2011) SequenceMatrix: concatenation software for the fast assembly of multi‐gene datasets with character set and codon information. Cladistics 27(2):171–180.
  • §Citation: Shiyang Kwong, Amrita Srivathsan, Gaurav Vaidya, Rudolf Meier (December 11, 2011) Is the COI barcoding gene involved in speciation through intergenomic conflict?. Molecular Phylogenetics and Evolution 62(3):1009-1012.
  • §Link: My description on our lab website

§Jul 2002 to Jun 2006 Bachelor of Science (with Merit) in Life Sciences with minors in Computational Science and Economics at Department of Biological Sciences