Building a biological knowledge graph via Wikidata with a focus on the Human Cell Atlas

Tiago Lubiana

The Human Cell Atlas (HCA) is an international effort to characterize every cell type of the human body. HCA members are producing cell-level data from virtually all human tissue using techniques such as single-cell RNA sequencing, mass cytometry, and multiplexed in situ hybridizations. The wealth knowledge stemming from the wealth of data needs management to maximize the societal benefit of HCA. While semantic technologies have emerged as key players in the data interoperability ecosystem, there are still gaps to bridge. Wikidata, a sister project of Wikipedia for structured data, is surfacing as a hub in the semantic web for multiple types of information and bridging experts and the people. The connection of the Human Cell Atlas to Wikidata can quickly insert its products into the larger knowledge ecosystem, extending HCA’s reach. This PhD project aims at studying Wikidata as a platform for representing cell types.
We review the literature on cell types, refining and formalizing concepts for cell type delimitation. At the same time, integrate biomedical databases (e.g. PanglaoDB) into the Wikidata infrastructure and we enrich Wikidata with manually curated cell types. To optimize the curation, we develop Wikidata Bib, a framework for literature management and organized note-taking system for dealing with the academic literature. Finally, we are improving the interplay of Wikidata, the Cell Ontology and software used for single-cell RNA-seq data, inserting Wikidata de facto as a tool for the Human Cell Atlas community.

Preface

The introduction contains an overview of the Human Cell Atlas project and the current state of classifying cells into types. Then, it introduces ontologies and knowledge graphs as tools for connecting what we know about cells.

The methodology section is an overview of the core methods used throughout the work. Of note, some chapters in the results section also display methods when they are intimately linked with the results.

It is worth noticing that the different results shown were not developed chronologically in the order shown. The different branches were developed in parallel, with overlapping periods of activity. They are organized into separate chapters and part of different publications.

The discussion on the concept of cell type is presented first, as it is instrumental for the later steps. It is followed by an account of how PanglaoDB, a database of cell markers, was integrated into Wikidata. Then, we present Wikidata Bib, a framework for an organized reading and curation of cell types. The framework, although used as a method throughout the PhD project, is also an intelectual product, thus presented in the results session. To end the results, we discuss how the work integrates with the Cell Ontology, the leading system for organizing cell types.

After the results, an account of other academic aspects of the project is presented as part of the qualification requirements. They present an overview of collaborations, participation in events and academic courses taken during the first part of the PhD project.

Background

The Human Cell Atlas (HCA) Project

The Human Cell Atlas (HCA) project is, arguably, the biggest multinational biomedical project of the present time, running since 2017 to characterize every cell type in the human body [1]. The HCA consortium gathers people from all over the world to tackle different parts of the project to have a diverse and equitable account of the cell type diversity. [2]
It intends to “ultimately describe at least 10 billion cells, covering all tissues, organs, and systems” and, to achieve such bold goals, it commits to “to open membership, to the open and immediate data release with no restrictions, and to open-source code for all computational approaches”. [1]

Building a complete atlas of human cells comes with multiple challenges. The project includes the detection, in single cells, of RNA species (scRNA-Seq), chromatin accessibility (scATAC-Seq), and protein markers (primarily by CYTOF), as well as spatial information on cells with multiplexed in situ hybridization (such as MERFISH) and imaging mass cytometry [1,3]. The HCA is set to revolutionize the biomedical sciences by creating tools and standards for basic research, allowing better characterization of disease, and improving diagnostics and therapy. Its products (data, information, knowledge and wisdom) need to be FAIR: findable, accessible, interoperable and reusable. Data stewardship and management are growing as core demands of the scientific community, ranging from data management plans [4] to specialized data personnel [4].

The Human Cell Atlas has a dedicated team for organizing data: the Data Coordination Platform (DCP) [5] [3]. The DCP is responsible for tracing the plan for computational interoperability, from the data generators to the consumers.[3]. The Human Cell Atlas has its portal for data [6]), which composes the data repository landscape with other resources, like the Broad Institute Single Cell Portal [https://singlecell.broadinstitute.org/single_cell>) and the Chan-Zuckerberg Biohub Tabula Sapiens (https://tabula-sapiens-portal.ds.czbiohub.org/). In addition to its core team, the HCA is poised to grow by community interaction. As stated in its opening publication “As with the Human Genome Project, a robust plan will best emerge from wide-ranging scientific discussions and careful planning”.[1]
Thus, this project inserts itself among the wide-ranging scientific discussions to improve data - and knowledge - interoperability.

The highlight of “knowledge” in the last paragraph is meant to stress that raw data per se is not enough to turn the Atlas objectives into reality. There is a long way from raw datasets to commonly-agreed scientific knowledge. Currently, the gap between data and knowledge is attempted by writing and sharing scientific manuscripts, the de facto currency of exchange of claims about the natural world. A Human Cell Atlas Publication Committee reviews and selects publications, [7] presenting the papers as one of the significant outputs of the whole endeavour. As of december 2021, the list includes 96 different publications, which, arguably, expose only a fraction of the knowledge extractable from the underlying data.

The challenge that arises is one of managing a wealth of information and casting it into useful knowledge. Ideally, we would like to understand, remember, and use every statement produced by the HCA. As this goal is humanely impossible, we need to develop tools to make the knowledge interoperable with the aid of computers. At that point, the challenges of the HCA enter in resonance with the challenges of biocuration, and with the goals of this project.

One core step for knowledge management is the minting of identifiers for the concepts of interest. [8] The minting of identifiers, however, depend on clear, well defined entities. The sheer notion of “cell type” is undefined [1], a challenge for organizing both data and knowledge. Accordingly, this PhD project concerns itself also with the theoretical basis of defining a “cell type”. Thus, the next chapter will introduce the state-of-the-art of classifying cell into types, in a preparation for the chapter on knowledge modelling and the theoretical discussions of the results section.

Classification of cells into types

Given that a core goal of the Human Cell Atlas is to advance knowledge about all human cell types, [1] the definition of “cell type” becomes essential. Although a number of views exist [9,10,11,12,13,14,15,16,17,18,19,20,21], there is no formal, commonly agreed upon defintion of cell type.

A 2017 article on the Human Cell Atlas mentions[12]: “Descriptors such as ‘cell type’ and ‘cell state’ can be difficult to define at the moment. An integrative, systematic effort by many teams of scientists working together and bringing different expertise to the problem could dramatically sharpen our terminology, and revolutionize the way we see our cells, tissues and organs. We invite you to join the effort.”

The article further highlights both the current gap in knowledge and the need for a community effort to work in finding definitions. The magnitude of the challenge justifies the position of the HCA to avoid attempting to propose a precies definition of ‘cell type’. [1]

One consequence of a lack of a definition is that there is no commonly agreed number of cell types, and not even on the order of magnitude. As of November 2021, the leading answers in the Google Search Engine for the question “How many different cell types are found in the human body?” all point to around 200 different types [22,23,24], an estimate that is agreed upon by Bionumbers, a database of useful biological numbers [25] [26]. A list of cell types in the adult human body on Wikipedia also amounts to around a couple hundred cell types [url?].

If we look at other sources, however, the estimates increase considerably. For example, the Cell Ontology has catalogued more than 2,311 cell types of interest for the Human Cell Atlas [27], increasing the estimate by at least one order of magnitude. Across the literature, we can find mentions of thousands to tens of thousands of cell types only in the brain. [28,29] Additionally, with an estimate of 37 trillion cells on average per human body [30] and an ever-increasing report of new cell types/clusters in single-cell transcriptomics [31], it is reasonable to assume the number of relevant cell types might be much larger. The Human Cell Atlas project itself does not commit to any estimates of numbers of cell types due to the sheer difficulty of estimating a number given current knowledge. (Aviv Regev; reply to a question in the HCA General Meeting conference [url?])

Even though there is no commonly agreed definition, different views on cell types are maturing, tailored to different research needs. One core line of thought to define “cell type” is based on the cell type as an evolutionary unit. That definition enables the drawing of parallels, from the evolution of other biological entities (such as genes, proteins, and species) to cell types’ evolution. Models of how multicellular life works greatly benefit from concepts as sister types (sharing a single ancestor), cell type homology (sharing a common evolutionary origin), and cell type convergence (executing similar functions without direct evolutionary links) [32,33]

Another school of thought is based on attractors: regions of dynamical stability in a feature space. [34,35] In this theory, “basins of attraction” direct cell phenotypes, providing points in, say, a gene expression space towards which different cells “move” their expression programs. This dynamic view sees each cell type corresponding to “a self-stabilizing regulatory program, which acts to maintain and restore the cell type-specific program of gene expression.” [36] It aligns itself with dynamic systems theory, and some authors go as far as to say that “Lacking the idea of attractors we have no clear idea of what a cell type is.” [37]

As much as different species concepts coexist [38], the quest to define cell types may take various forms. The challenge of representing cell types in evolution is conceptually different from representing cell types for identifying different entities on biomedical data. In that second direction, the groundwork of the Cell Ontology [39,40,41] is notable, providing identifiers that are reused across a range of databases. [42] Theoretical discussions on cell types is also the topic of specialized conferences, namely the International Workshop on Cells in Experimental Life Sciences conference series [43,44].

Single-cell transcriptomics

Even though many sources of knowledge contribute to our understanding of cell types [45], arguably single-cell transcriptomics is the workhorse for current efforts of the Human Cell Atlas. [45] Current scRNA-seq data analyses often rely on unsupervised clustering followed by labeling clusters. For the clustering, bioinformaticians select parameter sets to a target resolution, i.e., the level of detail used to detect cell identities. [46] [1] When the clustering is finished, the groups of cells are annotated with class labels, representing the underlying biology in a language we can understand. [47]

Instead of assigning expression gates, as done for flow-cytometry, single-cell RNA-seq analysis pipelines start from de novo clustering of cells followed by cluster annotation. [46] While it is clear that clusters and cell types are different concepts [46], often cluster labels are treated as cell types. There are several ways to cluster cells to find groups of similarity, and arguably the current default is derived from the methodology proposed by PhenoGraph. [48] The protocol is to calculate the distances between cells in a reduced PCA space (with the number of dimensions chosen by the experimenters), followed by constructing a k-nearest-neighbours network. Each cell is a node connected by k (another parameter) edges to other cells. Once the network is built, network modules (i.e. cell clusters) are commonly found using the Louvain algorithm, published in 2008 by researchers of the Université Catholique de Louvain in Belgium. [49] The cell clusters found by the PhenoGraph (or any other) algorithm are then labelled by domain experts, often based on genes called “cluster markers”, differentially expressed on each cluster. [46]

While it is possible to manually assign labels to clusters, automatic methods have been developed to aid in the task. [47] Mmarker-based automatic annotation bases itself on crossing clusters markers in the analyzed dataset with previous knowledge from databases like PanglaoDB [50] and CellMarker [51] [47]. Reference-based automatic cell annotation, on the other hand, relies on expert-annotated reference datasets to transfer labels to the dataset under study. [47] Other methods bypass the clustering step and focus on labelling individual cells, which avoids lumping different cells together. [47] Clarke et al.’s recent review and tutorial [47] provides an extensive account of current techniques for clustering and annotating cells.

Of note, even though a range of methods is available, most techniques and publications do not use standard identifiers for cell types. That omission disregards the advantages of using standard identifiers, such as those provided by the Cell Ontology. [47] [8] [27] [52] [53] [54]. Reassuringly, projects that use stable identifiers for single-cell RNA-seq data are appearing [42], including python and R packages (e.g. Besca [55], OnClass [56] and ontoProc[57]), data management projects and reference datasets, (e.g. Tabula Muris [58] and Tabula Sapiens [59] Azimuth map [60] and HubMap’s ASCT+B Tables [61]) and annotation platforms (e.g. the Cell Annotation Platform [62] and CellTypist [63]).

The Cell Ontology is currently the biggest provider of such standard identifiers. As elegantly put by Meehan et al. [64] the Cell Ontology is a “manually constructed computer-readable resource that links cell types by different relationships”. It was first described in 2005 [39] and was oriented both at creating a species-neutral classification of cells and for researchers to “learn a considerable amount about that cell type and its relationships to other biological objects” [39]:

The collaborative project gradually evolved and changed its design and scope to fit new needs. By 2011, a need for computable definitions led to an advance in the number and quality of immune cell types represented in CL. [65] It also included the addition of species-specific cell types to handle better marker-based definitions, usually presented at the species level. [65] Further developments over the years included both technical improvements and the addition of new cell types. By the time of the last official CL publication, in 2016, it contained approximately 2,200 classes. [52]
Currently, the Cell Ontology is growing as a resource for the Human Cell Atlas and in providing identifiers for cell types [42] and presents over 2500 classes of cells (Figure 1)

The solution to the challenges of cell type classification, thus, pass via ontologies and other best practices of knowledge management. [42] Consequently, on the following chapter, we present tools for computer-based knowledge processing. We introduce the technical aspects of ontologies and knowledge graphs, present Wikidata, and discuss how such tools can can influence life-sciences research.

Ontologies

The classification of biological concepts is at the core of biology. At least since the Aristotelian endeavours to group classes of animals, a good part of the scientific work is to capture concepts into knowledge systems [66]. Linnaeus’ binomial system for naming species and Mendeleev’s periodic table are likely the two most famous classification systems but are part of a much larger ecosystem of structuring scientific knowledge.

On the 20th century, the development of the analytical philosophy of Russel and Wittgenstein and their search for formalizations [67] gradually layed the foundations for the the logic of scientific descriptions. Karl Popper and his “The Logic of Scientific Discovery”[68] was heavily influenced by analytical philosophy. Less known among life scientists, Tarski’s inquiries on what can be considered to be “true” [69] were also foundational for formal ontologies. In the end of the 20th century, the tradition of analytic philosophy contributed to the rise of applied ontology, which provided the theoretical basis for the computational ontlogies from the early 21st centhury. [70]

The whole movement for formalization of knowledge progressed on the computational end. At the late 20th century, the advent of computational ontologies and large-scale knowledge graphs were at the root of the functioning of the World Wide Web. This chapter will provide an overview of ontologies and knowledge graphs and their use in today’s biomedical sciences, alongside its future prospects.

The OBO Foundry and biomedical ontologies

An ontology, as used here, is a formal computational representation of reality, which tries to represent each concept (and their relations) as precisely as possible. [66]
Constructing an ontology is a process of selecting and defining terms of interest, selecting and defining relationships of interest and making statements about reality using terms and relationships. The Gene Ontology is probably the most well known biomedical ontology; it describes (among other things) different classes of biological process related to each_other by “is_a” and “part_of relations. [71] [72].

The Gene Ontology is part of a much larger effort to formalize concepts across biology: the Open Biomedical and Biological Ontologies (OBO) Foundry. [73] Created in 2007, the OBO Foundry is a hub of biomedical ontologies that sets guidelines for designing and constructing high-quality ontologies. Under a common framework towards interoperability, the initial OBO Foundry united several independent ontologies (like the Cell Ontology, the Disease Ontology and the Protein Ontology). At the same time, the creation of the Relation Ontology (RO) provided a go-to point for relations in biology that different ontologies could reuse.

OWL and ontology languages

One of the OBO Principles for its ontologies is that they should be resolvable as a “syntactically valid OWL file using the RDF-XML syntax.” (http://www.obofoundry.org/principles/fp-002-format.html). The OWL Web Ontology Language was introduced as a standard by the W3C consortium in 2004. OWL is not a programming language, as it does not instruct computers to perform actions, but an ontology language, which allows computerizable descriptions of the world. Furthermore, it is an umbrella ontology language that includes several languages with varying levels of expressivity. Generally, more expressive languages can represent more complex ideas but make computations harder.

Regardless of ontology’s sublanguage, it must be resolvable to an RDF-XML file. RDF stands for Resource Description Framework, another W3C standard built around a graph-based data model [74]. Statements in RDF are triples consisting of 2 nodes (a subject and an object) and an edge (a predicate) connecting the nodes. All nodes and edges are represented in RDFs by International Resource Identifiers (IRIs), and there are many ways to lay out those IRIs on a text file to represent triples. One of those layouts is the RDF-XML syntax, inspired by the XML markup language. Arguably, other syntaxes (interchangeable with RDF-XML) are easier to read for a human. As an example of an RDF triple, here is how one would represent in the Turtle RDF Syntax, the notion that plasmacytoid dendritic cells are a type of dendritic cells:

Where http://purl.obolibrary.org/obo/CL_0000784 and http://purl.obolibrary.org/obo/CL_0000451 are the unique IDs in the Cell Ontology for “plasmacytoid dendritic cells” and dendritic cells, respectively, and http://www.w3.org/2000/01/rdf-schema#subClassOf is the identifier for the “subclass of” relation as defined by the RDF schema.

A longer explanation of the details of OWL and RDF is outside the scope of this work. This brief introduction has a dual goal of introducing the architecture of formal representations and demonstrating the system’s complexity. There is a high energy barrier to acquiring the knowledge and the technical skills to engage in ontology building. That complexity might be one of the reasons why a tiny fraction of the biomedical communities represent data with ontologies, and an even smaller fraction engages with ontology building.

Wikidata

Even though the Semantic Web (which ontologies are a part of) spawned with promises of a revolution in the way knowledge is shared, it is still to be widely known outside semantic engineering. Two recent projects are playing a significant role in bringing the Semantic Web to a broader audience: the Google Knowledge Graph and Wikidata.

The Google Knowledge Graph introduced the Semantic Web de facto in the daily life of users of Google. [75/]. Its underlying structure is similar to the triples in an ontology, but it is less concerned with being logically coherent and does have strict semantics of a representation. In that way, Google Knowledge Graphs can feed on a variety of sources and not crash if there is some data modelling that, rigorously, could be inconsistent. Even though there is not a strict boundary between ontologies and knowledge graphs, one reasonable interpretation is that a knowledge graph may not be perfectly coherent, as long as it still can provide enough knowledge and reasoning for the approach of interest. While the lack of formal semantics limits reasoning and inference, the knowledge graphs are arguably easier to use, edit and understand, and so provide a user-friendly alternative for computable information with a lower entry barrier.

While the Google Knowledge Graph is widely used as a source of knowledge, it does not allow independent users to contribute information. On the other hand, Wikidata, the collaborative knowledge graph of the Wikimedia Foundation, allows users to contribute with classes and statements in the same spirit as Wikipedia and share its “epistemic virtues, like power, speed and availability. [76] Its power is derived from its large community of contributors, closely linked to the hugely successful Wikipedia. With a community of more than 20,000 active editors [77] and growing, it can cover a much wider number of concepts than any user individually. It is fast because one does not need to install any software or ask for permissions to update it: any user can do it via a web interface. That speed makes it easier for newcomers to join and contribute, in contrast to OBO Foundry ontologies, which require extensive training on semantics and knowledge of Git/GitHub for contributions. Finally, the information on Wikidata is available via a user interface, via a SPARQL query service and as large, full-size database dumps, providing full extent reusability. The Wikidata model has been so successful that Google decided to migrate its knowledge base, Freebase, fully into Wikidata.[78]

The inner workings of Wikidata

Wikidata uses the same framework (RDF) that powers ontologies, and its model represents statements about the world in triples containing a subject, a property and an object. [79] Its data model is serialized both in JSON and RDF. The data model contains 17 different data types, including, for example, “Item”, an entry on Wikidata that refers to “o a real-world object, concept, or event that is given an identifier in Wikidata” and “String”, a “sequence of freely chosen characters interpreted as text”. [80]. Knowledge is stored on Wikidata upon basic triples containing a subject (of type “Item”), a property and a value (which can be of any of the 17 types). As of November 2021, Wikidata contains more than 90 million data items [77] and more than 9000 properties that link them to values. As values often are other items, the database acquires a network format with labelled edges.

As seen in the example in 2, each of the items in the database contains an item identifier (Q followed by numbers). They also contain a label, a description, and a list of aliases, which can be recorded in more than 200 hundred languages, making it a multilingual project. [81] Each item is decorated with statements comprising property-value pairs. These pairs can be further specified via qualifiers and references, which treats the full triple as the subject, adding metadata to it (a process called reification [82/#reification]). Qualifiers provide ways to extend the information on the triple, while references provide provenance, enabling users to judge the validity of the claims in the database.

All the information is available on a user interface and programmatically. Advanced users can download dumps in JSON and RDF dumps and acess the data via the MediaWiki API and a SPARQL endpoint. [83] Several wrappers of such services are available in languages such as R [84] and python [85]. The data scheme can be seen in ?? model, where each item is connected to a statement node via a property in the “p:” namespace, from which references and qualifiers are accessible. To facilitate primary usage, the namespace “wdt:” connects items to values directly, simplifying, for example, the writing of SPARQL queries.

Information on Wikidata is released under a CC0 license, which enables full reuse of the data. [86] One of the major points of access and reuse of the information is the Wikidata Query Service [87], a core resource of the community which enables live querying in the SPARQL language. [88] A number of services make use of embedded queries from the Wikidata Query Service [87] to create interactive, live dashboards like Scholia [89] and the SARS-CoV-2 Query Book [90]

Wikidata is accessible in many ways and writable in many ways. It provides a user-friendly, point-and-click interface for modifying the database, providing low entry barriers for newcomers. It is also possible to semi-automatically reconcile spreadsheets to Wikidata items and use batch tools such as Open Refine [91] and Quickstatements [92], which enable batches on the magnitude of thousands of edits. For larger amounts of edits, it is possible to ask for bot permissions [93] and deploy systems that integrate big data sources. Bot edits are made via the Wikimedia API and are predominantly written via Python wrappers, such as Pywikibot [94] and the Wikidata Integrator. [95]

Wikidata as a knowledge graph for the life sciences

Due to its privileged position inside the linked data ecosystem and its ease of writing and query, Wikidata has been growing as a hub for interoperable data for the life sciences community. [96] [97] Even though Wikidata was created in 2013, the demand for a community-cured life sciences knowledge graph is apparent at least since 2008 [98] [99] The Wikidata-like project proposed was eventually discontinued, an example of the challenge of maintaining independent biomedical databases. [100] As Wikidata has a very large community, has stable funding and is at the core of modern technologies, like the Google Knowledge Graph [78] and Amazon’s Alexa, [101] it is virtually guaranteed that data in Wikidata will remain accessible for a long time, regardless of local funding schemes.

The Gene Wiki project [102] was likely the first large scale biomedical project to rely directly on the Wikipedia infrastructure for community curation. It provided a direction connection between the generalist community of Wikipedia and domain experts. The interplay of both communities is a topic of discussion and the opportunities and challenges were already discussed in NAR in 2012. [103]
Notably, Wikidata appeared chronologically after those efforts.
Notwithstanding, the Gene Wiki research group has embraced the Wikidata environment for community biocuration and data interoperability [104][105] [96] [106]. The information on Wikidata is still integrated to Wikipedias across multiple languages, often a a source of information in Wikipedia’s infoboxes.

Other projects outside the Gene Wiki initiative also started using Wikidata as a platform for knowledge integration. A list of several projects that use Wikidata as part of their service to their community is given in table ??. There is a movement exploring how Wikidata can be employed to advance Computational Biology and how it can be integrated to the current publication status quo. [107] In that direction, Wikidata is being developed as a platform for scholarly linked open data, mainly via the Scholia platform [108] [109],(https://scholia.toolforge.org/) which provides profiles of pre-templated SPARQL queries for entities like particular authors and articles (e.g. Scholia profile on Prof. Helder Nakaya available at https://scholia.toolforge.org/author/Q42614737).

During the COVID-19 pandemic, Wikidata has spawned as a hotspot for modelling information about the virus and the pandemic in real-time. [117] [wikidata:99196713?] The general scope of the database allowed representation in a shared system of molecular, epidemiologic and socio-economic aspects of the pandemic. [117][118] Information curated in Wikidata was immediately available, feeding live dashboards and other applications based on SPARQL queries. [119] [120] [121] Additionally, as the information presented on Wikidata is multilingual and collaboratively edited, it presented itself as a resource for constructing structured vocabularies in non-English languages. [122]

In addition to its value as a structured database, Wikidata is tightly connected to Wikipedia. The gene identifiers in the context of Gene Wiki [104] are now fed to Wikipedias across languages, benefitting users directly. Additionally, gene expression information from the Bgee database [123] was added to Wikidata and connected to Wikipedia, which led to a sizeable increase of access to the Bgee database. Currently, Wikipedia is one of the top 3 sources from which people access Bgee (personal communication with Tarcisio Farias), thus leading to direct recognition for integrated bases. More generally, the connections of Wikidata and Wikipedia make it unique in the power of flowing knowledge back to human-accessed interfaces. In the words of Matthias Samwald [124] and colleagues, “Wikidata could emerge as a community-backed and highly visible structured knowledge base of medical and biological information, bringing concepts and methodologies such as controlled taxonomies, Semantic Web / semantic technologies and ontologies into mainstream use.”

In conclusion, Wikidata’s unique position, robustness and guarantee of long term stability prompts the need for works exploring new ways of integrating it into current knowledge management systems. In light of the speed and breadth of the Human Cell Atlas and the challenges of knowledge representation on cells, this PhD works on addressing how Wikidata can play a role in organizing the discoveries about all human cell types.

Objectives

Methodology

This project’s methodology resembles practical research-action practices [125]. The “action” facet is done by contributing to projects in the Human Cell Atlas and knowledge management context.

Research in 3 forms: - Philosophical investigation on knowledge representations of cell types, both in formal logic and in academic literature - Applied investigations of database integration and data quality in the context of Wikidata and biomedical ontologies - Data-driven biomedical research targeted at hypothesis generation and literature-based discovery using knowledge at the level of cell-type

All research branches are linked to the improvement of knowledge management in biomedical sciences, focusing on the Human Cell Atlas. The methods included the development and application of a framework for an organized reading of the scientific literature, providing contact with the different facets of biocuration and Human Cell Atlas-related research.

Organized reading

As much of the project is based directly on published research, we developed a reading framework, described in detail in the results section. The framework is based on GitHub and includes Python scripts, a file organizing the reading list, and another documenting the reading history in RDF. Notes and additional information are saved in a GitHub repository, and the structured information powers a live website with analytics on the users recent readings. The source code for Wikidata Bib is available at https://github.com/lubianat/wikidata_bib/tree/template and notes on my readings can currently be accessed at https://lubianat.github.io/wikidata_bib/.

Additionally, the methodology included a discipline of reading that entails the daily task of reading 2 articles, one about “cell types” and another about “biocuration”. The articles are obtained by a mixed manual and automatic approach, including a la carte selection of articles to read alongside Wikidata queries for Cell, Nature, Science and eLife papers about single-cell transcriptomics (query: https://w.wiki/4LHr) and papers on biocuration (query: https://w.wiki/4LHi).

Biocuration of cell classes for Wikidata

For each article about cell types read, cell types previously absent on Wikidata are added via a combination of curation in a Google Spreadsheet and a custom Python script (https://github.com/lubianat/wikidata_markers/tree/master/curation_of_classes).

Wikidata updates

Properties, which link items to values need community approval. Under the scope of this PhD project, we have gotten the community approval for the properties : Cell Ontology ID (https://www.wikidata.org/wiki/Property:P7963) used to link cell types to their IDs in the Cell Ontology and has marker (https://www.wikidata.org/wiki/Property:P8872) used to link cell types to genes and proteins considered their markers.

Cell Ontology participation

As part of the research-action process, I have joined the Cell Ontology working group. I participate in the monthly meetings and sporadic workshops, learning and contributing to the discussions. Additionally, I contribute to the ontology development, actively engaging in the Cell Ontology GitHub repository (https://github.com/obophenotype/cell-ontology) and contributing with new terms and assertions. I edit the ontology with the software for ontology editing Protégé v. 5.5.0 [126].

Status of cell type info on Wikidata and the Cell Ontology

Preliminary results

Concept of cell types

General work on the concept of cell type

As an initial step of this PhD project, we investigatet the definition of “cell type” for knowledge management on Wikidata. The definition of “cell type” is a topic of avid debate. [9,10,11,12,13,14,15,16,17,18,19,20]. Before we handled data in a large scale, we dedicated time for solidifying our theoretical basis of “cell type”.

In a preprint derived from this PhD project [127], we proposed naming conventions for different cell types classes. Much of the literature uses the same names for single species (e.g., when dealing with a cell type as an evolutionary unit) or multiple species (e.g., in the Cell Ontology). Fe find it helpful to distil these different cases into different categories. Given the importance of the species’ concept in biological classification [128], we derive a species-centric view on naming classes of cell types. Using the notion of “taxonomic scope” as the taxonomic range in which instances of the class are expected to materialize, we propose 4 classes (Figure 4):

By specifying 4 different categories of cell types, we can hone in our organization. While resources often do not make the taxonomic scope explicit, it might be inferred. For example, the vast majority of biomedical articles present experiments with a single strain or a single species. While two articles using different species might call some cells by the same name, they have important intrinsic differences. Even if functionally equivalent, a mouse fibroblast can never become a human fibroblast, no matter the protocol.
While implicitly obvious, for computer-based, large scale knowledge management, it becomes necessary to state the obvious.

The division between archetypes and sensu stricto cell types is of particular importance for biocuration and data annotation. Usually, names and identifiers for genes and proteins are standardized for single species.[129] Thus, if we want to annotate marker genes, we would do better associating them to a species-specific cell type (a sensu stricto cell type) instead of the more vague association to a species-neutral type. Of note, current scRNAseq reference datasets and databases still use species-neutral cell type IDs(e.g. in the reference HuBMAP app; https://azimuth.hubmapconsortium.org/references/).

Our theoretical discussion on the notion of “cell type” extends the current state-of-the-art and introduces new ways to organize our knowledge about cells. The technotype and the infratype are currently pure theoretical constructs, and almost no resources deal with cell types at the level of strains or below. While we reason that this level of granularity would provide a more precise description of research projects, they are still far from being applicable in the present, and are present as tools for the future. The division of archetypes and sensu stricto cell types, on the other hand, was immediate value. As an example, it was instrumental for the integration of the Panglao database of cell markers to Wikidata, described in the following session of the results.

A simplified definition

While theoretical discussion on the notion of “cell type” are important, a reasonable consensus is likely to take longer than the duration of this PhD project. Here we adopt a liberal view of cell type, defining, for our purposes, a cell type as any class of cells described by a domain expert with evidence of the reality of its instances. The requirement of evidence of existence in the material world is based on the principle of instantiation of ontological realism [130]. Barry Smith and Werner Ceuster state “A term should be included in a reference ontology only if there is experimental evidence that instances to which that term refers exist in reality. (‘Exists’ here should be understood in a tenseless sense in order to accommodate, for example, universals pertaining to extinct species as well as universals such as swarm or hurricane which are instantiated only intermittently.)” Following their advice, our minimum requirement for a cell type is public evidence for materializations of its instances.

By “type”, or “class”, we mean an abstract entity in a sense intended by the multilevel theory (MLT) of conceptual modelling [131] Figure 5 displays a simplified version of MLT adopted throughout this project. In this framework, real-world entities are materializations of individuals. Individuals are theoretical constructs that are (1) thought to exist or have existed, as per the principle of instantiation, and (2) refer to only one material entity at any point in time. For example, Wikidata’s entries for “Helder Nakaya (Q42614737)” and “Charles Darwin (Q1035)” are considered individuals by Multi-Level Theory. Other examples of individuals include “Albert Einstein’s brain (Q2464312)” and the “Christ the Redeemer statue (Q79961)”.

In MLT, individuals are instances of some classes. For example, both “Helder Nakaya (Q42614737)” and “Charles Darwin (Q1035)” could be represented as instances of the class “Homo sapiens (Q15978631)”. “Homo sapiens (Q15978631)” is only one of the classes that those individuals belong to. Another one is “animal (Q729)”. As all instances of “Homo sapiens (Q15978631)” are also instances of “animal (Q729)”, “Homo sapiens (Q15978631)” is a subclass of “animal (Q729)”. It is possible to continue the hierarchy of subclasses: “animal (Q729)” is a subclass of “organism (Q7239)”, and so on until the root case “entity (Q35120)”

Classes can also behave as individuals in some aspects. For example, both “Homo sapiens (Q15978631)” and “animal (Q729)” are instances of “taxon (Q16521)”. “Taxon (Q16521)”, thus, is a metaclass, or, more precisely, a 1st-order metaclass. Other examples of metaclasses are “species (Q7432)” and “phylum (Q38348)”. These, in turn, are instances of “taxonomic rank (Q427626)”, a 2nd-order metaclass.

In the Figure 5 B there is a proposal of this version of MLT for cell types. As individual cells are rarely named, for the sake of example, we can consider the “zygote of Mahatman Gandhi” as an individual in the theoretical system, an instance of the class “zygote (Q170145)”. “Zygote(Q170145)” is an instance of the metaclass “cell type (Q189118)” A more concrete example of from RNA-sequencing datasets, where there are barcodes for each cell in a particular sample. Each barcode labels an individual. Thus, labelling single-cells is a process of identification, where each individual is connected to a class of interest.

We avoid the dissection of the differences between persistent classes of cells (often called “cell types”) or the transient, fugacious classes of cells (often called “cell states”) (see “Definition of cell identity” section in [132] for an example). We also consider only the cell as it was observed in an experiment, not necessarily the future conditions of any cell (i.e., the “cell fate”). [35]
Even though such a distinction is an important topic for theoretical research, it is outside the initial scope of this work.

Another consequence of is that “subtype” becomes redundant with the idea of cell type and differ only stylistically. The notion of subtype, only will make sense in context of superclasses, but all in the same kingdom (i.e. the kingdom of classes).

We opted to frame our work around the term “cell type” due to its historical usage and familiarity for the life sciences community. The term “cell class” is also used in the literature and is a suitable synonym for our notion of cell type. Other related terms present semantic ambiguities, and were avoided whenever possible. For example, terms as “cell set,” “cell population,” and “cell cluster,” can reminisce of a specific, countable group of cells, frequently from the same experiment. The term “cell identity” has also been suggested for avoiding the cell type/cell state dilemma [46], but we avoid it to emphasize a nominalistic perspective (in the Popperian sense[133]). In doing so, we reinforce the intent to represent the cell types reported to exist instead of stating bluntly which cell types exist or are essential for human beings.

The employment of MLT and species-specific cell type are instrumental for the next chapters of this work. In the chapter about the PanglaoDB integration, we describe how we applied the theory to add marker information to Wikidata and cleaned up conceptual disarrays throughout the platform. Later, on the chapter about Wikidata Bib, we describe how we performed a large-scale curation of the biomedical literature for new cell types, using the theories discussed here as a starting principle.

PanglaoDB integration to Wikidata

Introduction

Biomedical databases gather structured information for end users. They are present in different states of maintenance, and reconciling cell-oriented databases to Wikidata has the potential to increase interoperability, and multiply the value of previous biocuration efforts. PanglaoDB [134] [135] is a publically-available database that contains data and metadata on hundreds of single-cell RNA sequencing experiments. It provides extensive information on cell types, genes, tissues, and cell type markers, obtained via automatic and manual methods. It also displays a rich web user interface for easy data acquisition, including database dumps for bulk downloads.

As of 8 December June 2021, the article describing PanglaoDB had been cited 230 times. Despite its use by the community, the database is on a 3-star category for Linked Open Data [136] as it does not use the open semantic standards from W3C (RDF and SPARQL) needed for a 4-star rank, neither the links to external data via standard identifiers that make datasets 5-star. Improving the data format is a valuable step in making biological knowledge FAIR (Findable, Accessible, Interoperable, and Reusable). Thus, we provide a case study of making PandlaoDB available in a 5-star Linked Open Data Format on Wikidata.

As of August 2020, Wikidata had 264 items categorized as a “cell type”, considerably less than the Cell Ontology, which counts over two thousand cell types [52,137]. Strikingly, there were also 23 items categorized as instances of “cell (Q7868)”. This classification is imprecise, as an instance of cell would be an individual named cell from a single named individual, an example of conceptual disarray that often occurs on Wikidata. [138]

Wikidata editors often mix 1st-order classes such as “cells” and “organs” with metaclasses like “cell types” and “organ types”. As mentioned in the chapter on the concept of cell type and Multi Level Theory, individuals, like the “Dolly sheep zygote” and the “brain of Albert Einstein” are instances of classes like “zygote” and “brain”, respectively. Classes, like “zygote” and “brain” are instances of metaclasses, like “cell type” or “organ type”.

We diligently fixed and improved the conceptual consistency of cell type entries on Wikidata. As of 8 December 2021, the Wikidata database contains 2834 instances of “cell type” (see current status at https://w.wiki/b2t) and 0 instances of “cell” (https://w.wiki/4XAg) highlighting the improvements in both quantity and quality. This increase stems from the PanglaoDB integration (around 430 new types) and the Wikidata Bib curation described later.

Methodology for PanglaoDB integration

After obtaining approval from the database owners, we matched genes and cell types to Wikidata and performed Wikidata queries to demonstrate the value of the approach. An overview of the process is shown in ??. Overview of how data from a database is integrated into Wikidata

Overview of how data from a database is integrated into Wikidata

Class creation on Wikidata

Classes corresponding to species-neutral classes were curated from Wikidata using Wikidata’s Graphic User Interface. A manually-curated dictionary matching terms in PanglaoDB to Wikidata identifiers was assembled and used for integration. Cell types that were not represented on Wikidata were added to the database via the graphical user interface (https://www.wikidata.org/wiki/Special:NewItem) and logged in the reference table.

Species-specific cell types for human and mouse cell types were created for every entry in the reference table and connected to the species-neutral concept via a “subclass of” property (e.g. every single “human neutrophil” is a also “neutrophil”). Our approach was analogous to the one taken by the CELDA ontology to create species-specific cell types. [139]

Integration of PanglaoDB to Wikidata

After receiving authorization by e-mail from the PanglaoDB developer, Oscar Franzen, the PanglaoDB markers dataset was downloaded manually from PanglaoDB’s website (https://panglaodb.se/markers/PanglaoDB_markers_27_Mar_2020.tsv.gz) for integration. It contains 15 columns and 8256 rows. Only the columns species, official gene symbol, and cell type were used for the reconciliation. The reconciled dataset was uploaded to Wikidata via the WikidataIntegrator Python package [95], a wrapper for the Wikidata Application Programming Interface.

SPARQL queries

Besides the Wikidata Dumps, Wikidata provides a SPARQL endpoint with a Graphical User Interface (https://query.wikidata.org/). Updated data was immediately accessible via this endpoint, enabling integrative queries integrated with other database statements.

Overview of integrated information on Wikidata

Cell Marker information on Wikidata

Adding marker information on Wikidata was not possible before this study and became possible after we proposed and got community approval of the property “has marker” (P8872). Figure 6 shows 2 of the current markers of “human colinergic neuron”(Q101405051), CHAT and ACHE, as they are seen on Wikidata. The PanglaoDB is referenced both via URL to the website (https://panglaodb.se/markers.html) and a pointer to the PanglaoDB item on Wikidata, Q99936939.

Now that we re-formatted the markers on PanglaoDB as Linked Open Data, we can make queries that were not possible before, including federated queries with other biological databases, such as Uniprot [140] and Wikipathways [141]. Due to previous similar reconciliation projects, Wikidata already contains information about genes, including their relations to Gene Ontology (GO) terms.

PanglaoDB’s integration to the Wikidata ecosystem allows us to ask various questions (figure 7).

“Which human cell types are related to neurogenesis via their markers?”

As expected, the query below retrieved a series of neuron types, such as “human Purkinje neuron” and “human Cajal-retzius cell.” It also retrieved non-neural cell types such as the “human loop of Henle cell, a kidney cell type, and”human osteoclast. These seemingly unrelated cell types markedly express genes involved in neurogenesis, but that does not mean that they are involved with this process. The unexpected results reinforce that one needs to be careful when using curated pathways to analyze gene sets, as false positives abound.

The molecular process that gene products take part depends on the cell type. SPARQL allows us to seamlessly compare Gene Ontology processes with cell marker data, providing a sandbox to generate hypotheses and explore the biomedical knowledge landscape.

“Which cell types express markers associated with Parkinson’s disease?”

Besides integration with Gene Ontology, Wikidata reconciliation connects the marker gene info on PanglaoDB with disease markers.

Disease genes are often compiled from Genomic Wide Association Studies, which look for sequence variation in the DNA. These studies are commonly blind to the cell types related to the pathophysiology of the disease. In the query below, we can see cell types marked by genes genetically associated with Parkinson’s disease. Even considering the false positives, the overview can aid domain experts in coming up with novel hypotheses.

Discussion and conclusion

In this part of the PhD project, we re-released the knowledge curated in PanglaoDB on Wikidata, connecting it to the semantic web. Each cell-type/marker statement was added to Wikidata with a pointer to PanglaoDB and a citation of the article, providing proper provenance. Based on the theoretical considerations on the concept of cell type, we added species-specific terms to Wikidata for cell types of Homo sapiens and Mus musculus described in the PanglaoDB database.

This work exemplifies the power of releasing Linked Open Data via Wikidata, and provides the biomedical community with the first semantically accessible, 5-star LOD dataset of cell markers, easily reachable from Wikidata’s SPARQL Query Service (https://query.wikidata.org/). Alongside other case studies of biomedical data integration to Wikidata (see [118], it contributes with tools and practices to serve as basis for contributors.) The work also paves the way for reconciling of other databases for cell-type markers, such as CellMarker [51], labome [142], CellFinder [137] and SHOGoiN/CELLPEDIA [143]) (if the owners give proper authorization). The approach we took here can be applied to any knowledge set of public interest, providing a low-cost and low-barrier platform for sharing biocurated knowledge in gold-standard format.

Wikidata Bib and a professional system for biocuration

Introduction

Reading scientific articles is an integral part of the routine of modern scientists. Although several literature-management software are available [144], the process of reading is mainly artisanal. There are no standard guidelines on how to probe the literature organize notes for biomedical researchers. Thus, while reading and studying is a core activity, there are few (if any) protocols for the efficient screening of scientific articles.

Other professional traditions have dealt with similar issues in the past. Notetaking is vital to keep track of financial balances and avoid costly problems in accounting. Double-entry bookkeeping was developed in the 13th century as a professional solution for notetaking in accounting where “every entry to an account requires a corresponding and opposite entry to a different account.” [145, =Double-entry_bookkeeping&oldid=1055066428] In software development, Test-Driven Development (TDD) is a popular methodology where tests for code snippets are written before the code itself, therefore ensuring that written software passes minimum quality standards. The similarities of Double-entry bookkeeping and TDD are diverse [146], but for our purpose, here suffices to see both as professionalized systems that promote better quality and accountability of works.

In the humanities, there is a well-established practice of annotations of readings. The annotation skills are part of standard academic training in the humanities [147][url?]. An influential work in presenting methods for academic reading in the humanities is Umberto Eco’s book “How to Write a Thesis” [148], which outlines not only how to annotate the literature that basis an academic thesis, but also why to do so. The book, written originally in 1977, is still influential today. Still, its theoretical scope (roughly the humanities) and its date preceding the digital era limits the extent to which it applies to the biomedical sciences.

Notably, the need for an organized reading system for biocuration studies stems from a difference in methodology. In humanities, the main (if not sole) research material is the written text, the books and articles from which research stems—[url?]. In the biomedical sciences, including a large part of bioinformatics, the object of study is the natural world, observed via experimentation. Thus, naturally, scientific training focuses on experimentation and data analysis’s theoretical and practical basis. With the boom of scientific articles, however, the scientific literature (and accompanying public datasets) already provide a strong material for sculpting scientific projects. Thus, developing a methodology for academic reading tailored to the digital environment is a need.

This chapter concerns itself with presenting Wikidata Bib, a framework for large scale reading of scientific articles. It is presented in three parts, each with a technical overview alongside the theoretical foundations. First, Wikidata Bib is presented as a reading system for managing references and notes using a GitHub repository and plain text notes. Then, we present how the system ensures accountability, allowing users to get personalized analytics on their reading patterns. Finally, we demonstrate how Wikidata Bib fits an active curation environment, connecting the framework with the larger goal of this project of curating information about cell types on Wikidata.

Wikidata Bib as a reading system

The reading framework of Wikidata bib is built upon a git repository integrated with GitHub, Python 3 scripts and SPARQL queries. It has a standard file structure, summarized as the following:

The docs/ directory contains the live dashboard from the readings, which will be discussed in the following sessions. The downloads/ directory hosts the pdfs of the articles read with the system. These are not committed to the repository and are only stored locally. The notes/ directory contains markdown files, one for each article read. The src/ directory contains the python code with the system’s mechanics. They contain helper functions for the command line commands discussed below: - wread which receives a Wikidata QID for an article and outputs (1) a notes document, (2) a pdf for the paper obtained from Unpaywall [149] and (3) an updated version of the dashboard HTML files in the docs/ directory. - pop, which “pops” an article from toread.md and runs wread for it - wadd, which takes an URL for a Wikidata SPARQL query and adds new QIDs to toread.md - wadd_all, which parses config.yaml for recurrent SPARQL queries and runs wadd for each - wlog, which adds, commits and pushes recent readings and dashboard updates to GitHub

All the structures described so far are commonly shared by any user of Wikidata Bib. To personalize the use of the system, the user edits three plain text files. toread.md hosts plain text QIDs of the articles that will be read. These can be added either manually or via wadd. While the pop command only sees QIDs, articles titles or other identifiers can temporarily be added to toread.md without breaking the system. index.md hosts a numbered list of topics of interest. This file plays the role of Umberto Eco’s work plan, with the topics of interest for the academic. [148] These are used to tag articles for retrieval in a later step. config.yaml contains shortcuts for different reading lists. This is better explained by example. In my to read.md file there are two reading lists, one following a # Cell types header and another following a # Biocuration header. My config.yaml contains the following snippet:

The config.yaml shortcuts are used as arguments by the pop command, where $ ./pop ct retrieves an article from the “Cell types” list, while $ ./pop bioc retrieves an article from the “Biocuration” list.

The Wikidata bib framework is coupled with a discipline of daily reading. The discipline is inspired by Robert Cecil Martin’s description of Test Driven Development in the book “Clean Code”, which includes not only a technical description but a school of thought of how software development might be approached. [150] Every day, I read one article of each list, using the notetaking station displayed in Figure [fig?]: notetaking. The constancy of reading allows steady coverage of the relevant literature. While the discipline has worked for this research project, it is not required to use the Wikidata Bib system.

The notetaking station of Wikidata Bib, opened in Virtual Studio Code, is depicted on Figure [fig?]: notetaking A. The title and publication dates are displayed, and the reading process entails copying snippets from the text to the “Highlights” session. Copying the highlights into plain text makes the sections of interest searchable via command line using grep (https://en.wikipedia.org/w/index.php?title=Grep&oldid=1039541979). Comments can be added either in the comment section or inline, alongside the highlights, using --> Comment goes here to differentiate from highlights. Also searchable by grep are the tags, copied and pasted from index.md in the ## Tags session or alongside the main article.

The discipline also includes, whenever possible, an improvement of the metadata about the article on Wikidata. In [fig?]: notetaking B are shown the links included in the dashboard. A link to a Scholia [108] profile allows identification of related articles from a series of pre-made SPARQL queries probing bibliography data on Wikidata. While Scholia provides an overview of a given article, it does not allow direct curation of the metadata. For that, two links are provided, one to Wikidata and one to Author Disambiguator [151]. By accessing the Wikidata page for the entity, one can add new triples, for example, curating authors and topics of the article, which are then used by Scholia and by Wikidata Bib’s dashboard. Author Disambiguator is a wrapper of an Wikimedia API that facilitates disambiguating author names to unique identifiers on Wikidata, thus feeding the public knowledge graph of publication and authors.
Finally, a link to the article’s DOI or full-text URL is provided and serves as a fallback when the automatic download fails. Of note, while the metadata curation has a technical benefit to Wikidata and the dashboard, it also plays a theoretical role. By curating metadata on authors, the user of Wikidata Bib can better understand the people they read, and expand their metascientific perspective on their domain of interest.

Wikidata Bib as a dashboard

The Wikidata Bib system also enables the reader to get statistics on their readings. Two simple databases are stored on the GitHub repository: * read.ttl - An RDF document recording the dates in which each article was read. * read.csv - An simple, human-readable index connecting QIDs with article titles. The CSV file is only stored for accountability and as a quick way to glance at the titles read. The .ttl file, on the other hand, is processed by the update_dashboard.py script to render 4 different HTML files under the docs/ folder: - index.html - last_day.html - past_week.html - past_month.html All files are displayed in a GitHub pages. In the case of this work, they are displayed at https://lubianat.github.io/wikidata_bib/.

To organize the code for rendering the dashboard, we created a python package, wbib, and deposited it in PyPi, making it available via pip. [152]. The package implements the logic for rendering complex Wikidata-based academic dashboards and is available in GitHub at https://github.com/lubianat/wbib. It allows the user to build dashboards based on Wikidata records of information such as gender of authors, the region of author’s institutions, topics of articles and similar metascientific information. The dashboard is composed of SPARQL queries written for the Wikidata Query Service [87] It also allows users to feed an arbitrary list of articles and obtain a custom dashboard. Wikidata Bib obtains the HTML dashboards after feeding wbib the lists of articles read in total (index.html) or in pre-determined time spans (last_day.html, past_week.html and past_month.html )

The dashboard includes not only a basic list of read articles, but also statistics on most read authors and most-read venues. It also displays an interactive map of the institutions of articles read, permitting a glance at geographic biases in activities. An example of queries is shown in 9. As the queries are rendered live, they evolve in quality with the growth of Wikidata. Finally, the clean 5-star-open data format enables users to adapt the queries to include different aspects of Wikidata. For example, table 4 showcases 10 articles that (1) I have read in the past year and (2) were authored by a speaker of the 1st Human Cell Atlas Latin America Single Cell RNA-seqData Analysis Workshop [153]. One practical application that the dashboard enables, thus, is to identify people in an event, institution or location that the user has read before, therefore catalyzing the possibility of collaborations. Anecdotally, this strategy was tested successfully at Biohackathon Europe 2021 [154], where I used the system both to identify possible collaborators and as a conversation starter.

Wikidata Bib for curation of cells to Wikidata

The Wikidata Bib system was devised originally to allow an overview of the fields of cell classification and biocuration. However, during the process, it was also repurposed for biocuration of new cell classes in Wikidata. By fast-tracking the reading of new articles, Wikidata Bib enables an efficient parsing of the literature and, thus, the identification of previously uncatalogued cell types.

Articles read with Wikidata Bib were screened to mention cell types absent from Wikidata. As discussed in the chapter about the concept of cell type, we considered a “cell type” as any class of cells described by a domain expert with evidence of the reality of its instances. When a mention of such a class appears in an article, I first verify Wikidata for the existence of a related class. If it is absent from the platform, I enter a class name, alongside a superclass, and a QID in a Google Spreadsheet, as shown in Figure 10.

The information from the spreadsheet is pulled by a python script and processed locally with a series of dictionaries that match common terms to Wikidata IDs. In the example shown in Figure 10, the string “endothelial cell” was matched against a manually curated dictionary to the Wikidata entry Q11394395, the representation of that concept on Wikidata. After reconciling the data, the script uses the Wikidata Integrator python package [95] to insert the new entries on the Wikidata database. The code for integrating a Google Spreadsheet to Wikidata is available at https://github.com/lubianat/wikidata_cell_curation.

Wikidata contains 2940 subclasses of “cell (Q7868)” as of 8 December 2021. From those, 550 cell classes are specific for humans, and 318 are specific for mice.
As a comparison, as of 8 December 2021, Wikidata has more cell classes than the Cell Ontology, which lists 2577 classes. It is worth noticing that classes on the Cell Ontology are added after careful consideration by ontologists and domain experts and should be considered of higher quality than the ones on Wikidata.

From the 2940 cell classes on Wikidata, 2812 (95.6%) have been edited somehow by User:TiagoLubiana, and 1668 (56.7%) have been created by User:TiagoLubiana. Edits made to the cells were often connecting a dangling term, created automatically from an Wikipedia page to the cell subclass hierarchy, and included adding identifiers, images, markers, and other pieces of information. From the 1668 entities created, approximately 63 species-neutral cell types, 188 human and 188 mouse cell types were added based on PanglaoDB entries (total of 439). The remaining 1229 entries were created either via Wikidata’s web interface or via the curation workflow described in this chapter. These statistics are a simple demonstration of how the curation system efficiently contributes to the status of cell type information on Wikidata.

As mentioned by Aviv Regev in the Human Cell Atlas General Meeting 2021, after a shoutou to ontologies: “It’s everyone’s collective responsibility to participate in the annotation efforts, because that relies on domain expertise. To really tease apart things and give them names. Until we have names, people will have really a hard time working with things in biology.”[url?]” We hope that by developing simplified curation tools we will engage more domain experts into the curation efforts.

Wikidata and the Cell Ontology interplay

The contributions to cell types on Wikidata will be of most value if they are integrated to the current state-of-art of knowledge representation. Arguably, the Cell Ontology is the main source of cell type identifiers in the context of the Human Cell Atlas project.[42] Thus, data about cell types on Wikidata must be connected to the Cell Ontology.

To start the improvement in the interplay of both databases, we proposed and got the approval of a specific Wikidata identifier for the Cell Ontology, the “Cell Ontology ID” (https://www.wikidata.org/wiki/Property:P7963). IDs can be added to Wikidata entities and connected them to external databases enabling integrative SPARQL queries. Besides using the common Wikidata interface, one can crowd-curate identifiers via a 3rd-party service, Mix’N’Match, which provides a user-friendly framework for connecting identifier catalogues to Wikidata. [155/?p=114], as seen in Figure ??. Logically, we created a Mix’N’Match catalogue for harmonizing Cell Ontology IDs to Wikidata (https://mix-n-match.toolforge.org/#/catalog/4719), harnessing the community support for the task.

As of early December 2021, more than 700 Cell Ontology IDs have been manually matched to Wikidata. The integration already enables queries that harness the previously existing information on Wikidata for Cell Ontology-based applications. For example, one can query Wikidata items that have (1) a crossref to a CL ID (2) a picture in Wikimedia Commons (https://w.wiki/4F6e, Figure 12). The different possibilities of mutual benefit between the Cell Ontology and Wikidata will continue to be explored in the following years of this PhD project.

Final considerations and next steps

To sum up, this PhD research project aims at improving knowledge representation in the context of the Human Cell Atlas. It is composed of a mixture of theoretical studies on conceptual modelling, practical contributions to knowledge organization projects (mainly the Cell Ontology and Wikidata), explorations of the data to generate biomedical insights, and a technical framework for organized reading. By approaching the object of study from a new perspective, we hope to make sizeable contributions and promote discussion and fruitful conflation of approaches.

The next years of study will be devoted to improving the projects presented here into mature, useful objects. We hope to improve the interplay of Wikidata and Cell Ontology, developing frameworks to combine community- and expert-based curation of knowledge on cell types. Furthermore, we plan to integrate Wikidata to current single-cell RNA-sequencing pipelines by adapting R packages to use Wikidata (e.g. the ontology-based packages OnClass [56] and ontoProc[57]). Finally, we aim at moving the Wikidata Bib system to a well documented, user-friendly mature system, testing usability with other academics and distributing it as a durable open-source project.

Additional Work

Collaborations and manuscripts

fcoex

During the initial course of this PhD work, we also completed the development and reporting of fcoex, an R package for investigating cellular phenotypes using co-expression networks. [156] The software was maintained to withstand new releases of dependencies and new R version and was published as a preprint on biorxiv. [doi:10.1101/2021.12.07.471603v1?]

Wikidata Bots

Alongside the editing of cell-type information on Wikidata, I have joined different efforts to improve biological information on Wikidata. I have collaborated with the ComplexPortal curators as part of the Virtual Elixir BioHackathon 2020 (https://github.com/virtual-biohackathons/covid-19-bh20/wiki) and for the following year, to build a Wikidata Bot to integrate information on protein complexes to Wikidata. An overview of the Wikidata integration is in Figure 13, presented in an article published in Nucleic Acid Research (re-use of the image and legend possible under the CC-BY license of the article). [157]

I have also collaborated with the Cellosaurus database [158] to revive the CellosaurusBot [159], responsible for updating the metadata on more than 100,000 cell lines on Wikidata. The bot code, written in Python, was refactored entirely and runs semi-automatically after the Cellosaurus database was released. A write-up of the integration is in progress and is planned for release/submission in the first semester of 2022.

Systematic Reviews and publishing of intermediary tables

Finally, in collaboration with Olavo Amaral and Kleber Neves, from the Brazilian Reproducibility Initiative [160] I wrote a commentary on the value of publishing intermediate datasets as citable products. [161] The pieces discuss the value of small curations done both in systematic reviews and by experimentalists in the course of their research projects. Published curation tables can serve as a source for improving the ecosystem of open knowledge, not less by reconciliation to Wikidata (thereby bridging the commentary with this project)

WiseCube - enterprise biomedical question and answering

During a part of this project, I have worked part-time as a consultant for the Wisecube company, based in Seattle, United States. [162] The job was approved by FAPESP and consisted mainly in writing SPARQL queries that probe Wikidata for answers to the questions posed by the BioASQ competition. [163] It also entails on-demand curation of biomedical topics on Wikidata based on requests by pharmaceutical companies as well as the development of dashboards targeted at providing insights to customers.

Awards and participation in events

During the initial course of this PhD project, I have participated in several events:

Course work

During the first year of the PhD program, I took four different classes, acquiring 36 academic credits. Figure 14 displays the disciplines taken, available only in Portuguese.

References

The Human Cell Atlas.

Aviv Regev, Sarah Teichmann, Eric Lander, Amir Giladi, Christophe Benoist, Ewan Birney, Bernd Bodenmiller, Peter Campbell, Piero Carninci, Menna R Clatworthy, … Human Cell Atlas Meeting Participants

eLife (2017-12-05) https://www.wikidata.org/wiki/Q46368626

DOI: 10.7554/elife.27041

The Human Cell Atlas and equity: lessons learned

Partha P Majumder, Musa M Mhlanga, Alex K Shalek

Nature Medicine (2020-10-01) https://www.wikidata.org/wiki/Q100491106

DOI: 10.1038/s41591-020-1100-4

The Human Cell Atlas White Paper

Aviv Regev, Sarah Teichmann, Orit Rozenblatt-Rosen, Michael JT Stubbington, Kristin Ardlie, Amir Giladi, Paola Arlotta, Gary D Bader, Christophe Benoist, Moshe Biton, … Human Cell Atlas Organizing Committee

(2018-10-11) https://www.wikidata.org/wiki/Q104450645

Everyone needs a data-management plan

Nature

(2018-03-15) https://www.wikidata.org/wiki/Q56524391

DOI: 10.1038/d41586-018-03065-z

About the Data Coordination Platform

HCA Data Portal

https://data.humancellatlas.org/about/

Mapping the Human Body at the Cellular Level

HCA Data Portal

https://data.humancellatlas.org/

Publications https://www.humancellatlas.org/publications/

Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data

Julie A McMurry, Nick Juty, Niklas Blomberg, Tony Burdett, Tom Conlin, Nathalie Conte, Melanie Courtot, John Deck, Michel Dumontier, Donal K Fellows, … Helen Parkinson

PLOS Biology (2017-06-29) https://www.wikidata.org/wiki/Q33037209

DOI: 10.1371/journal.pbio.2001414

What Is Your Conceptual Definition of "Cell Type" in the Context of a Mature Organism?

Paul Blainey, Hans Clevers, Cole Trapnell, Ed Lein, Emma Lundberg, Alfonso Martinez Arias, Joshua R Sanes, Jay Shendure, James Eberwine, Junhyong Kim, … Mathias Uhlén

Cell systems (2017-03-01) https://www.wikidata.org/wiki/Q87649649

DOI: 10.1016/j.cels.2017.03.006

10.

A periodic table of cell types

Bo Xia, Itai Yanai

Development (2019-06-15) https://doi.org/ggctwf

DOI: 10.1242/dev.169854 · PMID: 31249003 · PMCID: PMC6602355

11.

Exciting times to study the identity and evolution of cell types

Maria Sachkova, Pawel Burkhardt

Development (2019-09-15) https://doi.org/ghdb9v

DOI: 10.1242/dev.178996 · PMID: 31537583

12.

The Human Cell Atlas: from vision to reality.

Orit Rozenblatt-Rosen, Michael JT Stubbington, Aviv Regev, Sarah Teichmann

Nature (2017-10-01) https://www.wikidata.org/wiki/Q47565008

DOI: 10.1038/550451a

13.

Human Cell Atlas and cell-type authentication for regenerative medicine

Yulia Panina, Peter Karagiannis, Andreas Kurtz, Glyn N Stacey, Wataru Fujibuchi

Experimental and Molecular Medicine (2020-09-15) https://www.wikidata.org/wiki/Q99418657

DOI: 10.1038/s12276-020-0421-1

14.

A community-based transcriptomics classification and nomenclature of neocortical cell types

Rafael Yuste, Michael J Hawrylycz, Nadia Aalling, Argel Aguilar-Valles, Detlev Arendt, Rubén Armañanzas, Giorgio A Ascoli, Concha Bielza, Vahid Bokharaie, Tobias B Bergmann, … Ed S Lein

Nature Neuroscience (2020-08-24) https://www.wikidata.org/wiki/Q98665291

DOI: 10.1038/s41593-020-0685-8

15.

The evolving concept of cell identity in the single cell era

Samantha A Morris

Development (2019-06-27) https://www.wikidata.org/wiki/Q93086971

DOI: 10.1242/dev.169748

16.

Implications of Epigenetic Variability within a Cell Population for "Cell Type" Classification

Inna Tabansky, Joel Stern, Donald W Pfaff

Frontiers in Behavioral Neuroscience (2015-12-16) https://www.wikidata.org/wiki/Q26770736

DOI: 10.3389/fnbeh.2015.00342

17.

Geometry of the Gene Expression Space of Individual Cells

Yael Korem, Pablo Szekely, Yuval Hart, Hila Sheftel, Jean Hausser, Avi Mayo, Michael E Rothenberg, Tomer Kalisky, Uri Alon

PLOS Computational Biology (2015-07-10) https://www.wikidata.org/wiki/Q35688096

DOI: 10.1371/journal.pcbi.1004224

18.

Evolution of Cellular Differentiation: From Hypotheses to Models

Pedro Márquez-Zacarías, Rozenn M Pineau, Marcella Gomez, Alan Veliz-Cuba, David Murrugarra, William C Ratcliff, Karl J Niklas

Trends in Ecology & Evolution (2020-08-20) https://www.wikidata.org/wiki/Q98633613

DOI: 10.1016/j.tree.2020.07.013

19.

An era of single-cell genomics consortia

Yoshinari Ando, Andrew T Kwon, Jay W Shin

Experimental and Molecular Medicine (2020-09-15) https://www.wikidata.org/wiki/Q99418649

DOI: 10.1038/s12276-020-0409-x

20.

Inferring cell type innovations by phylogenetic methods-concepts, methods, and limitations

Koryu Kin, Koryu Kin

Journal of Experimental Zoology. Part B: Molecular and Developmental Evolution (2015-10-14) https://www.wikidata.org/wiki/Q40436539

DOI: 10.1002/jez.b.22657

21.

Towards a pragmatic definition of cell type

Tiago Lubiana, Helder Nakaya

(2021-01-04) https://www.wikidata.org/wiki/Q108723646

DOI: 10.22541/au.160979530.02627436/v1

22.

How Many Types of Cells Are in the Human Body?

ibswit

(2017-05-17) https://askabiologist.asu.edu/questions/human-cell-types

23.

How many cell types in a human body? How about the number of cell cycles in each species?

ResearchGate

https://www.researchgate.net/post/How-many-cell-types-in-a-human-body-How-about-the-number-of-cell-cycles-in-each-species

24.

Types of cells in the human body

Kenhub

https://www.kenhub.com/en/library/anatomy/types-of-cells-in-the-human-body

25.

BioNumbers--the database of key numbers in molecular and cell biology

Ron Milo, Paul Jorgensen, Uri Moran, Griffin M Weber, Michael Springer

Nucleic Acids Research (2010-01-01) https://www.wikidata.org/wiki/Q24643881

DOI: 10.1093/nar/gkp889

26.

Search BioNumbers - The Database of Useful Biological Numbers https://bionumbers.hms.harvard.edu/search.aspx

27.

Cell types and ontologies of the Human Cell Atlas

David Osumi-Sutherland, Chuan Xu, Maria C Keays, Peter V Kharchenko, Aviv Regev, Ed S Lein, Sarah Teichmann

(2021-06-28) https://www.wikidata.org/wiki/Q107373831

28.

The continued need for animals to advance brain research

Judith R Homberg, Roger AH Adan, Natalia Alenina, Antonis Asiminas, Michael Bader, Tom Beckers, Denovan P Begg, Arjan Blokland, Marilise E Burger, Gertjan van Dijk, … Lisa Genzel

Neuron (2021-08-01) https://www.wikidata.org/wiki/Q110088775

DOI: 10.1016/j.neuron.2021.07.015

29.

Neuronal diversity: too many cell types for comfort?

Stevens CF

Current Biology (1998-10-01) https://www.wikidata.org/wiki/Q48373178

DOI: 10.1016/s0960-9822(98)70454-3

30.

An estimation of the number of cells in the human body

Eva Bianconi, Allison Piovesan, Federica Facchin, Alina Beraudi, Raffaella Casadei, Flavia Frabetti, Lorenza Vitale, Maria Chiara Pelleri, Simone Tassani, Francesco Piva, … Silvia Canaider

Annals of Human Biology (2013-07-05) https://www.wikidata.org/wiki/Q34037445

DOI: 10.3109/03014460.2013.807878

31.

A curated database reveals trends in single-cell transcriptomics

Valentine Svensson, Eduardo da Veiga Beltrame, Lior Pachter

Database (2020-11-01) https://www.wikidata.org/wiki/Q103034964

DOI: 10.1093/database/baaa073

32.

The evolution of cell types in animals: emerging principles from molecular studies.

Detlev Arendt

Nature reviews. Genetics (2008-11) https://www.ncbi.nlm.nih.gov/pubmed/18927580

DOI: 10.1038/nrg2416 · PMID: 18927580

33.

The origin and evolution of cell types

Detlev Arendt, Jacob M Musser, Clare VH Baker, Aviv Bergman, Connie Cepko, Douglas H Erwin, Mihaela Pavlicev, Gerhard Schlosser, Stefanie Widder, Manfred D Laubichler, Günter P Wagner

Nature Reviews Genetics (2016-12) https://doi.org/f9b62x

DOI: 10.1038/nrg.2016.127 · PMID: 27818507

34.

Stem cell states, fates, and the rules of attraction.

Tariq Enver, Martin Pera, Carsten Peterson, Peter W Andrews

Cell Stem Cell (2009-05-01) https://www.wikidata.org/wiki/Q37475461

DOI: 10.1016/j.stem.2009.04.011

35.

Theory of cell fate

Michael J Casey, Patrick S Stumpf, Ben D MacArthur

Wiley interdisciplinary reviews. Systems biology and medicine (2019-12-12) https://www.wikidata.org/wiki/Q91908361

DOI: 10.1002/wsbm.1471

36.

Perspectives on defining cell types in the brain

Eran A Mukamel, John Ngai

Current Opinion in Neurobiology (2018-12-06) https://www.wikidata.org/wiki/Q90361677

DOI: 10.1016/j.conb.2018.11.007

37.

Ensembles, dynamics, and cell types: Revisiting the statistical mechanics perspective on cellular regulation

Stefan Bornholdt, Stuart Kauffman

Journal of Theoretical Biology (2019-01-31) https://www.wikidata.org/wiki/Q91316993

DOI: 10.1016/j.jtbi.2019.01.036

38.

Species Concepts and Species Delimitation

Kevin De Queiroz

Systematic Biology (2007-12-01) https://doi.org/c34kzf

DOI: 10.1080/10635150701701083 · PMID: 18027281

39.

An ontology for cell types

Jonathan Bard, Sue Rhee, Michael Ashburner

Genome Biology (2005-01-01) https://www.wikidata.org/wiki/Q21184168

DOI: 10.1186/gb-2005-6-2-r21

40.

Logical Development of the Cell Ontology

Terrence F Meehan, Anna Maria Masci, Amina Abdulla, Lindsay G Cowell, Judith A Blake, Christopher J Mungall, Alexander D Diehl

BMC Bioinformatics (2011-12) https://doi.org/c7kw6x

DOI: 10.1186/1471-2105-12-6 · PMID: 21208450 · PMCID: PMC3024222

41.

The Cell Ontology 2016: enhanced content, modularization, and ontology interoperability

Alexander D Diehl, Terrence F Meehan, Yvonne M Bradford, Matthew H Brush, Wasila M Dahdul, David S Dougall, Yongqun He, David Osumi-Sutherland, Alan Ruttenberg, Sirarat Sarntivijai, … Christopher J Mungall

Journal of Biomedical Semantics (2016-12) https://doi.org/gg99b9

DOI: 10.1186/s13326-016-0088-7 · PMID: 27377652 · PMCID: PMC4932724

42.

Cell type ontologies of the Human Cell Atlas

David Osumi-Sutherland, Chuan Xu, Maria Keays, Adam P Levine, Peter V Kharchenko, Aviv Regev, Ed Lein, Sarah Teichmann

Nature Cell Biology (2021-11-01) https://www.wikidata.org/wiki/Q109755180

DOI: 10.1038/s41556-021-00787-7

43.

Cells in experimental life sciences - challenges and solution to the rapid evolution of knowledge

Sirarat Sarntivijai, Alexander D Diehl, Yongqun He

BMC Bioinformatics (2017-12) https://doi.org/gg99b7

DOI: 10.1186/s12859-017-1976-2 · PMID: 29322916 · PMCID: PMC5763506

44.

Cells in ExperimentaL Life Sciences (CELLS-2018): capturing the knowledge of normal and diseased cells with ontologies

Sirarat Sarntivijai, Yongqun He, Alexander D Diehl

BMC Bioinformatics (2019-04) https://doi.org/gg99b8

DOI: 10.1186/s12859-019-2721-9 · PMID: 31272374 · PMCID: PMC6509796

45.

Scaled, high fidelity electrophysiological, morphological, and transcriptomic cell characterization

Brian R Lee, Agata Budzillo, Kristen Hadley, Jeremy A Miller, Tim Jarsky, Katherine Baker, DiJon Hill, Lisa Kim, Rusty Mann, Lindsay Ng, … Jim Berg

eLife (2021-08-13) https://www.wikidata.org/wiki/Q109717199

DOI: 10.7554/elife.65482

46.

Current best practices in single-cell RNA-seq analysis: a tutorial

Malte D Luecken, Fabian J Theis

Molecular Systems Biology (2019-06-19) https://www.wikidata.org/wiki/Q64974172

DOI: 10.15252/msb.20188746

47.

Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods

Zoe A Clarke, Tallulah Andrews, Jawairia Atif, Delaram Pouyabahar, Brendan T Innes, Sonya A MacParland, Gary D Bader

Nature Protocols (2021-05-24) https://www.wikidata.org/wiki/Q107158224

DOI: 10.1038/s41596-021-00534-0

48.

Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis

Jacob H Levine, Erin F Simonds, Sean C Bendall, Kara L Davis, El-ad D Amir, Michelle D Tadmor, Oren Litvin, Harris G Fienberg, Astraea Jager, Eli R Zunder, … Garry P Nolan

Cell (2015-06-18) https://www.wikidata.org/wiki/Q30975629

DOI: 10.1016/j.cell.2015.05.047

49.

Fast unfolding of communities in large networks

Vincent Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre

Journal of Statistical Mechanics: Theory and Experiment (2008-10-09) https://www.wikidata.org/wiki/Q29305711

DOI: 10.1088/1742-5468/2008/10/p10008

50.

PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data

Oscar Franzén, Li-Ming Gan, Johan LM Bjorkegren

Database (2019-01-01) https://www.wikidata.org/wiki/Q63664483

DOI: 10.1093/database/baz046

51.

CellMarker: a manually curated resource of cell markers in human and mouse

Xinxin Zhang, Yujia Lan, Jinyuan Xu, Fei Quan, Erjie Zhao, Chunyu Deng, Tao Luo, Liwen Xu, Gaoming Liao, Min Yan, … Yun Xiao

Nucleic Acids Research (2019-01-01) https://www.wikidata.org/wiki/Q56984510

DOI: 10.1093/nar/gky900

52.

The Cell Ontology 2016: enhanced content, modularization, and ontology interoperability.

Journal of Biomedical Semantics (2016-07-04) https://www.wikidata.org/wiki/Q36067763

DOI: 10.1186/s13326-016-0088-7

53.

Cell type discovery using single-cell transcriptomics: implications for ontological representation

Brian D Aevermann, Mark Novotny, Trygve E Bakken, Jeremy A Miller, Alexander D Diehl, David Osumi-Sutherland, Roger S Lasken, Ed S Lein, Richard H Scheuermann

Human Molecular Genetics (2018-05-01) https://www.wikidata.org/wiki/Q52625486

DOI: 10.1093/hmg/ddy100

54.

Cell ontology in an age of data-driven cell classification.

David Osumi-Sutherland, David Osumi-Sutherland

BMC Bioinformatics (2017-12-21) https://www.wikidata.org/wiki/Q49192555

DOI: 10.1186/s12859-017-1980-6

55.

Besca, a single-cell transcriptomics analysis toolkit to accelerate translational research

Sophia Clara Mädler, Alice Julien-Laferriere, Luis Wyss, Miroslav Phan, Albert SW Kang, Eric Ulrich, Roland Schmucki, Jitao David Zhang, Martin Ebeling, Laura Badi, … Klas Hatje

bioRxiv (2020-08-12) https://www.wikidata.org/wiki/Q104450593

DOI: 10.1101/2020.08.11.245795

56.

Leveraging the Cell Ontology to classify unseen cell types

Sheng Wang, Angela Oliveira Pisco, Aaron McGeever, Maria Brbić, Marinka Žitnik, Spyros Darmanis, Jure Leskovec, Jim Karkanias, Russ Altman

Nature Communications (2021-09-21) https://www.wikidata.org/wiki/Q108929315

DOI: 10.1038/s41467-021-25725-x

57.

ontoProc: processing of ontologies of anatomy, cell lines, and so on https://www.wikidata.org/wiki/Q101074371

58.

Tabula Muris https://tabula-muris.ds.czbiohub.org/

59.

Tabula Sapiens https://tabula-sapiens-portal.ds.czbiohub.org/celltypes

60.

Azimuth https://azimuth.hubmapconsortium.org/

61.

Construction and Usage of a Human Body Common Coordinate Framework Comprising Clinical, Semantic, and Spatial Ontologies

Katy Börner, Ellen Quardokus, Bruce WHerr II, Leonard E Cross, Elizabeth G Record, Yingnan Ju, Andreas D Bueckle, James P Sluka, Jonathan C Silverstein, Kristen M Browne, … Griffin M Weber

(2020-07-28) https://www.wikidata.org/wiki/Q109755184

62.

Cell Annotation Platform | Coming Soon http://celltype.info/

63.

Cross-tissue immune cell analysis reveals tissue-specific adaptations and clonal architecture across the human body

Conde C Domínguez, Tomás Gomes, Lorna B Jarvis, C Xu, SK Howlett, DB Rainbow, Ondrej Suchanek, Hamish W King, Lira Mamanova, Krzysztof Polański, … Sarah Teichmann

(2021-04-28) https://www.wikidata.org/wiki/Q107363182

DOI: 10.1101/2021.04.28.441762

64.

Ontology based molecular signatures for immune cell types via gene expression analysis

Terrence F Meehan, Nicole Vasilevsky, Christopher J Mungall, David S Dougall, Melissa Haendel, Judith A Blake, Alexander D Diehl

BMC Bioinformatics (2013-08-30) https://www.wikidata.org/wiki/Q34978215

DOI: 10.1186/1471-2105-14-263

65.

Logical development of the cell ontology

Terrence F Meehan, Anna Maria Masci, Amina Abdulla, Lindsay G Cowell, Judith A Blake, Christopher J Mungall, Alexander D Diehl

BMC Bioinformatics (2011-01-05) https://www.wikidata.org/wiki/Q33786317

DOI: 10.1186/1471-2105-12-6

66.

Ontologies for the life sciences

Steffen Schulze-Kremer, Barry Smith

(2005-11-15) https://www.wikidata.org/wiki/Q105870680

DOI: 10.1002/047001153x.g408213

67.

The Philosophy of Logical Atomism, Lecture 1: Facts and Propositions https://www.wikidata.org/wiki/Q105105637

68.

Logik der Forschung

Karl Popper

(1934-01-01) https://www.wikidata.org/wiki/Q1868040

69.

The semantic conception of truth: and the foundations of semantics

Alfred Tarski

Philosophy and Phenomenological Research (1944-03-01) https://www.wikidata.org/wiki/Q106090790

DOI: 10.2307/2102968

70.

Applied Ontology - An Introduction https://www.wikidata.org/wiki/Q110015932

71.

The Gene Ontology resource: enriching a GOld mine

Gene Ontology Consortium

Nucleic Acids Research (2020-12-08) https://www.wikidata.org/wiki/Q104130127

DOI: 10.1093/nar/gkaa1113

72.

Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

M Ashburner, CA Ball, Judith A Blake, David Botstein, H Butler, JMichael Cherry, AP Davis, K Dolinski, Selina S Dwight, JT Eppig, … Gavin Sherlock

Nature Genetics (2000-05-01) https://www.wikidata.org/wiki/Q23781406

DOI: 10.1038/75556

73.

The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration

Barry Smith, Michael Ashburner, Cornelius Rosse, Jonathan Bard, William Bug, Werner Ceusters, Louis J Goldberg, Karen Eilbeck, Amelia Ireland, Christopher J Mungall, … Suzanna Lewis

Nature Biotechnology (2007-11-01) https://www.wikidata.org/wiki/Q19671692

DOI: 10.1038/nbt1346

74.

RDF 1.1 Concepts and Abstract Syntax https://www.w3.org/TR/rdf11-concepts/

75.

Introducing the Knowledge Graph: things, not strings

Google

(2012-05-16) https://blog.google/products/search/introducing-knowledge-graph-things-not/

76.

Toward an epistemology of Wikipedia

Don Fallis

Journal of the Association for Information Science and Technology (2008-08-01) https://www.wikidata.org/wiki/Q101955295

DOI: 10.1002/asi.20870

77.

Wikidata:Statistics - Wikidata https://www.wikidata.org/wiki/Wikidata:Statistics

78.

From Freebase to Wikidata: The Great Migration

Thomas Pellissier Tanon, Denny Vrandečić, Sebastian Schaffert, Thomas Steiner, Lydia Pintscher

Proceedings of the 25th International Conference on World Wide Web (2016-01-01) https://www.wikidata.org/wiki/Q24074986

DOI: 10.1145/2872427.2874809

79.

Wikibase/DataModel - MediaWiki https://www.mediawiki.org/wiki/Wikibase/DataModel

80.

Help:Data type - Wikidata https://www.wikidata.org/wiki/Help:Data_type

81.

Help:Multilingual - Wikidata https://www.wikidata.org/wiki/Help:Multilingual

82.

RDF 1.1 Semantics https://www.w3.org/TR/rdf11-mt/

83.

Wikidata:Data access - Wikidata https://www.wikidata.org/wiki/Wikidata:Data_access

84.

WikidataR package - RDocumentation https://www.rdocumentation.org/packages/WikidataR/versions/2.2.0

85.

wikidata2df: Utility package for easily turning a SPARQL query into a dataframe

João Vitor F Cavalcante

https://github.com/jvfe/wikidata2df

86.

Wikidata:Licensing - Wikidata https://www.wikidata.org/wiki/Wikidata:Licensing

87.

https://query.wikidata.org/

88.

Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph, Getting the Most Out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph

Stanislav Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, Adrian Bielefeldt

The Semantic Web – ISWC 2018: 17th International Semantic Web Conference, Monterey, CA, USA, October 8–12, 2018, Proceedings, Part II (2018-01-01) https://www.wikidata.org/wiki/Q56010228

DOI: 10.1007/978-3-030-00668-6_23

89.

Scholia

Scholia

https://scholia.toolforge.org/

90.

SARS-CoV-2-Queries

SARS-CoV-2-Queries

https://egonw.github.io/SARS-CoV-2-Queries/

91.

Wikidata:Tools/OpenRefine - Wikidata https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine

92.

Help:QuickStatements - Wikidata https://www.wikidata.org/wiki/Help:QuickStatements

93.

Wikidata:Bots - Wikidata https://www.wikidata.org/wiki/Wikidata:Bots

94.

Wikidata:Pywikibot - Python 3 Tutorial - Wikidata https://www.wikidata.org/wiki/Wikidata:Pywikibot_-_Python_3_Tutorial

95.

GitHub - SuLab/WikidataIntegrator: A Wikidata Python module integrating the MediaWiki API and the Wikidata SPARQL endpoint

GitHub

https://github.com/SuLab/WikidataIntegrator

96.

Wikidata as a knowledge graph for the life sciences

Andra Waagmeester, Gregory Stupp, Sebastian Burgstaller-Muehlbacher, Benjamin M Good, Malachi Griffith, Obi Griffith, Kristina Hanspers, Henning Hermjakob, Toby Hudson, Kevin Hybiske, … Andrew I Su

eLife (2020-03-17) https://www.wikidata.org/wiki/Q87830400

DOI: 10.7554/elife.52614

97.

Wikidata: A large-scale collaborative ontological medical database

Houcemeddine Turki, Thomas Shafee, Mohamed Ali Hadj Taieb, Mohamed Ben Aouicha, Denny Vrandečić, Diptanshu Das, Helmi Hamdi

Journal of Biomedical Informatics (2019-09-23) https://www.wikidata.org/wiki/Q68471881

DOI: 10.1016/j.jbi.2019.103292

98.

Big data: Wikiomics

Mitch Waldrop

Nature (2008-09-04) https://www.wikidata.org/wiki/Q28292893

DOI: 10.1038/455022a

99.

Calling on a million minds for community annotation in WikiProteins

Barend Mons, Michael Ashburner, Christine Chichester, Erik M van Mulligen, Marc Weeber, Johan den Dunnen, Gert-Jan van Ommen, Mark A Musen, Matt Cockerill, Henning Hermjakob, … Amos Bairoch

Genome Biology (2008-01-01) https://www.wikidata.org/wiki/Q21183907

DOI: 10.1186/gb-2008-9-5-r89

100.

Ten Simple Rules for Developing Public Biological Databases

Mohamed Helmy, Alexander Crits-Christoph, Gary D Bader

PLOS Computational Biology (2016-11-01) https://www.wikidata.org/wiki/Q28595967

DOI: 10.1371/journal.pcbi.1005128

101.

Inside the Alexa-Friendly World of Wikidata

Tom Simonite

Wired https://www.wired.com/story/inside-the-alexa-friendly-world-of-wikidata/

102.

A gene wiki for community annotation of gene function

Jon W Huss, Camilo Orozco, James Goodale, Chunlei Wu, Serge Batalov, Tim J Vickers, Faramarz Valafar, Andrew I Su

PLOS Biology (2008-07-08) https://www.wikidata.org/wiki/Q21092744

DOI: 10.1371/journal.pbio.0060175

103.

Making your database available through Wikipedia: the pros and cons

Robert D Finn, Paul P Gardner, Alex Bateman

Nucleic Acids Research (2012-01-01) https://www.wikidata.org/wiki/Q28254676

DOI: 10.1093/nar/gkr1195

104.

Wikidata as a semantic framework for the Gene Wiki initiative

Sebastian Burgstaller-Muehlbacher, Andra Waagmeester, Elvira Mitraka, Julia Turner, Timothy Elliott Putman, Justin Leong, Chinmay Naik, Paul Pavlidis, Lynn Schriml, Benjamin M Good, Andrew I Su

Database (2016-01-01) https://www.wikidata.org/wiki/Q23712646

DOI: 10.1093/database/baw015

105.

WikiGenomes: an open web application for community consumption and curation of gene annotation data in Wikidata

Timothy Elliott Putman, Sebastien Lelong, Sebastian Burgstaller-Muehlbacher, Andra Waagmeester, Colin Diesh, Nathan Dunn, Monica Munoz-Torres, Gregory Stupp, Chunlei Wu, Andrew I Su, Benjamin M Good

Database (2017-03-08) https://www.wikidata.org/wiki/Q28529449

106.

ChlamBase: a curated model organism database for the Chlamydia research community

Timothy Elliott Putman, Kevin Hybiske, Derek Jow, Cyrus Afrasiabi, Sebastien Lelong, Marco Alvarado Cano, Chunlei Wu, Andrew I Su

Database (2019-01-01) https://www.wikidata.org/wiki/Q63286185

DOI: 10.1093/database/baz041

107.

Submit a Topic Page to PLOS Computational Biology and Wikipedia

Daniel Mietchen, Shoshana Wodak, Szymon Wasik, Natalia Szostak, Christophe Dessimoz

PLOS Computational Biology (2018-05-31) https://www.wikidata.org/wiki/Q54655231

DOI: 10.1371/journal.pcbi.1006137

108.

Scholia, Scientometrics and Wikidata

Finn Årup Nielsen, Daniel Mietchen, Egon Willighagen

The Semantic Web: ESWC 2017 Satellite Events (2017-10-01) https://www.wikidata.org/wiki/Q41799194

DOI: 10.1007/978-3-319-70407-4_36

109.

Robustifying Scholia: paving the way for knowledge discovery and research assessment through Wikidata

Lane Rasberry, Egon Willighagen, Finn Årup Nielsen, Daniel Mietchen

Research Ideas and Outcomes (2019-05-02) https://www.wikidata.org/wiki/Q63433973

DOI: 10.3897/rio.5.e35820

110.

The LOTUS Initiative for Open Natural Products Research: Knowledge Management through Wikidata

Adriano Rutz, Maria Sorokina, Jakub Galgonek, Daniel Mietchen, Egon Willighagen, Arnaud Gaudry, James G Graham, Ralf Stephan, Roderic Page, Jiří Vondrášek, … Pierre-Marie Allard

Bioinformatics (2021-03-01) https://doi.org/gh6dqk

DOI: 10.1101/2021.02.28.433265

111.

GeneDB and Wikidata

Magnus Manske, Ulrike Böhme, Christoph Püthe, Matt Berriman

Wellcome Open Research (2019-10-14) https://doi.org/gnq75m

DOI: 10.12688/wellcomeopenres.15355.2 · PMID: 31723674 · PMCID: PMC6823904

112.

The Cellosaurus, a Cell-Line Knowledge Resource

Amos Bairoch

Journal of Biomolecular Techniques : JBT (2018-07) https://doi.org/gf75bq

DOI: 10.7171/jbt.18-2902-002 · PMID: 29805321 · PMCID: PMC5945021

113.

Complex Portal 2022: new curation frontiers

Birgit HM Meldal, Livia Perfetto, Colin Combe, Tiago Lubiana, João Vitor Ferreira Cavalcante, Hema Bye-A-Jee, Andra Waagmeester, Noemi del-Toro, Anjali Shrivastava, Elisabeth Barrera, … Sandra Orchard

Nucleic Acids Research (2021-10-29) https://doi.org/gnq75j

DOI: 10.1093/nar/gkab991 · PMID: 34718729

114.

WikiPathways: connecting communities

Marvin Martens, Ammar Ammar, Anders Riutta, Andra Waagmeester, Denise N Slenter, Kristina Hanspers, Ryan A. Miller, Daniela Digles, Elisson N Lopes, Friederike Ehrhart, … Martina Kutmon

Nucleic Acids Research (2021-01-08) https://doi.org/gh6dq2

DOI: 10.1093/nar/gkaa1024 · PMID: 33211851 · PMCID: PMC7779061

115.

Complex Portal 2018: extended content and enhanced visualization tools for macromolecular complexes

Birgit H M Meldal, Hema Bye-A-Jee, Lukáš Gajdoš, Zuzana Hammerová, Aneta Horáčková, Filip Melicher, Livia Perfetto, Daniel Pokorný, Milagros Rodriguez Lopez, Alžběta Türková, … Sandra Orchard

Nucleic Acids Research (2019-01-08) https://doi.org/gnq75k

DOI: 10.1093/nar/gky1001 · PMID: 30357405 · PMCID: PMC6323931

116.

Wikidata as a knowledge graph for the life sciences

Andra Waagmeester, Gregory Stupp, Sebastian Burgstaller-Muehlbacher, Benjamin M Good, Malachi Griffith, Obi L Griffith, Kristina Hanspers, Henning Hermjakob, Toby S Hudson, Kevin Hybiske, … Andrew I Su

eLife (2020-03-17) https://doi.org/ggqqc6

DOI: 10.7554/elife.52614 · PMID: 32180547 · PMCID: PMC7077981

117.

Representing COVID-19 information in collaborative knowledge graphs: The case of Wikidata

Houcemeddine Turki, Mohamed Ali Hadj Taieb, Thomas Shafee, Tiago Lubiana, Dariusz Jemielniak, Mohamed Ben Aouicha, José Emilio Labra Gayo, Eric Youngstrom, Mossab Banat, Diptanshu Das, … WikiProject COVID-19

Semantic Web: Interoperability, Usability, Applicability (2021-09-28) https://www.wikidata.org/wiki/Q108766311

DOI: 10.3233/sw-210444

118.

A protocol for adding knowledge to Wikidata: aligning resources on human coronaviruses

Andra Waagmeester, Egon Willighagen, Andrew I Su, Martina Summer-Kutmon, José Emilio Labra Gayo, Daniel Fernández-Álvarez, Quentin J Groom, Peter J Schaap, Lisa M Verhagen, Jasper Koehorst

BMC Biology (2021-01-22) https://www.wikidata.org/wiki/Q105037759

DOI: 10.1186/s12915-020-00940-y

119.

Wikidata Queries around the SARS-CoV-2 virus and pandemic https://www.wikidata.org/wiki/Q88647643

120.

COVIWD: COVID-19 Wikidata Dashboard

Fariz Darari

Jurnal Ilmu Komputer dan Informasi (2021-03-01) https://www.wikidata.org/wiki/Q105833381

DOI: 10.21609/jiki.v14i1.941

121.

Painel de informação sobre a COVID-19: consultas SPARQL na Wikidata

Ana Carolina Simionato Arakaki, Fabiano Ferreira de Castro, Felipe Augusto Arakaki

AtoZ: Novas Práticas em Informação e Conhecimento (2020-12-03) https://www.wikidata.org/wiki/Q106249454

DOI: 10.5380/atoz.v9i2.76684

122.

Uso de Wikidata y Wikipedia para la generación asistida de un vocabulario estructurado multilingüe sobre la pandemia de Covid-19

Tomás Saorín, Juan-Antonio Pastor-Sánchez, María-José Baños-Moreno

Profesional de la Informacion (2020-09-13) https://www.wikidata.org/wiki/Q107377131

DOI: 10.3145/epi.2020.sep.09

123.

The Bgee suite: integrated curated expression atlas and comparative transcriptomics in animals

Frederic B Bastian, Julien Roux, Anne Niknejad, Aurélie Comte, Sara SFonseca Costa, Tarcisio M Farias, Sébastien Moretti, Gilles Parmentier, Valentine Rech de Laval, Marta Rosikiewicz, … Marc Robinson-Rechavi

Nucleic Acids Research (2020-10-10) https://www.wikidata.org/wiki/Q100513179

DOI: 10.1093/nar/gkaa793

124.

Utilizing the Wikidata system to improve the quality of medical content in Wikipedia in diverse languages: a pilot study

Alexander Pfundner, Tobias Schönberg, John Horn, Richard David Boyce, Matthias Samwald

Journal of Medical Internet Research (2015-05-05) https://www.wikidata.org/wiki/Q21503276

DOI: 10.2196/jmir.4163

125.

Pesquisa-ação: uma introdução metodológica

David Tripp

Educação e Pesquisa (2005-12-01) https://www.wikidata.org/wiki/Q108479295

DOI: 10.1590/s1517-97022005000300009

126.

protégé https://protege.stanford.edu/

127.

Towards a pragmatic definition of cell type

Tiago Lubiana, Helder I Nakaya

Preprints (2021-01-04) https://doi.org/ghrxwf

DOI: 10.22541/au.160979530.02627436/v1

128.

PhyloCode https://www.wikidata.org/wiki/Q1189395

129.

Highlights of the 'gene nomenclature across species' meeting.

Elspeth A Bruford

Human genomics (2010-02-01) https://www.wikidata.org/wiki/Q42790699

DOI: 10.1186/1479-7364-4-3-213

130.

Ontological realism: A methodology for coordinated evolution of scientific ontologies

Barry Smith, Werner Ceusters

Applied Ontology (2010-11-15) https://www.wikidata.org/wiki/Q28239464

DOI: 10.3233/ao-2010-0079

131.

Multi-level ontology-based conceptual modeling

Victorio A Carvalho, João Paulo A Almeida, Claudenir M Fonseca, Giancarlo Guizzardi

Data and Knowledge Engineering (2017-05-01) https://www.wikidata.org/wiki/Q108926456

DOI: 10.1016/j.datak.2017.03.002

132.

The Human Cell Atlas: Technical approaches and challenges.

Chung Chau Hon, Jay W Shin, Piero Carninci, Michael JT Stubbington

Briefings in functional genomics (2017-10-28) https://www.wikidata.org/wiki/Q48563763

DOI: 10.1093/bfgp/elx029

133.

Popper on Definitions

Wilhelm Büttemeyer

Zeitschrift fur allgemeine Wissenschaftstheorie. Journal for general philosophy of science (2005-01-01) https://www.wikidata.org/wiki/Q108925548

DOI: 10.1007/s10838-005-6037-2

134.

PanglaoDB - A Single Cell Sequencing Resource For Gene Expression Data https://panglaodb.se/index.html

135.

PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data

Oscar Franzén, Li-Ming Gan, Johan LM Björkegren

Database (2019-01-01) https://doi.org/ggkzxr

DOI: 10.1093/database/baz046 · PMID: 30951143 · PMCID: PMC6450036

136.

Linked Data - Design Issues https://www.w3.org/DesignIssues/LinkedData.html

137.

CellFinder: a cell data repository

Harald Stachelscheid, Stefanie Seltmann, Fritz Lekschas, Jean-Fred Fontaine, Nancy Mah, Mariana Lara Neves, Miguel A Andrade-Navarro, Ulf Leser, Andreas Kurtz

Nucleic Acids Research (2013-12-03) https://www.wikidata.org/wiki/Q28660708

DOI: 10.1093/nar/gkt1264

138.

Type or Individual? Evidence of Large-Scale Conceptual Disarray in Wikidata

Atílio A Dadalto, João Paulo A Almeida, Claudenir M Fonseca, Giancarlo Guizzardi

Lecture Notes in Computer Science (2021-01-01) https://www.wikidata.org/wiki/Q109990743

139.

CELDA -- an ontology for the comprehensive representation of cells in complex systems

Stefanie Seltmann, Harald Stachelscheid, Alexander Damaschun, Ludger Jansen, Fritz Lekschas, Jean-Fred Fontaine, Throng Nghia Nguyen-Dobinsky, Ulf Leser, Andreas Kurtz

BMC Bioinformatics (2013-07-17) https://www.wikidata.org/wiki/Q21284308

DOI: 10.1186/1471-2105-14-228

140.

UniProt https://sparql.uniprot.org/sparql

141.

Portal:Semantic Web - WikiPathways https://www.wikipathways.org/index.php/Portal:Semantic_Web

142.

Cell Markers

Konstantin Yakimchuk

Materials and Methods (2013-05-02) https://doi.org/ghq494

DOI: 10.13070/mm.en.3.183

143.

SHOGoiN: Shogoin Human Omics database for the Generation of iPS and Normal cells https://stemcellinformatics.org/

144.

Comparison of reference management software - Wikipedia https://en.wikipedia.org/wiki/Comparison_of_reference_management_software

145.

Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Main_Page

146.

Clean Coder Blog https://blog.cleancoder.com/uncle-bob/2017/12/18/Excuses.html

147.

Como fazer um fichamento

Priscilla de Carvalho Nunes disse

Blog da Biblioteca da ECA-USP (2019-09-30) https://bibliotecadaeca.wordpress.com/2019/09/30/como-fazer-um-fichamento/

148.

Come si fa una tesi di laurea https://www.wikidata.org/wiki/Q3684178

149.

Unpaywall https://unpaywall.org/

150.

Clean Code: A Handbook of Agile Software Craftsmanship https://www.wikidata.org/wiki/Q109996684

151.

Wikidata:Tools/Author Disambiguator - Wikidata https://www.wikidata.org/wiki/Wikidata:Tools/Author_Disambiguator

152.

wbib: A helper for building Wikidata-based literature dashboards via SPARQL queries.

Tiago Lubiana

https://github.com/lubianat/wbib

153.

HCA Latin America - 2021 Workshop https://www.humancellatlas.org/hca-latin-america-2021-workshop/

154.

BioHackathon Europe https://biohackathon-europe.org/

155.

The Whelming › Tech, tools, and tribulations

Scott Allan Wallick

http://magnusmanske.de/wordpress/

156.

fcoex: FCBF-based Co-Expression Networks for Single Cells

Tiago Lubiana, Helder Nakaya

Bioconductor version: Release (3.14) (2022) https://bioconductor.org/packages/fcoex/

157.

Complex Portal 2022: new curation frontiers

Nucleic Acids Research (2021-10-29) https://www.wikidata.org/wiki/Q109348309

DOI: 10.1093/nar/gkab991

158.

The Cellosaurus, a cell-line knowledge resource.

Amos Bairoch

Journal of Biomolecular Techniques (2018-05-01) https://www.wikidata.org/wiki/Q54370168

DOI: 10.7171/jbt.18-2902-002

159.

User:CellosaurusBot - Wikidata https://www.wikidata.org/wiki/User:CellosaurusBot

160.

The Brazilian Reproducibility Initiative

Ana P Wasilewska-Sampaio, Olavo Bohrer Amaral, Kleber Neves, Ana P Wasilewska-Sampaio, Clarissa FD Carneiro, Olavo Bohrer Amaral, Clarissa FD Carneiro

eLife (2019-02-05) https://www.wikidata.org/wiki/Q61799268

DOI: 10.7554/elife.41602

161.

Sharing intermediate datasets from systematic reviews

Tiago Lubiana, Olavo Bohrer Amaral, Kleber Neves

MetaArXiv (2021-11-26) https://doi.org/gnrf9v

DOI: 10.31222/osf.io/vbwa9

162.

Wisecube AI | Knowledge Graph Engine https://www.wisecube.ai/

163.

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition

George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, … Georgios Paliouras

BMC Bioinformatics (2015-04-30) https://www.wikidata.org/wiki/Q28646342

DOI: 10.1186/s12859-015-0564-6

164.

No Budget Science Hack Week

reprodutibilidade

https://www.reprodutibilidade.bio.br/hack-week-2021

165.

Biocuration - Wikipedia https://en.wikipedia.org/wiki/Biocuration

166.

biohackathon-projects-2021/projects/32 at main · elixir-europe/biohackathon-projects-2021

GitHub

https://github.com/elixir-europe/biohackathon-projects-2021

name	project website	citation
LOTUS	https://lotus.naturalproducts.net/	[110]
GeneDB	https://www.genedb.org/	[111]
Cellosaurus	https://web.expasy.org/cellosaurus/	[112]
Complex Portal	https://www.ebi.ac.uk/complexportal/	[113]
WikiPathways	https://www.wikipathways.org/	[114]
Reactome	https://reactome.org/	[115]
CIViC	http://www.civicdb.org	[116]
PubChem	https://pubchem.ncbi.nlm.nih.gov	[116]
Human Disease Ontology	https://www.ebi.ac.uk/ols/ontologies/doid	[116]

geneLabel	cellTypeLabel
OMP	human purkinje neuron
OMP	human olfactory epithelial cell
OMP	human neuron
EPHB1	human oligodendrocyte
EPHB1	human osteoclast
PCSK9	human delta cell
PCSK9	human loop of Henle cell
CXCR4	human b cell
CXCR4	human T cell
CXCR4	human NK cell

geneLabel	diseaseLabel	cellTypeLabel
BST1	Parkinson’s disease	human b cell
BST1	Parkinson’s disease	human neutrophil
RIT2	Parkinson’s disease	human neuron
SH3GL2	Parkinson’s disease	human alpha cell
SH3GL2	Parkinson’s disease	human beta-cell

workLabel	authors
A promoter-level mammalian expression atlas	Jay W Shin
Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors.	Muzlifah Haniffa
The Human Cell Atlas.	Musa Mhlanga, Jay W Shin, Muzlifah Haniffa, Menna R Clatworthy, Dana Pe’er
The Human Cell Atlas: Technical approaches and challenges.	Jay W Shin
Innate Immune Landscape in Early Lung Adenocarcinoma by Paired Single-Cell Analyses.	Dana Pe’er
Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations	Sonya A MacParland
Single-cell reconstruction of the early maternal-fetal interface in humans	Muzlifah Haniffa
Distinct microbial and immune niches of the human colon	Rasa Elmentaite, Menna R Clatworthy
A cell atlas of human thymic development defines T cell repertoire formation	Muzlifah Haniffa, Menna R Clatworthy
Decoding human fetal liver haematopoiesis	Muzlifah Haniffa

Authors

Abstract

Preface

Background

The Human Cell Atlas (HCA) Project

Classification of cells into types

Single-cell transcriptomics

Ontologies

The OBO Foundry and biomedical ontologies

OWL and ontology languages

Wikidata

The inner workings of Wikidata

Wikidata as a knowledge graph for the life sciences

Objectives

Methodology

Organized reading

Biocuration of cell classes for Wikidata

Wikidata updates

Cell Ontology participation

Status of cell type info on Wikidata and the Cell Ontology

Preliminary results

Concept of cell types

General work on the concept of cell type

A simplified definition

PanglaoDB integration to Wikidata

Introduction

Methodology for PanglaoDB integration

Class creation on Wikidata

Integration of PanglaoDB to Wikidata

SPARQL queries

Overview of integrated information on Wikidata

Cell Marker information on Wikidata

“Which human cell types are related to neurogenesis via their markers?”

“Which cell types express markers associated with Parkinson’s disease?”

Discussion and conclusion

Wikidata Bib and a professional system for biocuration

Introduction

Wikidata Bib as a reading system

Wikidata Bib as a dashboard

Wikidata Bib for curation of cells to Wikidata

Wikidata and the Cell Ontology interplay

Final considerations and next steps

Additional Work

Collaborations and manuscripts

fcoex

Wikidata Bots

Systematic Reviews and publishing of intermediary tables

WiseCube - enterprise biomedical question and answering

Awards and participation in events

Course work

References