Overview of the data and preprocessing steps

Data and Preprocessing

We use data from a database listing the University of Basel’s research output between 2000 and 2020. The original database is composed of a total 67’269 publications including several different categories: 43’997 journal articles, 11’959 book chapters, 3’454 books (edited and authored), 2’014 proceedings articles, 1’671 theses, and 4’174 other items including news articles and discussion papers. We found that a number of publications were repeated and removed these so we end up with a total of 64’268 publications.

In addition, we use data from the Equal opportunities monitoring concerning the percentage of females in different positions (i.e., professorships, academic staff) available for the period 2013-2019.

Diversity Dimensions

We use information concerning diversity derived from the authors’ first names.

We recruit the services,, to determine the authors’ probable gender, age, and nation, respectively. Estimates are provided based on the services’ databases that include hundreds of thousands of confirmed mappings on the internet between first names and these three characteristics. In addition to probable categories, the services return information on confidence. For instance, for the name Alexandra, informs that 98% of 122’985 gender records point to a female gender, that the probable age is 27 based on 114,238 age records, and that there is a 11.7% probability that the nation is Romania, a 5.1% probability that the nation is Ecuador, and a 4.5% probability that the nation is Portugal.

You can use the links above to try out the three services for yourself.

We have processed the data in the following way to simplify our analysis. For gender and age, we only use the probable value, irrespective of the number of instances or the assigned probability. For nation, we determine whether any of the three most probable nations include a “western” nation defined as geographic Europe, the US, Canada, New Zealand, and Australia. Additional checking of names suggested that this categorization of nationality is suboptimal and we refrain to present results concerning this dimension in our report.

Organizational Units

In a previous project, CDS members identified faculty-level organizational units within the University of Basel (including associated and cross-disciplinary institutes) manually, using the descriptors in use at the University of Basel in 2021.

The data is shared via this Github repository.


Variable Description
author List of authors.
year Year of publication.
title Title of publication.
organization Original list of Unibas organizations involved in publication.
unibas Publication offically affiliated with University of Basel.
category Category of publication: Journal, Chapter, Book, Thesis, Proceedings, Other.
n_authors Number of authors
gender_proportion Proportion of female authors.
gender_first Is first author female: 1 = yes, 0 = no.
gender_middle Proportion of female authors excluding first and last author, where available.
gender_last Is last author female: 1 = yes, 0 = no.
gender_diversity measure of gender dispersion
age_average Average age of authors.
age_first Age of first author.
age_middle Average age of authors excluding first and last author, where available.
age_last Age of last author.
age_diversity measure of age dispersion
nation_proportion Proportion of western authors.
nation_first Is first author western: 1 = yes, 0 = no.
nation_middle Proportion of western authors excluding first and last author, where available.
nation_last Is last author western: 1 = yes, 0 = no.
nation_diversity measure of nationality dispersion
Org-* Logical vectors indicating Unibas faculty-level organizations involved in publication.