swib23 Semantic Web in Libraries conference swib23 banner

Day 1   Day 2   Day 3  

Programme

DAY 1 | Monday, 2023-09-11

10:00–12:15

Satellite Event

VIVO meetup

Christian Hauschke 1, Jose Francisco Salm Junior 2
1 Technische Informationsbibliothek (TIB), Germany; 2 Universidade Federal de Santa Catarina, Brazil

Abstract

VIVO is a linked data based research information system. This collocated event serves as a user meeting of the VIVO community to discuss current concerns and deepen cooperation. Topics of concern include improving the ontologies to describe research projects and funding, and mapping initiatives for example with the CERIF community. Anyone interested in research information systems and related ontologies (KDSF, Lattes, PTCris, ROH/Hercules, VIVO) is welcome. The meeting will be held in English. Agenda and meeting notes: link

13:00–19:00

Workshops

Introduction to the Annif automated indexing tool

Osma Suominen 1, Mona Lehtinen 1, Juho Inkinen 1, Anna Kasprzik 2, Lakshmi Bashyam2
1 National Library of Finland, Finland; 2 ZBW – Leibniz Information Centre for Economics, Germany

Abstract

Many libraries and related institutions are looking at ways of automating their metadata production processes. In this hands-on tutorial, participants will be introduced to the multilingual automated subject indexing tool Annif as a potential component in a library’s metadata generation system. By completing exercises, participants will get practical experience on setting up Annif, training algorithms using example data, and using Annif to produce subject suggestions for new documents. The exercises cover basic usage as well as more advanced scenarios.

We provide a set of instructional videos and written exercises for the participants to attempt to complete on their own before the tutorial event; see the Annif-tutorial GitHub repository. The actual event will be to a greater extent dedicated to solving problems, asking questions and discussion. Participants are instructed to use a computer with at least 8GB of RAM and at least 20 GB free disk space to complete the exercises. The organizers will provide the software as a preconfigured VirtualBox virtual machine. Alternatively, Docker images and a native Linux install option are provided. No prior experience with the Annif tool is required, but participants are expected to be familiar with subject vocabularies (e.g. thesauri, subject headings or classification systems) and subject metadata that reference those vocabularies.

Enno Meijers , Bob Coret
National Library of the Netherlands / Dutch Digital Heritage Network

Abstract

How can you find information in heterogeneous Linked Data sources, available at different locations and managed by different owners? At the Dutch Digital Heritage Network we developed a federated querying service to solve this problem: the Network of Terms. We encourage cultural institutions to publish their data as Linked Data and assign standardized terms to their digital heritage information. Terms are standardized descriptions of concepts or entities that make heritage easier to find for anyone interested in it. Yet it is quite a challenge for institutions to use these terms, because the sources in which the terms are managed – such as thesauri and reference lists – have different API protocols and data models. The Network of Terms removes these barriers. It is a service that searches Linked Data sources in a federative way using the Comunica framework. It has a unified SKOS based on GraphQL API that can be easily implemented in collection registration or other systems. It searches the sources for matching terms – real time, with SPARQL. The Network of Terms is published as open source and its API is already integrated in five commercial systems and one open source system for collection registration. It also provides a Reconcilation API for use in Openrefine and other tools.

Although the core of the software is written in Typescript, the configuration for adding terminology sources through a so-called catalogue requires only basic knowledge of JSON and SPARQL. For testing purposes local instances can be easily set up in Node.js compliant environments. In this workshop we demonstrate the functionality and work with participants to build their specific catalogue using the terminology sources of their preference.

Truly shared cataloguing ecosystem development workshop

Jason Kovari , Simeon Warner , Steven Folsom
Cornell University, United States of America

Abstract

Current cataloguing practice entails copying data from shared pools into local environments, which enables local editing, but at the cost of data divergence and substantial complexity when trying to aggregate data or perform large-scale enhancement operations. Meanwhile, many BIBFRAME proofs of concept are simply switching MARC for BIBFRAME and thus continuing the practice of copying data. To fulfill the promise of linked data, institutions must stop copying data and instead move to shared source data where groups of institutions consider their data of record to live in stores outside their sole control. This transition is as much a social challenge as it is a technical one.

In this workshop, participants will collaborate to develop the idea of what it means to move cataloguing practice from its current state to one where work is performed in shared data stores, using linked-data approaches rather than copying. Participants will directly engage in structured brainstorming and designing infrastructure components, workflows and metadata issues relevant to shifting to this model. The facilitators will outline some initial thoughts resulting from ten years of BIBFRAME and linked data for cataloguing work, offer these for discussion, and then proceed with several rounds of breakout and discussion. The facilitators will collect notes throughout the workshop and compile a summary report to be placed in an open-access repository soon afterwards.

Creating linked open usable data with Metafacture

Pascal Christoph, Tobias Bülte
hbz, Germany

Abstract

Metafacture is a powerful free and open source ETL tool. We will learn how to transform bibliographic data from MARC21 to JSON-LD (see link as an example of a result of the ETL from Alma’s MARC21 to lobid’s JSON-LD). Best practices are shown of how to sensibly do that, i.e. by defining a context and by creating URIs instead of strings for identifiers (IDs). We will enrich our records with SKOS data to gain valuable IDs and Strings. No installation is required – all you need is a browser.

Nightwatch – metadata management at your fingertips

Marie-Saphira Flug, Lena Klaproth
Staats- und Universitätsbibliothek Bremen, Germany

Abstract

In addition to the actual library catalogue, the provision of further specific search engines for a wide variety of search spaces is an increasingly important topic in many libraries. For this purpose, metadata from very different sources in diverse formats must be procured, normalized, and adequately indexed. With a large number of data sources and formats these processes often become confusing, and purely manual processing quickly exceeds the available personnel resources. To automate and monitor the numerous processes for continuously updating metadata as much as possible, we have been developing the internal process management tool “Nightwatch” since 2017. Nightwatch automatically controls all essential phases of the processing of the individual metadata packages in the “life cycle” of the metadata.

In this workshop, participants will create and execute a complete cycle of metadata processing themselves. From the download of metadata from crossref to the conversion and indexing of the metadata. Requirements for this are a laptop with Docker (Compose) installed, preferably with the required images already downloaded, and Python knowledge on beginner level. A repository with all materials is available at https://gitlab.suub.uni-bremen.de/public-projects/nightwatch-workshop and will be updated before the workshop.

DAY 2 | Tuesday, 2023-09-12

09:00–10:30

Opening

Moderator: Anna Kasprzik

Welcome

Klaus Tochtermann1, Silke Schomburg2, Hans-Jörg Lieder3
1 Director of ZBW – Leibniz Information Centre for Economics, Germany; 2 Director of North Rhine-Westphalian Library Service Centre (hbz), Germany; 3 Berlin State Library – Prussian Cultural Heritage

Keynote: Inherently Social, Decentralized, and for Everyone

Sarven Capadisli
Web Standards Architect

Abstract

Let’s delve into the realms of the lawful, neutral, and chaotic social web. This talk examines the social and technical challenges involved in empowering individuals and communities to assert control over their identity, data, and privacy while having the freedom to choose applications that align with their needs. Specifically, we will explore the advancements in open standards development and deployments in the wild that strive to improve communication and knowledge-sharing for everyone. While there is a plethora of open approaches for the decentralized web, this talk will decipher the underlying philosophies and technical architectures behind initiatives like the Solid project and the Fediverse, with some focus on their implications for scholarly communication. We will analyze the possibilities for interoperability among different classes of products in this expansive ecosystem and consider the potential impact of these efforts on libraries, as well as the potential roles libraries can play in this dynamic space.

Slides   Video  

From ambition to go live: The National Library Board of Singapore’s journey to an operational linked data management & discovery system

Richard Wallis
Data Liberate, United Kingdom

Abstract

Like many institutions, the National Library Board (NLB) is responsible to curate, host, and manage many disparate systems across the National Library, national archives and public libraries, both print and digital. The NLB team evolved an ambitious vision for a management and discovery system built upon linked open data and the web, encompassing the many resources they manage. Richard, the project’s linked, structured, and web data, and library metadata consultant, explores the the two-year journey to a live production system.

The agile project between NLB and commercial parters (utilizing a cloud-based environment hosting a semantic graph database, a knowledge graph development & management platform, and server-less compute services) overcame many interesting challenges, including the absence of a single source of truth. The requirements were to provide a continuously updated, reconciled, aggregation of data sources, providing a standardized view of five separate source data systems, each with their individual data formats, data models and curation teams, i.e., a data model capable of supporting a public discovery interface, which required the development of a model that provides a consistent view of all entities regardless of source system. This was constructed using combination of the BIBFRAME and Schema.org vocabularies and involved a data model capable of supporting the consolidated presentation of properties from multiple sources in single primary entities, automatic ingest and reconciliation of daily delta data dumps from source systems, and managing a project team spread across multiple geographies and timezones. There were lessons learnt and practical future plans made, which Richard will also discuss.

Slides   Video  

10:30–11:00

Coffee Break

11:00–12:30

Authorities

Moderator: Joachim Neubert

Supporting sustainable lookup services

Steven Folsom
Cornell University Library, United States of America

Abstract

As presented at previous SWIB conferences, the LD4P Authority Lookup Service, maintained by Cornell University Library, is designed to be a translation layer that provides an easily consumable normalized API response when searching across multiple authority data sources.

This presentation will reflect on [Linked Data For Production] (https://wiki.lyrasis.org/display/LD4P3)’s recent efforts to more sustainably support lookups in Sinopia (an RDF cataloguing tool, https://sinopia.io/), and share what we have learned about the requirements for robust lookup services. We will describe the opportunities and challenges of relying on caching for lookups, and explain how the significant maintenance costs associated with caching a large number of authorities led us to a no-cache approach, where we instead translate existing authority search APIs into a normalized response. We maintain hope that cache-based lookup services will be feasible for vendors and consortia to support. To this end, and based on the persistent challenge we faced with keeping cached data current, this talk will also describe an API specification we are developing to as a possible way to standardize how data providers communicate changes to their data more frequently than periodic data dumps.

Slides   Video  

Enno Meijers
National Library of the Netherlands / Dutch Digital Heritage Network

Abstract

How can you find information in heterogeneous linked data sources, available at different locations and managed by different owners? At the Dutch Digital Heritage Network we developed a federated querying service to solve this problem: the Network of Terms. We encourage cultural institutions to publish their data as Linked Data and assign standardized terms to their digital heritage information. Terms are standardized descriptions of concepts or entities that make heritage easier to find for anyone interested in it. Yet it is quite a challenge for institutions to use these terms, because the sources in which the terms are managed – such as thesauri and reference lists – have different API protocols and data models.

The Network of Terms removes these barriers. It is a service that searches linked data sources in a federative way using the Comunica framework. It has a unified SKOS based on GraphQL API that can be easily implemented in collection registration or other systems. It searches the sources for matching terms – real time, with SPARQL. The Network of Terms is published as open source and its API is already integrated in five commercial systems and one open source system for collection registration. It also provides a reconcilation API for use in OpenRefine and other tools. The Network of Terms has been already adopted widely in the cultural heritage domain in the Netherlands but we think it has potential for use in other countries and domains too and we would like to present our tool for the SWIB audience.

Slides   Video  

Wikibase as an institutional repository authority file

Joe Cera, Michael Lindsey
Berkeley Law Library, United States of America

Abstract

The Berkeley Law Library manages the institutional repository (IR) for Berkeley Law. The IR is part of the institutional TIND integrated library system (ILS). Since these systems are linked through many shared functionalities, we explored various options for creating authority records for the IR that wouldn’t interfere with the ILS authority records file. Prior to this effort, there were no authority records for the IR and all data was added manually which created a great potential for data to be out of sync.

For the past 5 years, we have been using Wikidata QIDs as Berkeley Law faculty identifiers in the IR. In order to maintain consistency, we used the Wikidata API to create an external HTML page used to gather a handful of properties for each QID in the IR which would make manual entry easier and more consistent. In the past 6 months, the IR platform had an authority file update which motivated us to move away from a spreadsheet approach to managing the structure of these records to a more linked data friendly and dynamic method. We created a Wikibase using the wikibase.cloud platform to host this data and explore the potential of using our own Wikibase for internal applications. This instance helps us track data that is more local to our needs while also maintaining connections to other data sources. Entity modification timestamps and revision IDs are used to determine whether or not to initiate an update to our wikibase or to wikidata itself. Updates are limited to a controlled list of properties, such as Library of Congress identifiers, ORCiD, CV, etc. Directionality of particular properties is governed at the application level by the narrative logic of the script.

Slides   Video  

12:30–14:00

Lunch

14:00–14:25

Lightning Talks

Moderator: Jakob Voß

Use the opportunity to share your latest projects or ideas in a short lightning talk. Talks are registered after the start of the conference.

paprika.idref.fr

Carole Melzac
ABES, France

Slides   Video  

Extracting metadata from grey literature using large language models

Osma Suominen
National Library of Finland

Slides   Video  

Integrating Network of Terms into Cocoda Mapping Tool

Stefan Peters
VZG, Germany

Slides   Video  

openglam-de@lists.wikimedia.org – New Mailing list on Wikimedia for Galleries, Libraries, Archives, Museums from German speaking Countries

Eva Seidlmayer
ZB MED, Germany

Slides   Video  

Recent SkoHub developments

Adrian Pohl
hbz, Germany

Slides   Video  

14:30–15:30

Breakout Sessions

Shaping the future SWIB

Adrian Pohl1, Anna Kasprzik2
1 North Rhine-Westphalian Library Service Centre (hbz), Germany; 2 ZBW Leibniz Information Centre for Economics, Germany

Abstract

hbz and ZBW have been organizing SWIB for 15 years and have decided not to go back to a pre-COVID19 status with annual in-person meetings in Germany. In this breakout session led by Adrian and Anna we come together to share ideas for a future SWIB that remains international and interactive while also being inclusive and climate-friendly.

Standardizing Changes to Entity Datasets

Simeon Warner, Steven Folsom
Cornell University, United States of America

Abstract

The Entity Metadata Management API (EMM API) is an effort with the LD4 community to define a specification for communicating changes to linked data entity datasets so that data consumers are aware of new, updated, and deprecated entities as the dataset evolves over time. Understanding these types of changes are critical for a number of use cases including local caching of labels and caching of full datasets. Simeon Warner and Steven Folsom will provide an introduction to the preliminary specification and facilitate a discussion to gather initial reactions and capture use case for possible future versions of the API.

Libraries, Wikidata, and linked data projects

Will Kent1, Eduards Skvireckis2
1 Wiki Education, United States of America; 2 National Library of Latvia, Latvia

Abstract

Eduards Skvireckis and Will Kent facilitate a breakout session to explore how library professionals are working with Wikidata. Whether it is using Wikidata’s data, adding to Wikidata, or building projects with Wikidata in a library setting, we invite you to attend this session and share your experiences.

Best Practices for sharing and discovering ETL workflows

Tobias Bülte
North Rhine-Westphalian Library Service Centre (hbz), Germany

Abstract

Many libraries do some kind of metadata transformation e.g. for indexing different data sources in a discovery service or for migrating to another LMS. However, usually ETL (extract, transform, load) workflows are not being shared between libraries let alone being discoverable for a wider community.

In this session we want to discuss best practices to foster sharing and reuse of ETL workflows by discussing the following questions: What are use cased for finding and reuising ETL processes? How would people like to discover existing workflows? What core information helps discoverability and how should it be provided?

SoVisu+: starting point and foundations of a national CRIS

David Reymond1, Joachim Dornbush2, Alessandro Buccheri2, Raphaëlle Lapôtre3
1 Institut Méditerranéen des Sciences de l’Information et de la Communication, Toulon, France; 2 University Paris, France; 3 Ecole des Hautes Etudes en Sciences Sociales, France

Abstract

In this session, David Reymond and Raphaëlle Lapotre would like to discuss the project SoVisu+.

SoVisu+ is a project aimed at developing a modular, open-source, and shared Current Research Information System (CRIS), tailored to the needs of the French higher education and research community. It grounds low-level aspects with Solid protocols, PODs and linked data information architecture. SoVisu+ consolidates two existing proofs of concept (SoVisu and EFS) and federates a growing number of institutions. Its development is underway to support researchers and governance in implementing Open Science policies, with a focus on improving the quality of the researchers’ bibliographical data. The latter is an expert finder system powered by LLMs. We consolidate the developments and plan the use of Semantic Web technologies to extend their capabilities. SoVisu+ allows to capitalize, trace and index dynamically the researchers’ areas of expertise with the singularity of inviting them to be actors of the ecosystem: researchers check the quality of their own data and participate in this indexing. The project will impact many aspects of the information systems, but also data flows, data quality verification, and the relations between researchers and libraries. In this session, we will present Sovisu’s functionalities and SoviSu+ planed architecture, and we would llike to discuss the challenges and gather feedback.

15:30–16:00

Coffee Break

16:00–17:00

Data Modelling

Moderator: Osma Suominen

Hollinger’s Box: The retrieval object at the edge of the ontology

Ruth K. Tillman 1, Regine Heberlein 2
1 Penn State University Libraries, United States of America; 2 Princeton University Library, United States of America

Abstract

This presentation examines the extent to which recent and emerging linked data ontologies for archival description allow for a deliberate “contact zone” between intellectual and physical description of archival resources. We define as the contact zone the descriptive space occupied by the retrieval object: the place where the handoff occurs between a discovery system and an administrative system concerned with object management. Archival resources, due to their aggregate and unattested nature, present specific descriptive challenges not shared by other domains within cultural resource description. More often than not, archival resources are represented in intellectual groupings in a way that contextualizes their meaning, while being housed and shelved opportunistically in nested containers in a way that minimizes their footprint, thereby dissociating their physical presence from their intellectual groupings. As a result, the intellectual object identified and requested by a researcher rarely maps one-to-one to the retrieval object stored and paged by a repository.

In this presentation, we examine three recent or emerging linked data standards for archival resources – Records in Contexts, the Art and Rare Materials BIBFRAME extension, and Linked.Art – from the perspective of facilitating retrieval. To what extent do these graphs either engage with the concept of the retrieval object directly or position their edge as a contact zone that may seamlessly integrate with external models defining such an object? How comfortably do the graphs map onto current community practices for handling the ultimate goal of our researchers: getting their hands on the box?

Slides   Video  

Development of the Share-VDE ontology: goals, principles, and process

Tiziana Possemato1, Jim Hahn 2, Oddrun Ohren3
1 @Cult/Casalini Libri, Italy; 2 University of Pennsylvania, United States of America; 3 National Library of Norway

Abstract

Share-VDE (SVDE) is a library-driven initiative which brings together the bibliographic catalogues and authority files of a community of libraries in a shared discovery environment based on the linked data ontology BIBFRAME. The SVDE Ontology is an extension to BIBFRAME. The design choices for the SVDE ontology support discovery tasks in the federated linked data environment. This presentation describes the ontology design process, goals, and principles. The overall goals for the SVDE ontology are: 1) the use of the web ontology language (OWL) to publish the classes, properties and constraints that are used in the SVDE environment; 2) to clarify the relationship among Share-VDE entities and other linked data vocabularies and ontologies, and 3) to provide internal (to SVDE) and external (to Library of Congress BIBFRAME) consistency and clarity to classes and properties used in the discovery layer of SVDE. The SVDE ontology is not intended to be a complete departure from BIBFRAME nor is it intended to be an all-new ontology, rather, it is based in BIBFRAME and SVDE is an extension. An overarching design principle is to re-use existing vocabularies wherever possible to reduce complexity of the SVDE ontology. The ontology editing process began by evaluating existing SVDE classes and documenting in OWL; moving next to properties; finally, the process concluded by evaluating any needed restrictions for entities. Entities discussed in this presentation which are novel to SVDE include svde:Opus; svde:Work, and the property svde:hasExpression. The SVDE ontology is interoperable among bibliographic models by using direct references among BIBFRAME, LRM, and RDA entity sets. Interoperability is achieved by asserting that bibliographic entities are described by attribute sets.

Slides   Video  

19:00–21:00

Dinner

Please register for the dinner when you register for the conference.

DAY 3 | Wednesday, 2023-09-13

09:00–10:30

Utilizing Wikimedia

Moderator: Katherine Thornton

From EAD to MARC to Wikidata and back again: tools and workflows for archival metadata at the University of Washington libraries

Crystal Yragui , Adam Schiff
University of Washington Libraries, United States of America

Abstract

The Labor Archives of Washington (LAW) is a collaboration between the University of Washington Libraries (UWL) and the Harry Bridges Center for Labor Studies with support from the local labour community. With over two hundred collections, the archive is one of the largest repositories for labour materials in the United States. Since 2020, UWL has been finding creative ways to integrate traditional library metadata for archival collections into Wikidata. The goal of this work is to better serve researchers by contributing linked open data to Wikidata and to link back from Wikidata to local library metadata.

The presenters will describe semi-automated workflows for the creation of Wikidata items for the people, organizations, and collections in LAW. EAD finding aids are used to generate Wikidata items for agents using Python and OpenRefine. The EAD is also used to generate MARC bibliographic records for the collections, which are then mapped to Wikidata using MarcEdit and OpenRefine. The presenters will also discuss the creation and implementation of a new Wikidata property for Archives West finding aid identifiers and the UWL Wikidata application profile for archival collections. Finally, presenters will share several workflows for interlinking MARC records, library authority records, archival finding aids, Wikipedia articles, and Wikidata items using open source tools such as MarcEdit, OpenRefine, and Quickstatements.

Slides   Video  

Developing a linked data workflow using Wikidata

Will Kent
Wiki Education, United States of America

Abstract

As Wikidata becomes a more popular hub for linked data for libraries and the whole internet, more and more linked data projects use Wikidata. As a decentralized, open source, community-oriented platform, there is no manual for Wikidata work. On one hand this is a relief as it encourages any approach to projects and linked data. On the other hand, it can be overwhelming to start, execute, and finish a project. This presentation will explore some essential takeaways from several project-oriented Wikidata courses.

These courses gather together several library professionals in an effort to create a working community around project-based work. The nature of these projects is diverse – every collection and institution has their own way of setting goals, scoping the work, collection specifics, and how staff can engage with these projects. Drawing on the experience of these participants, we have identified some universal recommendations for realizing Wikidata projects. Embedded in these takeaways are broader themes of linked data education and training around crowdsourcing data models, contributing to community ontologies, and collaboration across collections. Rather than a prescribed set of steps, the conclusions of this presentation are meant to serve as a toolkit for anyone at any point of Wikidata work at their library. In sharing with this community it is our hope we can foster more project work and begin address some of the systemic bias and major content gaps on Wikidata. We also hope to encourage more interest in linked data, its applications, and its adoption into new institutions, tools, and research.

Slides   Video  

Entity linking historical document OCR by combining Wikidata and Wikipedia

Kai Labusch , Clemens Neudecker
Berlin State Library, Germany

Abstract

Named entities like persons, locations and organisations are a prominent target for search in digitized collections. While named entity recognition can be used to automatically detect named entities in texts, through the additional disambiguation and linking of the entities to authority files their usability for retrieval and linking to other sources is significantly improved.

We used Wikidata to construct a comprehensive knowledge-base that holds information on linkable entities and combined it with a Wikipedia-derived corpus of text references that can be used by a neural network-based entity linking system to find references of entities in historical German texts. We demonstrate the feasibility of the approach on ~5,000,000 pages of historical German texts obtained by OCR and show how the entity linking results can be used to group the entire historical text corpus by latent dirichlet allocation. All software components are also released as open source for others to adapt and reuse.

Slides   Video  

10:30–11:00

Coffee Break

11:00–12:30

Collections

Moderator: Jim Hahn

Machine-based subject indexing and beyond for scholarly literature in psychology at ZPID

Florian A. Grässle , Tina Trillitzsch
Leibniz Institute for Psychology (ZPID), Germany

Abstract

PSYNDEX is a reference database for psychological literature from German-speaking countries, growing at a rate of 1,000 new publications per month. The mostly scholarly papers in PSYNDEX are extensively catalogued by human indexers at ZPID (Leibniz Institute for Psychology) along several dimensions and vocabularies specific to psychological research. For the past 15 years, we used a lexical system (AUTINDEX) to generate keyword suggestions for our indexers, based on our vocabulary’s main concepts, synonyms and hidden indicators. This system has recently been replaced by the machine-learning based software Annif, and we plan to move to fully automated indexing for part of our records. In this presentation, we will discuss how we integrated Annif into our workflow and most importantly, how we try to assess and improve its suggestions.

Indexers sporadically report specific concepts that Annif failed to suggest (false negatives), or that it wrongly suggested (false positives). We will discuss the “detective work” of classifying these concepts into problem categories and our strategies of dealing with each: e.g. exclusion lists for overly general concepts (“Diagnosis”), boosting new vocabulary concepts not appearing in the training set yet (“COVID-19”), or optimizing the vocabulary itself (like adding more synonyms so lexical parts of the backend can recognize infrequently used concepts). We will also report how we fared with automatically collecting exhaustive lists of such problem concepts by comparing Annif’s suggestions with the concepts actually accepted or added by human indexers. Finally, we will present our attempts at going beyond keyword indexing: automatically marking some suggestions as “weighted” (main vs secondary topics) and suggesting publication genre or type, study methodology, and study population.

Slides   Video  

Implementation of the Albrecht Haupt collection portal based on the general-purpose semantic web application Vitro

Georgy Litvinov , Birte Rubach , Tatiana Walther
Technische Informationsbibliothek Hannover, Germany

Abstract

The Albrecht Haupt Collection is a collection of European drawings and prints, which became publicly available this year on sah.tib.eu. In this presentation, we describe adoptions of a standard Vitro implemented for indexing environment of an art-historical collection according to the special needs of art-historians on the one hand and linked data principles on the other hand and making it available for further use by experts, broad audience and art-historical Web portals. We also address the technical challenges encountered in the process as well as further developments we are working on.

To describe the items in alignment with standards of cultural data description and export such as the CIDOC CRM and LIDO, the GESAH Graphic Arts Ontology was created as an event-centered ontology. Thus, cultural objects are described by a number of activities like e. g. creation or production, agents having various roles in these activities and other attributes e.g. used technique or material. The ontology and collection metadata has been enriched with PIDs from Art & Architecture Thesaurus, ICONCLASS, Getty Thesaurus of Geographic Names, GND and Wikidata. Apart from the specific ontology, new entry forms and display modifications were implemented in an iterative and user-centered approach based on feedback from art historians for recording and representing metadata related to the collection items. We integrated Cantaloupe IIIF to provide instant access to high definition images. Furthermore we developed a highly customizable search that allows users to limit the results area using various filters. A challenge we are currently facing is exporting collection object descriptions to other formats such as LIDO. We will introduce the Dynamic API for the VIVO project with which we aim to solve the aforementioned challenges.

Slides   Video  

The DDB collection and the limits of artificial intelligence

Mary Ann Tan 1,2, Harald Sack 1,2
1 FIZ Karlsruhe, Germany; 2 Karlsruhe Institute of Technology, Germany

Abstract

The German Digital Library (DDB) contains a vast collection of over 45 million digitized objects from more than 400 memory institutions across Germany, making it both voluminous and heterogeneous. This presentation includes a discussion of the results of our data analysis on the metadata of digitized objects in the library sector, taking into account the historical background of bibliographic cataloguing, with a particular focus on determining the languages, content, and length of the metadata’s descriptive attributes in the context of the French revolution when the card catalogue was developed. The presentation also covers the results of automatic evaluation of metadata quality, according to metrics such as completeness, accuracy, conformance, consistency, and coherence. This evaluation highlights the extent to which semantic web technologies, namely linked open data and controlled vocabularies, were employed in the DDB. The ultimate goal is to improve the searchability of the DDB collection. With this goal, the presentation includes a demonstration of the advantages of FRBR-ization of DDB’s bibliographic dataset in terms of searchability. This presentation concludes with a discussion of the challenges of the aforementioned proposal, as well as solutions for each of the identified challenges, taking into consideration the recent and rapid developments in the research fields of natural language processing, machine learning, and knowledge graph embeddings.

Slides   Video  

12:30–14:00

Lunch

14:00–15:00

Aggregators

Moderator: Julia Beck

FranceArchives a portal for the French archives

Élodie Thieblin1, Fabien Amarger1, Saurfelt Katia1, Mathilde Daugas2
1 Logilab; 2 FranceArchives

Abstract

FranceArchives is an online aggregator portal designed to provide a single access point to metadata of about 140 French archival institutions. Based on CubicWeb, an open source software, it allows archivists to import archival metadata, publish articles, etc. Researchers, students, genealogy enthusiasts can search above 21 millions of documents. The authorities (people, institutions, places, themes) indexed in the archives are semi-automatically aligned with linked open data repositories (geonames, wikidata, data.culture.fr). This allows disambiguation and cross-referencing of archival data from different sources. The portal information on the authorities is enriched thanks to these alignments. To integrate FranceArchives into the linked open data cloud, the data is also published in RDF, described in the Records In Contexts ontology (RiC-O). Records in Contexts is a new standard promoted by the International Council on Archives to reconcile the current four standards of description. FranceArchives is one of the first projects to use this standard at such a large scale with success. It is also one of the first archival data repository of this size published as linked open data. A SPARQL endpoint is currently being set up to make the data queryable.

Slides   Video  

Towards a methodology for validation of metadata enrichments in Europeana

Antoine Isaac 1, Hugo Manguinhas1, Valentine Charles1, Monica Marrero 1, Nuno Freire 1, Eleftheria Tsoupra1, José Grano de Oro1, Paolo Scalia1, Lianne Heslinga1, Alexander Raginsky2, Vadim Shestopalov2
1 Europeana Foundation, The Netherlands; 2 Jewish Heritage Network

Abstract

Metadata enrichment is a powerful way to augment the description of cultural heritage entities, improving data discoverability and supporting end-users in having access to critical context for a given entity. In the Europeana network, several aggregators and projects aim to enrich metadata and/or content for cultural heritage objects that they provide to Europeana. This includes semantic tagging using linked open data vocabularies, translations, transcriptions, etc. A variety of processes are used to produce enrichments, from fully manual (e.g., using crowdsourcing) to fully automatic (e.g., geo-enrichment), and these will vary in terms of quality, reliability and informational value. Therefore, before the enrichments can be integrated into the Europeana services, they must be validated. If the appropriate acceptance criteria are not met, the enrichments should either be rejected or pushed back for improvement.

Europeana is defining a general methodology to assess the overall quality of the enrichments produced by a project or a tool. This methodology does not need to aim at global accept/reject decisions – it can, for example, help to select a subset of the produced enrichments that is deemed trustable enough. The validation efforts may be carried out by members/partners of the network or by the Europeana Foundation – ideally both being involved in some way. Although many projects are prepared so as to include an evaluation effort to assess the quality of the enrichments they produce, Europeana Foundation as the operator of the Europeana service, has the final responsibility for the quality of Europeana’s data, therefore it should always be involved in the final vetting. This presentation will introduce the methodology and describe the outcomes of its first application to validate the geo-enrichments provided by the Jewish History Tours project.

Slides   Video  

15:00–15:30

Closing

Moderator: Adrian Pohl

15:30–16:00

Farewell Coffee