swib22 Semantic Web in Libraries conference swib22 banner

Day 1   Day 2   Day 3   Day 4   Day 5  

Programme

DAY 1 | Monday, 2022-11-28

12:00-13:00 h UTC

Collocated Events

DINI-AG KIM meeting

Letitia Mölck1, Alex Jahnke2

1 German National Library; 2 Göttingen State and University Library, Germany

DINI-AG Kompetenzzentrum Interoperable Metadaten (KIM) virtual public meeting of the DINI-AG KIM. KIM is a forum for German-speaking metadata experts from LAM institutions. The Meeting will be held in German. Everybody is invited to come. Agenda

14:00-15:15 h UTC

Conference Start

Moderators: Adrian Pohl, Katherine Thornton

Opening

Silke Schomburg1, Klaus Tochtermann2

1 Director of North Rhine-Westphalian Library Service Centre (hbz), Germany; 2 Director of ZBW - Leibniz Information Centre for Economics, Germany

Libraries, linked data, and decolonization (keynote)

Stacy Allison-Cassin

Dalhousie University in Halifax, Canada

Global decolonization movements have focused attention on the colonial nature of cultural heritage. Events such as the toppling and removal of statues of colonial figures, the fight for the return of stolen artefacts and ancestors, and questions regarding the ownership and usage of historical photographs in cultural heritage collections are not new; such activities have gained momentum and greater prominence in public discourse in recent years. While less visible than tangible objects, tendrils and structures of colonialism run through data practices and standards. Such practices continue to harm peoples and lands that have been and continue to be, subject to colonialism. It is vital that linked data practice address this continued legacy and work toward anticolonial within contemporary library and cultural data work.

What does decolonial (data) practice mean for linked data in the library and broader cultural data environment? Why does this matter for all individuals involved in linked data work? This talk will delve into issues facing the linked data community, including the limitations of current standards and systems, the vital need for change in practice, the importance and implications of recognizing Indigenous rights and rights of nations and states, and opportunities on the horizon with emerging standards and principles such as CARE and projects related to respectful naming. The talk will also focus on concrete actions the library linked data community can take to support decolonization and a more just and ethical future for linked data.

Slides   Video  

15:15-15:30 h UTC

Coffee break

15:30-16:45 h UTC

Linked Library Data I

Moderators: Jakob Voß, Huda Khan

Mapping and transforming MARC21 bibliographic metadata to LRM/RDA/RDF

Theodore Gerontakos , Crystal Yragui , Zhuo Pan

University of Washington Libraries, United States of America

The MARC21-to-RDA/LRM/RDF Mapping Project is an international, cross-organizational project that aims to broaden the adoption of the RDA/LRM/RDF ontology (that is, the Resource Description and Access implementation of the Library Reference Model using the Resource Description Framework). One premise of the project is that RDA-based values situated in the rich RDA/LRM/RDF context using RDA entities and relationships are more consistent and potentially more useful than when situated in less-granular formats such as BIBFRAME or non-graph-based metadata formats such as MARC21.

The starting point of the project is MARC21 data accumulated over decades, and the potential to create a lossless conversion while at the same time creating high quality RDA/LRM/RDF has been a central concern. Once the mapping is complete, there is the expectation that abundant RDA/LRM/RDF datasets will be created and used to test the efficacy of RDA entities and relationships for specific uses. To that end, the project is producing an open-source, XSLT-based conversion tool that can process most MARC bibliographic data to output a close approximation of the intent of the mapping.

Presenters describe the theoretical foundation of the project, review the mapping structure and format in detail, demonstrate the conversion tool with a focus on its RDA/LRM/RDF output, and share a vision for the future of RDA/LRM/RDF metadata in the greater library linked data ecosystem. Mappings, code, and project information can be found at the project GitHub repository.

Slides   Video  

A crosswalk in the park? Converting from MARC 21 to Linked Art

Martin Lovell, Timothy A. Thompson

Yale University, United States of America

Yale University is currently undertaking a multiyear effort to create a cross-collections discovery environment called LUX, or “light,” from the university’s Latin motto. LUX aggregates metadata from the catalogs of Yale’s four main collecting units, including the university library. The LUX platform has been designed using the Linked Art (LA) profile of the CIDOC-CRM ontology as its common data model. Each collecting unit has developed a crosswalk from its local domain standard to the LA model. In the library, metadata librarians and software engineers have collaborated to develop and document a crosswalk from MARC 21 to LA JSON-LD.

Implementing the crosswalk has meant designing a system that is more complex than a typical transformation. A single MARC record may be expanded into multiple top-level LA entities. In total, Yale’s 12.5 million catalog records translate into 47.5 million entities in LA. To implement the semantics of the LA model and manage data dependencies within and across MARC records, a new system was developed using Java and Spring Boot with Postgres. The system was designed to serve up LA entities, exposed through an activity stream, and to provide a framework for a modular set of transformation components.

Generating linked data at scale was a new challenge for the library. The initial design used a database schema based on triples, with resolvable identifiers for all objects; this approach provided an intuitive way to create and query relationships without using a native triple store or graph database, given the learning curve that an unfamiliar system would have entailed. However, the triples-based approach proved too resource-intensive at scale, and the system was modified to store some data in JSON and resolve only top-level entities, while still storing a subset of the relationship data for querying.

Slides   Video  

A LITL more quality: improving the correctness and completeness of library catalogs with a librarian-in-the-loop linked data workflow

Sven Lieber , Ann Van Camp , Hannes Lowagie

Royal Library of Belgium (KBR), Belgium

Traditionally, a library maintains both bibliographic data about publications and authority data about publication contributors such as authors or publishers. At the Royal Library of Belgium (KBR), we currently maintain more than 900,000 authority records about persons and metadata records of over 3.5 million publications. As an authoritative data source we are expected to have correct and complete data. In contrast to syntactical data quality issues – which can be detected automatically – issues regarding the correctness of data or completeness of the catalogue often require semantic checks, e.g., if the correct ISBN is used and if the nationality of the contributor is correct. However, the amount of Belgian cultural heritage data makes manual quality checks very time-consuming.

Within the BELTRANS project which studies Intra-Belgian book translation flows of publications based on metadata (who?, when?, where?), we developed an approach to support librarians in the detection and correction of semantic data issues. This approach consists in the integration of data from different heterogeneous sources via RDF and identifying contradicting data automatically with SPARQL. Librarians can inspect the contradicting data fields, choose the correct data, and thus semi-automatically correct the contradiction.

This contribution describes the approach, which is implemented as a 5-step workflow using linked data and Python. We discuss how we use the workflow to improve the correctness of the data used in the BELTRANS project and the wider applicability of the workflow within the National Library to improve the completeness of our catalog with respect to legal deposit. Since the expertise of librarians is needed to interpret identified data contradictions, the next step is to improve the usability of our prototypical solution.

Slides   Video  

16:50-17:10 h UTC

Walkthrough Mattermost / BigBlueButton Integration

Gentle introduction to the Conference Communication Tools and how they interact with the SWIB22 livestream. Take your chance to ask any question!

This is not in the live stream. For the connection link, please log into SWIB Mattermost.

DAY 2 | Tuesday, 2022-11-29

14:00-15:15 h UTC

Linked Library Data II

Moderators: Julia Beck, Joachim Neubert

BIBFRAME for academic publishing in psychology

Tina Trillitzsch

ZPID (Leibniz Institute for Psychology), Germany

PSYNDEX of ZPID is a publically funded reference database for psychological research literature from German-speaking countries. We are currently rewriting our cataloging and indexing software and exposing all PSYNDEX publication data as linked open data, the resulting BIBFRAME-based knowledge graph serving as the basis for our new search portal PsychPorta. This case study discusses our rationales, modeling difficulties, and tools and strategies for migration and transformation.

Our cataloging software had accumulated many errors and problems over the years; we needed a modern, maintainable application. PubPsych, our search portal for researchers, teaching staff, practicing psychologists, and laypeople is also showing its age. To compete with commercial search engines, its successor PsychPorta needs to offer an improved, modern search experience while also providing persistent, open, interoperable, and reusable data.

PSYNDEX has high indexing standards: scientific publications are described with many details about the actual studies they document, using various keyword thesauri, classification systems, controlled vocabularies for research methods and study samples, metadata about preregistration, funding etc. BIBFRAME has little to offer for such details. To represent all this data and make it usable by humans and machines, other ontologies must be integrated, and new properties and classes have to be created. A new requirement was grouping “versions” of the same content in different publication forms – thus the adoption of BIBFRAME Works and Instances. Our experiences may be useful to other domains with indexing needs not covered by BIBFRAME, and also offer guidance on where BIBFRAME seems vague or unfinished. Our approach to handling e.g. dependent parts of things (journal articles, chapters) or aggregates and serials may be reused, thus improving interoperability.

Slides   Video  

On leveraging artifical intelligence and natural language processing to create an open source workflow for the rapid creation of archival linked data for digital collections

Jennifer Erin Proctor

College of Information Studies, University of Maryland, United States of America

This paper proposes, tests, and evaluates an Artificial-Intelligence-supported workflow to enhance the ability of librarians and archivists to convert standardized metadata to better-than-item-level archival linked data. The protocol combines elements of computer vision with natural language processing, entity extraction, and metadata linking techniques to provide new approaches for findability and usability of cultural resources in digital spaces.

Existing metadata and, optionally, images are taken as input. Metadata text is processed with natural language processing including sentenizing, tokenizing, part-of-speech tagging, chunking of phrases and clauses, and finally named entity recognition, extraction, and linking. Entities are used to query HIVE2, a search tool that matches ontology terms to linked data tags, which are then parsed into triples through semantic processing.

For images, each image is processed to identify people who are pictured in it. Cropped sub-images are created for each person where each image is given a unique identifier to act as its primary linked data entity, and a first triple is created showing that that entity is depicted in the image being processed.

Once the spreadsheet of triples is output, it can be imported into OpenRefine in order to convert it into the Wikidata format which is useful for linked digital collections, crowdsourcing, and cooperative collections-as-data programs.

Slides   Video  

Library data on Wikidata: a case study of the National Library of Latvia

Eduards Skvireckis

National Library of Latvia, Latvia

Libraries have changed but the core purpose of the library is very much the same – to give access to knowledge and learning. Knowledge organisation in libraries has moved from clay tablets to extensive catalogues, from the backs of playing cards to card indices. Then in the 1970s and early 1980s integrated library systems (ILS) appeared and changed the game. Now national libraries are increasingly stepping leaving locally used ILS behind and steering towards new technologies and possibilities that were not feasible before – bridging silos, multilingual support, alternatives to traditional authority control, etc. Although there is no right or prevailing answer, the direction is set.

And that direction is toward the semantic web and linked data which are influencing how bibliographic and authority data are created, shared, and made available to potential users. In this presentation I will focus on modeling, mapping, and creating workflows for moving the bibliographic and authority data of the National Library of Latvia to Wikidata as a major hub of the semantic web. The main focus will be on modeling and integrating library authority files (persons, organizations, works, locations, etc.) to a full extent into Wikidata, reducing any possible data loss to a minimum. This way, Wikidata will not be used as just a hub for institutional identifiers but as a universal and collaborative knowledge base where libraries can truly share and disseminate their knowledge, connect it and use it to describe and query their resources.

Slides   Video  

15:15-15:30 h UTC

Coffee break

15:30-16:45 h UTC

Continued Progress

Moderators: Osma Suominen, Uldis Bojars

New, newer, newest: incrementally integrating linked data into library catalog discovery

Huda Khan 1, Steven Folsom 1, Astrid Usong 2

1 Cornell University, United States of America; 2 Stanford University, United States of America

We are exploring how the use of linked data can enhance discovery in library catalogs as we design, evaluate, prototype, and integrate linked data solutions as part of the Linked Data for Production: Closing the Loop (LD4P3) grant.

In one phase of this grant, we linked data to supplement information about authors and subjects in the Cornell production library catalog. This integration used data from Wikidata, DBpedia, and the Library of Congress (LOC) linked data service. We used feedback from user studies and from our library catalog user representatives team to implement and refine these features.

In a separate phase, we focused on how the linked data representation of creative works and related information could be integrated into a discovery layer. We examined how works are aggregated in multiple BIBFRAME representations of library catalog data, such as the Library of Congress Hubs and ShareVDE BIBFRAME conversion of library data, and how these aggregations may help us identify relationships between library catalog records. In addition to this data analysis, we also implemented a prototype of the Cornell library catalog which uses BIBFRAME relationships to display related items to the user. We used Sinopia, the linked data cataloging editor being developed as part of LD4P3, to define these relationships. One of the outcomes of this phase is a better understanding of how these linked data sources can demonstrate linkages between library catalog items representing the same or related works.

This presentation will provide an overview of the production integration from the first phase and the data analysis and experimental prototype from the second phase. This work is further described here. Related work is also captured in this article .

Slides   Video  

Improving language tags in cultural heritage data: a study of the metadata in Europeana

Nuno Freire 1, Paolo Scalia1, Antoine Isaac 1, Eirini Kaldeli2, Arne Stabenau2

1 Europeana Foundation, Netherlands; 2 National Technical University of Athens, Greece

Enhancing the multilingual accessibility of its rich cultural heritage material is a constant objective of Europeana and key to improving the user experience it offers. Technological advances are opening new ways for multilingual access but for their successful application the language of the existing data must be identified. In RDF data, language is indicated by the language tag in a dedicated attribute (xml:lang in RDF/XML). Previous studies conducted on Europeana datasets show that language tags are not used as often as they should, but we do not have precise statistics on this. Moreover, language tags found in Europeana have data quality issues – they do not always follow established standards even though Europeana already performs some (automatic) normalisation of tags. We conducted a study on the language tags included in the metadata of Europeana with two objectives in mind: First, to inform decision-making about possible improvements in the current language tag normalisation process, and second, to enhance the quality and quantity of training data for specialising automatic translation systems in the cultural heritage domain (a crucial objective for the Europeana Translate project, which aims to translate 25 million records of Europeana). Our study analysed the totality of the data in Europeana, which contains over 1,700 million RDF literals, and identified that only 15.9% of the literals are language-tagged. We also determined that 3.3% of the language tags are not valid according to the IETF BCP 47 standard. In our presentation, we recount the results of this study along with the improvements in the normalisation process we applied to collect training data for machine translation.

Slides   Video  

Evaluation and evolution of the Share-VDE 2.0 linked data catalog

Jim Hahn 1, Beth Camden 1, Kayt Ahnberg1, Filip Jakobsen2

1 University of Pennsylvania, United States of America; 2 Samhæng, Denmark

Share-VDE (SVDE) is a library-driven initiative which brings together the bibliographic catalogs and authority files of a community of libraries in an innovative discovery environment based on linked data. The beta release of the SVDE 2.0 (https://www.svde.org) catalog was collaboratively shaped among multiple perspectives and stakeholder groups. A team at the University of Pennsylvania Libraries gathered feedback through structured interviews and observations from library catalogers working in linked data, university faculty, and new undergraduate students in order to understand how linked data supports user tasks promulgated in the IFLA Library Reference Model (IFLA-LRM). Specific user tasks evaluated over remote testing sessions include ascertaining how library catalogers make use of advanced search functionality provided in the linked data interface. Context finding tasks included evaluating how Penn catalogers might find a linked data search useful for providing context to their searching or for helping to understand a research area. Specific LRM mapping focused on the LRM Identify user task; particularly disambiguation of similar name results. For comparative results similar questions are posed to students and faculty. Several targeted questions of faculty included understanding the relationships in linked data that are useful for future research planning using linked data search. In compiling results of the study we describe the linked data functionality and scenarios which the Share-VDE 2.0 discovery system addresses and the ways in which user feedback is supporting the evolution of linked data discovery. We will show how we have evolved the system to align with user needs based on evaluations across multiple stakeholder groups.

Slides   Video  

DAY 3 | Wednesday, 2022-11-30

14:00-16:45 h UTC

Tutorials and Workshops

An Introduction to SKOS and SkoHub Vocabs

Adrian Pohl 1, Steffen Rörtgen 2

1 North Rhine-Westphalian Library Service Centre (hbz), Germany; 2 North Rhine-Westphalian Library Service Centre (hbz), Germany

With Simple Knowledge Organization Systems (SKOS), the World Wide Web Consotrium (W3C) more than 15 years ago published a clear and simple RDF-based data model for publishing controlled vocabularies on the web following Linked Data principles. Although a large part of controlled vocabularies – from simple value lists, to thesaurus and classifications – is created and maintained in libraries, SKOS has not been widely adopted yet in the library world.

This workshop gives an introduction to SKOS with hands-on exercises. Participants will create and publish their own SKOS vocabulary using GitHub/GitLab and SkoHub Vocabs, a static site generator for SKOS concept schemes.

Getting Started with SPARQL and Wikidata

Katherine Thornton 1, Joachim Neubert 2

1 Yale University Library, United States of America; 2 ZBW – Leibniz Information Centre for Economics, Germany

Would you like to learn how to write SPARQL queries for the Wikidata Query Service? In this workshop we will demonstrate a variety of SPARQL queries that highlight useful features of the Wikidata Query Service. We will also practice writing SPARQL queries in a hands-on session. After participating in this workshop you will have familiarity with designing and writing your own SPARQL queries to ask questions of the Wikidata Query Service.

All are wellcome to participate. No experience with SPARQL or Wikidata necessary.

Getting started with Wikidata and SPARQL: build a dynamic web page

Daniel Scott

Laurentian University Library and Archives, Canada

In this workshop, you will learn how to explore and retrieve data from linked data stores such as Wikidata using the SPARQL Protocol and RDF Query Language (SPARQL). Starting from a simple selection of random data, you will create progressively complex SPARQL queries. As you practice each element of SPARQL, you will also learn the structure of Wikidata and the SPARQL extensions offered by the Wikidata Query Service. Finally, you will create a simple web page that, using your new knowledge of SPARQL, dynamically queries Wikidata to display data.

After completing this workshop, you will know how to: 1) Write SPARQL queries to retrieve data from linked data stores such as Wikidata 2) Use the Wikidata Query Service to extract data from Wikidata, including SPARQL extensions specific to Wikidata, and 3) Dynamically incorporate data from any of Wikidata’s 100 million items data into a web page

No previous experience with SPARQL, linked data, or Wikidata, is required. To complete the web page portion of the workshop, some familiarity with HTML and JavaScript would be beneficial, but is not necessary. You must have some means of running a local web server to test your web pages, such as a standard install of the PHP or Python programming languages, or the Node.js runtime environment.

Structured Data in Wikimedia Commons

Christian Erlinger

Lucerne Central and University Library, Switzerland

Since 2020, Structured Data on Wikmedia Commons (SDC) is on the way to completely renew the way of describing media files on the open and central media repository Wikimedia Commons. Instead of plain (unstructured) Wikitext media files can be described and categorized with structured data, in detail with both with properties and items out of Wikidata, the free knowledge graph.

For all GLAM (Galeries. Libraries, Archives and Museums) institutions Wikimedia Commons is for a long time a good place to make the own collections even more publicly available and see the own materials reused e.g. in Wikipedia articles. With SDC there is a new technical way to describe all those data with the existing local metadata and to facilitate the reuse of the metadata edits and corrections made by the community. Media file descriptions based on linked data enable a completely new way to search and query for those data, in particular with federated query with the whole Wikidata knowledge graph.

In this workshop a introduction in working with Structured Data on Wiki Commons is given. - SDC data model - Difference between “classical” commons image description with wikitext and SDC - Manual editing of SDC - Query SDC with SPARQL - Presentation of tool-based SDC editing (ISA-Tool, Image Positioning Annotation, OpenRefine)

Prerequisites

Please register at least 4 days before the Workshop as a Wikimeda User (here) to have full access to all tools and editing capabilities.

Introduction into the Solid Project and its implementations

Patrick Hochstenbach

Ghent University, Belgium

This workshop will introduce the Solid project, the protocols and one of its implementations: the IMEC/Inrupt Community Solid Server CSS.

The Solid project promises a solution to get control of your own data and provides a choice who and which applications can have access to this data. The project was started as a response to the mass centralization of personal data by Internet platforms and the effects security breaches of these platforms had on manufacturing of public opinion. Although, the incentive of the project was based on a dissatisfaction with the centralization of personal information, the provided solution doesn’t require a reboot of the Web. In the tradition of open standards, decentralized services and permissionless innovation, Solid build protocols to improve the current Web and not replace it.

The core Solid protocols follow Semantic Web principles and apply it to management of (personal) data and documents. In this workshop we will discuss the principles behind Solid, the protocols and provides hands-on experience on the current state of the implementations.

Introduction to the Annif automated indexing tool

Osma Suominen 1, Mona Lehtinen 1, Juho Inkinen 1, Moritz Fürneisen2, Anna Kasprzik 2

1 National Library of Finland, Finland; 2 ZBW – Leibniz Information Centre for Economics, Germany

Many libraries and related institutions are looking at ways of automating their metadata production processes. In this hands-on tutorial, participants will be introduced to the multilingual automated subject indexing tool Annif as a potential component in a library’s metadata generation system. By completing exercises, participants will get practical experience on setting up Annif, training algorithms using example data, and using Annif to produce subject suggestions for new documents.

The participants are provided with a set of instructional videos and written exercises, and are expected to attempt to complete them on their own before the tutorial event. Exercises and introductory videos can be found in the Annif-tutorial GitHub repository. The actual event will be dedicated to solving problems, asking questions and getting a feeling of the community around Annif.

Participants are instructed to use a computer with at least 8GB of RAM and at least 20 GB free disk space to complete the exercises. The organizers will provide the software as a preconfigured VirtualBox virtual machine. Alternatively, Docker images and a native Linux install option are provided. No prior experience with the Annif tool is required, but participants are expected to be familiar with subject vocabularies (e.g. thesauri, subject headings or classification systems) and subject metadata that reference those vocabularies.

Working with linked open data in the context of annotation and semantic enrichment of 3D media: A new FOSS toolchain

Lozana Rossenova , Zoe Schubert, Paul Duchesne, Lucia Sohmen, Lukas Günther, Ina Blümel

TIB – Leibniz Information Centre for Science and Technology, Germany

This workshop aims to help researchers, digital curators and data managers learn how to make datasets including 3D models and other media files available as linked open data within a collaborative annotation and presentation-ready environment. Participants will take part in practical demonstrations using an integrated toolchain that connects three existing open source software tools: 1) OpenRefine – for data reconciliation and batch upload; 2) Wikibase – for linked open data storage; and 3) Kompakkt – for rendering and annotating 3D models, and other 2D and AV media files. This toolchain and associated workflow was developed in the context of NFDI4Culture, a German consortium of research and cultural institutions working towards a shared infrastructure for research data that meets the needs of 21st-century data creators, maintainers and end users across the broad spectrum of the digital libraries and archives field, and the digital humanities. All components of the toolchain feature graphical user interfaces aiming to lower the barrier of participation in the semantic web for a wide range of practitioners and researchers. Furthermore, the toolchain development involves the specification of a common data model that aims to increase interoperability across datasets of digitised objects from different domains of culture. The workshop will be of interest to researchers, digital curators and information science professionals who work with datasets containing 3D media, and want to learn more about the possibilities of linked open data, open source software and collaborative annotation workflows.

DAY 4 | Thursday, 2022-12-01

14:00-15:15 h UTC

RDF Insights / Lightning Talks

Moderators: Adrian Pohl, Jakob Voß

Lightning talks

Use the opportunity to share your latest projects or ideas in a short lightning talk. Talks are registered after the start of the conference.

Shapes, forms and footprints: web generation of RDF data without coding

Patrick Hochstenbach

Ghent University, Belgium

While browsing the Web a user is often confronted with a form requiring them to enter their personal data. We share book reviews on Goodreads; share our names, addresses and affiliations information with conference tools; submit our bibliography to institutional websites and centralized services such as ORCID. Filling out this information is a repetitive task for many users, using data that in principle should already be available in a knowledge graph somewhere.

For the creation of (ad hoc) forms from scratch users do not have many options other than using platforms such as Google Forms which provide a limited set of input fields and a Google sheet as the end result, or asking their IT department to build a form which can require some time to implement and publish. During the COVID-19 pandemic many IT departments were asked to provide such ad hoc forms for all kinds of crowd sourcing where metadata was entered by library staff working from home. All these forms have hard-coded locations where the produced data needs to be stored, so a user has no choice in that matter.

I will present an abstract way how RDF data can be read, updated, and stored in a decentralized way using RDF forms. The RDF data is defined by shapes, the forms are defined using a Form ontology, and the footprint (where to store the result) is also coded in RDF. All these inputs are web resources that declare to FormViewer apps how to read, update, and store data. An entire app can be written just by manipulating RDF resources, using the Solid protocol as persistence layer.

Slides   Video  

Performance comparison of select and construct queries of triplestores on the example of the JVMG project

Tobias Malmsheimer

Hochschule der Medien, Germany

In the Japanese Visual Media Graph (JVMG) project (Project blog, JVMG web frontend (Beta)) we use the Resource Description Framework (RDF) to create a knowledge graph for researchers working with contemporary popular Japanese media. The project is funded by the German Research Foundation and the main project partners are Stuttgart Media University and Leipzig University Library.

In order to easily access our RDF data we use a triplestore and SPARQL. We initially chose the Apache Fuseki triple store because of its open licensing terms and ease of installation and management.

Once the database was completed to a certain degree, we implemented and tested our knowledge graph using different triplestore software solutions (Apache Fuseki, Blazegraph, Virtuoso and GraphDB) and compared their performance on several tasks that we consider representative operations on our data. These tasks include simple queries such as aggregating all data for a given entity and more complex analyses such as finding co-occurrence. We found major performance discrepancies, both between the different software solutions and between SELECT and CONSTRUCT SPARQL queries. Query times differ by factors of up to 100 across the software solutions, and CONSTRUCT queries consistently perform much worse, even when using the exact same WHERE patterns.

In summary, the Apache Fuseki triple store performed quite well across all tasks. While some other software solutions were faster for some tasks, the gains were not significant enough to consider migrating our infrastructure to a new solution.

Slides  

15:15-15:30 h UTC

Coffee break

15:30-16:45 h UTC

Machine Learning

Moderators: Nuno Freire, Huda Khan

How are data collections and vocabularies teaching AI systems human stereotypes?

Artem Reshetnikov

Barcelona Supercomputing Center, Spain

Bias is a concept found in machine learning which means that the data which was used for training is in some way not representative of the real world and therefore the patterns or models that are generated are systematically skewed. While bias is a technical concept, fairness is a more social concept that can have direct implications for users. Hannes Hapke et al. define the concept of fairness as “the ability to identify when some groups of people get a different experience than others in a problematic way”. The authors illustrate the problem of fairness with an example of people who tend not to pay back their loans. If an AI model is trying to predict who should have their credit extended, this group of people should have a different experience than others, i.e. not have their credit extended. An example of a situation to avoid is when people whose application for a loan is unfairly turned down are predominantly of a certain race.

Digitalization of CH objects for machine ingestion may have started for preservation reasons, but it allowed applying AI technology in order to extract knowledge that can improve user experience and be a valuable resource for GLAM. Generating digitalized data for AI use implies that it is annotated according to what is envisioned to be meaningful and relevant for future ML tasks; Iconclass is a great example. These annotations are the basis on which the AI models are built, thus whatever ideas and concepts were included in these structures can be recognized in the final AI model, which will reflect the possible original bias.

Our presentation will show data collections of GLAM and vocabularies such as Iconclass containing prejudices against LGBT, gender inequality, or colonization stereotypes, and we will illustrate how these systems can embed these human stereotypes in AI generated data and why fairness is important for AI in GLAM.

Slides   Video  

Insight into the machine-based subject cataloguing at the German National Library

Christoph Poley

Deutsche Nationalbibliothek, Germany

The German National Library has the mandate to collect online publications published in Germany or in the German language, translated from German or relating to Germany. Our approach is to provide metadata including classification codes and subject headings to make them searchable and retrievable. To handle the amount of data, which increases by about 1–2 million records every year, we introduced machine-learning-based tools into productive operations in order to be able to assign GND index terms, DDC subject categories and DDC short numbers automatically.

The first steps in that direction were taken more than ten years ago. In the last three years we have set up a project called “Erschließungsmaschine” (“indexing machine”; EMa) to replace the legacy system with a new modular and flexible architecture for automated subject cataloguing. The core component is the toolkit Annif developed by the National Library of Finland, which provides us with lexical, statistical and neural-network-based machine learning algorithms behind a unified interface.

In this presentation we want to give practical insights into the architecture, main workflows, and quality measurements of EMa within our productive environment. We also talk about our experiences and challenges concerning both human and IT resources.

Furthermore, we give a brief overview over the DNB AI project. The goal of the project is to look into state-of-the-art algorithms, text data, and vocabularies in order to improve the quality of subject cataloguing using AI methods.

Slides   Video  

Multilingual BERT for library classification in Romance languages using Basisklassifikation

José Calvo Tello 1, Enrique Manjavacas 2, Susanne Al-Eryani 1

1 Göttingen State and University Library, Germany; 2 Leiden University, Netherlands

We apply the multilingual version of the language model BERT (mBERT) to predict classes from the library classification system Basisklassifikation (BK). We frame the present task as a multi-label classification problem, where each input instance (a library record) must be assigned at least one class from the BK. As input, we only use data from the catalogue, we do not use the full text of the publications. Three feature sets are considered: title only, bibliographic data only, and extended bibliographic data. We apply two algorithms: mBERT, which we fine-tune using raw text from different metadata fields as input. We also train a multi-label Support Vector Machine classifier with a vector based on the tokenizer of mBERT.

We decided to work with records from Romance Studies because it challenges the perspective of considering only one or only a few languages, such as in national libraries. The dataset contains 189,134 library records associated with Romance Studies from the catalogue K10plus.

The general results for the different approaches yield micro F1-scores between 0.6 and 0.8 (macro F1-scores between 0.2 and 0.4), with better performance for mBERT as classifier. In this presentation we will explore how these results are influenced by factors such as the language, specific classes, or the number of records per class. This is a promising approach for the generation of suggestions which should be considered by subject specialists.

Slides   Video  

DAY 5 | Friday, 2022-12-02

14:00-15:20 h UTC

LOD Applications

Moderators: Joachim Neubert, Julia Beck

Digital Scriptorium 2.0: toward a community-driven LOD knowledge base and national union catalog for premodern manuscripts

L.P. Coladangelo 1, Lynn Ransom 2, Doug Emery 3

1 College of Communication and Information, Kent State University, United States of America; 3 Penn Libraries, United States of America; 2 Schoenberg Institute for Manuscript Studies, Penn Libraries, United States of America

Digital Scriptorium 2.0 (DS 2.0) is a project to redevelop the technical platform of a national union catalog of premodern manuscripts representing the full range of global manuscript cultures in US collections. This paper documents the development and implementation of a workflow for aggregating and transforming metadata records for premodern manuscripts in order to structure and publish them as linked open data (LOD). Strategies and practices for extracting, enhancing, and combining metadata from heterogeneous sources were explored. Data sources included library catalogs and other institutional data repositories, ranging from metadata in CSV files to XML-structured schemas. Automated scripts extracted and combined metadata by entity types for such entities as agents, places, materials, subjects, genres, languages, and date information recorded in catalog records as string values. Using OpenRefine, string values were reconciled to linked open databases and vocabularies such as Wikidata, FAST, and the Getty’s AAT and TGN. This resulted in the creation of open-source metadata repositories to be used in the transformation of DS member metadata records into LOD for uploading into a shared knowledge base (Wikibase), thus semantically enriching existing cultural heritage metadata and enabling end user function like faceted search and SPARQL querying. This approach facilitated low institutional barriers to contribution as participating institutions did not have to alter their metadata records to contribute data and they retained control over their own catalog record formats and cataloging practices. The resulting database of heterogeneous, cross-institutional published LOD can be leveraged by the community of scholars, librarians, curators, and citizen scholars who participate in Digital Scriptorium to enhance and build upon existing knowledge about their collections.

Slides   Video  

The application of IIIF and LOD in digital humanities: a case study of the dictionary of wooden slips

Shu-Jiun Chen 1,2, Lu-Yen Lu1,2

1 Institute of History and Philology, Academia Sinica, Taiwan; 2 Academia Sinica Center for Digital Cultures, Academia Sinica, Taiwan

This study explores the use of the International Image Interoperability Framework (IIIF) and linked open data (LOD) in digital humanities with regard to different layers of interoperability. Focusing on imaged texts, it takes the text interpretation and slip restoration of the “Juyan Han Wooden Slips (202-220 CE)” as use case, integrates the demands of scholars in wooden slip research in the humanities, and establishes a digital research environment.

By analysing and deconstructing historians’ work on character interpretation and manuscript restoration into discrete tasks and categories, the study proposes that the digital humanities research platform needs to be able to provide a complete, structured annotation function that allows input of annotated information from different knowledge fields. In addition, it needs to provide functionalities for reading, comparing, and referring to images, for example for the annotation of interpretations in wooden slip images, zooming in or out of image areas, or side-by-side image comparison for multiple wooden slip regions, which allow arguments about the context surrounding a character, to be displayed together in the image interface.

The study has adopted the image, presentation and content search APIs of IIIF, and developed a LOD lifecycle framework in order to transform the legacy data into LOD, enable access and cross-referencing for these resources, and satisfy different needs in research. “The Wooden Slips Character Dictionary–Database of Juyan Han Wooden Slips from the Institute of History and Philology Collections” (WCD)” and the “Multi-database Search System for Historical Chinese Characters” were taken as practical examples to explain how interoperability is utilised in systems for the digital humanities.

Slides   Video  

Telma Peura 1,2, Petri Leskinen 1, Eero Hyvönen 1,2

1 Semantic Computing Research Group (SeCo), Aalto University, Finland; 2 Helsinki Centre for Digital Humanities (HELDIG), University of Helsinki, Finland

BookSampo – a linked data (LD) service and portal for Finnish fiction literature – was launched in 2011 by the public libraries of Finland. Since then, the original knowledge graph (KG) has grown from 400,000 subjects to over 8,740,000, including literary works, authors, book covers, reviews, literary awards, along with other detailed semantic metadata. The portal, maintained by a team of Finnish librarians, has nearly 2 million annual users and its KG resides in a SPARQL endpoint. However, the potential of the KG has not been explored from a Digital Humanities (DH) research perspective where digital tools could help with knowledge discovery. This study presents novel data analyses on the BookSampo KG to demonstrate how LD can be used in DH studies, building on the concepts of distant reading and literary geography, i.e., the application of computational methods to study literature and its geographical dimension. Our work reveals and illustrates trends and biases in the development of Finnish literature based on semantic metadata. Focusing on novels and their geographical settings, the analyses show interesting annotation patterns related to genres and themes. Although the geographical diversity in fictional settings has increased over time, our results suggest that Finnish fiction literature still focuses on national settings and historical topics. Moreover, our digital perspective shows that the Finnish literary geography is biased by gender. Finally, we discuss to which extent the findings depend on the KG annotation practices, and how this potential meta-level bias should be addressed in the future.

Slides   Video  

Closing