What role does cultural heritage play

People develop standards

The question of standards is fundamental to the discourse on the digitization of cultural heritage. This article mainly discusses motivations: Why is there standardization in cultural heritage? He also describes current challenges and potential for standardization.

There are four motivations for standardizing cultural heritage:

  • To enable new analyzes of digital artifacts;

  • Model phenomena in digital representations;

  • To ensure sustainability and interoperability of digitally recorded cultural assets and

  • Preserving culture from extinction through its digital representation.

These motivations are described below using people who play a key role in standardization for cultural heritage. This should make a central, general aspect of standardization clear: Standards are developed by people. It takes “drivers” to drive standards forward and communities to form standards. The technical quality of a standard is only one aspect of ensuring its success.

Roberto Busa: The father of the digital humanities

Roberto Busa was an Italian Jesuit priest. He was one of the first to use computers to represent humanities texts. Busa focused on the work of Thomas Aquinas. Together with the founder of IBM, Tom Watson, Busa discussed how this work could be captured in digital form. In a project that spanned over thirty years, Busa created the Index Thomisticus: the digital recording of the writings of Thomas Aquinas.[1]

The main motivation for Busa was to create new analytical possibilities. Linguists should be enabled to quickly find examples of grammatical phenomena in large texts. With the Thomisticus Index, literary scholars could quickly find passages of text and analyze them in the overall context.

For both linguists and literary scholars, the Thomisticus Index offers analysis options that would not have been available without the digital representation. Standardization is relevant at various levels, for example for the definition of character collections in order to be able to store and search script in detail or to mark linguistic features such as parts of speech using a standardized feature inventory. In this role, standards are in the best case invisible to the actual user - i.e. the humanities researcher. He does not have to deal with technical details and can advance the respective scientific question.

The work of Roberto Busa founded a new scientific area in which various humanities disciplines focus on the use of digital tools: the digital humanities or "digital humanities".

Michael Sperberg-McQueen: The coding of texts as a scientific subject

In the digital humanities, standards are not always invisible to the scientist. There are areas in which standardization itself becomes the subject of scientific discourse.

The literary scholar Michael Sperberg-McQueen has not only dealt with the digital capture of texts - in the tradition of Roberto Busa - but also with text structuring. What are the metadata, paragraphs, paragraphs, words in a text; what are typographical units? A standard on which Michael Sperberg-McQueen worked intensively is very important for these questions: the Text Encoding Initiative (TEI).[2] The TEI defines an XML format (XML stands for eXtensible Markup Language) to provide texts with corresponding metadata, so-called markups. The TEI provides a large number of modules that are relevant for different groups of humanities scholars, for example for the creation of digital editions, for the written recording of spoken language, for the notation of music, etc.

On the one hand, the TEI serves as a tool for the humanities researcher. On the other hand, the text coding gives rise to its own questions. A prominent example that Sperberg-McQueen in particular has studied intensively is the OHCH hypothesis.[3] According to this hypothesis, text is understood as an ordered hierarchy of objects. However, many scenarios with which humanities scholars work make the problems of this hypothesis clear: for example, critical editions with overlapping comments; Transcription of spoken language, taking into account breaks between speakers; or the linguistic analysis of non-hierarchical relationships between grammatical constitutions. Nevertheless, it is understandable that this hypothesis arose: The standardized XML technology allows the definition of hierarchically ordered units, but non-hierarchical relationships can only be expressed to a limited extent. A new research topic has developed from this problem: the representation of overlapping units in the text coding. Widespread solution approaches are described in the guidelines of the TEI.[4]

Sperberg-McQueen, TEI and XML not only show how standardized formats (XML) can promote a scientific subject (overlapping hierarchies). Sperberg-McQueen and the TEI community were themselves centrally involved in the standardization of XML, which was defined in 1998 by the World Wide Web Consortium (W3C). And in other areas of cultural heritage, too, people and standardization issues have achieved visibility that goes far beyond their respective communities.

Eric Miller: Sustainability, identification and interoperability for digitally recorded cultural assets

The library world knew standardization long before the start of the digital age. Classifications are an important means of opening up objects and making them discoverable. The motivation for standardization here is: interoperability, findability and sustainability. Standards-based indexing allows indexing information to be reused across institutional boundaries.

Library cataloging is usually very detailed and geared towards the needs of libraries. Information scientist Eric Miller has done a lot in building the bridge to general development. He chaired the “Dublin Core Metadata Initiative” for a long time.[5] The standardized metadata of Dublin Core provides considerably fewer details than the original library cataloging. However, Dublin Core is widespread and has many uses outside of the library community.

Eric Miller was responsible for the Semantic Web department at W3C for a long time. The name “Semantic Web” comes from a time when complex, formal-semantic modeling was at the center of the underlying RDF (Resource Description Framework) technology.[6] The focus is now on the linking of resources - one speaks of linked data. A central standard for this are URIs (Universal Resource Identifiers). In little technical terms, URIs are simply web addresses. What is special about them: Every object - that is, websites that can not only be read in the browser - can be described by a URI. This approach is also nothing new from a library point of view: books are given unique identifiers, for example ISBNs (International Standard Book Numbers), to make them easy to find and to identify them sustainably, but also to provide them with additional information (author of a book, relevant topics, etc. .) to be linked.

Linked data, however, realizes these functionalities in the largest available information space: the web. URIs also identify abstract objects here - every resource that you want to capture on the web. Tim Berners-Lee summed up this approach in a blog post by asking the reader: “… give yourself a URI. You deserve it! "[7]

The use of the web as an information space is very attractive for cultural heritage. The web lives from the network effect: As soon as digital representations of cultural artifacts are clearly identified by URIs on the web, users can link to them and use the digital representation. This makes cultural assets accessible again without the cultural institutions having to organize or plan in advance.

Many librarians have recognized the potential of linked data. Corresponding use cases are discussed in the “Library Linked Data Community Group”.[8] Projects such as Europeana and the Deutsche Digitale Bibliothek follow this trend or dictate it: They are increasingly making information available as linked data. It is important to use standardized vocabulary to describe cultural artifacts, such as the Europeana Data Model.

Deborah Anderson: Save the cultural assets of language and writing from digital extinction

Deborah Anderson from the Department of Linguistics at Berkeley leads the Script Encoding Initiative.[9] Their main task is the standardized coding of writing systems that are not as common as the Latin alphabet, but also of widely used Asian writing systems for Chinese, Korean and Japanese. Their work is rarely discussed in the context of cultural heritage and standardization, because cultural heritage and language are seldom put in relation to one another. Language in particular is of great relevance in order to safeguard the cultural heritage of many regions in the world in the long term.

Anderson is a member of the Unicode Consortium and is committed to ensuring that the corresponding writing systems are recorded in the global Unicode character set. This is now available on almost every computer and can also be used on the web.

The standardized representation of signs is the basis for digitally recording language as a cultural asset over the long term. This shows, for example, the inequality at Wikipedia:[10] The size of the single language editions of Wikipedia varies considerably. Given the fact that Wikipedia is the most comprehensive description of human knowledge, this situation is worrying from the point of view of poorly represented cultures. They may even be threatened with digital extinction in the long term.[11]

This problem is particularly relevant for Europe. The META-NET initiative has in the "META-NET White Paper" series[12] the relevance for Europe made clear. The imbalance between languages ​​persists here too. It also affects the availability and quality of tools, for example automatic translation programs, to translate content.

What role can standardization play in helping here? The standardized link between individual language content can facilitate translation or adaptation to the respective culture. Wikipedia is working on relevant technologies in the Wikidata project.[13] Wikidata is based on unique identifiers (= URIs) for Wikipedia articles. These are made available separately for each single-language version of Wikipedia and linked to one another via Wikidata. The next step is to establish relationships between single-language versions on the content level, for example between paragraphs or even sentences. If these relationships were expressed in a standardized way, higher quality automatic translation tools could produce content for underrepresented languages. However, this application scenario is still in the future.

Challenges and opportunities for standardization and cultural heritage

The digital recording of texts, their structuring, the sustainable referencing of cultural artefacts and the safeguarding of language and writing as a cultural asset - standards have already achieved a great deal for cultural heritage in all of these areas. But there are still challenges and opportunities for further development.

Industry-driven standardization versus in-depth modeling

Standardization is mainly carried out by industrial companies. One often encounters the Pareto effect: 80 percent of standardization can be achieved in 20 percent of the time; the remaining 20 percent take up 80 percent of the time.

Industrial companies usually react to this effect with rapid standardization. The market launch of new technologies is more important than the complete modeling or the standardization of all aspects. Dealing with web technologies in particular is therefore a challenge for cultural heritage. Because science and culture rightly have the right to specify their subject area 100 percent.

Standardization of semantics - but which one?

This tension is evident in the standardization of basic, delimitable objects that are relevant to both cultural heritage and the web as a whole. The Schema.org initiative[14] was launched by global search engine operators to define objects such as person, institution, place, product, etc. On this basis, authors of web content can embed structured information in web pages, which are interpreted by the search engines and lead to higher rankings in search results.

Due to the 80/20 effect, only a few properties of objects are specified in detail in Schema.org. Data models such as the Europeana Data Model (EDM)[15] are much more expressive. This tension can be avoided in part by mapping between definitions. This means that the detailed object descriptions are mapped onto the general descriptions. However, a loss of information is inevitable during this process.

What is content - the boundaries of the objects dissolve

Stefan Gradmann has repeatedly pointed out that traditionally, especially in the library sector, cultural artefacts are only viewed from the outside and as a whole.[16] The book has a year of publication, an author, and a publisher. But what does it say on page 13 in the second paragraph? What does one reader think of the comments of the first and what does another think? These questions make it clear that information from cultural heritage is also increasingly being incorporated into the content, without clear boundaries and including social networks in which a scientific discourse also takes place. For standardization, this means that the referenceability mentioned must also be available for the smallest units, such as chapters, paragraphs and sentences.

For these reasons, annotation technologies for digital artifacts that allow deep referencing are currently a relevant topic in projects such as DM2E (Digitized Manuscripts to Europeana).[17] Again it is necessary to relate these efforts to efforts from the industry, for which a W3C workshop was held this year.[18] This is the only way to ensure that the cultural heritage on the global web is perceived in the long term.

Persistence - no longer a problem

Persistent identifiers are used to identify digitally or physically available cultural artifacts clearly and sustainably. They are often given by cultural institutions. An example: The Wikipedia page of "Johann Wolfgang von Goethe"


provides a number of persistent identifiers:

Authority data (person): GND: 118540238 | LCCN: n79003362 | NDL: 00441109 | VIAF: 24602065

One question is: what formats should such identifiers have? For a long time this was a point of contention between the various communities. On the web, the mentioned URIs are the technical basis for unambiguous identification. Communities in the field of cultural heritage often rely on so-called DOIs or other technologies (URNs, ARKs, etc.).

Discussions about an either-or are rare these days. It has become clear to all communities that you don't have to make a decision. Variants of persistent identifiers, once as a complete URI including DOI, once as a stand-alone DOI, can coexist. The DOI Foundation guarantees the persistence of domains such as http://dx.doi.org, which are used as a prefix for a URI with a DOI identifier.[19]


Standards and cultural heritage often form a fruitful relationship. Often times, the result proves the relevance of this relationship well beyond the boundaries of cultural heritage.

One question has so far remained unanswered: Which organizations develop the standards? Due to his personal background, the author has placed the W3C in the foreground. However, there are numerous committees in other organizations such as IETF, ISO, OASIS, TEI Consortium, Unicode Consortium etc. that carry out standardization relevant to cultural heritage. Coordination between these organizations is important in order to avoid overlapping standards.

For the end users of the standards, these aspects are mostly not important. You would like to know: For what purpose do I have to use which technology? Once a standard has become widely used, it often only works covertly. You use it without realizing it - XML ​​is a good example of this. So it remains to be hoped that the consensus on relevant standards for cultural heritage will advance vigorously. Then there could be many standards in the sense of Yuri Rubinsky[20] Do their work invisibly “in the tunnels of Disneyland”.

To the author

Prof. Dr.Felix Sasaki is Senior Researcher at DFKI (German Research Center for Artificial Intelligence) and W3C Fellow as well as co-head of the German-Austrian office of the W3C (World Wide Web Consortium) anchored at DFKI. From 1993 to 1999 Felix Sasaki studied Japanese and linguistics in Berlin, Nagoya (Japan) and Tokyo. From 1999 he worked at the Department of Computational Linguistics and Text Technology at Bielefeld University, where he completed a PhD in 2004 on the integration of linguistic, multilingual data based on XML and RDF. Felix Sasaki has many years of experience in various standardization areas such as internationalization, web services and multimedia metadata. His main focus is on the application of web technologies for the representation and processing of multilingual information.