Setting the Stage
Metadata and the Web
Crosswalks, Metadata Harvesting, Federated Searching, Metasearching
Rights Metadata Made Simple
Practical Principles for Metadata Creation and Maintenance
Selected Bibliography
PDF Version

Introduction to Metadata
Metadata and the Web

Tony Gill

When the first edition of this book was published in 1998, the term metadata was comparatively esoteric, having originated in the information science and geospatial data communities before being co-opted and partially redefined by the library, archive, and museum information communities at the end of the twentieth century. Today, nearly a decade later, a Google search on "metadata" yields about 58 million results (see Web Search Engines sidebar). Metadata has quietly hit the big time; it is now a consumer commodity. For example, almost all consumer-level digital cameras capture and embed Exchangeable Image File Format (EXIF)1 metadata in digital images, and files created using Adobe's Creative Suite of software tools (e.g. Photoshop) contain embedded Extensible Metadata Platform (XMP) 2 metadata.

As the term metadata has been increasingly adopted and co-opted by more diverse audiences, the definition of what constitutes metadata has grown in scope to include almost anything that describes anything else. The standard concise definition of metadata is "data about data," a relationship that is frequently illustrated using the metaphor of a library card catalog. The first few lines of the following Wikipedia entry for metadata are typical:

Metadata (Greek: meta- + Latin: data "information"), literally "data about data," are information about another set of data. A common example is a library catalog card, which contains data about the contents and location of a book: They are data about the data in the book referred to by the card. 3

The library catalog card metaphor is pedagogically useful because it is nonthreatening. Most people are familiar with the concept of a card catalog as a simple tool to help readers find the books they are looking for and to help librarians manage a library's collection as a whole. However, the example is problematic from an ontological perspective, because neither catalog cards nor books are, in fact, data. They are containers or carriers of data. This distinction between information and its carrier is increasingly being recognized; for example, the CIDOC Conceptual Reference Model (CRM), 4 a domain ontology for the semantic interchange of museum, library, and archive information, models the relationship between information objects—identifiable conceptual entities such as a text, an image, an algorithm, or a musical composition—and their physical carrier as follows:

E73 Information Object P128 is carried by E24 Physical Man-Made Stuff

The IFLA Functional Requirements for Bibliographic Records (FRBR) 5 model makes a similar four-tier distinction between Works, Representations, Manifestations, and Items: the first three entities are conceptual entities, and only Items are actual physical instances represented by bibliographic entities.

Web Search Engines

Web search engines such as Google are automated information retrieval systems that continuously traverse the Web, visiting Web sites and saving copies of the pages and their locations as they go in order to build up a huge catalog of fully indexed Web pages. They typically provide simple yet powerful keyword searching facilities and extremely large result sets that are relevance ranked using closely guarded proprietary algorithms in an effort to provide the most useful results. The most well known Web search engines are available at no cost to the end-user and are primarily supported by advertising revenue. Web search engines rely heavily on Title HTML tags (a simple but very important type of metadata that appears in the title bar and favorites/bookmarks menus of most browsers), the actual words on the Web page (unstructured data), and referring links (indicating the popularity of the Web resource).

Of course, most library catalogs are now stored as 0s and 1s in computer databases, and the "items" representing the "works" that they describe (to use the nomenclature of the FRBR model) are increasingly likely to be digital objects on a Web server, as opposed to ink, paper, and cardboard objects on shelves (this is even more true now in light of large-scale bibliographic digitization initiatives such as the Google Book Search Library Project, the Million Books Project, and the Open Content Alliance, about which more later).

So if we use the term metadata in a strict sense, to refer only to data about data, we end up in the strange predicament whereby a record in a library catalog can be called metadata if it describes an electronic resource but cannot be called metadata if it describes a physical object such as a book. This is clearly preposterous and illustrates the shortcomings of the standard concise definition.

Another property of metadata that is not addressed adequately by the standard concise definition is that metadata is normally structured to model the most important attributes of the type of object that it describes. Returning to the library catalog example, each component of a standard MARC bibliographic record is clearly delineated by field labels that identify the meaning of each atomic piece of information, for example, author, title, subject.

The structured nature of metadata is important. By accurately modeling the most essential attributes of the class of information objects being described, metadata in aggregate can serve as a catalog—a distillation of the essential attributes of the collection of information objects—thereby becoming a useful tool for using and managing that collection. In the context of this chapter, then, metadata can be defined as a structured description of the essential attributes of an information object.

The Web Continues to Grow

The World Wide Web is the largest collection of documents the world has ever seen, and its growth is showing no signs of slowing. Although it is impossible to determine the exact size of the Web, some informative metrics are available. The July 2007 Netcraft survey of Web hosts received responses to HTTP (HyperText Transfer Protocol, the data transmission language of the Web) requests for server names from 125,626,329 "sites." 6 A site in this case represents a unique hostname such as http://www.hostname.com. The same survey in January 1996 received responses from just 77,128 Web servers; the number of Web servers connected to the Internet has grown exponentially over the past decade or so. (Fig. 1.)

Although the Netcraft Web hosts survey clearly demonstrates the continuing upward trend in the growth of the Web, it does not tell the whole story because it does not address how many Web sites are hosted on each server or how many accessible pages are contained in each site.

<META NAME="KEYWORDS" CONTENT="data standards, metadata, Web resources, World Wide Web, cultural heritage information, digital resources, Dublin Core, RDF, Semantic Web">
<META NAME="DESCRIPTION" CONTENT="Version 3.0 of the site devoted to metadata: what it is, its types and uses, and how it can improve access to Web resources; includes a crosswalk.">

The original intention was that the "keyword" metadata could be used to provide more effective retrieval and relevance ranking, whereas the "description" tag would be used in the display of search results to provide an accurate, authoritative summary of the particular Web resource.

Dublin Core
The Dublin Core Metadata Element Set (DCMES) 29 is a set of fifteen information elements that can be used to describe a wide variety of resources for the purpose of simple cross-disciplinary resource discovery. Although originally intended solely as the equivalent of a quick and simple "catalog card" for networked resources, the scope of the Dublin Core gradually expanded over the past decade to encompass the description of almost anything. The fifteen elements are Contributor, Coverage, Creator, Date, Description, Format, Identifier, Language, Publisher, Relation, Rights, Source, Subject, Title, and Type.

The fifteen Dublin Core elements and their meanings have been developed and refined by an international group of librarians, information professionals, and subject specialists through an ongoing consensus-building process that has included more than a dozen international workshops to date, various working groups, and several active electronic mailing lists. The element set has been published as both a national and an international standard (NISO Z39.85-2001 and ISO 15836-2003, respectively). There are now a significant number of large-scale deployments of Dublin Core metadata around the globe.30

Resource Description Framework
The Resource Description Framework (RDF)31 is a standard developed by the World Wide Web Consortium (W3C) for encoding resource descriptions (i.e., metadata) in a way that computers can "understand," share, and process in useful ways. RDF metadata is normally encoded using XML, the Extensible Markup Language.32 However, as the name suggests, RDF only provides a framework for resource description; it provides the formal syntax, or structure, component of the resource description language but not the semantic component. The semantics, or meaning, must also be specified for a particular application or community in order for computers to be able to make sense of the metadata. The semantics are specified by an RDF vocabulary, which is a knowledge representation or model of the metadata that unambiguously identifies what each individual metadata element means and how it relates to the other metadata elements in the domain. RDF vocabularies can be expressed either as RDF schemas33 or as more expressive Web Ontology Language (OWL)34 ontologies.

The CIDOC CRM35 is a pertinent example of an ontology that provides the semantics for a specific application domain—the interchange of rich museum, library, and archive collection documentation. By expressing the classes and properties of the CIDOC CRM as an RDF schema or OWL ontology, information about cultural heritage collections can be expressed in RDF in a semantically unambiguous way, thereby facilitating information interchange of cultural heritage information across different computer systems.

Using the highly extensible and robust logical framework of RDF, RDF schemas, and OWL, rich metadata descriptions of networked resources can be created that draw on a theoretically unlimited set of semantic vocabularies. Interoperability for automated processing is maintained, however, because the strict underlying XML syntax requires that each vocabulary be explicitly specified.

RDF, RDF schemas, and OWL are all fundamental building blocks of the W3C's Semantic Web36 activity. The Semantic Web is the vision of Sir Tim Berners-Lee, director of the W3C and inventor of the original World Wide Web: Berners-Lee's vision is for the Web to evolve into a seamless network of interoperable data that can be shared and reused across software, enterprise, and community boundaries.

A Bountiful Harvest

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)37 provides an alternative method for making Deep Web metadata more accessible. Rather than embed metadata in the actual content of Web pages, the OAI-PMH is a set of simple protocols that allows metadata records to be exposed on the Web in a predictable way so that other OAI-PMH-compatible computer systems can access and retrieve them. (Fig. 2.)

The OAI-PMH supports interoperability (which can be thought of as the ability of two systems to communicate meaningfully) between two different computer systems; an OAI data provider and an OAI harvester, which in most cases is also an OAI service provider (see Glossary). As the names suggest, an OAI data provider is a source of metadata records, whereas the OAI harvester retrieves (or "harvests") metadata records from one or more OAI data providers. Since both an OAI data provider and an OAI data harvester must conform to the same basic information exchange protocols, metadata records can be reliably retrieved from the provider(s) by the harvester.

Although the OAI-PMH can support any metadata schema that can be expressed in XML, it mandates that all OAI Data Providers must be able to deliver Dublin Core XML metadata records as a minimum requirement. In this way, the OAI-PMH supports interoperability of metadata between different systems.

Google's Sitemap, part of a suite of Webmaster tools offered by that search engine, also supports the OAI-PMH. By exposing a metadata catalog as an OAI data provider and registering it with Google's Sitemap, otherwise Deep Web content can be made accessible to Google's Web crawler, indexed, and made available to the search engine's users.

Meta-Utopia or Metagarbage?

In his oft-quoted diatribe, "Metacrap: Putting the Torch to the Seven Straw-men of the Meta-Utopia,"38 journalist, blogger, and science fiction writer Cory Doctorow enumerates what he describes as the "seven insurmountable obstacles between the world as we know it and meta-utopia." In this piece, Doctorow, a great proponent of making digital content as widely available as possible, puts forth his arguments for the thesis that metadata created by humans will never have widespread utility as an aid to resource discovery on the Web. These arguments are paraphrased below.
  • "People lie." Metadata on the Web cannot be trusted, because there are many unscrupulous Web content creators that publish misleading or dishonest metadata in order to draw additional traffic to their sites.
  • "People are lazy." Most Web content publishers are not sufficiently motivated to do the labor involved in carefully cataloging the content that they publish.
  • "People are stupid." Most Web content publishers are not smart enough to catalog effectively the content that they publish.
  • "Mission: Impossible—know thyself." Metadata on the Web cannot be trusted, because there are many Web content creators who inadvertently publish misleading metadata.
  • "Schemas aren't neutral." 39 Classification schemes are subjective.
  • "Metrics influence results." Competing metadata standards bodies will never agree.
  • "There's more than one way to describe something." Resource description is subjective.

Although obviously intended as a satirical piece, Doctorow's short essay nevertheless contains several grains of truth when considering the Web as a whole.
  • Descriptive data structure standards for different kinds of community resource descriptions, for example, MARC,42 Dublin Core, MODS,43 EAD,44 CDWA Lite,45 and VRA Core;46
  • Markup languages and schemas for encoding metadata in machine-readable syntaxes, for example, XML and RDF;
  • Ontologies for semantic mediation between data standards, for example, CIDOC CRM and IFLA FRBRoo;47
  • Protocols for distributed search and metadata harvesting, for example, the Z39.50 family of information retrieval protocols (Z39.50,48 SRU/SRW49), SOAP,50 and OAI-PMH.51

By combining these various components in imaginative ways to provide access to the rich information content found in museums, libraries, and archives, it should be possible to build a distributed global Semantic Web of digital cultural content and the appropriate vertically integrated search tools to help users find the content they are seeking therein.

Libraries and the Web

The Web has dramatically changed the global information landscape—a fact that is felt particularly keenly by libraries, the traditional gateways to information for the previous two millennia or so. Whereas previous generations of scholars relied almost entirely on libraries for their research needs, the current generation of students, and even of more advanced scholars, is much more likely to start (and often end) their research with a Web search.

Faced with this new reality, libraries and related service organizations have been working hard to bring information from their online public access catalogs (OPACS), traditionally resources hidden in the Deep Web beyond the reach of the search engines' Web crawlers, out into the open. For example, OCLC has collaborated with Google, Yahoo! and Amazon.com to make an abbreviated version of its WorldCat union catalog accessible as Open WorldCat. The full WorldCat catalog is available only by subscription.

But the most striking example of collaboration between libraries and a search engine company to date is undoubtedly the Google Book Search\x96Library Project.1 This massive initiative, announced late in 2004, aims to make the full text of the holdings of five leading research libraries—Harvard University Library, the University of Michigan Library, the New York Public Library, Oxford University Library, and Stanford University Library—searchable on the Visible Web via Google. By adding the full text of millions of printed volumes to its search index, the Google Book Search\x96Library Project will enable users to search for words in the text of the books themselves. However, the results of searches will depend on the works' copyright status. For a book that is in the public domain, Google will provide a brief bibliographic record, links to buy it online, and the full text. For a book that is still in copyright, however, Google will provide only a brief bibliographic record, small excerpts of the text in which the search term appears (the size of the excerpts depends on whether the copyright holder is a participant in the Google Books Partner Program,2 a companion program for publishers), and links to various online booksellers where it can be purchased.

It is perhaps ironic that, due to the dysfunctional and anachronistic state of existing copyright legislation, this scenario is almost the exact reverse of the familiar library catalog metadata example: Rather than search metadata catalogs in order to gain access to full online texts, the Google model helps users to search full online texts in order to find metadata records!

But open access to the rich content of printed books is clearly an idea whose time has come. The Google Book Search\x96Library Project may be the most ambitious project of its kind to date, but it is neither the first large-scale book digitization project (e.g., the Million Book Project has already digitized over 600,000 volumes) 3 nor the last. At the same time that Google was striking deals with libraries to digitize their collections, the Internet Archive and its partner, Yahoo! were busy recruiting members for the Open Content Alliance.4

The Open Content Alliance is a diverse consortium that includes cultural, nonprofit, technology, and government organizations that offer both technological expertise and resources (e.g., Adobe Systems, HP Labs, Internet Archive, MSN, Yahoo!) and rich content (e.g., Columbia University, the UK's National Archives, the National Library of Australia, Smithsonian Institution Libraries, the University of California). It has a broad mission to "build a permanent archive of multilingual digitized text and multimedia content" and "to offer broad, public access to a rich panorama of world culture."5

The Open Content Alliance has launched the Open Library,6 which, like Google Book Search, will make the full texts of large quantities of books accessible via Yahoo!'s search engine while simultaneously respecting copyright restrictions. However, unlike the Google initiative, the Open Library is committed to making the full text of every digitized book available free of charge on the Web.

The undeniably positive result of these various initiatives is that within the next decade or so the Web will be vastly enriched by the addition of a huge and freely accessible corpus of the world's literature. Unfortunately, however, unless the copyright situation improves dramatically (e.g., through the introduction of proposed new legislation for "orphan works"),7 it seems that the corpus of literature soon to be freely available on the Web will not include any significant quantity of copyrighted material from the twentieth and twenty-first centuries.