June 9, 1995
Larry Masinter <masinter@parc.xerox.com>
Document management is used to manage the entire life cycle of a document, from creation through multiple revisions and finally into long-term storage and records management. For example, workgroup document management systems often offer library services for preserving update consistency, similar to check-out and check-in capabilities of software source code control systems. When a user checks out a document, the system locks the document from other users' changes. When the document is checked back in, the document management system makes it available for others to revise. Along with maintaining update consistency, the document management application tracks revisions in a multi-author/editor setting.
Document management systems usually feature searching in repositories of documents both by externally applied information about the documents (e.g., user who entered it, date of revision, or version relationship) and by content (e.g., search on words contained within the document.)
Frequently, document management systems are integrated with imaging capabilities: the ability to deal with scanned raster images (fax quality or higher) of documents that originated in paper form, as well as with documents that originated in electronic form. While imaging applications traditionally had been a separate domain, the line between image management and general document management has been increasingly blurred in recent years. In image document management systems, optical character recognition (OCR) is used to analyze the document content and index the corpus for content retrieval, even when the documents themselves are retained in image form.
Document management systems are usually integrated with the desktop applications. That means that the user's application program -- word processor, spreadsheet, graphic editor -- is modified to work directly with the document management system. For example, if a user running WordPerfect pulls down on the "File/Open" menu, a search interface to the document management repository might appear rather than the standard file system dialog interface.
Document management systems are sometimes connected to or integrated with workflow systems, though the latter is strictly speaking a different application. While document management systems deal with storing and searching documents in repositories, workflow systems are organized around work processes. Thus, a workflow system contains a model of the tasks of an organization and the roles that individuals play in that organization, and routes the work according to the model of the work process. Of course, the results of that process are often stored in document management repositories, and document management operations are often steps in the tasks managed by the workflow system.
To make clear the function of document management applications, it may help to give some typical examples of how these systems are used:
There are a large number of vendors of document management systems. Some of the major products and vendors include Documentum, PC Docs, SoftSolutions from WordPerfect/Novell, FileNet, Visual Recall from Xerox, and Mezzanine from Saros. Many other products include document management capabilities, including offerings from Verity, Oracle, and Lotus (Notes).
As document management products have developed, there has been a growing demand for standards to allow interoperability between them. Large enterprises discover that different workgroups within their organization have, for various reasons, chosen different document management products. As they attempt to integrate these products across the enterprise, enterprise-wide standard interfaces and interoperability become increasingly important.
To this end, consortia have organized to define standards for document management. For example, the Open Document Management API (ODMA) is a simple Application Program Interface (API) designed to let desktop applications (such as an editor or spreadsheet) integrate with any of a number of document management systems[3][4][5]. It redefines file access menu items such as "Open", "Save", and "Save as..." to call the document management system (if one is installed) instead of the file system.
At another level, there have been recent attempts by industry groups to define a middleware layer between the user interface and back-end document repositories, so that users in an enterprise can access documents stored in multiple document management systems across their enterprise. The two efforts by the Shamrock Document Management Coalition (Shamrock's Enterprise Library Services) and the Document Enabled Networking[6] specification are being merged into a new Document Management Alliance (DMA)[7] to promote a single standard interface. These initiatives are creating a set of standard interfaces that define system elements such as "document", "repository", and "attribute" as well as as operations such as searching, checking out a document, and retrieving it.
Digital libraries usually possess large corpora of information of generally high value. Not only is the material of high quality, but also some care is placed on cataloging the material, and making sure that the origin, date, and other external descriptive information is accurate. Many digital library projects are concerned with providing digital access to material that already exists within traditional library collections, and thus concentrate on material that was originally intended for analog media: libraries of scanned images of photographs or printed texts, digitized video segments and so forth. Other projects extend the library metaphor to other collections such as scientific data sets, software libraries or multimedia works. A great deal of work in this area concentrates on providing enhanced content or access methods, with the problem often couched as one of providing a way of satisfying the individual's particular "information needs". This might be a chemistry graduate student looking for information for a research project, a high-school student downloading a multi-media chemistry text, or a market researcher looking for information about chemical companies.
While much digital library work is in its early phase of development, there is a rich tradition in the library community that has influenced the thinking and design of systems for Digital Libraries. Historically, library automation has taken the form of Online Public Access Catalogs (OPACs). The standards for online library catalogs include MARC[13] and Z39.50[27]. Another kind of metadata is represented by the Scientific and Technical Attribute Set (STAS), which defines a standard for metadata elements to describe scientific datasets as opposed to traditional bibliographic material.
More recently, a number of research initiatives have proposed systems and mechanisms for future digital libraries, including the six NSF/ARPA/NASA joint initiative projects, initiatives of the national libraries and library system vendors. Previous work in copyright management[14][15], document identifiers[16], and the Computer Science Technical Report project [17] also contribute to digital library technology.
These days, it is hardly necessary to define "the web" at an Internet conference. (It's hardly necessary to define "the web" to the cab driver who takes you to the conference from the airport.) For the sake of contrast, though, it will be useful to lay out the web's key features here.
By "the web", I mean information on the Internet, as is accessed by individuals using a World-Wide Web or some other network information access tool. The web is accessed using one of the many web browsers now available. The web provides a document interface to information. That is, a users is presented with a document which includes links to follow and forms to fill out. By interacting with the document, the user causes a new document to be presented. The web, as an Internet service, is primarily public. A web site can provide access to a very large number of users across the world.
The web is used for institutional public relations and product information, personal communication, online publishing, and scientific, technical and scholarly interchange. For example, companies put up web sites about their products and services; a growing number of newspapers and information service providers are producing web sites. Students put up `home pages' covering their hobbies. Professional organizations and educational institutions give out information about their organizations and their resources.
There are a growing number of web systems and software packages, including those produced by sponsored research, university researchers and commercial vendors. Dozens of start-ups compete for attention.
The web systems and protocols, originally defined in the research community, are being refined by a number of companies and consortia (the W3C consortium, for example) and being standardized by working groups of the Internet Engineering Task Force (IETF). The IETF is developing standards for Uniform Resource Locators (URLs), Uniform Resource Names (URNs), the HyperText Transfer Protocol (HTTP), and the HyperText Markup Language (HTML). These elements are the principal elements of the World Wide Web. The web also includes other network search protocols and access systems. For example, the Gopher protocol defined by the University of Minnesota is part of the web, while the Internet use of the Z39.50 standard is defined by the Z39.50 Implementors Group (ZIG)[18].
How does one identify a piece of something else? For example, if there is a volume of collected papers, do the individual papers get separate identifiers? If so, is the identifier for each element somehow syntactically related to the identifier for the whole? If not, how is the relationship established? Is there a database that links the part to the whole?
When an object is revised, does it retain its identifier? For example, in System 33[23], every document had two identifiers: one that was assigned to `this version' and another that specified `the latest version of whatever this becomes'.
In the office environment, a document with a cover memo attached might be considered a different object. However, in some situations, the `cover' material is merely an external attribute, and the document hasn't changed and should not get a different identifier.
In general, there are a large number of relationships between objects that can be expressed as relationships of the identifiers of the objects, and relevant design decisions are currently made in an ad hoc fashion. Publishers are allowed to retain the same ISBN number for minor printing revisions, but the paperback and hardcover of a book are given different ISBN numbers. On the web, the URL of a document doesn't change if the content changes. Moreover, different vendors' document management systems seem to take different approaches to dealing with revision and identity.
In a hierarchical uniqueness system, there is a tree of 'naming authorities'. Every naming authority guarantees that it will not give out the same identifier to two different documents. If it delegates some of the naming authority to sub-authorities, it also delegates that promise. ("Here, you can give out names, but you make sure you never give out the same name twice.") For example, the Internet's Domain Name Service is a hierarchical service; the owner of "xerox.com" can hand out unique names under that suffix, and to delegate the naming system underneath to the owner of "parc.xerox.com". Many of the proposals for URNs on the Web are hierarchical.
Some distributed naming systems are hierarchical but have a fixed depth of the hierarchy. For example, ISBN numbers have three parts: a country code (the country of registry for the publisher), the publisher identifier, and, for each publisher, the document identifier. Each publisher is allowed to assign their own ISBN numbers. Some naming systems are not distributed, but guarantee uniqueness by keeping a single source of identifiers; for example, the Library of Congress Control Number is assigned uniquely by the U.S. Library of Congress.
A random naming authority is one in which names are given out using random numbers; each authority uses enough information to make the probability of two documents getting the same identifier quite small. For example, some schemes use the one-way hash (MD5, SHA) of the document as the document identifier. The LIFN system [24] uses a randomly assigned document identifier in this way.
Libraries have traditionally been quite concerned with cataloging -- a process which associates metadata with bibliographic material. The card catalog entries for an item in the library provides metadata about the item. There are a variety of standards used for online cataloging. The most prominent is USMARC. Various attempts have been made to extend and enhance USMARC to deal with online material[25][26]. The Z39.50 standard contains extensive mechanisms for both communicating search parameters (requested metadata) and document attributes (output metadata.) More recently, attempts to define online document standards for the humanities arrived at a standard set of metadata for humanities texts[28].
The Uniform Resource Identifier working group[31] has been trying to develop a standard syntax and representation for information citations in a scheme called Uniform Resource Citations (URCs) to describe information on the Internet as a way of discovering or describing more about a referenced resource (via URL or URN) before retrieving the item, as well as a way of cataloging Internet information.
There are a number of design issues in representing metadata for online information, some semantic (what does it mean and how do you say it?), some structural (does metadata have structure?) and some syntactic (how do the semantics and structure get represented as a sequence of characters or bytes?) These issues span the three application areas.
Are there well known attributes? MARC takes a strong stand: MARC defines a set of well-known attributes with descriptions of each. Some of them take on values within a controlled vocabulary. There are standards for the completeness and quality of a catalog entry. The set of attributes is defined and used universally by nearly all online library catalogs. In document management systems, on the other hand, the system administrator for a workgroup generally establishes conventions for the attributes used and what they mean. When multiple document management systems are brought together, though, combining the semantics of the disparate sources is a serious problem. The Internet community is struggling with standardization of semantics for attribute sets. While there are some attributes that are well-known (content attributions in mail messages, mapping to ISO protocols in X.400), these are by no means universal.
If there is not a single well-known set of attributes that spans all known objects, then it is still possible to create a system of entities -- classes of documents which share the same schema of attributes. For each class, the attribute set can then be defined. For example, a document management system might allow for 'memo' and 'spreadsheet' and 'expense report'. Every memo might be catalogued by its distribution list, while an expense report might be required to have a budget center and a signature status. More complex schema systems allow for inheritance and specialization of classes, as is found in object-oriented programming. There are variations among different implementations, just as there are in different object-oriented programming systems.
Frequently it is difficult to tell the `boundaries' of an online electronic work. If one describes a site's `home page', does the description apply to the site, or just to the introductory `splash page'? If an object contains parts, do the parts have separate attributes? For example, if a report in a document management system has a cover memo, in what way are the author of the report and the author of the cover memo distinguished or reported in the description of the overall object?
Metadata itself can also have structure. It is sometimes necessary and occasionally critical to know the author of an attribute or the time when the attribute was assigned. If metadata itself can be updated and revised, then the history of its editing may be of relevance. How does one distinguish between `the title' and `the title, translated into French', and `the title, translated into English from Italian by D.H.Lawrence'. The relationships between elements of the metadata are problematic for some flat attribute-value representation schemes like MARC.
While it might seem straightforward, standardization of the syntactic mechanisms for representing the semantics and structure of attributes is quite difficult. First, attributes might have a fixed, extensible, or uncontrolled set of values. The mechanisms for assigning the allowable elements of the controlled set are difficult to establish. Each attribute or field might need to deal with alternative syntaxes (e.g., for names, is it last name first or given name first?), multiple character sets (names in Chinese or Arabic), or even non-textual data.
Despite the more complex needs, some document management systems rely on either their database manager or the host network operating system to provide authentication and access control, if for no other reason than to avoid providing a separate authentication and administrative domains.
The Internet community has a large number of separate efforts defining security standards. The web community is exploring two systems, Secure HTTP (S-HHTP)[32] and Secure Socket Layer (SSL)[33]. S-HTTP is a modification of the HTTP web protocol that includes security features. SSL is an application-independent protocol for negotiating secure network communication. Recently these efforts have joined forces. In addition, new authentication mechanisms for web access (other than simple passwords) are being proposed using Digest Access Authentication[33] and Multi-party Digest Authentication[XX].
In addition, the Internet mail community has produced two complementary systems for secure electronic mail, Pretty Good Privacy (PGP)[36] and Privacy Enhanced Mail (PEM)[37]. PGP is a public key cryptosystem with a number of utilities for dealing with keys and mail. PEM is a system for providing privacy enhancement services (confidentiality, authentication, message integrity assurance and non-repudiation of origin) using either symmetric (secret-key) and asymmetric (public-key) approaches for encryption of data encrypting keys. There is some hope that all of these separate efforts will eventually converge.
Beyond the mechanisms for dealing with security, copyright and intellectual property, the Web is capable of providing for spontaneous financial transactions. A number of mechanisms for handling payment and billing are being explored, either through credit card settlement methods or digital cash[38].
The most serious issue is the design of an authorization scheme that will scale to the size of 'all users on the Internet', given the enormous international scope of the Internet and the wide variety of needs and policies requiring support.
Finally, US export control laws that govern the export of cryptographic software have been perceived as a difficult impediment to widespread deployment of secure software solutions to the Web's problems.
One common issue in all of the systems is detecting the boundary of the item to which a particular authorization might apply. Access control and authorization might need to apply to a different granularity of object than is denoted with a single identifier.
In general, one of the most troubling elements of AAA design is that it is difficult to retrofit security in an architecture that doesn't already have it. The analysis of likely threats often requires revisiting optimizations made for performance reasons. For example, a design which employs distribution and caching of documents close to the site of access for performance reasons needs to account for the risks embodied in having a repository of cached documents which might be compromised.
Individual vendors of document management systems have frequently created their own ad hoc registries, to allow their systems to deal with multiple document types in a consistent way. More recent work in the electronic mail vendors association and ODMA group have created registries of well-known document types. Most generally, though, document management systems restrict themselves to dealing with the document types that either are common in desktop applications in the workplace or else are registered by the system administrator of the document management system.
The range of kinds of media and digital objects that potentially might be stored in a digital library is enormous. Currently, most attempts to catalog material have used fairly ad hoc descriptions of the files and their formats. A critical issue in the library community, though, is preservation[39][40]. It is important to make sure recorded material will be available in 10, 20, or 100 years. This is an issue not only of the longevity of the storage medium (which can be mitigated by refreshing the media), but, more importantly, the longevity of any particular storage representation. If one were to preserve a file that was created with Microsoft Word in 1995, how long is it expected to have a Microsoft Word-capable reader in the future?[39]
The method for indicating the media type of an object in the Internet arose from work on MIME: the Multipurpose Internet Mail Exchange standard. MIME extended Internet electronic mail -- formerly confined to the interchange of ASCII text -- by allowing for a rich representation of objects and object types. The MIME standard allows for the labeling of an object by its media type. Media types are defined as a two part name (e.g., "text/html" or "application/postscript") along with optional parameters. Media types are categorized into several top-level types ("text", "image", "audio", "application", "multipart") and then, within each top-level type, an extensible set of subtypes. Each type can also define parameters; for example, "text" types can have a "charset" parameter where the character encoding used for the text is given. There is a formal process for defining new media types, where information about the type and required and allowed parameters are supplied.
A related problem is that many document types are merely references to specifications that are evolving over time. For example, when the "application/postscript" type was originally proposed, there was one version of Postscript. Now, there are two levels. The GIF specification for images has two versions and a third under development. A system element might be able to deal with some versions and not others. Many type specification systems do not explicitly allow for versioning.
Some organizations are offering services to search the Internet, by traversing the known Internet web, gathering together the pages, and indexing them. The search capability is offered as a service, for a fee, as a demonstration of text retrieval capabilities or as a way of advertising other products and services[42].
One fundamental choice, made differently by different applications, is whether search is expressed by a search language or by a programming interface or some combination. Search languages include SQL (originally designed for relational databases) or enhancements of it, intended to deal with full text search, geographic information, etc. For example, Documentum's DQL[42] is a query language extended with versioning. The WAIS system originally left the `question' as a full text (presumably English) query. On the other hand, interfaces such as DEN allow the programmer of an interface to construct a query using API calls, without an expression in a query language. This has several advantages; it allows for more extensibility than is generally found in predefine syntax, allows for the query to be expressed in non-textual terms and does not require a parser in the search engine.
Much effort in each domain is being placed on enhancing user interface systems to deal with multiple sources. When a user queries more than one database at a time, it is necessary to merge the results from those sources. If two search engines have quite different capabilities, however, it is difficult to know how to express a combined search in a simple manner. Also, if the query language allows the expression of capabilities that are not present in the search database, there is a conflict. Some systems attempt to gloss over this or return results that are only approximately what the original search entailed.
Most models of database query and search allow for a single call/return sequence, where a search produces a result set, and then the result set is sequentially accessed to get back individual documents. However, in many cases, searching a corpus is a time-consuming process. Advanced user interfaces allow better feedback on the operation of the system and the state of the search; in order to provide that feedback, though, the search engine needs to provide updates as to the state, and these updates from multiple sources need to be merged.
The boundaries between these separate domains are blurring. Most digital library projects are exploring ways of making their libraries available to the entire Internet community, usually in spite of the perceived limitations of the current suite of web protocols and standards. As enterprise boundaries become more flexible with corporate outsourcing, dynamic enterprise construction and the increasing use of the Internet in the commercial sector, there is growing pressure to blur the boundary between an enterprise and workgroup repositories and those accessible on the Internet. And as companies and workgroups build larger repositories of archival quality documents--beyond those useful only momentarily--the distinction between an enterprise document management repository and a digital library is being blurred.
There is an opportunity to merge the interfaces for systems originally intended for document management, digital libraries or deployment in the web, in a way that will allow for several kinds of synergy. More specifically, there are several near-term opportunities.
For example, those charged with building and maintaining an Internet presence for an organization are discovering that, with the growth of their site, they have a large collection of documents with interdependencies, and need tools to help them manage their sites. One possible scenario is to use a tool originally designed as a document management system as the back-end to a web site. The version management, check-in and check-out, access control features of the document management system can be used by the web development staff, while the results are exported to the world over the Internet. Some explicit support for this kind of operation has been announced by a handful of document management companies.
Because workgroup document management systems are designed to integrate with office applications, it would be useful, for those office workers, to also be able to access other resources in repositories, whether in online libraries or other kinds of Internet resources. This could be accomplished by connecting the document management standard interfaces with Internet services.
Another possibility is to extend current Internet protocols for the web access (HTTP and current browsers) to add protocol elements for document management, including check-out, check-in, and a more rigorous approach to document attribute management. This effort has also begun in some quarters.
Other combinations of these technology elements are also possible, as long as the protocols and system architecture of the systems are not architecturally incompatible. Bringing together document management, digital libraries and the web is an important goal.