Transition Strategy for Moving from URLs to URNs

Problems with URLs

URLs are widely used today, but they have the following three basic problems:
  1. they are tied to host names,
  2. they are tied to filenames on a particular host,
  3. they are tied to access protocols.
Since all of these are likely to change over time, URLs are not stable names.

Host names impose two problems: stability and scalability. Host names based on DNS are frequently unstable because they (by design) reflect administrative hierarchies, which tend to change fairly frequently rela- tive to the lifetime of a resource. Host names also reflect the names of organizations, which also tend to change over time. Finally, a resource owned or maintained by one organization can migrate to another organization, thus necessitating a change in the host name at which a resource is located. All of these changes invalidate old URLs.

As long as the only way to access an object named by a URL is to con- tact the host whose name appears in a URL, it is difficult to provide scalable access to those objects. Popular world wide web sites assign several server machines to a single host name (perhaps using a modified name server that randomizes responses), but this requires that the entire collection of files made available using that host name and be consistently mirrored to each server machine. As a practical consequence, all of those server machines are usually maintained at a single location, even though better service and conservation of bandwidth would result from distributing the servers around the network.

Filenames embedded in URLs serve two purposes which are in conflict. They need to be stable so that a reference to a named resource will con- tinue to be valid for as long as needed. On the other hand, files are usually organized into hierarchies to help humans browse through the file system. The hierarchies need to change from time to time as new files are added and old ones are removed. When the file system hierar- chy is changed, the old URLs become invalid.

Having access protocols implicit in a resource name imposes an addi- tional barrier: The client must support the access protocol (perhaps via a proxy) in order to access the resource. This is effectively true even if the resource is available via other protocols, because other URLs for those protocols may not be available. Even if several URLs are avail- able, the user must be knowledgable enough to choose one which is sup- ported by his client, will pass through his security firewall, etc.

Solution Using URNs

The widely expected solution to these problems is something like the following: Instead of using URLs as resource names, we will migrate to using uniform resource names, or URNs. URNs will not be tied to locations, but there will be resolution services available that will allow a user to obtain the "characteristics" of the resource (also known as the URC). Locations (URLs) of resources will also be obtainable via the URN, either as part of the URC, indirectly (through a location- independent file name or LIFN which appears in a URC), or perhaps via some separate resolution service.

To solve the problem of stale links, users will have to upgrade to browsers that support URNs directly, or access the web via proxy servers which convert URNs to URLs. References to existing objects (several million of them) will need to be converted to provide URNs instead of or in addition to the URLs.

Cost of Transition to URNs

We have resource names now -- they're called URLs. URLs have all kinds of undesirable properties, but one nice thing is that they're cheap. All you need is a host on the Internet and a DNS name, and you have your own very large chunk of URL-space. Since DNS is a well-established part of the Internet infrastructure, it's "free".

There's a cost to providing another level of indirection. New protocols have to be defined and tested, new (and more complex) clients and servers must be debugged. The new services have to be managed; the new clients have to be configured correctly. Information providers (which potentially includes nearly everybody) will have to learn how to use the new tools.

There are also issues of serviceability and reliability. If the client doesn't do what the user thinks it should, the system administrator has one or two more possible culprits. Every additional level of lookup imposes an overhead in network bandwidth and delay (as seen by the user). When adding a layer to a service that millions of people depend on, the new layer must integrate smoothly with those that already exist so as not to adversely impact reliability.

Even after a URN lookup service exists and is available in clients, most clients will still support only URLs for some time, and most references to resources will still use URLs. During that period, there will be an increased cost due not only to the need to maintain both sets of names, but also to having multiple sources of failure. (Did the attempt to access the resource fail because the URL was stale, or because the URN server returned the wrong answer?). Due to the large number of resources presently named by URLs, one can imagine needing a service to map from URLs to URNs, in addition to the other way around.

For any new protocol, naming scheme, or additional level of lookup, the costs and benefits need to be analyzed to see whether the anticipated gain is likely to be worth the cost. An important question is: when will the user see a benefit from using a client that supports URNs? Another is: how much investment is required to develop the URN infras- tructure to the point that the user does see that benefit?

We probably cannot anticipate needs of web users and information providers for more than the next few years, and we are deluding our- selves if we believe we can impose a discipline on its use simply by creating a new space for resource names. We need to design a system which is adaptable to future needs without knowing precisely what those needs are. At the same time, if a solution for possible future problems does not address today's needs, it is doomed to failure.

An Alternative Solution: Location-Independent URLs

The usual explanation for why URLs are a Bad Thing (tm) is "URLs are tied to locations". But the Internet has already had one successful transition away from location-based names - in electronic mail. Once upon a time, email addresses were of the form user@host, and were there- fore tied to the network "locations" of those hosts. To send mail to a user at a host, you connected to that host's SMTP server (before that, it was the FTP server), and told it to deliver the message to that user's mailbox. Along came DNS and the MX record. Addresses are now of the form user@domain. Instead of connecting to the host associated with that domain, one now connects to one of the mail exchangers for that domain listed in the DNS. One result is that email domains are increasingly decoupled from host names. It is not uncommon for a single email domain to serve hundreds of Internet hosts, which may not even be directly con- nected to the Internet. Email addresses are now less likely to be tied to individual host names. And since there can be multiple mail exchang- ers for a domain, other results have been increased fault-tolerance via redundancy, and better ability to handle load.

Imagine a new DNS record type called RCS (for "resource catalog server") which performed an MX-like function for URLs. For example, the records:

www.netlib.org.         RCS     10 netlib2.cs.utk.edu.
                        RCS     10 netlib1.epm.ornl.gov.
would inform a web client that meta-information for any URL containing the domain www.netlib.org, could be found using the resource catalog servers at netlib2.cs.utk.edu and netlib1.epm.ornl.gov.

These records would be obtained in a single DNS query for www.netlib.org, along with the IP addresses of the RCS servers. If there had been no RCS records, the same query would have returned the IP addresses of www.netlib.org. (Existing TXT records could be used instead of RCS records, but additional DNS queries would be required to look up the IP addresses of any resource characteristics servers.)

Once the addresses of RCS servers were known for the domain in a URL, the client would use a special-purpose resolution protocol to obtain characteristics of the resource named by that URL, alternate locations (i.e additional URLs) at which that URL could be accessed, or both.

New clients that supported this scheme would use the same URLs as the old clients, but would gain immediate benefit from being able to discover alternate servers for their resources (with little penalty for trying). Eventually they would be able to make use of URCs as well. When coupled with a mechanism such as SONAR for finding network proximity information, the client would gain the ability to automatically choose a nearby location for that resource (thereby improving access times), and to recover from the failure of any single resource server.

Legacy clients would still be able to access those resources as long as there were a host with the same name as that used in the URL which provided such access. Information providers would therefore continue to maintain such servers until most legacy clients had been replaced.

It should be possible to reserve a few chunks of DNS for naming authorities for long-term resource names. Subdomains within these spaces would be required to NOT be meaningful to humans; thus, the names themselves need never be obsolete. While such a subdomain would initially be assigned to a publisher, the responsibility for serving that subdomain would be transferred as necessary when that publisher or its intellectual property assets changed hands. Finally, resource names would only be assigned for any particular subdomain for a short time (perhaps a year), after which a new resource name would be used. This would allow the resolution service for older subdomains could be shifted from primary servers to "custodians".

One nice feature of this scheme is that it still works for ordinary users and their home pages. Sites could set up their location and/or URC servers with their existing DNS names; they need not obtain new naming authority names. If a user's web page becomes unexpectedly popular, a resource catalog server and appropriate RCS records could be installed to inform clients of alternate locations, even though no prior arrangements were made.

But publishers and others who had an interest in making resources available over the long-term (including probably anybody who wanted to make money selling access to his works), would see the benefit in using the new name space. And new naming authority names could be distinguished from ordinary DNS names.

Such a system would not provide scalable resolution of resource names for several centuries in the future. But it probably would work for a few decades, during which usage patterns are almost certainly to change beyond what we can anticipate. It would also encourage building of an infrastructure for maintaining meta-information about resources.