Chapter 1. HTML, XHTML, and the World Wide Web
In many ways, the Web — the open community of hypertext-enabled
document servers and readers on the Internet — is responsible for the meteoric
rise in the network's popularity. You, too, can become a valued member by
contributing: writing HTML and XHTML documents and then making them available to
web surfers worldwide.
Let's climb up the Internet family tree to gain some deeper
insight into its magnificence, not only as an exercise of curiosity, but to help
us better understand just who and what it is we are dealing with when we go
online.
1.1 The Internet
Although popular media accounts are
often confused and confusing, the concept of the Internet really is rather
simple: it's a worldwide collection of computer networks — a network of networks
— sharing digital information via a common set of networking and software
protocols.
Networks are not new to computers. What makes the Internet
unique is its worldwide collection of digital telecommunication links that share
a common set of computer-network technologies, protocols, and applications.
Whether you run Microsoft Windows XP, Linux, Mac OS X, or even the now ancient
Windows 3.1, when connected to the Internet, computers all speak the same
networking language and use functionally identical programs, so you can exchange
information — even multimedia pictures and sound — with someone next door or
across the planet.
The common and now quite familiar programs people use to
communicate and distribute their work over the Internet have also found their
way into private and semi-private networks. These so-called intranets and extranets use the same software, applications, and
networking protocols as the Internet. But unlike the Internet, intranets are
private networks, with access restricted to members of the institution.
Likewise, extranets restrict access but use the Internet to provide services to
members.
The Internet, on the other hand, seemingly has no restrictions.
Anyone with a computer and the right networking software and connection can "get
on the Net" and begin exchanging words, sounds, and pictures with others around
the world, day or night: no membership required. And that's precisely what is
confusing about the Internet.
Like an oriental bazaar, the Internet is not well organized,
there are few content guides, and it can take a lot of time and technical
expertise to tap its full potential. That's because . . .
1.1.1 In the Beginning
The resulting network
was a marvelous technical success, but it was limited in size and scope. For the
most part, only defense contractors and academic institutions could gain access
to what was then known as the ARPAnet (Advanced Research Projects Agency Network
of the Department of Defense).
With the advent of high-speed modems for digital communication
over common phone lines, some individuals and organizations not directly tied to
the main digital pipelines began connecting and taking advantage of the
network's advanced and global communications. Nonetheless, it wasn't until the
last decade (around 1993, actually) that the Internet really took off.
Several crucial events led to the meteoric rise in popularity
of the Internet. First, in the early 1990s, businesses and individuals eager to
take advantage of the ease and power of global digital communications finally
pressured the largest computer networks on the mostly U.S. government-funded
Internet to open their systems for nearly unrestricted traffic. (Remember, the
network wasn't designed to route information based on content — meaning that
commercial messages went through university computers that at the time forbade
such activity.)
True to their academic traditions of free exchange and sharing,
many of the original Internet members continued to make substantial portions of
their electronic collections of documents and software available to the
newcomers — free for the taking! Global communications, a wealth of free
software and information: who could resist?
Well, frankly, the Internet was a tough row to hoe back then.
Getting connected and using the various software tools, if they were even
available for their computers, presented an insurmountable technology barrier
for most people. And most available information was plain-vanilla ASCII text
about academic subjects, not the neatly packaged fare that attracts users to
services such as America Online. The Internet was just too disorganized, and,
outside of the government and academia, few people had the knowledge or interest
to learn how to use the arcane software or the time to spend rummaging through
documents looking for ones of interest.
1.1.2 HTML and the Web
Lift-off happened when some bright students and faculty at the
National Center for Supercomputing Applications (NCSA) at the University of
Illinois, Urbana-Champaign wrote a web browser called Mosaic. Although designed
primarily for viewing HTML documents, the software also had built-in tools to
access the much more prolific resources on the Internet, such as FTP archives of
software and Gopher-organized collections of documents.
With versions based on easy-to-use graphical user interfaces
familiar to most computer owners, Mosaic became an instant success. It, like
most Internet software, was available on the Net for free. Millions of users
snatched up copies and began surfing the Internet for "cool web pages."
1.1.3 Golden Threads
There you have the history of the Internet and the Web in a
nutshell: from rags to riches in just a few short years. The Internet has
spawned an entirely new medium for worldwide information exchange and commerce.
For instance, when the marketers caught on to the fact that they could cheaply
produce and deliver eye-catching, wow-and-whizbang commercials and product
catalogs to those millions of web surfers around the world, there was no
stopping the stampede of blue suede shoes. Even the key developers of Mosaic and
related web server technologies sensed potential riches. They left NCSA and made
their fortunes with Netscape Communications by producing commercial web browsers and server
software. That was until the sleeping giant Microsoft awoke. But that's another
story . . .
Business users and marketing opportunities have helped
invigorate the Internet and fuel its phenomenal growth. Internet-based commerce
has become Very Big Business and is expected to approach US$150 billion annually
by 2005.
For some, particularly us Internet old-timers, business and
marketing have also trashed the medium. In many ways, the Web has become a vast
strip mall and an annoying advertising medium. Believe it or not, once upon a
time, Internet users adhered to commonly held (but not formally codified) rules
of netiquette that prohibited such things as
"spamming" special-interest newsgroups with messages unrelated to the topic at
hand or sending unsolicited email.
Nonetheless, the power of HTML and network distribution of
information goes well beyond marketing and monetary rewards: serious
informational pursuits also benefit. Publications, complete with images and
other media like executable software, can get to their intended audiences in the
blink of an eye, instead of the months traditionally required for printing and
mail delivery. Education takes a great leap forward when students gain access to
the great libraries of the world. And at times of leisure, the interactive
capabilities of HTML links can reinvigorate our otherwise television-numbed
minds.
1.2 Talking the Internet Talk
Every computer connected to the Internet
(even a beat-up old Apple II) has a unique address: a number whose format is
defined by the Internet protocol (IP), the
standard that defines how messages are passed from one machine to another on the
Net. An IP address is made
up of four numbers, each less than 256, joined together by periods, such as
192.12.248.73 or 131.58.97.254.
This naming stuff is
easier than it sounds. For example, the fully qualified domain name www.oreilly.com translates to a machine named "www"
that's part of the domain known as "oreilly," which, in turn, is part of the
commercial (com) branch of the Internet. Other branches of the Internet include
educational institutions (edu), nonprofit organizations (org), the U.S.
government (gov), and Internet service providers (net). Computers and networks
outside the United States may have two-letter abbreviations at the end of their
names: for example, "ca" for Canada, "jp" for Japan, and "uk" for the United
Kingdom.
Special computers, known as name servers, keep tables of machine names and their
associated unique numerical IP addresses and translate one into the other for us
and for our machines. Domain names must be registered and paid for through any
one of the now many for-profit registrars.[1] Once it is registered, the owner of the
unique domain name broadcasts it and its address to other domain name servers
around the world. Each domain and subdomain has an associated name server, so
ultimately every machine is known uniquely by both a name and an IP address.
[1] At one time, a single nonprofit organization known as InterNIC handled that function. Now ICANN.org coordinates U.S. government-related name servers, but other organizations or individuals must work through a for-profit company to register their unique domain names.
1.2.1 Clients, Servers, and Browsers
The Internet connects two kinds of computers: servers, which serve up documents, and clients, which retrieve and display documents for us
humans. Things that happen on the server machine are said to be on the server side, while activities on the client machine
occur on the client side.
To access and display
HTML documents, we run programs called browsers
on our client computers. These browser clients talk to special web servers over the Internet to access and retrieve
electronic documents.
Several web browsers
are available (most for free), each offering a different set of features. For
example, browsers like Lynx run on character-based clients
and display documents only as text. Others run on clients with graphical
displays and render documents using proportional fonts and color graphics on a
1024 x 768, 24-bit-per-pixel display. Others still — Netscape Navigator, Microsoft's Internet
Explorer, and Opera, to name the leading few — have special features that allow
you to retrieve and display a variety of electronic documents over the Internet,
including audio and video multimedia.
1.2.2 The Flow of Information
All web activity begins on the client side, when a user starts
his or her browser. The browser begins by loading a home
page document, either from local storage or from a server over some
network, such as the Internet, a corporate intranet, or a town extranet. In
these latter cases, the client browser first consults a domain name system (DNS)
server to translate the home page document server's name, such as www.oreilly.com, into an IP address, before sending a
request to that server over the Internet. This request (and the server's reply)
is formatted according to the dictates of the Hypertext Transfer Protocol (HTTP) standard.
A server spends most
of its time listening to the network, waiting for document requests with the
server's unique address stamped on them. Upon receipt of a request, the server
verifies that the requesting browser is allowed to retrieve documents from the
server and, if so, checks for the requested document. If found, the server sends
(downloads) the document to the browser. The server usually logs the request,
the client computer's name, the document requested, and the time.
Back on the browser, the document arrives. If it's a
plain-vanilla ASCII text file, most browsers display it in a common,
plain-vanilla way. Document directories, too, are treated like plain documents,
although most graphical browsers display folder icons that the user can select
with the mouse to download the contents of subdirectories.
Browsers also retrieve context files from
a server. Unless assisted by a helper program or
specially enabled by plug-in software or applets, which display an image or video file or play
an audio file, the browser usually stores downloaded binary files directly on a
local disk for later use.
For the most part, however, the browser retrieves a special
document that appears to be a plain text file but that contains both text and
special markup codes called tags. The browser
processes these HTML or XHTML documents, formatting the text based on the tags
and downloading special accessory files, such as images.
The user reads the document, selects a hyperlink to another
document, and the entire process starts over.
1.2.3 Beneath the Web
Isolating web documents is good for the author, too, since it
gives you the opportunity to finish, in the editorial sense of the word, a
document collection for later distribution. Diligent authors work locally to
write and proof their documents before releasing them for general distribution,
thereby sparing readers the agonies of broken image files and bogus
hyperlinks.[2]
[2] Vigorous testing of HTML documents once they are made available on the Web is, of course, also highly recommended and necessary to rid them of various linking bugs.
Organizations, too, can be connected to the Internet but also
maintain private webs and document collections for distribution to clients on
their local networks, or intranets. In fact, private webs are fast becoming the
technology of choice for the paperless offices we've heard so much about during
these last few years. With HTML and XHTML document collections, businesses can
maintain personnel databases complete with employee photographs and online
handbooks, collections of blueprints, parts, assembly manuals, and so on — all
readily and easily accessed electronically by authorized users and displayed on
a local computer.
1.2.4 Standards Organizations
Like many popular technologies, HTML
started out as an informal specification used by only a few people. As more and
more authors began to use the language, it became obvious that more formal means
were needed to define and manage — i.e., to standardize — the language's
features, making it easier for everyone to create and share documents.
1.2.4.1 The World Wide Web Consortium
The World Wide Web
Consortium (W3C) was formed with the charter to define the standards for HTML
and, later, XHTML. Members are responsible for drafting, circulating for review,
and modifying the standard based on cross-Internet feedback to best meet the
needs of the many.
Beyond HTML and XHTML, the W3C has the broader responsibility
of standardizing any technology related to the Web; they manage the HTTP,
Cascading Style Sheets (CSS), and Extensible Markup Language (XML) standards, as
well as related standards for document addressing on the Web. They also solicit
draft standards for extensions to existing web technologies.
If you want to track HTML, XML, XHTML, CSS, and other exciting
web development and related technologies, contact the W3C at http://www.w3.org.
Also, several Internet newsgroups are devoted to the Web, each
a part of the comp.infosystems.www hierarchy.
These include comp.infosystems.www.authoring.html
and comp.infosystems.www.authoring.images.
1.2.4.2 The Internet Engineering Task Force
Even broader in reach
than W3C, the Internet Engineering Task Force (IETF) is responsible for defining
and managing every aspect of Internet technology. The Web is just one small area
under the purview of the IETF.
The IETF defines all of the technology of the Internet via
official documents known as Requests
for Comments, or RFCs. Individually numbered for easy reference, each RFC
addresses a specific Internet technology — everything from the syntax of domain
names and the allocation of IP addresses to the format of electronic mail
messages.
To learn more about the IETF and follow the progress of various
RFCs as they are circulated for review and revision, visit the IETF home page,
http://www.ietf.org.
1.3 HTML and XHTML: What They Are
HTML and XHTML are document-layout and
hyperlink-specification languages. They define the syntax and placement of
special, embedded directions that aren't displayed by the browser but tell it
how to display the contents of the document, including text, images, and other
support media. The languages also tell you how to make a document interactive
through special hypertext links, which connect your document with other
documents — on either your computer or someone else's — as well as with other
Internet resources.
You've certainly heard of HTML, and perhaps XHTML too, but
did you know that they are just two of many other markup languages? Indeed, HTML
is the black sheep in the family of document markup languages. HTML was based on
SGML, the Standard Generalized Markup
Language. The powers-that-be created SGML with the intent that it be the one and
only markup metalanguage from which all other document markup elements would be
created. Everything from hieroglyphics to HTML can be defined using SGML,
negating any need for any other markup language.
The problem with SGML is that it is so broad and
all-encompassing that mere mortals cannot use it. Using SGML effectively
requires very expensive and complex tools that are completely beyond the scope
of regular people who just want to bang out an HTML document in their spare
time. As a result, HTML adheres to some, but not all, SGML standards,[3] eliminating many of
the more esoteric features so that it is readily useable and used.
[3] The HTML DTD in Appendix D uses a subset of SGML to define the HTML 4.01 standard.
Besides the fact that SGML is unwieldy and not well suited to
describing the very popular HTML in a useful way, there was also a growing need
to define other HTML-like markup languages to handle different network
documents. Accordingly, the W3C defined the Extensible Markup Language (XML). Like SGML, XML is a separate
formal markup metalanguage that uses select features of SGML to define markup
languages. It eliminates many features of SGML that aren't applicable to
languages like HTML and simplifies other SGML elements in order to make them
easier to use and understand.
However, HTML Version 4.01 is not XML-compliant. Hence, the W3C
offers XHTML, a reformulation of HTML that is compliant with XML. XHTML attempts
to support every last nit and feature of HTML 4.01 using the more rigid rules of
XML. It generally succeeds, but it has enough differences to make life difficult
for the standards-conscious HTML author.
1.4 HTML and XHTML: What They Aren't
Despite all their new, multimedia-enabling page-layout
features, and the hot technologies that give life to HTML/XHTML documents over the Internet, it is also important
to understand the languages' limitations. They are not word-processing tools,
desktop-publishing solutions, or even programming languages. Their fundamental
purpose is to define the structure and appearance of documents and document
families so that they may be delivered quickly and easily to a user over a
network for rendering on a variety of display devices. Jack of all trades, but
master of none, so to speak.
1.4.1 Content Versus Appearance
HTML and its progeny, XHTML, provide many different ways to let
you define the appearance of your documents: font specifications, line breaks,
and multicolumn text are all features of the language. Of course, appearance is
important, since it can have either detrimental or beneficial effects on how
users access and use the information in your documents.
Nonetheless, we believe that content is paramount; appearance
is secondary, particularly since it is less predictable, given the variety of
browser graphics and text-formatting capabilities. In fact, HTML and XHTML
contain many ways for structuring your document content without regard to the
final appearance: section headers, structured lists, paragraphs, rules, titles,
and embedded images are all defined by the standard languages without regard for
how these elements might be rendered by a browser. Consider, for example, a
browser for the blind, wherein graphics on the page come with audio descriptions
and alternative rules for navigation. The HTML/XHTML standards define such a
thing: content over visual presentation.
If you treat HTML or XHTML as a document-generation tool, you
will be sorely disappointed in your ability to format your document in a
specific way. There is simply not enough capability built into the languages to
allow you to create the kinds of documents you might whip up with tools like
FrameMaker or Microsoft Word. Attempts to subvert the supplied structuring
elements to achieve specific formatting tricks seldom work across all browsers.
In short, don't waste your time trying to force HTML and XHTML to do things they
were never designed to do.
Instead, use HTML and XHTML in the manner for which they were
designed: indicating the structure of a document so that the browser can then
render its content appropriately. HTML and XHTML are rife with tags that let you
indicate the semantics of your document content, something that is missing from
tools like FrameMaker and Word. Create your documents using these tags and
you'll be happier, your documents will look better, and your readers will benefit immensely.
1.5 Standards and Extensions
The basic syntax and semantics of HTML are
defined in the HTML standard, now in its final version, 4.01. HTML matured
quickly, in barely a decade. At one time, a new version would appear before you
had a chance to finish reading an earlier edition of this book. Today, HTML has
stopped evolving. As far as the W3C is concerned, XHTML has taken over. Now the
wait is for browser manufacturers to implement the standards.
The XHTML standard
currently is Version 1.0. Fortunately, XHTML Version 1.0 is, for the most part,
a reconstitution of HTML Version 4.0.1. There are some differences, which we
explore in Chapter
16. The popular browsers continue to support HTML documents, so there is no
cause to stampede to XHTML. Do, however, start walking in that direction: a
newer XHTML version, 1.1, is under consideration at the W3C, and browser
developers are slowly but surely dropping nonstandard HTML features from their
products.
Obviously, browser developers rely upon standards to
have their software properly format and display common HTML and XHTML documents.
Authors use the standards to make sure they are writing effective, correct
documents that get displayed properly by the browsers.
However, standards are not always explicit; manufacturers have
some leeway in how their browsers might display an element. And to complicate
matters, commercial forces have pushed developers to add into their browsers
nonstandard extensions meant to improve the language.
Confused? Don't be: in this book, we explore in detail the
syntax, semantics, and idioms of the HTML Version 4.01 and XHTML Version 1.0
languages, along with the many important extensions that are supported in the
latest versions of the most popular browsers.
1.5.1 Nonstandard Extensions
It doesn't take an advanced degree in The
Obvious to know that distinction draws attention. So, too, with browsers. Extra
whizbang features can give the edge in the otherwise standardized browser
market. That can be a nightmare for authors. A lot of people want you to use the
latest and greatest gimmick or even useful HTML/XHTML extension. But it's not
part of the standard, and not all browsers support it. In fact, on occasion, the
popular browsers support different ways of doing the same thing.
1.5.2 Extensions: Pro and Con
Every software vendor adheres to the technological standards;
it's embarrassing to be incompatible, and your competitors will take every
opportunity to remind buyers of your product's failure to comply, no matter how
arcane or useless that standard might be. At the same time, vendors seek to make
their products different from and better than the competition's offerings.
Netscape's and Internet Explorer's extensions to standard HTML are perfect
examples of these market pressures.
Fortunately, with HTML Version 4.0, the W3C standards caught up
with the browser manufacturers. In fact, the tables turned somewhat. The many
extensions to HTML that originally appeared as extensions in Netscape Navigator
and Microsoft Internet Explorer are now part of the HTML 4 and XHTML 1.0
standards, and there are other parts of the new standard that are not yet
features of the popular browsers.
1.5.3 Avoiding Extensions
In general, we urge you to resist using
extensions unless you have a compelling and overriding reason to do so. By using
them, particularly in key portions of your documents, you run the risk of losing
a substantial portion of your potential readership. Sure, the Internet Explorer
community is large enough to make this point moot now, but even so, you are
excluding from your pages millions of people who use Netscape.
Of course, there are varying degrees of dependency on
extensions. If you use some of the horizontal rule extensions, for example, most
other browsers will ignore the extended attributes and render a conventional
horizontal rule. On the other hand, reliance upon a number of font-size changes
and text-alignment extensions to control your document's appearance will make
your document look terrible on many alternative browsers. It might not even
display at all on browsers that don't support the extensions.
We admit that it is disingenuous of us to decry the use of
extensions while presenting complete descriptions of their use. In keeping with
the general philosophy of the Internet, we'll err on the side of handing out
rope and guns to all interested parties while hoping you have enough smarts to
keep from hanging yourself or shooting yourself in the foot.
Our advice still holds, though: use an extension only where it
is necessary or very advantageous, and do so with the understanding that you are
disenfranchising a portion of your audience. To that end, you might even
consider providing separate, standards-based versions of your documents to
accommodate users of other browsers.
1.5.4 Extensions Through Modules
The upcoming XHTML
Version 1.1 provides a mechanism for extending the language in a standard way:
XML modules. In fact, XHTML 1.1 is comprised of modules itself.
XHTML modules divide the HTML language into discrete document
types, each defining features and functions that are parts of the language.
There are separate modules for XHTML forms, text, scripting, tables, and so on —
all the nondeprecated elements of XHTML 1.0.
The advantage of modules is extensibility. In addition to using
the markup features from the XHTML modules normally included in the standard,
the new language lets you easily blend other XML modules into your documents,
extending their features and capabilities in a standard way. For instance, the
W3C has defined a MathML module that provides explicit markup elements for
mathematical equations that you could use in your next XHTML-based math thesis.
Modules, let alone the XHTML Version 1.1 language, are
experimental and are not well supported by the popular browsers. Accordingly, we
don't recommend that you use XHTML modules just yet. For now, the subject is
beyond the scope of this book. Consult the W3C web site for more details.
1.6 Tools for the Web Designer
While you can use the barest of barebones text editors
to create HTML and XHTML documents, most authors have a bit more elaborate
toolbox of software utilities than a simple word processor. You also need a
browser, so you can test and refine your work. Beyond the essentials are some
specialized software tools for developing and preparing HTML documents and
accessory multimedia files.
1.6.1 Essentials
At the very least, you'll need an editor, a browser to check
your work, and, ideally, a connection to the Internet.
1.6.1.1 Word processor or WYSIWYG editor?
Some authors use the word-processing capabilities of their
specialized HTML/XHTML editing software. Some use the WYSIWYG
(what-you-see-is-what-you-get) composition tools that come with their browsers
or the latest versions of the popular word processors. Others, such as
ourselves, prefer to compose their work on a general word processor and later
insert the markup tags and their attributes. Still others include markup as they
compose.
We think the stepwise approach — compose, then mark up — is the
better way. We find that once we've defined and written the document's content,
it's much easier to make a second pass to judiciously and effectively add the
HTML/XHTML tags to format the text. Otherwise, the markup can obscure the
content. Note, too, that unless specially trained (if they can be),
spell-checkers and thesauruses typically choke on markup tags and their various
parameters. You can spend what seems to be a lifetime clicking the Ignore button
on all those otherwise valid markup tags when syntax- or spell-checking a
document.
When and how you embed markup tags into your document dictates
the tools you need. We recommend that you use a good word processor, which comes
with more and better writing tools than simple text editors or the browser-based
markup-language editors. You'll find, for instance, that an outliner,
spell-checker, and thesaurus will best help you craft the document's flow and
content, disregarding for the moment its look. The latest word processors encode
your documents with HTML, too, but don't expect miracles. Except for boilerplate
documents, you will probably need to nurse those automated HTML documents to
full health. (Not to mention put them on a diet when you see how long the
generated HTML is.) And it'll be a while before you'll see XHTML-specific markup
tools in the popular word processors.
Another word of caution about automated composition tools: they
typically change or insert content (e.g., replacing relative hyperlinks with
full ones) and arrange your document in ways that will annoy you. Annoying, in
particular, since they rarely give you the opportunity to do things your own
way.
Become fluent in native HTML/XHTML. Be prepared to reverse some
of the things a composition tool will do to your documents. And make sure you
can wrest your document away from the tool so you can make it do your bidding.
1.6.1.2 Browser software
Obviously, you should view your newly composed documents and
test their functionality before you release them for use by others. For serious
authors, particularly those looking to push their documents beyond the
HTML/XHTML standards, we recommend that you have several browsers, perhaps with
versions running on different computers, just to be sure one's delightful
display isn't another's nightmare.
The currently popular — and therefore most important — browsers
are Netscape Navigator (the browser portion of Netscape
Communicator) and Microsoft Internet Explorer. Download the latest
versions from their web sites.
By the way, Netscape Communicator includes a fine HTML WYSIWYG
editor called Composer.
1.6.2 An Extended Toolkit
If you're serious about creating documents, you'll soon find
there are all sorts of nifty tools that make life easier. The list of freeware,
shareware, and commercial products grows daily, so it's not very useful to
provide a list here. This is, in fact, another good reason to frequent the
various newsgroups and web sites that keep updated lists of HTML and XHTML
resources on the Web. If you are really dedicated to writing in HTML and XHTML,
you will visit those sites, and you will visit them regularly to keep abreast of
the language, tools, and trends.
No comments:
Post a Comment