DRAFT IN PROGRESS

Last modified 18 July 1997

History and Context

SKETCH THE HISTORY OF CJTCS, JFLP. EXPLAIN TEX/LATEX.

Editing for Robustness and Flexibility

How Much Should We Edit?

How much editing of articles should editors of online journals do? This question generates a fair bit of controversy, and proposed answers range from

I propose that online scholarly articles can benefit as much from editing as printed articles, but some of the editing is substantially different and serves a new purpose not achievable in print.

Why Do We Edit?

Content Editing (Keep it Up)

Editing of scholarly articles divides naturally into content and format editing, although the line between them is not precise, and the division of labor may involve other issues. Content editing is essentially the same problem whether the content is to be printed or published online. I have a strong conviction that good content editing is an important part of the service provided by a journal publisher. Although the question is not peculiar to online publishing, some of the relevant issues have broader application for online publication than for print. The value of content editing hinges on the type of reader that we cater to, which might include:

I claim that research articles are not worth archiving unless they are written for a relatively general audience. Although scholarly publication supports the credentials business, that is a derivative value which disappears without the more fundamental value of sharing enlightenement through communication of ideas. The quick communication of results to fellow experts is well supported by unpublished or informally published technical reports. The point of archival publication is to communicate ideas to a broad range of readers, not all of them expert in the topic. The farther into the future we go, the less we can count on readers' familiarity with the technical context of an article. In my experience, most authors lapse several times per article into highly ambiguous, and sometimes erroneous, expressions that can only be decoded correctly by those who already understand the topic. I see no effective alternative to good content editing for producing genuinely readable articles.

Format Editing (a New Sort of Format)

Format editing for print publication is essentially the creation of good legible typographical layout for the content. For online publishing, layout is less relevant to editors, because authors often do a pretty good job, and because readers also take over some of the responsibility in return for the flexibility of producing custom formats tailored to their needs and facilities. There is an attractive argument that we should abandon format editing, making online publishing cheaper and more efficient.

Instead of dropping format editing, I propose to spend a similar effort toward a new goal. I propose that we replace most of the editing that produces good layout with editing for robust and flexible source structure. We will need some discipline to do so, since the benefits of well structured source will be invisible in the short run. But, in the longer term, well structured source may make the difference between journal archives that offer a perpetual intellectual resource, and journal archives that disappear into obsolescence.

MENTION SEPARATION OF STRUCTURE, LAYOUT.

What Do We Edit?

To understand format editing for robustness and flexibility, we need first to understand what the ``format'' of an online article is. The right sort of format is the subject of controversy, probably more intense than the controversy regarding editing. I have argued elsewhere that the definitive archival format of articles published online should present the textual structure of articles as transparently as possible, leaving details of the display, such as typographical layout, to readers. Layouts that are attractive to large numbers of readers should be provided when convenient as a derivative service, but the definitive copy of an article should be a structural source format, such as SGML or a disciplined form of LaTeX. Key reasons for preferring structural formats to typographical or pictorial formats include:

  1. Structural formats come closest to presenting the information contributed by the author.
  2. It is much easier to produce good typographical layout from a structural presentation than vice versa.
  3. Structural presentations are much easier to parse automatically, making them more useful for
    1. innovative information processing and retrieval applications and
    2. automatic conversion to new formats when the current one becomes obsolete.
  4. Structural formats give maximum power to readers to process the information in articles however they please, not just in ways explicitly supported by publishers.
SGML (HTML is essentially a subset of SGML) is a reasonable structural format for plain text, but there are no satisfactory utilities for displaying mathematical formulae in SGML documents today. The Astrophysical Journal, published by the University of Chicago Press, uses SGML as a source format, with automatic conversion to LaTeX to render the mathematics. CJTCS depends even more than the Astrophysical Journal on the presentation of mathematics, and we do not enjoy the programming manpower that they have, so we chose a disciplined form of LaTeX as our source format.

The case for structural formats is a bit hard to grasp concretely, since it depends, not on the look of a standard display of an article right now, but on the potential for displaying and processing the article in unpredictable ways in the unknown future. For the most part, we have not yet enjoyed the payoff. So far, Evan Owens of the Astrophysical Journal reports that the choice to base the old printed journal on SGML made rapid conversion to online publication very feasible, instead of horrifically difficult. CJTCS made the transition from the obsolescent LaTeX version 2.18, to the new LaTeX2e without a hitch, largely due to the robustness of our source format. CJTCS is also experimenting with audible display of articles, using T. V. Raman's AsTeR system. Dr. Raman discusses in his dissertation why a structural format, such as disciplined LaTeX, allows automatic audio display and even browsing of mathematics and text, while popular typographical formats, such as PostScript and PDF do not.

Since it is treated elsewhere, I ask you to postulate the value of structural formats as the definitive archival forms of published articles. This paper treats the need for intelligent editing to produce such formats.

A Sample Editing History

I will demonstrate the value of editing authors' LaTeX source by tracing the key changes in an article recently published in CJTCS. Uriel Feige and Joe Kilian have kindly allowed me to quote from their original manuscript, which is normally held confidential. This manuscript is fairly typical of what authors provide. Feige and Kilian provided me with a LaTeX and BIBTeX source that produced very readable text, but was not a reliable presentation of textual structure. I edited their source into a disciplined subset of LaTeX, supported by my cjstruct macros (available as freeware under the GNU General Public License). cjstruct does very little to determine typographical layout, leaving most of that to parameters that may be set by each reader. Rather, it provides a translation between a form that is structurally similar to SGML, and the typographical macro calls that are supported directly by LaTeX. You should be able to understand LaTeX and cjstruct well enough for current purposes just by following the examples in the edit history.

The Manuscript From the Authors

``On Limited versus Polynomial Nondeterminism''

Here are two sample typeset pages from the draft accepted by the CJTCS editors, essentially as it was transmitted by the authors, in PostScript and DVI formats. I say ``essentially,'' because the authors submitted LaTeX source, and the precise look of the result may be somewhat different at different sites. This example is typical of the quality of typeset mathematical copy produced naively by authors using LaTeX. It is highly readable. Most computer scientists are thoroughly satisfied to read material of this visual quality, particularly when it is printed on a high-resolution laser printer. Most of the free PostScript browsers widely available on UNIX systems are not very satisfactory for reading such text, but they will improve quickly. LaTeX produces a less widely used typographical format, called DVI, for which very nice free browsers are already available to run on UNIX. If this typeset display were the real published product, there might be little or no need for format editing.

Here are corresponding pages from a standard display of the final published article, also in PostScript and DVI. The difference is not dramatic. The nearly identical pagination is an unusual, and insignificant coincidence.

Now, look at the LaTeX source for Sections 3.1 and 3.2 of the original draft versus the published version. I suggest that you open these two files in two additional windows, and compare them as you read ahead. In LaTeX source, ``%'' makes the remainder of a line into an unprocessed comment, ``\'' introduces a macro command, ``{'' and ``}'' provide grouping, mostly for arguments to macros. I describe other special forms in LaTeX as they come up. The remainder of this section describes the key format-editing steps that transformed the former LaTeX source into the latter. You may also notice some minor changes introduced by copy editing. I follow a conceptual, and nearly chronological, order.

Bracketing Mathematical Forumulae

LaTeX provides two different ways to bracket mathematical formulae. There is no good reason for the difference, it's just there. Authors normally use ``$'' both before and after mathematical formulae that are worked into a line of text, and ``$$'' before and after formulae displayed on a separate line. This form is the easier to type. CJTCS uses ``\('' and ``\)'' to bracket inline formulae, ``\['' and ``\]'' to bracket displayed formulae, because this form is more robust to parse. This is a very minor point, but it may confuse you when comparing the two sources. A substantial minority of articles have a mathematical formula contained in text that is itself part of a mathematical formula, leading to nesting of these brackets. In principle, they may be nested quite deeply in LaTeX source, but I have never seen more than one level of nesting.

Good Macros, Clarifying Formula Structure

Mathematical formulae use single letters in various alphabets to denote variables. This is the most common use of single letters. But, a very significant minority of single letters denote special operations. In this article, large and small roman letters ``O'' and ``o'', and large greek letters theta and omega, are used systematically for special operators that denote four different qualities of the growth rates of functions. To encourage flexible uses of formulae from articles, including searching for particular forms, and exporting formulae into symbolic math utilities, CJTCS articles distinguish such operators with macro names, in this case ``\orderle'', ``\ordereq'', ``\orderlt'', and \orderge''. The definitions

\newcommand{\orderle}[1]{O(#1)}
\newcommand{\ordereq}[1]{\Theta(#1)}
\newcommand{\orderlt}[1]{o(#1)}
\newcommand{\orderge}[1]{\Omega(#1)}
at the head of the published LaTeX source define these operators and their standard typographical displays. These four operators are very common in CJTCS articles, so they are standardized across all articles. Other operators may be peculiar to a single article. I contemplated creating a catalog of all the operators likely to be used widely, but rejected the idea as too costly of effort at this stage. Eventually, someone will surely create such a standard catalog, and it will be extremely valuable in promoting interoperability between different systems that process mathematical formulae. CJTCS articles are reasonably well postitioned to adapt to such a standard when it arises.

Another special operator is denoted ``\inputsize'' in the published source. Notice that ``\inputsize{x}'' in the published version is ``|x|'' in the original draft. In the typset copy you see the variable ``x'' surrounded by vertical bars. The definition

\newcommand{\inputsize}[1]{\mathopen{|}#1\mathclose{|}}
at the head of the published source not only defines this common standard operator uniformly, but it provides layout information to LaTeX so that the spacing around the vertical bars is appropriately asymmetric. I try, and usually succeed, to isolate similarly all such detailed layout information, so that the main body of the source reflects structure alone, and not layout.

Bad Macros, Used for Speed Typing

Authors who create articles in LaTeX often define short macros for longer pieces of text and/or layout instructions just to speed up their typing. These macros are often recognizable by names that refer to typographical, rather than mathematical, concepts. They are perfectly sensible from the authors' point of view, but they confuse structure in the definitive published form, so I remove them. None of these occurs in Sections 3.1 and 3.2, but elsewhere in the original draft I find

A set $U$ of cardinality $n$ and
a family $\cF$ of $n$ subsets of $U$.
where ``\cF'' is defined at the head of the source by
\newcommand{\cF}{{\cal F}}
(``\cal'' selects a calligraphic shape for the letter ``F''). I revised the source for this sentence to
A set \(U\) of cardinality \(n\) and
a family \(\setofset{F}\) of \(n\) subsets of \(U\)\@.
with the macro definition
\newcommand{\setofset}[1]{\mathcal{#1}}
In the revised version, the letter ``F'' is visible directly in the presentation of the formula, and the macro ``\setofset'' indicates the mathematical type of object denoted by ``F''. The change from ``\cal'' to ``\mathcal'' is merely a conversion from an older style for specifying font choices to a newer and more systematic one in the latest version of LaTeX.

Notice that authors tend to prefer forms that are shorter to type, while I try to impose forms that are more transparent to parse.

Is it a Formula, or is it Text?

In technical articles, with mathematical formulae woven into nearly every line of text, it is sometimes subtle to determine the appropriate boundaries for a formula. Authors are usually content if their printed copy looks reasonable, but I want formulae to be identified correctly for automatic search engines. Feige and Kilian did a pretty good job of marking formulae, but they lapsed once into the form

A major open question in computational complexity is whether P$=$NP.
``P'' and ``NP'' here are formal mathematical symbols, denoting classes of problems requiring certain amounts of time to compute. The equation ``P=NP'' is a mathematical formula, but the authors' source suggests that it is a strange concatenation of two pieces of text with the mathematical symbol ``=''. I revised this sentence to
A major open question in computational complexity is whether
\(\complexityclass{P}=\complexityclass{NP}\)\@.
with the macro definition
\mathstyleclass{\complexityclass}{\mathord}{\textsl}
which marks ``P'' and ``NP'' as special symbols denoting complexity classes, and causes them to be displayed with the spacing of ordinary mathematical variables, but using a slanted type shape. (For those who know LaTeX, ``\mathstyleclass'' is a new command defined in cjstruct, not a standard LaTeX command.)

Once I have decided that forms, such as ``P'' and ``NP'', are really mathematical symbols, rather than text, I mark them consistently as math, even when they appear alone in the middle of text. Look at the subsection title at the very beginning of the source for sections 3.1 and 3.2. The authors entered ``NP'' as bare text, but I marked it as mathematics. (The macro ``\protect'' is required due to a bêtise of LaTeX: it has no structural significance.)

The scholarly literature on complexity theory is full of long acronyms and phrases, used in formulae as single mathematical symbols. These cause headaches for typographical layout. Feige and Kilian used phrases presented in capital letters for this purpose, the longest being

$k$-MONOTONE CIRCUIT SATISFIABILITY
In many articles, such phrases appear in substantial formulae, and I must mark them in a similar way to ``P'' and ``NP''. I try to negotiate shorter names, and more graceful typography, with the authors when possible. In this article, I found only one short formula involving such a phrase cum symbol. So, I replace that formula with English text, and converted all of the phrases similarly into text, for example
the \(k\)-Monotone Circuit Satisfiability problem

In some articles, there are short textual phrases embedded in mathematical formulae. These are a bit subtle to represent correctly. The examples from actual articles are tangled with other issues, so here is a phony example of what might have been submitted.

$E=\{(i,j)~|~\sigma_{n,k}(i,j)=1$ or $(\sigma_{n,k}(i,j)=x_i$ and $x_i=1)\}$
LaTeX lays this out reasonably, but the structure is all wrong, since the individual marked formulae are incomplete and make no sense on their own. I would revise this to
\(E=\setof{(i,j)}{\sigma_{n,k}(i,j)=1\text{ or }
(\sigma_{n,k}(i,j)=x_i\text{ and }x_i=1)}\)
with the macro definition
\newcommand{\setof}[2]{\{\,#1\mid#2\,\}}
Here is the typeset result in PostScript and DVI formats. ``\text'' formats normal text inside a mathematical formula. The important point is that ``or'' and ``and'' are marked correctly as textual elements inside the mathematical formula. The other changes are mainly for typographical layout: ``\mid'' is the vertical bar symbol (``|''), treated as a mathematical relation, which affects typographical spacing. ``\,'' provides a custom adjustment of spacing.

Markup of Text Structure

LaTeX is designed to help authors write and revise drafts of documents, leading to a final permanent printed copy. To a great extent, the structure that is helpful in controlling a sequence of revisions is also good for other sorts of information processing. But, some forms of sloppiness are tolerated in drafts, yet very harmful in archival material. Also, good support for revising drafts requires that systematic elements of a document, such as section nubers, are calculated by the processor. Many of these calculated elements should be stored explicitly in an archival version.

In the author's source, notice the form in which the subsection 3.1 is labelled:

\subsection{The goal: subexponential algorithms for NP}
\label{sec:goal}
Each time the draft is formatted, LaTeX automatically assigns an appropriate number for the subsection, and displays that number in the section title. Also, the ``\label{sec:goal}'' command creates a name for that section number, so that the form ``see section \cite{sec:goal}'' automatically fills in the section number in a cross reference. This automatic numbering saves authors immense labor when they add or delete sections. Labelled elements, such as theorems and definitions, are treated analogously. In an archival version, these dynamically computed numbers should be fixed permanently. So, I replace the section heading with
\asubsection{3.1}{The Goal: Subexponential Algorithms for
\protect\(\complexityclass{NP}\protect\)}
giving the number 3.1 explicitly. I substitute permanent numbers into cross references, also. This is much more convenient for browsers that need to search for a specific numbered element, and for other information processors that use these elements. A simple thought experiment shows why it is important to bind numerical labels into archival source. Imagine that a reader quotes Definition 1 from this article. The best way to do so is to cut and paste the LaTeX source, since that leaves the least chance of error. But, it is certainly wrong to have the definition renumbered into the scheme of the paper in which it is quoted. The macros ``\asection'', ``\alabsubtext'', etc. are defined by cjstruct to display labels explicitly in the source text. The ``a'' stands for ``absolute.'' These macros merely substitute the explicitly given labels for those that LaTeX generates dynamically, leaving all layout decisions to the normal LaTeX code.

In addition to converting dynamically numbered elements to explicit numbering, I also add explicit markup and numbering for paragraphs. Essentially all authors use blank lines to denote paragraph breaks in LaTeX, because this is so easy to type and to recognize in text editors. They also tend to leave blank lines around elements other than paragraphs, such as theorems and definitions, in some cases because the resulting layout looks better, or in some cases by accident. This leaves a lot of ambiguity regarding the actual identities of paragraphs. In printed text, precise identification of paragraphs is not very important. In online text, search engines of the future are very likely to use paragraph structure to interpret texts, and other information processing applications may find unforeseen uses. Also, since readers may produce their own custom layouts of articles, page numbers are useless in citations of CJTCS articles, and paragraph numbers are an attractive substitute. To resolve ambiguity in paragraphing, and to allow easy citation of arbitrary segments of articles, I provide explicit paragraph marks with numbers, such as ``\apar{3.1-2}'' (the second paragraph in section 3.1). One of the few layout features in cjstruct is a feature for printing these paragraph numbers in the margins.

Having made paragraph breaks unambiguous, I decided to mark up sentence boundaries as well. It's not clear how useful this will be, but it's very easy to do. It is likely, but not certain, that future information processing and retrieval will use sentence structure. For sentences ending in a punctation mark, normally period, question mark, or exclamation mark, I precede the punctuation with the macro ``\@''. This is a LaTeX macro intended to prevent certain periods from being misunderstood as ends of abbreviations, rather than ends of sentences. It has no impact on layout otherwise. For sentences not ending in punctuation, I use the macro ``\sentence'', defined in cjstruct to have no effect on layout. Many sentences end with displayed math, and since periods or similar dots are often used in formulae, it is confusing to have a sentence-ending period in such a display. Sentence markup impresses many people as outrageously compulsive. It may turn out to have little value, but it is extremely easy to do. And, automatic parsing of sentences from unmarked text is much more difficult than it looks at first. There are remarkably many variant uses of punctuation, the use of periods for abbreviations being the most common. There are remarkably many cases where one sentence contains another. Embedded quotes, and formal elements such as definitions, often appear as noun phrases in larger sentences.

What Can be Automated?