DRAFT IN PROGRESS

# History and Context

SKETCH THE HISTORY OF CJTCS, JFLP. EXPLAIN TEX/LATEX.

# Editing for Robustness and Flexibility

## How Much Should We Edit?

How much editing of articles should editors of online journals do? This question generates a fair bit of controversy, and proposed answers range from

• editors should only select articles, and publish unchanged the materials provided by authors to
• editors should polish articles into beautiful page images, which are stored online instead of printed.
I propose that online scholarly articles can benefit as much from editing as printed articles, but some of the editing is substantially different and serves a new purpose not achievable in print.

## Why Do We Edit?

### Content Editing (Keep it Up)

Editing of scholarly articles divides naturally into content and format editing, although the line between them is not precise, and the division of labor may involve other issues. Content editing is essentially the same problem whether the content is to be printed or published online. I have a strong conviction that good content editing is an important part of the service provided by a journal publisher. Although the question is not peculiar to online publishing, some of the relevant issues have broader application for online publication than for print. The value of content editing hinges on the type of reader that we cater to, which might include:

• someone who wants to verify the credentials of the author as a researcher, and particularly the author's claim to precedence for the particular results in the article,
• an expert on the topic who wants to learn the results in detail,
• a more general reader who wants to understand one or more aspects of the research.
I claim that research articles are not worth archiving unless they are written for a relatively general audience. Although scholarly publication supports the credentials business, that is a derivative value which disappears without the more fundamental value of sharing enlightenement through communication of ideas. The quick communication of results to fellow experts is well supported by unpublished or informally published technical reports. The point of archival publication is to communicate ideas to a broad range of readers, not all of them expert in the topic. The farther into the future we go, the less we can count on readers' familiarity with the technical context of an article. In my experience, most authors lapse several times per article into highly ambiguous, and sometimes erroneous, expressions that can only be decoded correctly by those who already understand the topic. I see no effective alternative to good content editing for producing genuinely readable articles.

### Format Editing (a New Sort of Format)

Format editing for print publication is essentially the creation of good legible typographical layout for the content. For online publishing, layout is less relevant to editors, because authors often do a pretty good job, and because readers also take over some of the responsibility in return for the flexibility of producing custom formats tailored to their needs and facilities. There is an attractive argument that we should abandon format editing, making online publishing cheaper and more efficient.

Instead of dropping format editing, I propose to spend a similar effort toward a new goal. I propose that we replace most of the editing that produces good layout with editing for robust and flexible source structure. We will need some discipline to do so, since the benefits of well structured source will be invisible in the short run. But, in the longer term, well structured source may make the difference between journal archives that offer a perpetual intellectual resource, and journal archives that disappear into obsolescence.

MENTION SEPARATION OF STRUCTURE, LAYOUT.

## What Do We Edit?

To understand format editing for robustness and flexibility, we need first to understand what the format'' of an online article is. The right sort of format is the subject of controversy, probably more intense than the controversy regarding editing. I have argued elsewhere that the definitive archival format of articles published online should present the textual structure of articles as transparently as possible, leaving details of the display, such as typographical layout, to readers. Layouts that are attractive to large numbers of readers should be provided when convenient as a derivative service, but the definitive copy of an article should be a structural source format, such as SGML or a disciplined form of LaTeX. Key reasons for preferring structural formats to typographical or pictorial formats include:

1. Structural formats come closest to presenting the information contributed by the author.
2. It is much easier to produce good typographical layout from a structural presentation than vice versa.
3. Structural presentations are much easier to parse automatically, making them more useful for
1. innovative information processing and retrieval applications and
2. automatic conversion to new formats when the current one becomes obsolete.
4. Structural formats give maximum power to readers to process the information in articles however they please, not just in ways explicitly supported by publishers.
SGML (HTML is essentially a subset of SGML) is a reasonable structural format for plain text, but there are no satisfactory utilities for displaying mathematical formulae in SGML documents today. The Astrophysical Journal, published by the University of Chicago Press, uses SGML as a source format, with automatic conversion to LaTeX to render the mathematics. CJTCS depends even more than the Astrophysical Journal on the presentation of mathematics, and we do not enjoy the programming manpower that they have, so we chose a disciplined form of LaTeX as our source format.

The case for structural formats is a bit hard to grasp concretely, since it depends, not on the look of a standard display of an article right now, but on the potential for displaying and processing the article in unpredictable ways in the unknown future. For the most part, we have not yet enjoyed the payoff. So far, Evan Owens of the Astrophysical Journal reports that the choice to base the old printed journal on SGML made rapid conversion to online publication very feasible, instead of horrifically difficult. CJTCS made the transition from the obsolescent LaTeX version 2.18, to the new LaTeX2e without a hitch, largely due to the robustness of our source format. CJTCS is also experimenting with audible display of articles, using T. V. Raman's AsTeR system. Dr. Raman discusses in his dissertation why a structural format, such as disciplined LaTeX, allows automatic audio display and even browsing of mathematics and text, while popular typographical formats, such as PostScript and PDF do not.

Since it is treated elsewhere, I ask you to postulate the value of structural formats as the definitive archival forms of published articles. This paper treats the need for intelligent editing to produce such formats.

## A Sample Editing History

I will demonstrate the value of editing authors' LaTeX source by tracing the key changes in an article recently published in CJTCS. Uriel Feige and Joe Kilian have kindly allowed me to quote from their original manuscript, which is normally held confidential. This manuscript is fairly typical of what authors provide. Feige and Kilian provided me with a LaTeX and BIBTeX source that produced very readable text, but was not a reliable presentation of textual structure. I edited their source into a disciplined subset of LaTeX, supported by my cjstruct macros (available as freeware under the GNU General Public License). cjstruct does very little to determine typographical layout, leaving most of that to parameters that may be set by each reader. Rather, it provides a translation between a form that is structurally similar to SGML, and the typographical macro calls that are supported directly by LaTeX. You should be able to understand LaTeX and cjstruct well enough for current purposes just by following the examples in the edit history.

### The Manuscript From the Authors

Here are two sample typeset pages from the draft accepted by the CJTCS editors, essentially as it was transmitted by the authors, in PostScript and DVI formats. I say essentially,'' because the authors submitted LaTeX source, and the precise look of the result may be somewhat different at different sites. This example is typical of the quality of typeset mathematical copy produced naively by authors using LaTeX. It is highly readable. Most computer scientists are thoroughly satisfied to read material of this visual quality, particularly when it is printed on a high-resolution laser printer. Most of the free PostScript browsers widely available on UNIX systems are not very satisfactory for reading such text, but they will improve quickly. LaTeX produces a less widely used typographical format, called DVI, for which very nice free browsers are already available to run on UNIX. If this typeset display were the real published product, there might be little or no need for format editing.

Here are corresponding pages from a standard display of the final published article, also in PostScript and DVI. The difference is not dramatic. The nearly identical pagination is an unusual, and insignificant coincidence.

Now, look at the LaTeX source for Sections 3.1 and 3.2 of the original draft versus the published version. I suggest that you open these two files in two additional windows, and compare them as you read ahead. In LaTeX source, %'' makes the remainder of a line into an unprocessed comment, \'' introduces a macro command, {'' and }'' provide grouping, mostly for arguments to macros. I describe other special forms in LaTeX as they come up. The remainder of this section describes the key format-editing steps that transformed the former LaTeX source into the latter. You may also notice some minor changes introduced by copy editing. I follow a conceptual, and nearly chronological, order.


LaTeX lays this out reasonably, but the structure is all wrong, since the individual marked formulae are incomplete and make no sense on their own. I would revise this to
$$E=\setof{(i,j)}{\sigma_{n,k}(i,j)=1\text{ or } (\sigma_{n,k}(i,j)=x_i\text{ and }x_i=1)}$$

with the macro definition
\newcommand{\setof}[2]{\{\,#1\mid#2\,\}}

Here is the typeset result in PostScript and DVI formats. \text'' formats normal text inside a mathematical formula. The important point is that or'' and and'' are marked correctly as textual elements inside the mathematical formula. The other changes are mainly for typographical layout: \mid'' is the vertical bar symbol (|''), treated as a mathematical relation, which affects typographical spacing. \,'' provides a custom adjustment of spacing.

#### Markup of Text Structure

LaTeX is designed to help authors write and revise drafts of documents, leading to a final permanent printed copy. To a great extent, the structure that is helpful in controlling a sequence of revisions is also good for other sorts of information processing. But, some forms of sloppiness are tolerated in drafts, yet very harmful in archival material. Also, good support for revising drafts requires that systematic elements of a document, such as section nubers, are calculated by the processor. Many of these calculated elements should be stored explicitly in an archival version.

In the author's source, notice the form in which the subsection 3.1 is labelled:

\subsection{The goal: subexponential algorithms for NP}
\label{sec:goal}

Each time the draft is formatted, LaTeX automatically assigns an appropriate number for the subsection, and displays that number in the section title. Also, the \label{sec:goal}'' command creates a name for that section number, so that the form see section \cite{sec:goal}'' automatically fills in the section number in a cross reference. This automatic numbering saves authors immense labor when they add or delete sections. Labelled elements, such as theorems and definitions, are treated analogously. In an archival version, these dynamically computed numbers should be fixed permanently. So, I replace the section heading with
\asubsection{3.1}{The Goal: Subexponential Algorithms for
\protect$$\complexityclass{NP}\protect$$}

giving the number 3.1 explicitly. I substitute permanent numbers into cross references, also. This is much more convenient for browsers that need to search for a specific numbered element, and for other information processors that use these elements. A simple thought experiment shows why it is important to bind numerical labels into archival source. Imagine that a reader quotes Definition 1 from this article. The best way to do so is to cut and paste the LaTeX source, since that leaves the least chance of error. But, it is certainly wrong to have the definition renumbered into the scheme of the paper in which it is quoted. The macros \asection'', \alabsubtext'', etc. are defined by cjstruct to display labels explicitly in the source text. The a'' stands for absolute.'' These macros merely substitute the explicitly given labels for those that LaTeX generates dynamically, leaving all layout decisions to the normal LaTeX code.

In addition to converting dynamically numbered elements to explicit numbering, I also add explicit markup and numbering for paragraphs. Essentially all authors use blank lines to denote paragraph breaks in LaTeX, because this is so easy to type and to recognize in text editors. They also tend to leave blank lines around elements other than paragraphs, such as theorems and definitions, in some cases because the resulting layout looks better, or in some cases by accident. This leaves a lot of ambiguity regarding the actual identities of paragraphs. In printed text, precise identification of paragraphs is not very important. In online text, search engines of the future are very likely to use paragraph structure to interpret texts, and other information processing applications may find unforeseen uses. Also, since readers may produce their own custom layouts of articles, page numbers are useless in citations of CJTCS articles, and paragraph numbers are an attractive substitute. To resolve ambiguity in paragraphing, and to allow easy citation of arbitrary segments of articles, I provide explicit paragraph marks with numbers, such as \apar{3.1-2}'' (the second paragraph in section 3.1). One of the few layout features in cjstruct is a feature for printing these paragraph numbers in the margins.

Having made paragraph breaks unambiguous, I decided to mark up sentence boundaries as well. It's not clear how useful this will be, but it's very easy to do. It is likely, but not certain, that future information processing and retrieval will use sentence structure. For sentences ending in a punctation mark, normally period, question mark, or exclamation mark, I precede the punctuation with the macro \@''. This is a LaTeX macro intended to prevent certain periods from being misunderstood as ends of abbreviations, rather than ends of sentences. It has no impact on layout otherwise. For sentences not ending in punctuation, I use the macro \sentence'', defined in cjstruct to have no effect on layout. Many sentences end with displayed math, and since periods or similar dots are often used in formulae, it is confusing to have a sentence-ending period in such a display. Sentence markup impresses many people as outrageously compulsive. It may turn out to have little value, but it is extremely easy to do. And, automatic parsing of sentences from unmarked text is much more difficult than it looks at first. There are remarkably many variant uses of punctuation, the use of periods for abbreviations being the most common. There are remarkably many cases where one sentence contains another. Embedded quotes, and formal elements such as definitions, often appear as noun phrases in larger sentences.