How much editing of articles should editors of online journals do? This question generates a fair bit of controversy, and proposed answers range from
Editing of scholarly articles divides naturally into content and format editing, although the line between them is not precise, and the division of labor may involve other issues. Content editing is essentially the same problem whether the content is to be printed or published online. I have a strong conviction that good content editing is an important part of the service provided by a journal publisher. Although the question is not peculiar to online publishing, some of the relevant issues have broader application for online publication than for print. The value of content editing hinges on the type of reader that we cater to, which might include:
Format editing for print publication is essentially the creation of good legible typographical layout for the content. For online publishing, layout is less relevant to editors, because authors often do a pretty good job, and because readers also take over some of the responsibility in return for the flexibility of producing custom formats tailored to their needs and facilities. There is an attractive argument that we should abandon format editing, making online publishing cheaper and more efficient.
Instead of dropping format editing, I propose to spend a similar effort toward a new goal. I propose that we replace most of the editing that produces good layout with editing for robust and flexible source structure. We will need some discipline to do so, since the benefits of well structured source will be invisible in the short run. But, in the longer term, well structured source may make the difference between journal archives that offer a perpetual intellectual resource, and journal archives that disappear into obsolescence.
MENTION SEPARATION OF STRUCTURE, LAYOUT.
To understand format editing for robustness and flexibility, we need first to understand what the ``format'' of an online article is. The right sort of format is the subject of controversy, probably more intense than the controversy regarding editing. I have argued elsewhere that the definitive archival format of articles published online should present the textual structure of articles as transparently as possible, leaving details of the display, such as typographical layout, to readers. Layouts that are attractive to large numbers of readers should be provided when convenient as a derivative service, but the definitive copy of an article should be a structural source format, such as SGML or a disciplined form of LaTeX. Key reasons for preferring structural formats to typographical or pictorial formats include:
The case for structural formats is a bit hard to grasp concretely, since it depends, not on the look of a standard display of an article right now, but on the potential for displaying and processing the article in unpredictable ways in the unknown future. For the most part, we have not yet enjoyed the payoff. So far, Evan Owens of the Astrophysical Journal reports that the choice to base the old printed journal on SGML made rapid conversion to online publication very feasible, instead of horrifically difficult. CJTCS made the transition from the obsolescent LaTeX version 2.18, to the new LaTeX2e without a hitch, largely due to the robustness of our source format. CJTCS is also experimenting with audible display of articles, using T. V. Raman's AsTeR system. Dr. Raman discusses in his dissertation why a structural format, such as disciplined LaTeX, allows automatic audio display and even browsing of mathematics and text, while popular typographical formats, such as PostScript and PDF do not.
Since it is treated elsewhere, I ask you to postulate the value of structural formats as the definitive archival forms of published articles. This paper treats the need for intelligent editing to produce such formats.
I will demonstrate the value of editing authors' LaTeX source by tracing the key changes in an article recently published in CJTCS. Uriel Feige and Joe Kilian have kindly allowed me to quote from their original manuscript, which is normally held confidential. This manuscript is fairly typical of what authors provide. Feige and Kilian provided me with a LaTeX and BIBTeX source that produced very readable text, but was not a reliable presentation of textual structure. I edited their source into a disciplined subset of LaTeX, supported by my cjstruct macros (available as freeware under the GNU General Public License). cjstruct does very little to determine typographical layout, leaving most of that to parameters that may be set by each reader. Rather, it provides a translation between a form that is structurally similar to SGML, and the typographical macro calls that are supported directly by LaTeX. You should be able to understand LaTeX and cjstruct well enough for current purposes just by following the examples in the edit history.
``On Limited versus Polynomial Nondeterminism''
Here are two sample typeset pages from the draft accepted by the CJTCS editors, essentially as it was transmitted by the authors, in PostScript and DVI formats. I say ``essentially,'' because the authors submitted LaTeX source, and the precise look of the result may be somewhat different at different sites. This example is typical of the quality of typeset mathematical copy produced naively by authors using LaTeX. It is highly readable. Most computer scientists are thoroughly satisfied to read material of this visual quality, particularly when it is printed on a high-resolution laser printer. Most of the free PostScript browsers widely available on UNIX systems are not very satisfactory for reading such text, but they will improve quickly. LaTeX produces a less widely used typographical format, called DVI, for which very nice free browsers are already available to run on UNIX. If this typeset display were the real published product, there might be little or no need for format editing.
Here are corresponding pages from a standard display of the final published article, also in PostScript and DVI. The difference is not dramatic. The nearly identical pagination is an unusual, and insignificant coincidence.
Now, look at the LaTeX source for Sections 3.1 and 3.2 of the original draft versus the published version. I suggest that you open these two files in two additional windows, and compare them as you read ahead. In LaTeX source, ``%'' makes the remainder of a line into an unprocessed comment, ``\'' introduces a macro command, ``{'' and ``}'' provide grouping, mostly for arguments to macros. I describe other special forms in LaTeX as they come up. The remainder of this section describes the key format-editing steps that transformed the former LaTeX source into the latter. You may also notice some minor changes introduced by copy editing. I follow a conceptual, and nearly chronological, order.
LaTeX provides two different ways to bracket mathematical formulae. There is no good reason for the difference, it's just there. Authors normally use ``$'' both before and after mathematical formulae that are worked into a line of text, and ``$$'' before and after formulae displayed on a separate line. This form is the easier to type. CJTCS uses ``\('' and ``\)'' to bracket inline formulae, ``\['' and ``\]'' to bracket displayed formulae, because this form is more robust to parse. This is a very minor point, but it may confuse you when comparing the two sources. A substantial minority of articles have a mathematical formula contained in text that is itself part of a mathematical formula, leading to nesting of these brackets. In principle, they may be nested quite deeply in LaTeX source, but I have never seen more than one level of nesting.
Mathematical formulae use single letters in various alphabets to denote variables. This is the most common use of single letters. But, a very significant minority of single letters denote special operations. In this article, large and small roman letters ``O'' and ``o'', and large greek letters theta and omega, are used systematically for special operators that denote four different qualities of the growth rates of functions. To encourage flexible uses of formulae from articles, including searching for particular forms, and exporting formulae into symbolic math utilities, CJTCS articles distinguish such operators with macro names, in this case ``\orderle'', ``\ordereq'', ``\orderlt'', and \orderge''. The definitions
\newcommand{\orderle}[1]{O(#1)}
\newcommand{\ordereq}[1]{\Theta(#1)}
\newcommand{\orderlt}[1]{o(#1)}
\newcommand{\orderge}[1]{\Omega(#1)}
at the head of the published LaTeX source define these
operators and their standard typographical displays. These four
operators are very common in CJTCS articles, so they are
standardized across all articles. Other operators may be peculiar to a
single article. I contemplated creating a catalog of all the operators
likely to be used widely, but rejected the idea as too costly of
effort at this stage. Eventually, someone will surely create such a
standard catalog, and it will be extremely valuable in promoting
interoperability between different systems that process mathematical
formulae. CJTCS articles are reasonably well postitioned to
adapt to such a standard when it arises.
Another special operator is denoted ``\inputsize'' in the published source. Notice that ``\inputsize{x}'' in the published version is ``|x|'' in the original draft. In the typset copy you see the variable ``x'' surrounded by vertical bars. The definition
\newcommand{\inputsize}[1]{\mathopen{|}#1\mathclose{|}}
at the head of the published source not only defines this common
standard operator uniformly, but it provides layout information to
LaTeX so that the spacing around the vertical bars is
appropriately asymmetric. I try, and usually succeed, to isolate
similarly all such detailed layout information, so that the main body
of the source reflects structure alone, and not layout.
Authors who create articles in LaTeX often define short macros for longer pieces of text and/or layout instructions just to speed up their typing. These macros are often recognizable by names that refer to typographical, rather than mathematical, concepts. They are perfectly sensible from the authors' point of view, but they confuse structure in the definitive published form, so I remove them. None of these occurs in Sections 3.1 and 3.2, but elsewhere in the original draft I find
where ``\cF'' is defined at the head of the source byA set $U$ of cardinality $n$ and a family $\cF$ of $n$ subsets of $U$.
\newcommand{\cF}{{\cal F}}
(``\cal'' selects a calligraphic shape for the letter ``F'').
I revised the source for this sentence to
A set \(U\) of cardinality \(n\) and
a family \(\setofset{F}\) of \(n\) subsets of \(U\)\@.
with the macro definition
\newcommand{\setofset}[1]{\mathcal{#1}}
In the revised version, the letter ``F'' is visible directly in the
presentation of the formula, and the macro ``\setofset'' indicates the
mathematical type of object denoted by ``F''. The change from ``\cal''
to ``\mathcal'' is merely a conversion from an older style for
specifying font choices to a newer and more systematic one in the
latest version of LaTeX.
Notice that authors tend to prefer forms that are shorter to type, while I try to impose forms that are more transparent to parse.
In technical articles, with mathematical formulae woven into nearly every line of text, it is sometimes subtle to determine the appropriate boundaries for a formula. Authors are usually content if their printed copy looks reasonable, but I want formulae to be identified correctly for automatic search engines. Feige and Kilian did a pretty good job of marking formulae, but they lapsed once into the form
``P'' and ``NP'' here are formal mathematical symbols, denoting classes of problems requiring certain amounts of time to compute. The equation ``P=NP'' is a mathematical formula, but the authors' source suggests that it is a strange concatenation of two pieces of text with the mathematical symbol ``=''. I revised this sentence toA major open question in computational complexity is whether P$=$NP.
A major open question in computational complexity is whether
\(\complexityclass{P}=\complexityclass{NP}\)\@.
with the macro definition
\mathstyleclass{\complexityclass}{\mathord}{\textsl}
which marks ``P'' and ``NP'' as special symbols denoting complexity
classes, and causes them to be displayed with the spacing of ordinary
mathematical variables, but using a slanted type shape. (For those who
know LaTeX, ``\mathstyleclass'' is a new command defined in
cjstruct, not a standard LaTeX command.)
Once I have decided that forms, such as ``P'' and ``NP'', are really mathematical symbols, rather than text, I mark them consistently as math, even when they appear alone in the middle of text. Look at the subsection title at the very beginning of the source for sections 3.1 and 3.2. The authors entered ``NP'' as bare text, but I marked it as mathematics. (The macro ``\protect'' is required due to a bêtise of LaTeX: it has no structural significance.)
The scholarly literature on complexity theory is full of long acronyms and phrases, used in formulae as single mathematical symbols. These cause headaches for typographical layout. Feige and Kilian used phrases presented in capital letters for this purpose, the longest being
In many articles, such phrases appear in substantial formulae, and I must mark them in a similar way to ``P'' and ``NP''. I try to negotiate shorter names, and more graceful typography, with the authors when possible. In this article, I found only one short formula involving such a phrase cum symbol. So, I replace that formula with English text, and converted all of the phrases similarly into text, for example$k$-MONOTONE CIRCUIT SATISFIABILITY
the \(k\)-Monotone Circuit Satisfiability problem
In some articles, there are short textual phrases embedded in mathematical formulae. These are a bit subtle to represent correctly. The examples from actual articles are tangled with other issues, so here is a phony example of what might have been submitted.
$E=\{(i,j)~|~\sigma_{n,k}(i,j)=1$ or $(\sigma_{n,k}(i,j)=x_i$ and $x_i=1)\}$
LaTeX lays this out reasonably, but the structure is all wrong,
since the individual marked formulae are incomplete and make no
sense on their own. I would revise this to
\(E=\setof{(i,j)}{\sigma_{n,k}(i,j)=1\text{ or }
(\sigma_{n,k}(i,j)=x_i\text{ and }x_i=1)}\)
with the macro definition
\newcommand{\setof}[2]{\{\,#1\mid#2\,\}}
Here is the typeset result in
PostScript and
DVI formats. ``\text'' formats
normal text inside a mathematical formula. The important point is that
``or'' and ``and'' are marked correctly as textual elements inside the
mathematical formula. The other changes are mainly for typographical
layout: ``\mid'' is the vertical bar symbol (``|''), treated as a
mathematical relation, which affects typographical spacing. ``\,''
provides a custom adjustment of spacing.
LaTeX is designed to help authors write and revise drafts of documents, leading to a final permanent printed copy. To a great extent, the structure that is helpful in controlling a sequence of revisions is also good for other sorts of information processing. But, some forms of sloppiness are tolerated in drafts, yet very harmful in archival material. Also, good support for revising drafts requires that systematic elements of a document, such as section nubers, are calculated by the processor. Many of these calculated elements should be stored explicitly in an archival version.
In the author's source, notice the form in which the subsection 3.1 is labelled:
\subsection{The goal: subexponential algorithms for NP}
\label{sec:goal}
Each time the draft is formatted, LaTeX automatically assigns
an appropriate number for the subsection, and displays that number in
the section title. Also, the ``\label{sec:goal}'' command creates a
name for that section number, so that the form ``see section
\cite{sec:goal}'' automatically fills in the section number in a cross
reference. This automatic numbering saves authors immense labor when
they add or delete sections. Labelled elements, such as theorems and
definitions, are treated analogously. In an archival version, these
dynamically computed numbers should be fixed permanently. So, I
replace the section heading with
\asubsection{3.1}{The Goal: Subexponential Algorithms for
\protect\(\complexityclass{NP}\protect\)}
giving the number 3.1 explicitly. I substitute permanent numbers into
cross references, also. This is much more convenient for browsers that
need to search for a specific numbered element, and for other
information processors that use these elements. A simple thought
experiment shows why it is important to bind numerical labels into
archival source. Imagine that a reader quotes Definition 1 from this
article. The best way to do so is to cut and paste the LaTeX
source, since that leaves the least chance of error. But, it is
certainly wrong to have the definition renumbered into the scheme of
the paper in which it is quoted. The macros ``\asection'',
``\alabsubtext'', etc. are defined by cjstruct to display
labels explicitly in the source text. The ``a'' stands for
``absolute.'' These macros merely substitute the explicitly given
labels for those that LaTeX generates dynamically, leaving
all layout decisions to the normal LaTeX code.
In addition to converting dynamically numbered elements to explicit numbering, I also add explicit markup and numbering for paragraphs. Essentially all authors use blank lines to denote paragraph breaks in LaTeX, because this is so easy to type and to recognize in text editors. They also tend to leave blank lines around elements other than paragraphs, such as theorems and definitions, in some cases because the resulting layout looks better, or in some cases by accident. This leaves a lot of ambiguity regarding the actual identities of paragraphs. In printed text, precise identification of paragraphs is not very important. In online text, search engines of the future are very likely to use paragraph structure to interpret texts, and other information processing applications may find unforeseen uses. Also, since readers may produce their own custom layouts of articles, page numbers are useless in citations of CJTCS articles, and paragraph numbers are an attractive substitute. To resolve ambiguity in paragraphing, and to allow easy citation of arbitrary segments of articles, I provide explicit paragraph marks with numbers, such as ``\apar{3.1-2}'' (the second paragraph in section 3.1). One of the few layout features in cjstruct is a feature for printing these paragraph numbers in the margins.
Having made paragraph breaks unambiguous, I decided to mark up sentence boundaries as well. It's not clear how useful this will be, but it's very easy to do. It is likely, but not certain, that future information processing and retrieval will use sentence structure. For sentences ending in a punctation mark, normally period, question mark, or exclamation mark, I precede the punctuation with the macro ``\@''. This is a LaTeX macro intended to prevent certain periods from being misunderstood as ends of abbreviations, rather than ends of sentences. It has no impact on layout otherwise. For sentences not ending in punctuation, I use the macro ``\sentence'', defined in cjstruct to have no effect on layout. Many sentences end with displayed math, and since periods or similar dots are often used in formulae, it is confusing to have a sentence-ending period in such a display. Sentence markup impresses many people as outrageously compulsive. It may turn out to have little value, but it is extremely easy to do. And, automatic parsing of sentences from unmarked text is much more difficult than it looks at first. There are remarkably many variant uses of punctuation, the use of periods for abbreviations being the most common. There are remarkably many cases where one sentence contains another. Embedded quotes, and formal elements such as definitions, often appear as noun phrases in larger sentences.