CafebrMan -- Cafebr Manual

1. Introduction

Cafebr (Citation Amender/Formatter for Biological Research) is a tool to generate a reference list in a preferred format.

Cafebr was originally developed as a command-line program written in Perl (cafebr.pl). Many of its functions were later copied to a GUI program written in HTML5 and JavaScript (cafebr.html). All functions of the GUI version are available in its online version. This online version also offers links to download stand-alone (command-line and GUI) versions of Cafebr.

The command-line version works at least on Ubuntu 16.04 LTS with Perl v5.22.1 and Windows 10 Home version 1709 with ActivePerl (from ActiveState, https://www.activestate.com/activeper, for Perl v5.24.3) installed. Under such a Perl environment, Cafebr can be executed by moving to the directory where the cafebr file is present at the command line and typing

perl cafebr.pl [options and arguments]

For Ubuntu, making the cafebr.pl file executable (often by the chmod command) may be necessary.
On Ubuntu, if 'PATH="$PATH:[path to the cafebr.pl directory]"' is added to the end of the ~/.profile file, it can be executed in any directory by typing "cafebr.pl [options and arguments]".

The GUI version works at least on Firefox (https://www.mozilla.org/ja/firefox/new/), Google Chrome (https://www.google.co.jp/chrome/index.html) and Microsoft Edge (https://www.microsoft.com/ja-jp/windows/microsoft-edge).

Cafebr tries to

(i) collect necessary information from either PubMed or given references, and

(ii) output it in a preferred format (in the standard output).

For the step i, by default, Cafebr tries to do so using a given reference file.
To collect more precise information, following optons are also available.

--browse, -b: DB browsing option (*no argument is taken)
--search, -s: PubMed search option (*no argument is taken)
--xmlcorr, -x: PubMed XML information correction option (*no argument is taken)
--abscorr, -a: PubMed abstract text correction option (*no argument is taken)
--delimit, -d: Delimiter designation option (*no argument is taken)

For the step ii, following optons are available.

--fldfmt, -f [argument]: Field formatting option
--order, -o [argument]: Output ordering option

Following optons are also available for possible convenience.

How they work is described in the following sections.

2. Methods to collect paper information

a. Default mode (*only for the command-line version)

Cafebr first tries to open a file given in the path as the first argument. If Cafebr can open a file ("reference file") in the absence of the option -b, -s, -x, -a or -d, it regards each line as a record of a paper, and tries to output them in format designated by the option -f (see the section 3). For output formats other than 'raw' or 'unchange', Cafebr tries to extract the following pieces of information from each record:

Authors, Article Title, Publication Year, Journal, Volume, Issue, Pages, PubMed ID (PMID), PubMed Central ID (PMCID), and DOI.

To do this, Cafebr uses a journal name list ("jlist.txt") and a species name list ("slist.txt"). Both of these files were prepared from all relevant records in PubMed, and are present in the same directory as the cafebr execution file.

Cafebr regards the words starting with 'JournalTitle: ', 'MedAbbr: ' or 'IsoAbbr: ' in each line of "jlist.txt" as a journal name, and deposits all the journal names. It regards words in each line of "jlist.txt" as a name of species, and deposits all the species names as well. Adding and removing some journal names and species names to and from "jlist.txt" and "slist.txt" may improve output results.

When Cafebr finds neither a reference file nor a manuscript file (see the section 4 for the output ordering option), it asks whether the option -s, -x, -a or -d should be activated.

The GUI version of Cafebr does not support this function.

c. PubMed search option (--search, -s)

When the option -s is selected, Cafebr searches PubMed for given keywords. Hit records of papers are used for final output.

If Cafebr can open a reference file, it tries to find PMID, PMCID, and then DOI in each line, using 'pmid', 'pmc', and 'doi' (case-insensitive) as search keys. If any of them is available, it is submitted as a query key to PubMed.

If none of them is available, Cafebr generates a query key from each line by replacing all the characters other than the word characters (those in the regular expression "\w") with '+'. The resulting query key is then submitted to PubMed.

When Cafebr finds neither a reference file nor a manuscript file, it asks to give a line of keywords to the standard input. When either 'M' or 'm' is given at this point, Cafebr becomes able to take multiple lines of keywords. One line of words is used as one query in this case.

When a query key has only one hit, Cafebr automatically uses it for final output. This behavior can be changed by the option -i (see the section 5). If a query has multiple hits, Cafebr asks what to do. Navigation messages appear in this case.

The maximum number of hits is limited to 20 for the command-line version of Cafebr in order not to overload the NIH server. If 20 hits are present, it asks if more records should be fetched. Navigation messages appear again in this case.

Records of papers are fetched as the XML format, and the following pieces of information for final output are extracted from each record:
Authors, Article Title, Publication Year and Date, Journal, Volume, Issue, Pages, Abstract, Author Affiliation, PMID, PMCID, DOI, Attributes (Erratum, Comment, etc.).

e. PubMed abstract text correction option (--abscorr, -a)

When the option -a ("PubMed abstract correction" on the GUI version) is selected, Cafebr regards input references as PubMed abstract text-formatted. Multiple records of papers in the input are automatically split, and information for output is collected from each record.

For the command-line version of Cafebr, if it cannot find a reference file, it asks to give PubMed abstract text-formatted paper information to the standard input. The input can contain multiple records in this case as well.
For the GUI version, the PubMed abstract-formated texts should be given to the text area for the step 2. They will then be separated into records and fields when the button "Use the above for step 3" is clicked.

Both the -x and -a options allow Cafebr to correctly collect necessary information, but the -x option may return better results in some cases.

f. Delimiter designation option (--delimit, -d)

When the option -d ("Delimiter" on the GUI version) is selected, Cafebr asks delimiters that can split input references into records (each corresonding to one paper) and fields (corresponding to authors, title, etc.). Carriage returns and line feeds ("[\r\n]+" in regular expression) are used as the default record delimiter. Field delimiters are not defined by default, thus need to be input.

The record delimiter can be designated by either RS='[delimiter]' or RS='[delimiter]'=REX. The former uses characters in [delimiter] per se, whereas the latter uses them as regular expression.

Field delimiters can be designated by either FS='[comma-delimited characters]' (=REX) or F1[delimiter]F2[delimiter]F3[delimiter] ... (=REX). In the former pattern, the comma-delimited characters are used as field delimiters, whereas in the latter pattern, characters between F1, F2, F3 ... ('F' and digits) are used as field delimiters. In either case, on the command-line version, the delimiters are used as regular expression if '=REX' is present. Use of regular expression for field delimiters is not allowed in the GUI version.

The prefix for F1, F2, F3 ... (i.e., 'F') can be changed by inputting PF='[prefix]'.

As the input format, !FJ, !FJ1, !FJ2, ... (prefix '!FJ' with or without digits) can be used as journal name fields. When Cafebr finds such a pattern, it tries to associate the corresponding field key with a journal name in the journal name list, which is present either as the file "jlist.txt" for the command-line version or a part of the HTML/JavaScript file for the GUI version (see the section 2a). The prefix '!FJ' cannot be changed.

After setting delimiters, Cafebr asks how to organize obtained fields (output format), if no preset format is selected by the -f option (see the section 3). It is possible to choose a preset format at this point. Alternatively, the output format can be designated by F1[delimiter]F2[delimiter]F3[delimiter] ...

By default, the command-line version of Cafebr regards the patterns F2, F5, ... as names of fields, and does not care the order of them. This means that a "prefix + digits" pattern in the output format works only when that pattern has been present in the input format. This befavior can be changed by adding '=NUM' at the end of the output format. In this case, F1, F2, F3, ... correspond to the first field, second, third, ... in the input. If '=NUM' is present in the output format, the prefix can be changed by PF='[prefix]'.

Example:
The input
Tsugama (2014) Diceman. PA, USA; Mike (2018) a nice man. Hokkaido, Japan
is output as
In PA (USA) in 2014, Tsugama was called Diceman.
In Hokkaido (Japan) in 2018, Mike was called a nice man.
if the input format is
P1 (P2) P3. P4, P5 RS=';' PF='P'
and the output format is
In F4 (F5) in F2, F1 was called F3. =NUM PF='F'
In this case, both of the following output formats
In P4 (P5) in P2, P1 was called P3. =NUM and
In P4 (P5) in P2, P1 was called P3.
return the same result.

If a preset output format is selected, Cafebr asks which fields correspond to Author, Article Title, Publication Year, etc. Fields for these pieces of information should be indicated by the order of the field (1, 2, 3, ...) rather than F1, F2, F3, ..., in this case.

The GUI version of Cafebr asks delimiters and fields containing specific information (Author, Article Title, etc.), according to the formatting and/or the ordering option selected.
It behaves a little differently from the command-line version: The GUI version does not support the use of "=REX" for field separaters (it does allow "=REX" for the record separator); the GUI version always uses "=NUM" to organize fields for output.

3. Field formatting option (--fldfmt, -f)

A preset format for output can be designated by the option -f with an argument. If the argument matches one of the preset formats, that format is used for output. Arguments fir the formatting option for the command-line version are:

'all', 'DB', 'default', 'smode', 'hokudai', 'lcp', 'febs', 'npg', 'abstracts', 'raw', 'unchanged', 'custom', and 'own' (given below are examples)

Some of these formatting methods are not supported by the GUI version of Cafebr.

The GUI version of Cafebr displays how fields are organized by the selected formatting option ("output template") in the 'Output style' text area for the step '4. Choose a style for output references'. For the output template, "!FAUTH", "!FYEAR", "!FTITLE", etc. are used as field names. For the 'Delimiter' option, "F1", "F2", "F3", etc. can also be used as field names (see the 'f. Delimiter designation option' section for details). These field names are replaced with corresponding pieces of informaiton for each article for output.
The output template is editable. It is therefore possible to customize an output format even when a specific output format has been selected.

4. Output ordering option (--order, -o)

Formatted references can be output in a certain order. Arguments for the output ordering options for the command-line version are:

'num', 'index', 'author', 'name', 'year', 'pm', 'title'

The arguments 'num' and 'index' are equivalent to each other. When either of them is selected, the first digits in each reference is used to order them. These arguments would be useful when index numbers are present at begginings of references, for example.

The arguments 'author' and 'name' are equivalent to each other, and order references using author names. The argument 'year', 'pm' or 'title' orders references using publicaiton years, PubMed IDs or article titles, respectively.

In either case, the ascending order is used by default. When the argument contains "=r" (e.g., when "-o author=r" is used), the descending order is used.

Some of these ordering methods are not supported by the GUI version of Cafebr.

For both the command-line version and GUI version of Cafebr, references can be ordered as they appear in a manuscript. On the command-line version, manuscript can be loaded if a name (or path) of a manuscript file (in a plain text format) is provided as the argument for the ordering (-o) option. On the GUI version, a manuscript input area becomes visible when the radio 'As in a manuscript' is checked.

When manuscript texts are provided, word characters ([A-Za-z] in regular expression) followed by four digits starting with either 1 or 2 in either '()' or '[]' are regarded as potential citations. The word characters and the four digits are used as author names and as publication years, respectively, to search input references for the references with specific combinations of author names and publication years for each citation. If such a combination correspond to two or more references, the command-line version asks which is relevant to the citation of interest, whereas the GUI version regards all of them as relevant and proceeds. If no potential relevant reference is found, the command-line version asks whether PubMed searching should be performed, whereas the GUI version ignores corresponding combination of authors and publication years.

5. Examples of use

Below are some examples of (tricky) use of Cafebr.

i. Getting articles either using PubMed or from your own database, and outputting them in a formatted style

Select the 'DB browsing' option and add your Cafebr database file. Alternatively, select the 'PubMed searhing' option, input keywords and click 'View in Cafebr' (either for up to 20 records or > 20 records). Extracted fields should be automatically displayed as a table for the step '3. Choose references for formatted output'. In this table, each row should contain each article information. Click each row to select/unselect necessary article inforation, and click the 'Add selected references to output' button to output them in a formatted style. Formats can be changed by the step '4. Choose a style for output references'. Results will be then displayed as texts and HTML for the step '6. Review and edit formatted references'.

ii. Changing styles of a preexisting reference list

It should be sometimes necessary to output the same reference list in a different format. The 'Delimiter' option may work in this case. Select this option, add a reference list, and indicate delimiters. Delimited records and fields are then displayed as a table for the step '3. Choose references for formatted output'. The records and fields can be reorganized using options in the steps '4. Choose a style for output references' and '5. Choose how to order output references'. These options may also be useful to find formatting errors in a reference list in advance of submitting it as a part of a manuscript.

iii. Customizing output formats

On the GUI version of Cafebr, output references are all based on the output template (what is written in the 'Output style' text area for the step '4. Choose a style for output references', see the '3. Field formatting option' section).

For example, the preset format 'NPG' uses the output template "!FAUTH. !FTITLE. <i>!FJOURNAL</i> <b>!FVOL</b>, !FPAGE (!FYEAR)." to output journal names and volumes as italic and bold styles, respectively, but such italic and bold styles are not applied if the output template has been changed to "!FAUTH. !FTITLE. !FJOURNAL !FVOL, !FPAGE (!FYEAR)." (i.e., if the HTML tags have been removed).

On the command-line version of Cafebr, the output template is used only when the 'Delimiter (--delimit, -d)' option has been selected. For more flexible and/or precise formatting, open the source file on a text editor, designate an output format using "$myownfmt", save it, and run it with either the "-f custom" option or "-f own". For $myownfmt, $author, $title, $year, $journal, $volume, $issue, $pages, $doi, $pmid, $pmcid, $abstract, $attribute, etc. can be used. For example, '$myownfmt=$author."|".$title."|".$year."|".$journal' will output authors, article titles, publication years and journal names in the pipe (|)-delimited form. Please see the subroutine "formatting" for more examples of output formats.

iv. Completing citations in a manuscript

The ordering option "-o [a manuscript file (path)]" or "As in a manuscript" returns a warning message if input references have no reference relevant to a certain citation in a manuscript. Cafebr may therefore be useful to find "missing" references.
Possible citations extracted by Cafebr from a manuscript are letters with at least one word character followed by four digits starting with either '1' or '2' in either '()' or '[]'. Parts needing any references in a manuscript can therefore be tentatively marked by a pattern such as 'R 1111', and later be revised and modified on Cafebr.

6. Point-by-point helps for the GUI version

How to collect and/or handle input references

The 'DB browsing' option expects the text area for the step "2. Give and edit input references" to have input references in the following format:
Authors [tab] Article Title [tab] Publication Year [tab] Journal [tab] Volume [tab] Issue [tab] Pages [tab] PubMed ID (PMID) [tab] PubMed Central ID (PMCID) [tab] DOI [tab] Attributes [tab] Author Information
These are then displayed as a table for the step "3. Choose references for formatted output".

The 'Delimiter' option is to split input references in the text area for the step 2 into records and fields using designated delimiters. Field delimiters are required, and can be designated by either
"F1[delimiter 1]F2[delimiter 2]F3..." or "FS='delimiter 1, delimiter 2, delimiter 3, ...'"
The record delimiter is carriage returns and line feeds ([\r\n]+) by default, and can be changed by "RS='[record delimiter]'" The defauld prefix 'F' for fields can be changed by "PF='[prefix]'". More details are available here.

The 'PubMed searching' option is to perform a keyword search on PubMed. If a 'View in Cafebr' button (either for 'up to 20 records' or '> 20 records') is clicked, results will be handled as when the 'DB browsing option' is selected.

The 'PubMed XML correction' option and the 'PubMed abstract correction' options are to extract information from the PubMed search results in the XML format and the abstract (text) format. The extracted information is then handled as when the 'DB browsing' option is selected.

-------------------------------------------------------------------------------

Input references

Input references can be provided by either designating a file or direct typing. When 'Add new input to old' is checked, new input references generated by designating a file, PubMed searching, or PubMed result text correction are added to the beggining of old input references.

If the button 'Use the above for step 3' is clicked, references in the text area is displayed as a table for the step 3, according to the option and delimiters designated at the step 1.

When words are present in the 'Keywords' box, only records with those words are displayed in the table for the step 3. Spaces between words in the 'Keywords' box are replaced with ".*?", and resulting characters are used to search input references (i.e., spaces are used for the AND search). Words in the 'Keywords' box are used as regular expression. Some characters need to be escaped by a backslash. Metacharacters can also be used. The OR search can be performed using a pipe ('|') as "OR".

Texts in the text area can be downloaded as a text file by clicking 'Download/Add to the database' button. When references are imported to the text area from a file and the 'DB browsing' option in the step 1 has been selected, the

-------------------------------------------------------------------------------

Style for output

Field names such as !FAUTH, !FYEAR, !FTITLE, F1, F2, F3, ... in the 'Output style' text area are replaced for each record (article) with pieces of information collected in the steps 1-3.

All the specific field names are !FAUTH (for authors), !FTITLE (for article titles), !FYEAR (for publication years), !FJOURNAL (for journals), !FVOL (for volumes of journals), !FISS (for issues of journals), !FPAGE (for pages of journals), !FDOI (for DOIs), !FPMID (for PubMed IDs), !FPMC (for PubMed Central IDs), !FABSTRACT (for abstracts), and !FATTR (for attributes).

F1, F2, ... (i.e., patterns of prefix + digits) are for the 'Delimiter' option in the step 1. The prefix 'F' can be changed by typing "PF='[prefix]'" in the 'Output style' text area.

-------------------------------------------------------------------------------

Input manuscript for the 'As in a manuscript' ordering option

The manuscript texts should be in a plain text format. Word characters ([A-Za-z]) and four digits starting with either 1 or 2 in either '()' or '[]' are regarded as an author name and a publication year, respectively. References with such an author and a publication year are then searched for, and used for output.

The input manuscript is copied in the 'Modified manuscript' text area, and the possible author names and publicaton years are replaced with sequential numbers in '[]'.

CafebrMan
-- Manual for Citation Amender/Formatter for Biological Research

Contents

1. Introduction

2. Methods to collect paper information

a. Default mode (*only for the command-line version)

b. DB browsing option (--browse, -b)

c. PubMed search option (--search, -s)

d. PubMed XML information correction option (--xmlcorr, -x)

e. PubMed abstract text correction option (--abscorr, -a)

f. Delimiter designation option (--delimit, -d)

3. Field formatting option (--fldfmt, -f)

4. Output ordering option (--order, -o)

5. Examples of use

6. Point-by-point helps for the GUI version

CafebrMan -- Manual for Citation Amender/Formatter for Biological Research

Contents

1. Introduction

2. Methods to collect paper information

a. Default mode (*only for the command-line version)

b. DB browsing option (--browse, -b)

c. PubMed search option (--search, -s)

d. PubMed XML information correction option (--xmlcorr, -x)

e. PubMed abstract text correction option (--abscorr, -a)

f. Delimiter designation option (--delimit, -d)

3. Field formatting option (--fldfmt, -f)

4. Output ordering option (--order, -o)

5. Examples of use

6. Point-by-point helps for the GUI version

CafebrMan
-- Manual for Citation Amender/Formatter for Biological Research