Scholarly references in Jekyll

Filed under jekyll, liquid, python, references.

This post is about properly referencing scholarly work in Jekyll. Traditional academic media use strictly codified reference systems, including citations (inline acknowledgements of specific findings, ideas, or quotations) and bibliographies (complete lists of relevant work, alphabetically or chronologically organized, usually found at the end of the document). Unfortunately, references in non-traditional media, such as blog posts, can be frustratingly lackluster.

Since this website is mostly about science, I wanted it to have a strong reference system, like those found in academic journals, with both inline citations and bibliographies. Crucially, I didn’t want to simply type everything by hand every time, since doing so would take forever and would likely introduce many errors. Like many academics, I use reference management software routinely as part of my day job, which takes the sting out of references when writing manuscripts. So I cobbled together a loose “system” that vaguely resembles the functionality of a reference manager specifically for this website. It’s not a particularly elegant solution, but it gets the job done reasonably well. This post describes my system.

Getting references

I’ve used practically every major reference manager over the years. My favorite by, a long way, is Zotero. Among Zotero’s many useful features is the “magic wand” button, which quickly adds items along with all metadata to the library given the ISBN, DOI, or PubMed ID. You get slight but annoying differences in the metadata if you use the different identifiers, but so long as you use DOI wherever possible, it works well most of the time.

My Zotero library contains all the references I need for work. Since the references for this website are a subset of those, I wanted to make Jekyll talk to my Zotero library to grab the references needed for each blog post. I couldn’t come up with a fully automated way of doing this, unfortunately. Instead, I wrote a Python script to extract metadata from each item in my Zotero library and store it in a YAML file. By placing this YAML file in Jekyll’s special _data directory, the metadata gets read into Jekyll and is accessible via Liquid whenever my website is rebuilt. This Python script, reproduced below, uses the Pyzotero third-party package. (Zotero’s web API is another great advantage of this reference manager!) I run this script as part of a collection of scripts every time I rebuild my site. I’ll describe another script in this collection shortly.

"""Download my zotero library and convert it to YAML.

"""
import re

from string import ascii_lowercase

from pyzotero import zotero
from unidecode import unidecode


def get_ids():
    """Read file containing ID and API key.

    """
    return [l.strip() for l in open("../../_data/zotero.txt").readlines()]


def download_refs():
    """Download my enitre zotero library.

    """
    library_id, api_key = get_ids()
    zot = zotero.Zotero(library_id, "user", api_key)
    results = zot.everything(zot.top())
    keys = []
    with open("../../_data/refs.yaml", "w") as fw:
        for item in results:
            dic = item["data"]
            authors = []
            _cite = []
            for author in dic["creators"]:
                if author["creatorType"] == "author":
                    if "lastName" in author:
                        _cite.append(author["lastName"])
                        name = author["lastName"] + ","
                        for n in author["firstName"].split():
                            if n not in ("Jr", "Jr.", "Jnr", "Jnr."):
                                n = n[0] + "."
                            name += f" {n}"
                    else:
                        _cite.append(author["name"])
                        name = author["name"]
                    authors.append(name)
            if len(authors) > 1:
                authors[-1] = f"& {authors[-1]}"
            authors = ", ".join(authors)
            date = re.split(" |,|-|/", dic["date"])
            for d in date:
                try:
                    if int(d) > 100:
                        date = str(int(d))
                except ValueError:
                    pass
            for s in ascii_lowercase:
                key = unidecode(authors.split(" ")[0].rstrip(",")) + date + s
                if key not in keys:
                    keys.append(key)
                    fw.write(f"{key}:\n")
                    break
            fw.write(f'   authors: "{authors}"\n')
            fw.write(f'   year: "{date}"\n')
            title = dic["title"].replace('"', "''")
            if dic["itemType"] != "book":
                fw.write(f'   title: "{title}"\n')
            else:
                fw.write(f'   book: "{title}"\n')
            if "publicationTitle" in dic:
                if "arXiv" in dic["publicationTitle"]:
                    arXiv = dic["publicationTitle"].split(":")[1].split()[0]
                    fw.write(f'   arXiv: "{arXiv}"\n')
                    del dic["publicationTitle"]
            _keys = {
                "publicationTitle": "journal",
                "volume": "volume",
                "issue": "issue",
                "bookTitle": "book",
                "publisher": "publisher",
                "edition": "edition",
                "DOI": "doi",
            }
            for k, v in _keys.items():
                if k in dic:
                    if dic[k] != "":
                        fw.write(f'   {v}: "{dic[k]}"\n')
            if "pages" in dic:
                if dic["pages"] != "":
                    pages = re.split("-|–", dic["pages"])
                    fw.write(f'   first_page: "{pages[0]}"\n')
                    if len(pages) > 1:
                        fw.write(f'   last_page: "{pages[1]}"\n')
            editors = []
            for editor in dic["creators"]:
                if editor["creatorType"] == "editor":
                    if "lastName" in editor:
                        name = ""
                        for n in editor["firstName"].split():
                            name += f"{n[0]}. "
                        name += editor["lastName"]
                    else:
                        name = editor["name"]
                    editors.append(name)
            if len(editors) > 1:
                editors[-1] = f"& {editors[-1]} (Eds.)"
            elif len(editors) == 1:
                editors[0] = f"{editors[0]} (Ed.)"
            editors = ", ".join(editors)
            if editors != "":
                fw.write(f'   editors: "In {editors}"\n')
            if len(_cite) == 1:
                citep = f"({_cite[0]}, {date})"
                citet = f"{_cite[0]} ({date})"
            elif len(_cite) == 2:
                citep = f"({_cite[0]} & {_cite[1]}, {date})"
                citet = f"{_cite[0]} and {_cite[1]} ({date})"
            else:
                citep = f"({_cite[0]} <i>et al.</i>, {date})"
                citet = f"{_cite[0]} <i>et al.</i> ({date})"
            fw.write(f'   citep: "[{citep}](#{key})"\n')
            fw.write(f'   citenp: "[{citep[1:-1]}](#{key})"\n')
            fw.write(f'   citet: "[{citet}](#{key})"\n')
            fw.write("\n")


if __name__ == "__main__":
    download_refs()

A typical entry in the resulting refs.yaml file is:

Myung2003a:
   authors: "Myung, I. J."
   year: "2003"
   title: "Tutorial on maximum likelihood estimation"
   journal: "Journal of Mathematical Psychology"
   volume: "47"
   issue: "1"
   doi: "10.1016/S0022-2496(02)00028-7"
   first_page: "90"
   last_page: "100"
   citep: "[(Myung, 2003)](#Myung2003a)"
   citenp: "[Myung, 2003](#Myung2003a)"
   citet: "[Myung (2003)](#Myung2003a)"

Citations

In the YAML snippet from the previous section, the meanings of most of the different variables are probably obvious. The last three are used to create inline citations via Liquid. For example, to include a citation I include the following in the body of the text.

{{ site.data.refs.Myung2003a.citet }}

This produces: Myung (2003). Notice how the citation is also an internal link to the corresponding item in the bibliography! I’m quite proud of that bit. citet is for text citations, citep is for parenthetical citations (Myung, 2003), and citenp is for parenthetical citations without the parentheses. citenp is useful for constructing parenthetical citations containing extra text (e.g., Myung, 2003).

Pulling out citations

My system requires that every post containing a reference needs that reference’s key (e.g., Myung2003a) in a YAML list called references in its front matter. I have another Python script that scans through all my existing posts to find all citations and create this YAML list. It would be nice to make Jekyll do this instead of Python, but I don’t know how. Here’s that script:

"""Add cited references to the front matter of each markdown page.

"""
import os
import re
import sys

import yaml


def get_frontmatter(f):
    """Return front matter from a markdown file in dictionary format.

    """
    with open(f) as fp:
        s = fp.read().partition("---")[2].partition("---")[0]
        d = yaml.safe_load(s)
    return d


def find_cites(f):
    """Return keys to cited papers.

    """
    with open(f) as fp:
        lst = re.findall(r"{{(.+?)}}", fp.read())
    refs = []
    for l in lst:
        if "site.data.refs" in l:
            refs.append(l.split(".")[3])
    return sorted(set(refs))


def replace_frontmatter(f, d):
    """Replace the front matter with new front matter.

    """
    with open(f) as fp:
        s = fp.read().partition("---\n")[2].partition("---\n")[2]
    with open(f, "w") as fw:
        fw.write("---\n")
        yaml.safe_dump(d, fw)
        fw.write(f"---\n{s}")


def add_refs():
    """Add all references.

    """
    posts = [p for p in os.listdir("../../_posts") if ".md" in p]
    for p in posts:
        f = f"../../_posts/{p}"
        d = get_frontmatter(f)
        r = find_cites(f)
        if r:
            d["include_references"] = True
            d["references"] = r
        replace_frontmatter(f, d)


if __name__ == "__main__":
    add_refs()

Generating the bibliography

The final part of my system is to generate the post’s bibliography. I use a boolean variable in the front matter called include_references to enable bibliography generation (I wrote about toggling features previously). Below is the relevant Liquid snippet within my post.md template, placed immediately after {{ content }}.

{% if page.include_references %}
  {% include references.html %}
{% endif %}

In my _includes folder is a template called refences.html, which looks like this:

<h2>References</h2>
{% assign refs = site.data.refs | sort %}
{% for paper in refs %}
    {% for ref in page.references %}
        {% if ref == paper[0] %}
            {% include citation.html %}
        {% endif %}
    {% endfor %}
{% endfor %}

This loops over each reference in the site-wide references.yaml; if a given reference is also in the references list within the post’s local YAML data, the reference is assigned the variable paper, which is then used in a third template called citation.html:

<p class="refs"><a name="{{ paper[0] }}"></a>
    {{ paper[1].authors }} ({{ paper[1].year }}).
    {% if paper[1].title %}
        {{ paper[1].title }}.
    {% endif %}
    {% if paper[1].editor %}
        {{ paper[1].editor }},
    {% endif %}
    {% if paper[1].book and paper[1].collection and paper[1].volume %}
        <i>{{ paper[1].book }}</i> ({{ paper[1].collection }} vol {{ paper[1].volume }}, pp. {{ paper[1].first_page }}–{{ paper[1].last_page }}).
    {% else %}
        {% if paper[1].book and paper[1].volume %}
            <i>{{ paper[1].book }}</i> (vol {{ paper[1].volume }}, pp. {{ paper[1].first_page }}–{{ paper[1].last_page }}).
        {% else %}
            {% if paper[1].book and paper[1].first_page %}
                <i>{{ paper[1].book }}</i> (pp. {{ paper[1].first_page }}–{{ paper[1].last_page }}).
            {% else %}
                {% if paper[1].book and paper[1].edition %}
                    <i>{{ paper[1].book }}</i> ({{ paper[1].edition }} ed.).
                {% else %}
                    {% if paper[1].book %}
                        <i>{{ paper[1].book }}</i>.
                    {% endif %}
                {% endif %}
            {% endif %}
        {% endif %}
    {% endif %}
    {% if paper[1].book %}
        {{ paper[1].publisher }}.
    {% endif %}
    {% if paper[1].journal %}
        {% if paper[1].volume or paper[1].first_page %}
            <i>{{ paper[1].journal }}</i>,
            {% if paper[1].volume %}
                {% if paper[1].issue %}
                    {% if paper[1].first_page %}
                        <i>{{ paper[1].volume }}</i>({{ paper[1].issue }}),
                    {% else %}
                        <i>{{ paper[1].volume }}</i>({{ paper[1].issue }}).
                    {% endif %}
                {% else %}
                    {% if paper[1].first_page %}
                        <i>{{ paper[1].volume }}</i>,
                    {% else %}
                        <i>{{ paper[1].volume }}</i>.
                    {% endif %}
                {% endif %}
            {% endif %}
            {% if paper[1].first_page %}
                {% if paper[1].last_page %}
                    {{ paper[1].first_page }}–{{paper[1].last_page }}.
                {% else %}
                    {{ paper[1].first_page }}.
                {% endif %}
            {% endif %}
        {% else %}
            <i>{{ paper[1].journal }}</i>.
        {% endif %}
    {% endif %}
    {% if paper[1].doi %}
    <a href="https://doi.org/{{ paper[1].doi }}">{{ paper[1].doi }}</a>
    {% endif %}
    {% if paper[1].arXiv %}
        arXiv:<a href="https://arxiv.org/abs/{{ paper[1].arXiv }}">{{ paper[1].arXiv }}</a>.
    {% endif %}
</p>

The above code is quite involved and difficult to read. Basically, it takes the information stored within paper and formats it into a style I call “APAish.” You can see the results immediately below.

References

Myung, I. J. (2003). Tutorial on maximum likelihood estimation. Journal of Mathematical Psychology, 47(1), 90–100. 10.1016/S0022-2496(02)00028-7

Version history

Originally posted September 02, 2020.

“My bibliography,” Sep 05, 2020.
“Audiobooks-o-rama,” Jun 24, 2020.
“Displaying external files in Jekyll,” Aug 18, 2019.
“Toggling features in Jekyll posts,” Aug 17, 2019.
All posts filed under jekyll, liquid, python, references.

The Cracked Bassoon