25.10.2020 linux hugo static-site-generator

Building a full text search for Hugo

Hugo is a static site generator. This means, that there is no dynamic code server-side. That is the strength of this kind of tool. No attack vector for the bad boys. They have to target the server, which is more difficult if the sysadmin didn’t completetly mess up.

Note: this is a direct port of my solution for Jekyll . Most of the code for this is over there, so I am not going to write it again, I am only going to explain how I did it in Hugo. This is also not a ctrl-c ctrl-v article, I only explain what I am doing. You will need to adapt it for your needs.

What are the options for a full text search with a SSG?

  1. Keep a json dictionary with titles and urls, search the titles only with JS.
    While this may work, the bigger your site gets, the more data you are using downloading that json file and data costs money. Your users money.

  2. Use an external service
    I am not going to name any, the Hugo website has a list of services that offer a search , if you are interested. However, putting your users data into the hands of others is something I despise deeply.

  3. Use a database
    It’s fast, it enables us the search for something and only send relevant data to the user. However, we do introduce the security risk that someone takes the database over, with SQL injection for example.
    That is why we are going to nail it down hard.

Hugo’s custom output formats

This is a bit fuzzy to learn and it took me a while to understand it. The plain documentation for this feature is hard to read, but maybe that is just my personal language barrier.

The first steps are this:

  1. Create the media type

    [mediaTypes]
    [mediaTypes."text/sql"]
        suffixes = ["sql"]
    
  2. Define the output format

    [outputFormats]
    [outputFormats.SQL]
        mediaType = "text/sql"
    
  3. Reques the output from Hugo

    [outputs]
        home = ["sql", "html"]
    

Put this into config.toml in your project root. Now Hugo is aware that the mediaType with the suffix sql exists, and we want it to be generated when Hugo is creating the homepage, which is using the template called home. Why the home, you may ask? It gives us access to all pages aka articles, without having to stitch together the files later, when using the single template. There is one caveat though.

In your layouts/_default folder, create two new files: home.sql.sql and baseof.sql.sql. Yes, it needs the suffix two times, this isn’t a typo.

I don’t want to write a lot of logic to only update the database with new articles. This would require a lot more code, keeping track of entries, assigning articles identifiers. To avoid duplicates, I simply drop the table and reinsert
everything. Yes, this may bite me in the ass later, when I have so many articles that it becomes very slow. I full agree.

So, I have 2 parts of SQL that need to be written, the table creation code and the sql commands for filling it. Previously I have shoved the whole code just into home.sql.sql, but that resulted in the table creation code to appear after each row of data. That is why we need baseof.sql.sql. Just like the normal baseof, it only gets inserted once in to the template.

This the sql for baseof.sql.sql. Pay attention to the part where I create a fulltext index. This is paramount for later queries.

    DROP TABLE IF EXISTS db.blog;
    
    CREATE TABLE db.blog (
        id INT NOT NULL AUTO_INCREMENT,
        published DATE NOT NULL,
        title TEXT NOT NULL,
        body TEXT NOT NULL,
        url varchar(2048) NOT NULL,
        CONSTRAINT id_PK PRIMARY KEY (id)
    )
    ENGINE=InnoDB
    DEFAULT CHARSET=utf8mb4
    COLLATE=utf8mb4_general_ci;
    CREATE FULLTEXT INDEX blog_body_IDX ON db.blog (body);
    
    {{ block "main" .}}
    {{ end }}

And here the sql for home.sql.sql:

{{ define "main" }}
{{ range (where .Pages "Section" "ne" "gist") }}
        {{ range .Pages }}
        INSERT INTO db.blog (published, title, body, url)
        VALUES(
            '{{ .Date.Format "2006-01-02" }}',
            '{{ plainify .Title }}',
            '{{ (plainify .Content) }}',
            '{{ .Permalink }}'
        );
        {{ end }}
    {{ end }}
{{ end }}

This is basically no different from any other Hugo template and it generates a filed called index.sql in your website’s root directory. I use this to automatically enter the data into my database upon commiting the project to git. Then the file gets deleted. This code also leaves out the gist category, since I do not need them to be full-text searchable.

What’s next?

Now you have a bare-bones database with all the important data:

  1. Publishing date
  2. Title
  3. The full text body without html tags
  4. A link to the article

I have written a PHP script that puts out 3 of those values as json. I never give out the full text via my little rest-like script, so people can’t scrape it easily. And I query that with javascript, while having CORS enabled in Apache. You can see the code of the JS in the article linked above. I think with this solution you have the good stuff from both worlds, while maintaing the technological simplicity of a static site generator and still having a usable search that does not burden the user with unnecessary downloads. Try it out, the search bar is in the footer.

If you have flames, comments or cookies, please mail to contact@tuxstash.de or give me a callout on Twitter. Thanks for reading.

Link to the author's twitter Link to the authors ko-fi page

comments

Characters: 0/1000

gravatar portrait

 Pinned by contact@tuxstash.de

Come join the discussion and write something nice. You will have to confirm your comment by mail, so make sure it is legit and not a throwaway. Only the name part of it will be displayed, so don't worry about spam. If it does not show up after confirming it, it may be considered spam, but I curate them manually, so don't worry. Please read the privacy statement for more.