# Posts

## Digest: Pivot Tables in SQL - Converting Rows to Columns

(This post is cross-posted from http://herbert.id/2017/11/pivot-tables-in-sql/.)

A few weeks ago I gave a talk at NUS Hackers’s Friday Hacks on some advanced tips in using SQL. (It was my first technical talk, btw!) Here, I will elaborate upon one of those tips in more detail.

# Background

I have been working on a website that hosts math olympiad contests monthly. (Shameless plug - source code at https://github.com/donjar/kontes-terbuka, website at https://ktom.tomi.or.id. Even though it is in Bahasa Indonesia, Google Translate gives a good enough translation.) In each contest, there are several structured questions that a contestant can solve. The schema is something like this:

Submissions:

• id: primary key
• user_id: foreign key to users
• problem_id: foreign key to problems
• score: integer

(There is some sample SQL data that can be found here. The file dump.sql is of your interest; check out the contest_scores table. The file contest_pivot_table.sql is the solution to the problem below. This repo was originally made for my SQL Wizardry presentation.)

# Problem

Let’s say the current data looks like this:

 id | user_id | problem_id | score
----+---------+------------+-------
1 |       1 |          1 |     3
2 |       1 |          2 |     8
3 |       1 |          3 |     1
4 |       2 |          1 |     7
5 |       2 |          2 |     3
6 |       2 |          3 |     3
7 |       3 |          1 |     4
8 |       3 |          2 |     2
9 |       3 |          3 |     3
10 |       4 |          1 |     7
11 |       4 |          2 |     5
12 |       5 |          1 |     4
13 |       5 |          3 |     7
14 |       6 |          2 |     3
15 |       6 |          3 |     4
16 |       7 |          1 |     4
17 |       7 |          2 |     4
18 |       7 |          3 |     9
19 |       8 |          1 |     7
20 |       8 |          2 |     2
21 |       8 |          3 |     4
22 |       9 |          1 |     8
23 |       9 |          2 |     0
24 |       9 |          3 |     4
25 |      10 |          1 |     3
26 |      10 |          2 |     3
27 |      10 |          3 |     6

Obviously data like this is not suitable to be shown to the end user. What we want is something like this:

 user_id | problem_1 | problem_2 | problem_3
---------+-----------+-----------+-----------
1 |         3 |         8 |         1
2 |         7 |         3 |         3
3 |         4 |         2 |         3
4 |         7 |         5 |
5 |         4 |           |         7
6 |           |         3 |         4
7 |         4 |         4 |         9
8 |         7 |         2 |         4
9 |         8 |         0 |         4
10 |         3 |         3 |         6

Notice that this involves converting rows of the table into columns: in this case, the problem_id needs to be “split” into problem_1, problem_2, and problem_3.

You can’t do this with a normal SQL SELECT statement! Normally, when you select columns, you can only select based on operations on the existing columns. There are no operations that allow you to convert rows into columns in this way.

You can work around this with table joins though, for example:

WITH (
SELECT
user_id,
problem_1 AS score
FROM submissions
WHERE
problem_id = 1
) AS problem_1_table, (
SELECT
user_id,
problem_2 AS score
FROM submissions
WHERE
problem_id = 2
) AS problem_2_table, (
SELECT
user_id,
problem_3 AS score
FROM submissions
WHERE
problem_id = 3
) AS problem_3_table

SELECT
problem_1_table.user_id,
problem_1,
problem_2,
problem_3
FROM problem_1_table
FULL OUTER JOIN problem_2_table
ON problem_1_table.user_id = problem_2_table.user_id
FULL OUTER JOIN problem_3_table
ON problem_1_table.user_id = problem_3_table.user_id

Basically, problem_1_table is a table that contains the (user_id, score) pair for submissions with problem_id = 1, and so on. Afterwards, we join with them on the user_id to produce the table we want.

In fact, this was our original approach to the problem! The website was built with Ruby on Rails, and we used a loop in Ruby to loop through all problems and generate the corresponding SQL query. I would see SQL queries like this often:

SELECT user_contests.*, marks.short_mark, marks.long_mark, marks.total_mark,
case when marks.total_mark >= gold_cutoff then 'Emas' when marks.total_mark >=
silver_cutoff then 'Perak' when marks.total_mark >= bronze_cutoff then
'Perunggu' else '' end as award, "long_problem_marks_269"."problem_no_269",
"long_problem_marks_271"."problem_no_271",
"long_problem_marks_270"."problem_no_270",
"long_problem_marks_268"."problem_no_268", RANK() OVER(ORDER BY
marks.total_mark DESC) AS rank FROM "user_contests" INNER JOIN "contests" ON
"contests"."id" = "user_contests"."contest_id" INNER JOIN "users" ON
"users"."id" = "user_contests"."user_id" INNER JOIN (SELECT user_contests.*,
short_marks.short_mark, long_marks.long_mark, (short_marks.short_mark +
long_marks.long_mark) as total_mark FROM "user_contests" INNER JOIN (SELECT
user_contests.id as id, sum(case when short_submissions.answer =
short_problems.answer then 1 else 0 end) as short_mark FROM "user_contests"
LEFT OUTER JOIN "short_submissions" ON "short_submissions"."user_contest_id" =
"user_contests"."id" LEFT OUTER JOIN "short_submissions"
"short_submissions_user_contests_join" ON
"short_submissions_user_contests_join"."user_contest_id" = "user_contests"."id"
LEFT OUTER JOIN "short_problems" ON "short_problems"."id" =
"short_submissions_user_contests_join"."short_problem_id" WHERE
"user_contests"."contest_id" = 29 AND ("short_submissions"."short_problem_id" =
"short_problems"."id" OR ("short_submissions"."short_problem_id" IS NULL AND
"short_problems"."id" IS NULL)) GROUP BY "user_contests"."id") short_marks ON
"user_contests"."id" = "short_marks"."id" INNER JOIN (SELECT user_contests.id
as id, sum(coalesce(long_submissions.score, 0)) as long_mark FROM
"user_contests" LEFT OUTER JOIN "long_submissions" ON
"long_submissions"."user_contest_id" = "user_contests"."id" WHERE
"user_contests"."contest_id" = 29 GROUP BY "user_contests"."id") long_marks ON
"user_contests"."id" = "long_marks"."id" WHERE "user_contests"."contest_id" =
29) marks ON "user_contests"."id" = "marks"."id" LEFT OUTER JOIN (SELECT
user_contests.id as id, long_submissions.score as problem_no_269 FROM
"user_contests" LEFT OUTER JOIN "long_submissions" ON
"long_submissions"."user_contest_id" = "user_contests"."id" WHERE
"long_submissions"."long_problem_id" = 269) long_problem_marks_269 ON
"user_contests"."id" = "long_problem_marks_269"."id" LEFT OUTER JOIN (SELECT
user_contests.id as id, long_submissions.score as problem_no_271 FROM
"user_contests" LEFT OUTER JOIN "long_submissions" ON
"long_submissions"."user_contest_id" = "user_contests"."id" WHERE
"long_submissions"."long_problem_id" = 271) long_problem_marks_271 ON
"user_contests"."id" = "long_problem_marks_271"."id" LEFT OUTER JOIN (SELECT
user_contests.id as id, long_submissions.score as problem_no_270 FROM
"user_contests" LEFT OUTER JOIN "long_submissions" ON
"long_submissions"."user_contest_id" = "user_contests"."id" WHERE
"long_submissions"."long_problem_id" = 270) long_problem_marks_270 ON
"user_contests"."id" = "long_problem_marks_270"."id" LEFT OUTER JOIN (SELECT
user_contests.id as id, long_submissions.score as problem_no_268 FROM
"user_contests" LEFT OUTER JOIN "long_submissions" ON
"long_submissions"."user_contest_id" = "user_contests"."id" WHERE
"long_submissions"."long_problem_id" = 268) long_problem_marks_268 ON
"user_contests"."id" = "long_problem_marks_268"."id" WHERE
"user_contests"."contest_id" = 29 AND "users"."province_id" = 35  ORDER BY
"marks"."total_mark" DESC

It wasn’t very pretty Ruby and SQL code, but hey, it works. But of course, this does not sit really well with me - using Ruby to generate SQL and using that generated code to query data seems weird, isn’t it?

Of course we can just select all data in SQL and do the processing in Ruby on Rails. However, as the saying goes - do the data processing in the database level, where it is suitable for the job right?

Enter pivot tables.

# Pivot Tables

You might have heard about pivot tables from Microsoft Excel. To quote Wikipedia:

A pivot table is a table that summarizes data in another table, and is made by applying an operation such as sorting, averaging, or summing to data in the first table, typically including grouping of the data.

The “grouping of the data” is of interest here.

## Caveat

Pivot tables have different implementations across different databases. I am only going to discuss how to do it in PostgreSQL, as that is the database I am using in my application. You should be able to find implementations for other databases by searching for something like “pivot tables MySQL”.

## Crosstab

In PostgreSQL, the relevant function is called “crosstab”. It is available as an extension, and hence, you should install it first by running CREATE EXTENSION tablefunc; as a superuser.

The relevant documentation can be found here. One thing you might notice, though, is that there are actually four different crosstab functions!

The functions are actually generating pivot tables, only with different abstraction levels. I found that the crosstab(text source_sql, text category_sql) function produces the result I needed. To quote the documentation:

source_sql is a SQL statement that produces the source set of data. This statement must return one row_name column, one category column, and one value column. It may also have one or more “extra” columns. The row_name column must be first. The category and value columns must be the last two columns, in that order. Any columns between row_name and category are treated as “extra”. The “extra” columns are expected to be the same for all rows with the same row_name value.

So basically, source_sql is in the form:

SELECT key(s), categories, values FROM ...

and the category_sql matches the values that we want in the category to separate.

An example will make these concepts clearer. Going to the example we discussed previously, we notice that the key is the user_id - basically, this is the column we want as our first column. The category is problem_id, since this is what we want the next columns of the resulting pivot table to be. And finally, the values to be filled in inside the resulting table would be score. It is normal to use an ORDER BY key as well here, so that the resulting pivot table is not jumbled. Combining all of them, we have this source_sql query:

SELECT user_id, problem_id, score FROM submissions ORDER BY user_id

For the category, we know that the problem IDs we want are 1, 2, and 3. Hence, the corresponding category_sql is:

SELECT 1, 2, 3

(in reality we would want something like SELECT id FROM problems ORDER BY id).

And at the end, we would need to specify the columns to be generated as well, with this format:

("column1" category1, "column2" category2, ...)

In this case, it would be:

("user_id" int, "problem_1" int, "problem_2" int, "problem_3" int)

Hence, the resulting SQL query is:

SELECT * FROM crosstab(
'SELECT user_id, problem_id, score FROM submissions ORDER BY user_id',
'SELECT 1, 2, 3'
) AS (
"user_id" int,
"problem_1" int,
"problem_2" int,
"problem_3" int
)

This will return just what we wanted!

## Problems

While this query is “cleaner” than the SQL joins “workaround” I wrote previously, there are still some problems with this solution. Firstly, there is a need to specify the columns in the resulting pivot table. While this can also be generated with the underlying framework (with Ruby etc.), we still need to query two times: once to get the column definitions (for example, SELECT id FROM problems), and once to do the actual crosstab query. If I remember correctly, this problem is not specific to PostgreSQL; other databases also have this problem. This solution will not solve the problem of using the application to craft SQL queries.

The other problem which might not be obvious is about ORMs - it is highly unlikely that ORMs have support for pivot tables. This may cause problems with using ORM tools - for example, while ActiveRecord (the ORM in Ruby on Rails) has a find_by_sql method, chaining it with another ActiveRecord method can cause bugs. In my case, Submissions.find_by_sql(crosstab_query).count does not work (as I assume Rails doesn’t know where to put the COUNT query).

With two queries running, performance might also be a problem. From my observations, the SQL joins query takes around 70 ms, while these two queries take around 50 ms in total; however, this does not account for network overheads etc. A more systematic benchmarking is needed to confirm if this solution is indeed faster than the joining solution.

# Conclusion

If you need to somehow convert rows into columns, and vice versa, to do grouping/aggregation, pivot tables might be for you. While several problems exist, I still believe that pivot table is the right tool for the job and the resulting solution is “cleaner”.

# Introduction

Not liking macOS? Prefer linux? In this post we’ll show you how.

## Requirements

You should have a USB Hub or similar with enough ports for

• One LiveUSB
• One USB Storage Device fast enough to run an OS from (e.g. SanDisk Ultra Fit); we shall call this the target device from here on.
• One input device (USB keyboard and mouse required; for ease two ports recommended)

As well as a MacBook running a macOS version between El Capitan and High Sierra (inclusive).

You should be comfortable running commands on terminal; bash is assumed.

## Caveats

• Internal speaker and headphone jack output is still not working (probably).
• Chaining GRUB2 bootloader probably needs to be manually (possibly automatable) rebuilt every kernel update, and probably the keyboard and touchpad drivers too.
• Seems like installing GRUB2 adds a folder to the internal SSD’s EFI partition that may mess with the MacBook’s native bootloader’s bootable-partition discovery process and its BOOTCAMP bootloading process. This is not a major issue as
• The bootloading process where one holds the Option key and chooses which volume to boot into still works fine,
• Booting into Windows through GRUB2 is possible,
• Removing the ubuntu folder from the EFI Partition, resetting the default boot partition from macOS, and not connecting the external drive containing Ubuntu should revert things to a normal state.
• I strongly do not recommend installing Ubuntu to the internal drive as macOS upgrades tend to assume several things about the partitioning state of the internal SSD and messing around with it outside of Disk Utility and the Bootcamp-proscribed instructions may mess stuff up. In particular, note the macOS upgrades that converted HFS+ partitions so that they were within CoreStorage partitions, and the macOS upgrades that converted HFS+ filesystems into APFS filesystems.

## The Stages

1. Preparing the LiveUSB.
• After this stage you should have a USB containing a live image that you can boot into from your MacBook
2. Preparing the target device to make it bootable.
• After this stage you should have a partition that your MacBook recognizes as a partition containing macOS that it can boot into.
3. Installing Ubuntu to the target device.
• After this stage you should have an Ubuntu installation in your target device that might not be bootable into.
• After this stage you should be able to boot into the Ubuntu installation, but keyboard and mouse support may not be present.
5. Installing and configuring internal keyboard and touchpad drivers.
• After this stage your Ubuntu should have internal keyboard and touchpad support, but it may no longer be bootable.
6. Rebuilding and reinstalling the bootloader.
• After this stage you should be able to boot into the Ubuntu installation and be able to use your mouse and keyboard.
• This stage is similar to Stage 4, and I think it must be re-performed everytime you update the kernel of your installation.

## Stage 1: Preparing the LiveUSB

1. Download the Ubuntu 17.10 x86_64 Desktop image. I have not tried the other versions, which may have slightly different installation instructions in stage 3. The 17.10 Beta 2 desktop image suffices.
2. We convert the .iso image file into a .dmg disk image file that the MacBook recognizes as bootable.

hdiutil convert -format UDRW -o /path/to/img.dmg /path/to/image.iso
3. We insert our Live-USB-to-be confirm the location our LiveUSB-to-be is at with

diskutil list

It should be identified by a name of the form diskN for some integer N. Then ensure that none of the partitions present on the disk are mounted with

diskutil unmountDisk /dev/diskN
4. We now burn the .dmg file to the disk with

sudo dd if=/path/to/img.dmg of=/dev/rdiskN

Note that writing to rdiskN is speedier than writing to diskN as it skips several layers of software abstraction. Nevertheless, depending on the USB standard your LiveUSB supports, it may still take quite a while.

We are done with this stage. Unmount your LiveUSB.

## Stage 2: Preparing the Target Device to Make It Bootable

1. Insert the target device. Let diskN be its identifier; as before, we may discover its identifier with diskutil list.
2. Erase the disk, write a GPT and EFI partition and a Journaled HFS+ partition to it with

diskutil eraseDisk JHFS+ Ubuntu GPT diskN
3. Split the JHFS partition into a 128MB (size is pretty arbitrary; ex. we have used MB and not MiB) partition for the GRUB2 bootloader and a partition for our installation with a command like

diskutil splitPartition diskNs2 2 JHFS+ "Ubuntu Boot Loader" 128M ExFAT "Ubuntu" R
4. We now mount the “Ubuntu Boot Loader” partition and navigate our terminal shell into its root; the standard Finder mounting suffices, whereupon it will be located at /Volumnes/Ubuntu Boot Loader.

5. We create the necessary folders necessary for the MacBook to recognize it as a macOS installation

mkdir mach_kernel
mkdir -p System/Library/CoreServices
6. We create a .plist text file at System/Library/CoreServices/SystemVersion.plist with the contents

<xml version="1.0" encoding="utf-8"?>
<plist version="1.0">
<dict>
<key>ProductBuildVersion</key>
<string></string>
<key>ProductName</key>
<string>Linux</string>
<key>ProductVersion</key>
<string>Ubuntu Linux</string>
</dict>
</plist>
7. Finally, we set the boot flag for the partition with

sudo bless --device /dev/diskNsM

where M is the partition identifier of the “Ubuntu Boot Loader” partition, which can be discovered with diskutil list.

We are done. Insert the LiveUSB and shutdown or restart the MacBook.

## Stage 3: Installing Ubuntu

We assume that the LiveUSB and the target device are both plugged in, and that the reader shall connect external input devices whenever required for input.

1. When the MacBook starts, immediately during or before the bootup chime, hold down the Option button to enter the native bootloader. Select any of the EFI Boot options.
2. You should boot into the LiveUSB’s GRUB2 bootloader. Select “Try Ubuntu without installing”. Your internal keyboard still works in the GRUB2 bootloader
3. You should arrive at the Live Ubuntu desktop. You should need an external keyboard and mouse while in this environment. Proceed with installation as per usual, except for the following part.
4. At “Installation type”, and presented with where and how on the disks you wish to install Ubuntu, select “Something else”. I do not like connecting to the Internet and updating just yet; we may do that later.
5. Select to format the partition that was diskNsM in macOS as ext4, and use it as the root / mount point. Choose to install the bootloader in the same partition, that is, the partition that was diskNsM in macOS. Leave every other partitions and drives alone.
6. Proceed and complete the installation, but do not reboot just yet.

## Stage 4: Making Ubuntu bootable

You should now be booted into a Live Ubuntu OS. You should have both the LiveUSB connected from which you are running the OS, and also have your target device connected. We shall identify the partition that you have installed Ubuntu 17.10 to by sdAN where A is some small-caps letter and N is some integer.

We now build the GRUB2 bootloader.

1. We mount the Ubuntu 17.10 partition. It suffices to use GNOME’s default mounting, whereupon it will be availabe at some location like /media/ubuntu/some_uuid_string/
2. We bind our Live Ubuntu’s special files so that they are available when we chroot into our Ubuntu 17.10 installation with

cd /media/ubuntu/some_uuid_string/
sudo mount -B /dev dev
sudo mount -B /dev/pts dev/pts
sudo mount -B /proc proc
sudo mount -B /sys sys
sudo mount -B /run run
3. We chroot into our Ubuntu 17.10 installation with

sudo chroot .
4. We configure GRUB2 with

grub-mkconfig -o boot/grub/grub.cfg
5. We build GRUB2 into a boot.efi located at the root of our Ubuntu 17.10 installation with

grub-mkstandalone -o boot.efi -d usr/lib/grub/x86_64-efi -O x86_64-efi --compress=xz boot/grub/grub.cfg
6. From outside the chroot’d shell (that is, from the Live Ubuntu desktop), save your boot.efi file somewhere from your macOS installation (e.g. Google Drive).

7. Reboot into macOS. Due to how the macOS searches for bootable partitions, from now on you may have to always hold down the Option button and select the location you would like to boot into, else you may arrive at a GRUB2 fallback shell.

8. Mount the “Ubuntu Boot Loader” partition. From the Terminal, (Finder glitches out) copy the GRUB2 image into the partition with

cp /path/to/boot.efi "/Volumes/Ubuntu Boot Loader/System/Library/CoreServices/"
9. It does not seem likely, but you may have to re-bless the “Ubuntu Boot Loader” partition.

We are done. Your Ubuntu 17.10 installation should now be bootable.

## Stage 5: Building, configuring, and installing the keyboard and touchpad drivers

You do not need the LiveUSB from here on. Reboot the MacBook with Option key held down while booting, and select “Ubuntu Boot Loader”. You should boot into GRUB2, and should be able to select an Ubuntu menu entry to boot onto your Ubuntu 17.10 installation. From here on you may need to perform input via an external keyboard and mouse. Log in.

1. Connect to the Internet.

sudo apt update
sudo apt upgrade

You may be prompted to restart your system, but it is important not to just yet.

3. Install git and the build tools you will need with

sudo apt install git build-essential

git clone https://github.com/roadrunner2/macbook12-spi-driver
5. Build the drivers as kernel modules

cd macbook12-spi-driver
git checkout touchbar-driver-hid-driver
make
6. Install the kernel modules

sudo mkdir /lib/modules/uname -r/custom/
sudo cp applespi.ko appletb.ko /lib/modules/uname -r/custom/
sudo depmod
7. Write a configuration file to set the touchpad to an appropriate DPI. The file should be located at /etc/udev/hwdb.d/61-evdev-local.hwdb and its contents should be

# MacBookPro13,* (Late 2016), MacBookPro14,* (Mid 2017)
EVDEV_ABS_00=::96
EVDEV_ABS_01=::94
EVDEV_ABS_35=::96
EVDEV_ABS_36=::94

EVDEV_ABS_00=::96
EVDEV_ABS_01=::95
EVDEV_ABS_35=::96
EVDEV_ABS_36=::95
8. We configure the modules to load on boot with

sudo su
echo 'add_drivers+="applespi intel_lpss_pci spi_pxa2xx_platform appletb"' >> /etc/initramfs-tools/modules
update-initramfs -u

You should now have a system that should load the drivers upon boot. But (I’m unsure about this point) you may have to rebuild your GRUB2 bootloader to correctly identify the kernel to boot into.

## Stage 6: Rebuilding the GRUB2 Bootloader

After every system update (upgrade) that rebuilds the kernel, you should re-run this step.

1. Reconfigure GRUB2.

sudo grub-mkconfig -o /boot/grub/grub.cfg
2. Rebuild GRUB2

sudo grub-mkstandalone -o /boot.efi -d /usr/lib/grub/x86_64-efi -O x86_64-efi --compress=xz /boot/grub/grub.cfg
3. Upload boot.efi to some place accessible by macOS.

4. Reboot into macOS.

5. Mount “Ubuntu Boot Loader” and replace the old boot.efi file in that partition with the new boot.efi file. Remember that it must be done with the Terminal as Finder glitches out.

6. I doubt this must be done, but you may need to re-bless the partition.

## Acknowledgments

• Nailen Matschke (nailen@caltech.edu) for instructions on how to boot into Ubuntu from Apple’s native bootloader via intermediately booting into a standalone GRUB2 bootloader.
• github.com/chisNaN for easy instructions to install and configure the keyboard and touchpad drivers.

# Introduction

Over the summer, we developed NUSMods Planner, an augmented version of NUSMods that comes with an automatic timetable planner. After much shameless marketing, it received much fanfare and user traffic. According to Google Analytics, we have about 2,600 users. Not bad for our first software project.

Snapshot of our Google Analytics Dashboard, taken in 27th September 2017

Screenshot of NUSMods Planner, observe how the generated timetable has no lessons on Tuesday

# How it all started

The same way every hack does: Over a casual chat with Wei Heng, we discovered we both found manual timetable planning too much of a chore.

There happened to be a school program (how to ahref this to nus orbital) sponsored by Google SG (tears we didnt get the Google trip) at that time, something like a summer-long hackathon, so “hey, why not”.

# Features

Of course, our application would not have gotten so much attention if it merely prevented clashes given a fixed set of modules. Anyone can do that.

Features were actually implemented iteratively after various rounds of testing, by asking our peers what they would like to see in our app.

We had to make certain design decisions when considering whether to include X or Y in our final release, maintaining a balance between usability and usefulness. Allowing users overly fine-grained control usually meant more clutter in the UI, and we wanted a pleasant first-time user experience.

The final list of features are as follows (but why not check it out (at NUSMods Planner yourself?):

• Optional Modules
• Locking lesson slots
• Free Days
• Lunch Hours
• Blocking out too early/late slots
• Website Tour

# First attempt at a solution

(the more technically inclined may check out our technical report)

After some research, we realised that since timetable planning can be expressed as a constraint problem, we could use powerful constraint solvers to help us. We started by learning about smtlibv2, what it could and could not achieve, and how long it would take to do so.

Then began the experimentation. For this part, we used z3, a solver developed by Microsoft Research, which had exposed python api. We initially modeled the problem as an assignment of slots to hours, then used the Distinct API call to make sure there were no clashes. Even with no optional modules and no other constraints, results were dismal. One call took >3s on average, much slower than a naive recursive backtracking algorithm.

# What went wrong

The reason for this was that Distinct internally actually creates $$n^2$$ (where $n$ is the total number of hours across all lessons in the specified modules) variables $$a_i\neq a_j$$ and adds their conjunction to the constraint. The number of clauses was thus polynomially bigger than what we expected. We didn’t want to give up on z3 and revert to the naive solution (recursive backtracking), since that would make supporting the features above impractical, so…

# Back to the drawing board

We had the Eureka moment while getting a drink at a cafe (did I mention we both love food): Instead of mapping lessons to hours, we could map each hour to a lesson (since timetables in NUS are fortnight-periodic)! Of course, there was still work to be done to properly express this concept in terms of first-order logic (we used selector variables and multiple implications), but once we did this, the solving time drastically improved. This method also could be naturally scaled to support all our intended features.

# We have a working, efficient algorithm, now what?

We met up with the NUSMods core team (Zhi An and Li Kai) for a demo, and they were impressed and agreed to be our mentors. They also tentatively agreed to code integration once we were done, subject to code quality. Over the course of the project, they provided invaluabe guidance and helped us navigate through their codebase, and we plan to finish codebase merging before their release of NUSMods V3.

# Modifying the NUSMods codebase

Being total newbies to Web App development, this was a huge challenge, and in fact what we spent the most time on. We started by taking a refresher course in JavaScript, then familiarising ourselves with React and Redux

At first, we were confused as to why there was a need for so much boilerplate code, but as we implemented more and more features and bug-fixes, we were grateful that the codebase was so clean, structured and modular. Hot reloading and tools like Redux DevTools also made previewing changes a lot faster and debugging a lot easier.

# Pushing Work to the Client

Once we had written our timetable-generating scripts, we wrote a simple server app that would respond to queries by calling the scripts and then returning the results. We then deployed it on a Digital Ocean (DO) instance and benchmarked the response time using Postman. To our horror, each query took the server at least 7 seconds to respond. Turns out our laptops were way more powerful than the DO instance (the cheapest option: 512 MB, we’re both poor kids).

Clearly, doing the solving at the server side would not be scalable - there was no way NUSMods Planner could support the same user traffic as NUSMods unless we had access to some computing cluster.

We tried to improve the running time of the solving algorithm by optimizing the problem representation. While we managed to slash the average solving time by half, scalability remained an issue. We had no choice but to push the solving workload to the client.

So we had to find some way to run a SMT solver in the web browser. We came across the project called ResearchJS that aims to share computer science research by compiling it to JavaScript. One of the shared research projects was BoolectorJS, a JavaScript version of Boolector compiled with Emscripten. The client would now load the BoolectorJS scripts and invoke them to solve our queries.

Here we would like to formally thank the above open-source project contributors - our project would not have been possible without them.

After three months of hardwork, this was how our full stack implementation looked like:

# Further Work

Sadly our summer break was only 3 months; we were looking to add more features. One idea was to optimize the timetable based on some heuristic such as travelling distance or compactness of lessons. Solving optimization problems subject to constraints is an active research topic (see here) but sadly not supported by most SMT solvers (we know that z3 does).

# Conclusion

As SMT solvers get more powerful and efficient, we can see SMT solvers being used as a catch-all solution for constraint-based problems. We look forward to more active development in JavaScript based SMT solvers and their application in web applications.

Overall we individually spent 200+ hours… It was heartwarming to have peers thanking us directly or providing postive feedback that they found the website useful.

## Digest: Preparing for a Coding Interview

Despite scoring decent grades in both my algorithm class and my data structures class in university, I shudder at the thought of going through a coding interview that focuses on algorithms.

Hence I spent the last three months figuring out how to improve my coding interview skills and eventually received offers from the big tech companies. In this post, I’ll be sharing the insights and tips I gained along the way. Experienced candidates can also expect system design questions, but that is out of the scope of this post.

Many of the algorithmic concepts tested in coding interviews are not what I usually use at work, where I am a front-end web engineer. Naturally, I have forgotten quite a bit about these algorithms and data structures, which I learned mostly during my freshmen and sophomore years of college.

It’s stressful to have to produce (working) code in an interview, while someone scrutinizes every keystroke that you make. What’s worse is that as an interviewee, you’re encouraged to communicate your thought process out loud to the interviewer.

I used to think that being able to think, code, and communicate simultaneously was an impossible feat, until I realized that most people are just not good at coding interviews when they first start out. Interviewing is a skill that you can get better at by studying, preparing, and practicing for it.

My recent job search has led me on a journey to improve my coding interview skills. Front-end engineers like to rant about how the current hiring process is broken because technical interviews can include skills not related to front-end development. For example, writing a maze solving algorithm and merging two sorted lists of numbers. As a front-end engineer myself, I can empathize with them.

Front end is a specialized domain where engineers have to care about many issues related to browser compatibilities, the Document Object Model, JavaScript performance, CSS layouts, and so on. It is uncommon for front-end engineers to implement some of the complex algorithms tested in interviews. At companies like Facebook and Google, the people are software engineers first, domain experts second.

Unfortunately, the rules are set by the companies, not by the candidates. There is a high emphasis on general computer science concepts like algorithms, design patterns, data structures; core skills that a good software engineer should possess. If you want the job, you have to play by the rules set by the game masters — improve your coding interview skills!

The content for this post can be found in my Tech Interview Handbook repo and future updates will be made there. Pull requests for suggestions and corrections are welcome!

# Preparing for a Coding Interview

### Picking a Programming Language

Before anything else, you need to pick a programming language to do your interviews in. Most companies will let you code in any language you want, the only exception I know being Google, where they only allow candidates to pick from Java, C++ or Python for their algorithmic coding interviews. Most of the time, I would recommend that you use a language that you are extremely familiar with rather than picking up a new language just for doing interviews because the company uses that language heavily.

There are some languages which are more suitable than others for coding interviews and some languages you absolutely want to avoid. From my experience interviewing as an interviewer, most candidates pick Python or Java. Other commonly seen languages include JavaScript, Ruby and C++. I would absolutely avoid lower level languages like C or Go, simply because they lack in many standard library functions and data structures.

Personally, Python is my de facto choice for coding algorithms during interviews because it is succinct and has a pretty huge library of functions and data structures available. One of my top reasons for recommending Python is that it uses consistent APIs that operate on different data structures, such as len(), for ... in ... and slicing notation on sequences (strings/lists/tuples). Getting the last element in a sequence is arr[-1] and reversing it is simply arr[::-1]. You can achieve a lot with minimal syntax in Python.

Java is a decent choice too but having to constantly declare types in your code means extra keystrokes which results in slower coding/typing speed. This issue will be more apparent when you have to write on a whiteboard during on-site interviews. The reasons for choosing/not choosing C++ are similar to Java. Ultimately, Python, Java and C++ are decent choices of languages. If you have been using Java at work for a while now and do not have time to be comfortably familiar with another language, I would recommend just sticking to Java instead of picking up Python from scratch just for interviews to avoid having to context switch between languages during work vs interviews. Most of the time, the bottleneck is in the thinking and not the writing.

One exception to the convention of allowing you to “pick any programming language you want” is when you are interviewing for a domain-specific position, such as Front End/iOS/Android Engineer roles, in which you would need to be familiar with coding algorithms in JavaScript, Objective-C/Swift and Java respectively. If you need to use a data structure that the language does not support, such as a Queue or Heap in JavaScript, perhaps try asking the interviewer whether you can assume that you have a data structure that implements certain methods with specified time complexities. If the implementation of that data structure is not crucial to solving the problem, the interviewer will usually allow it. In reality, being aware of existing data structures and selecting the appropriate ones to tackle the problem at hand is more important than knowing the intricate implementation details.

If you have been out of college for a while, it is highly advisable to review CS fundamentals — Algorithms and Data Structures. Personally, I prefer to review as I practice, so I scan through my college notes and review the various algorithms as I work on algorithm problems from LeetCode and Cracking the Coding Interview.

This interviews repository by Kevin Naughton Jr. served as a quick refresher for me.

The Medium publication basecs by Vaidehi Joshi is also a great and light-hearted resource to recap on the various data structures and algorithms.

### Mastery through Practice

Next, gain familiarity and mastery of the algorithms and data structures in your chosen programming language:

1. Practice coding algorithms using your chosen language. While Cracking the Coding Interview is a good resource for practice, I prefer being able to type code, run it and get instant feedback. There are various Online Judges such as LeetCode, HackerRank and CodeForces for you to practice questions online and get used to the language. From experience, LeetCode questions are the most similar to the kind of questions being asked in interviews whereas HackerRank and CodeForces questions are more similar to competitive programming questions. If you practice enough LeetCode questions, there is a good chance that you would have seen/done your actual interview question (or some variant) on LeetCode before.
2. Learn and understand the time and space complexities of the common operations in your chosen language. For Python, this page will come in handy. Also find out the underlying sorting algorithm that is being used in the language’s sort() function and its time and space complexity (in Python its Timsort which is a hybrid sort). After completing a question on LeetCode, I usually add the time and space complexities of the written code as comments above the function body to remind myself to analyze the algorithm after I am done with the implementation.
3. Read up on the recommended coding style for your language and stick to it. If you have chosen Python, refer to the PEP 8 Style Guide. If you have chosen Java, refer to Google’s Java Style Guide.
4. Find out and be familiar with the common pitfalls and caveats of the language. If you point out them out during the interview and intelligently avoid falling into them, you will usually impress the interviewer and that results in bonus points in your feedback, regardless of whether the interviewer is familiar with the language or not.
5. Gain a broad exposure to questions from various topics. In the second half of the article I mention algorithm topics and practice questions for each topic. Do around 100–200 LeetCode questions and you should be good.

Practice, practice and more practice!

### Phases of a Coding Interview

Congratulations, you are ready to put your skills into practice! In a real coding interview, you will be given a technical question by the interviewer, write code in a real-time collaborative editor (phone screen) or on a whiteboard (on-site) to solve the problem within 30–45 minutes. This is where the real fun begins!

Your interviewer will be looking out for signals that you fit the requirements of the role and it is up to you to display those signals to them. Initially it may feel weird to be talking while you are coding as most programmers do not have the habit of explaining out loud as they are typing code. However, it is hard for the interviewer to know what you are thinking just by looking at the code that you type. If you communicate your approach to the interviewer before you start coding, you can validate your approach with them and the both of you can agree upon an acceptable approach.

Before the Interview (Remote)

For phone screens/remote interviews, prepare paper and pen/pencil to jot down and visualize stuff. If you are given a question on trees and graphs, it usually helps if you draw out some examples of the data structure given in the question.

Use earphones and make sure you are in a quiet environment. You definitely do not want to be holding a phone in one hand and only be able to type with the other. Try avoiding using speakers because if the echo is bad, communication is harder and repeating of words will just result in loss of valuable time.

Upon Getting the Question

Many candidates jump into coding the moment they hear the question. That is usually a big mistake. Take a moment to repeat the question back at the interviewer and make sure that you understand exactly what they are asking. If you misunderstood and when you repeat back the question, they will clarify.

Always seek clarification about the question upon hearing it even if it you think it is clear to you. You might discover something that you have missed out and it also sends a signal to the interviewer that you are a careful person who pays attention to details. Consider asking the following questions:

• How big is the size of the input?
• How big is the range of values?
• What kind of values are there? Are there negative numbers? Floating points? Will there be empty inputs?
• Are there duplicates within the input?
• What are some extreme cases of the input?
• How is the input stored? If you are given a dictionary of words, is it a list of strings or a Trie?

After you have sufficiently clarified the scope and intention of the problem, explain your high level approach to the interviewer even if it is a naive solution. If you are stuck, consider various approaches and explain out loud why it will/will not work. Sometimes your interviewer might drop hints and lead you towards the right path.

Start with a brute force approach, communicate it to the interviewer, explain the time and space complexity and why it is bad. It is unlikely that the brute force approach will be one that you will be coding. At this point, the interviewer will usually pop the dreaded “Can we do better?” question, meaning that they are looking for a more optimal approach. In my opinion, this is usually the hardest part of the interview. In general, look for repeated work and try to optimize them by potentially caching the calculated result somewhere and reference it later, rather than having to compute it all over again. There are some tips on tackling topic-specific questions that I dive into details below.

Only start coding after you and your interviewer have agreed on an approach and has given you the green light.

Starting to Code

Write your code with good coding style. Reading code written by others is usually not an enjoyable task. Reading horribly-formatted code by others makes it worse. Your goal is to make your interviewer understand the code you have written so that they can quickly evaluate if your code does what you say it does and whether it solves the given problem. Use clear variable names, avoid single letter names unless they are for iteration. However, if you are coding on a whiteboard, you might not want to use extremely verbose variable names for the sake of reducing the amount you have to write.

Always be explaining what you are currently writing/typing to the interviewer. This is not about literally reading out what you are typing to the interviewer. Talk about the section of the code you are currently implementing at a higher level, explain why it is written as such and what it is trying to achieve.

When you copy and paste code, consider whether it is necessary. Sometimes it is, sometimes it is not. If you find yourself copying and pasting one large chunk of code spanning multiple lines, it is probably an indicator that you can refactor by containing those lines into a function. If it is just a single line you copied, usually it is fine. Do remember to change the respective variables in your copied line of code where relevant. Copy-paste errors are a common source of bugs even in day-to-day coding!

After Coding

After you have finished coding, do not immediately announce to the interviewer that you are done. In most cases, your code is usually not perfect and contains some bugs or syntax errors. What you need to do now is to review your code.

Firstly, look through your code from start to finish as if it is the first time you are seeing it, as if it was written by someone else and you are trying to spot bugs in it. That’s exactly what your interviewer will be doing. Look through and fix any minor issues you may find.

Next, come up with small test cases and step through the code (not your algorithm!) with those sample input. Interviewers like it when you read their mind and what they usually do after you have finished coding would be to get you to write tests. It is a huge plus if you write tests for your code even before prompts from them. You should be emulating a debugger when stepping through and jot down or say out the values of certain variables as you step through the lines of code.

If there are huge duplicated chunks of code in your solution, it would be a good chance to refactor it and demonstrate to the interviewer that you are one who values code quality. Also look out for places where you can do short-circuit evaluation.

Lastly, give the time/space complexity of your code and explain why it is such. You can even annotate certain chunks of your code with the various time/space complexities to demonstrate your understanding of your code and the APIs of your chosen programming language. Explain any trade-offs in your current approach vs alternative approaches, possibly in terms of time/space.

If your interviewer is happy with the solution, the interview usually ends here. It is also not uncommon that the interviewer asks you extension questions, such as how you would handle the problem if the whole input is too large to fit into memory, or if the input arrives as a stream. This is a common follow-up question at Google where they care a lot about scale. The answer is usually a divide-and-conquer approach — perform distributed processing of the data and only read certain chunks of the input from disk into memory, write the output back to disk and combine them later on.

### Practicing via Mock Interviews

Interviewing is a skill that you can get better at. The steps mentioned above can be rehearsed over and over again until you have fully internalized them and following those steps become second nature to you. A good way to practice is to find a friend to partner with and the both of you can take turns to interview each other.

A great resource for practicing mock coding interviews would be interviewing.io. interviewing.io provides free, anonymous practice technical interviews with Google and Facebook engineers, which can lead to real jobs and internships. By virtue of being anonymous during the interview, the inclusive interview process is de-biased and low risk. At the end of the interview, both interviewer and interviewees can provide feedback to each other for the purpose of improvement. Doing well in your mock interviews will unlock the jobs page and allow candidates to book interviews (also anonymously) with top companies like Uber, Lyft, Quora, Asana and more. For those who are totally new to technical interviews, you can even view a demo interview on the site (requires sign in). Read more about them here.

I have used interviewing.io both as an interviewer and an interviewee and found the experience to be really great! Aline Lerner, the CEO and co-founder of interviewing.io and her team are passionate about revolutionizing the technical interview process and helping candidates to improve their skills at interviewing. She has also published a number of technical interview-related articles on the interviewing.io blog. interviewing.io is still in beta now but I recommend signing up as early as possible to increase the likelihood of getting an invite.

Another platform that allows you to practice coding interviews is Pramp. Where interviewing.io matches potential job seekers with seasoned technical interviewers, Pramp differs takes a different approach. Pramp pairs you up with another peer who is also a job seeker and both of you take turns to assume the role of interviewer and interviewee. Pramp also prepares questions for you, along with suggested solutions and prompts to guide the interviewee. Personally, I do not really like Pramp’s approach because if I were to interview someone, I would rather choose a question I am familiar with. Also, many users of the platform do not have the experience of being interviewers and that can result in a horrible interview experience. There was once where my matched peer, as the interviewer, did not have the right understanding of the question and attempted to lead me down the wrong path of solving the question.

### Conclusion

Coding interviews are tough. But fortunately, you can get better at them by studying and practicing for them, and doing mock interviews.

To recap, to do well in coding interviews:

1. Decide on a programming language
2. Study CS fundamentals
3. Practice solving algorithm questions
4. Internalize the Do’s and Don’ts of interviews
5. Practice doing mock interviews
6. Interview successfully to get the job

By following these steps, you will improve your coding interview skills, and be one step closer (or probably more) to landing your dream job.

All the best!

Yang Shun is an astute NUS alumni and co-creator of NUSMods, among many other things. He is currently working as a Front End Engineer at Facebook. This post was originally published on Medium.

## Digest: NUS Hackers site migration and redesign

This summer, the NUS Hackers site went through a huge redesign (you’re looking at it right now). It took the efforts of everyone on the core team to make the blog better. In terms of improvements we managed to improve not only the developer experience but also the speed and performance of the website itself. In this week’s digest, we will be talking about some of the improvements we’ve made and how we’ve done it.

## Static site generators

Before we dive into the details, let’s take the time to explain what a static site generator is. A static site generator, as the name implies, generates a website. It is static, because once the html files are generated, they will always look and stay the same, which is in contrast to dynamic sites such as Facebook that present different content every time you visit. These generators takes a bunch of content files, normally written in a format called markdown, along with some template HTML files and combine them to make a website. Add in some CSS for styling, and you’ve got a beautiful website. At NUS Hackers, we originally used a generator called Jekyll. More alternatives can be found at staticgen.

## Migrating

When the proposal was raised to redesign the site, that idea seemed incredibly difficult because Jekyll, which was written in Ruby, took roughly 11 seconds for a single change to reflect on the browser, which was simply too long and frustrating. As a result, we decided to move to Hugo.

Hugo boasted about its speed and it proved itself when it would take less than a millisecond to rebuild our 500 page blog. it was an astounding change and made developing much easier and convenient.

Hugo also provided a command line tool to migrate from Jekyll to Hugo, which made the migration incredibly easy - hugo import jekyll was all it took. However that was not to say that there weren’t any difficulties encountered along the way, an easily overlooked feature in Jekyll was that templates could be placed anywhere, even within markdown itself. That was not available in Hugo, and we had to make use of a feature called shortcodes to replace the templates that we had. Shortcodes were simple tags you could write to provide functionality, but they were not as flexible as inlined templates

## Maintainable css

Another feature that was lost was support for SCSS, a superset of CSS, which makes CSS easier to maintain.

We managed to add it back with a build process which also concurrently spawns the Hugo instance. The build process, written in a javascript library called Gulp, allowed us to work on styling the site where every change would automatically reload the browser to reflect the change. In this build process, we also included libraries Prettier and Stylelint-order, which auto-formatted SCSS. This meant we would never have to worry about how our SCSS code was written, creating stress-free code reviews.

## Faster site & build previews

The last feature that we added was to move our hosting to a service called Netlify.

Not only did it improve the load time of our site through their incredibly fast Content-Distribution Networks (CDN), they also provided build previews whenever we pushed to GitHub. This meant that any new changes could be previewed in an actual fork of the site. It certainly made reviewing a lot easier and more trustworthy.

All said and done, the new blog is incredibly fast, convenient and sleek. We hope that this digest encouraged you to start your own. Our code is always publicly available at GitHub!

## Digest: A parallelized ray tracer in the browser

This post was originally published at staceytay.com.

This semester, I got a chance to write a ray tracer for my parallel programming class. I took this class because I thought it’d be fun to learn about parallel programming constructs and practices, but I didn’t think that I’d end up writing a ray tracer for a class assignment. Although I’ve never done any computer graphics programming before, I thought this assignment was neat since it gave me an idea of how computers can generate images and I ended up with a very visual result of my work. So, I wrote this post to consolidate what I learnt on ray tracing and doing it in the browser using the GPU, with help from GPU.js.

Since I’ve never had any direct experience with programming computer graphics before, I started this assignment by trying to understand what ray tracing is and how it works. In doing so, I found two particularly useful articles. The first, Ray Tracing: Graphics for the Masses, is a very readable and useful introduction to ray tracing’s basic concepts and terminology. The second, a narrative-style tutorial that implements a ray tracer in JavaScript, focuses more on implementing an actual ray tracer in JavaScript. It was especially helpful in demonstrating how to generate rays from a camera.

## How ray tracing works

Ray tracing is a method for generating realistic looking images based on tracing the path of light rays through a “world”, i.e. the scene in which we would like to generate these images. It mimics the way in which light enters a camera or an eye to form images and simulates that process to construct images. Imagine you’re shooting a movie. You’d have a camera, lights, and things in your set. When recording your movie, your camera records whatever light rays that enter its viewpoint. In ray tracing, you have the opposite effect for the light rays. Instead of calculating the rays based on light sources in the scene, we generate the rays out of the camera so that we only compute the colours of the rays that we know will be inside the picture. This way, we don’t waste resources computing rays that will never be part of the image.

An illustration from Wikipedia on how raytracing works

Ray tracing, especially when compared to other techniques for generating images, allows for realistic rendering of images since we can naturally mimic properties of light—such as absorption, reflection, and refraction—when it hits a surface. As a result, ray tracing is computationally costly and hence typically suited to situations where we can pre-render the images ahead of time. But, since each ray trace, and correspondingly its pixel colour on the screen can be determined independently, determining the colours for each pixel and hence generating the image can be done in parallel and hence makes this problem particularly suited for the GPU. This is where GPU.js comes in.

## Parallelizing ray traces using GPU.js

GPU.js is an in-browser transpiler that transpiles a subset of JavaScript into OpenGL Shading Language (GLSL). Shading languages are graphics programming languages for characterizing surfaces, volume, and colour. In GPU.js, the GLSL produced uses WebGL—a JavaScript API for rendering interactive 3D computer graphics that’s supported in most modern browsers—to access the GPU. Hence, GPU.js allows the programmer to write in JavaScript and perform parallel computations on the GPU and fall back to JavaScript when the GPU is unavailable. This is very neat since the programmer now doesn’t need to know GLSL to use it and can write once and use it on both CPU and GPU.

GPU.js has been rather apt for this task. You create a kernel in graphical mode and pass in a function that computes the colour for each pixel on a canvas that the kernel returns. This can be as simple as calling this.color (r, g, b) to set the pixel colours, using RGB values, that corresponds to the x and y coordinates that each thread is assigned to.

let gpu = new GPU ()
let kernel = gpu.createKernel (function () {
/* This pixel's coordinates on the canvas */

/* RGB values */
this.color (0, 0, 1)
}, {
/* Options */
dimensions: [800, 600],
graphical: false,
mode: 'gpu'
});

let canvas = kernel.getCanvas ()
document.body.appendChild (canvas)

A simple GPU.js kernel that creates a 800 x 600 blue canvas using the GPU and inserts it into a web page’s body.

However, there are some caveats. GPU.js is still in its early stages of development and so only supports a limited subset of JavaScript. It does not support passing in or returning arrays in kernel functions and does not support recursion (since GLSL does not support recursion). Hence, it took a bit more coding than usual to get the ray tracer up and running, especially since the trace () function does a number of vector operations and is typically used recursively. I worked around this by making trace () iterative and writing functions such unitVectorX (Vx, Vy, Vz) to compute the x value for a vector V’s unit vector, and similar functions unitVectorY (Vx, Vy, Vz) and unitVectorZ (Vx, Vy, Vz) to return its y and z values. It was a tiny bit annoying, but necessary.

## Results

It took a number of nights working on this and getting used to GPU.js’ quirks, but I’m very pleased with the end product. It was especially gratifying to be able to see the visible differences when implementing specific light tracing properties such as Lambertian reflectance and specular reflection—giving a surface a “matte” appearance and a mirror-like reflection respectively. I also like that the ray tracer runs in a browser, which allowed me to make an interactive ray tracer online. I thought Jingwen did a really great job in making his ray tracer more interactive (you can move the camera and dynamically add more spheres in it). Additionally, when toggling between CPU and GPU mode, the differences in rendering speed is notable, so GPU.js was definitely helpful in making tracing fast. I don’t think it would have been as cool if the ray tracer wasn’t fast enough to see the animations run smoothly. This has been an unexpected and unexpectedly fun assignment.

A screenshot of my ray tracer, at http://staceytay.com/raytracer/.

## References

2. Raytracing. http://www.macwright.org/literate-raytracer/

## Digest: Editor Internals - MS Word

Recently, I delivered a presentation on techniques that enable real-time collaboration, with a focus on text editors. Since, I’ve developed an interest in how text editors are implemented. This series is an attempt at an overview of the internals of all popular text editor implementations (including, but not exhaustively: MS Word, Vim, Emacs and Atom).

The implementations I present are far from an exhaustive list of all possible text editor implementations. Simpler implementations can be easily conjured (eg. simply represent the text as an array of characters). However, these naive implementations are inefficient, and never found in product-ready text editors, and are hence not discussed.

# Abstracting the text editor

Before any discussion about text editors is made, we need to define a few terms. An item is the data type in question. In most implementations, the item refers to an ASCII character. A text editor maintains a sequence (ordered set) of items. The position of an item is the index of the sequence which identifies the item (usually 0-indexed).

All operations on a text editor can be derived from the following primitive operations:

1. Empty(): Create a new empty sequence
2. Insert(seq, pos, i): Insert new item i at position pos in sequence seq.
3. Delete(seq, pos): Delete the item at position pos in sequence seq.
4. ItemAt(seq, pos): Returns the item at position pos in sequence seq.

Compound operations such as pasting (multiple inserts) can be obtained with a combination of these operations.

## Buffers, Spans and Descriptors

A buffer is a set of sequentially addressed memory locations. Think of it as a slice of memory (although not technically correct, it happens to be true with regards to text editors). This sequence of memory locations do not have to correspond to that of the sequence. For example, the sequence “abc” need not be represented as “abc” in the buffer. In the scenario where both the buffer and the sequence are logically contiguous, we call it a span. Descriptors are simply pointers to these spans.

Because many text implementations require buffers to be able to grow infinitely large, buffers tend to be stored in disk files, rather than in memory.

# Piece Table

We start our journey by looking at piece tables, found in early versions of Microsoft Word, and Abiword, the document editor I swore by several years ago. Both Abiword and MS Word use an in-RAM piece table.

## Implementation

The piece table uses two buffers. The first buffer is called the file buffer, and contains the original contents of the file, and is read only; the second buffer is called the add buffer, and is a dynamically-sized buffer that is append-only.

The sequence starts off as one big span, and is subsequently divided into smaller spans after several editor operations.

The piece table is implemented as a doubly-linked list, often with sentinel nodes at the head and tail. The choice of a doubly-linked list is because of its simplicity, and is not the only data structure compatible with piece tables.

While pieces contain no text; they contain information about the text object it represents. Specifically, piece objects store the start, end and buffer ID, and all together they identify a span in the sequence.

struct piece
{
piece   * next;
piece   * prev;

size_t   start;
size_t   length;
size_t   buffer;
};

It follows from this design that pieces don’t know their logical position in the sequence. They only know the physical location of the data they represent, and this makes insertion and deletion quick. It does, however, suffer from the same limitations of a doubly-linked list: There is no way to directly access a specific text-offset within the document. This issue is circumvented if a tree structure were to be used instead of a linked list. All this will be clear in the next section, where the piece table is put to the test.

## Initializing a piece table

We start off by opening our file, containing the text “piece outz”.

This creates a read-only file buffer containing the ten characters, and an empty add buffer of arbitrary size.

Empty() is called, creating a doubly-linked list with just the sentinel head and tail nodes. A span is then created, which represents the entirety of the file buffer contents. This span is appended to the linked list.

## Removing text

Suppose we want to change the text from “piece outz” to “pizza outz”. First, we’d have to remove 3 characters “ece”.

Here, the original big span is split into two. The two spans represent, respectively:

1. The items before the delete (in this case “PI”)
2. The items after the delete (in this case “ OUTZ”)

In general, the delete operation increases the number of spans in the piece table by 1.

## Insert text

Now, we have to insert the characters “zza” in the correct position. Note that if several items are inserted in a row, the items are combined into a single span. These characters are first appended to the add buffer. Next, the span is split into 3:

1. The first span represents the items of the old span before the insert. (“PI”)
2. The second span represents the inserted items (“ZZA”)
3. The third span represents the items of the old span after the insert (“”)

Notice that the original file is read-only; this is friendly for caching. Also, everything is append-only, lending itself to cheap undos by smartly storing previous descriptors.

In addition, text editors like MS Word require the storage of formatting information. This is trivial, since appended text in the add-buffer never changes its memory address, and can be safely refered to with pointers.

The size of the piece table is a function of the number of insert and delete operations, and not the size of the file, which makes this ideal for editing large files.

The idea behind the piece table I described in this article is simple. Implementation, however, can get unwieldy, and is certainly lengthier than other data structures. The versatility of piece tables lend themselves to the availability of a whole host of optimizations, which tend to end up as a huge ball of mud.

Modern text editors seldom implement piece tables underneath. Piece tables were invented some 30 years ago, in an era of limited memory and slow CPUs. Computers then were unable to handle less efficient sequence data structures, such as arrays of lines as spans. These archaic data structures power Vim and Emacs, and no one’s complaining about the editors being slow. This, however, is an article for another day.

In the next article of this series I’ll be covering Emacs and some of the common misconceptions surrounding it.

## Introduction

When I first learned Rails and web development in general, I wanted to create a website for my organization. One requirement for the website was to have an authentication system, so I searched for Ruby gems that could help with this. One of the most popular gems is devise. However, this gem is not recommended for beginners to Rails as mentioned in the README!

Based on my experiences, this article will walk through the steps needed to create a sufficiently secure authentication system in Ruby on Rails 5.0.1 on Ruby 2.4.0.

## What to store in database?

It is totally not advisable to store plain passwords in the database. For one, developers can see the password you store, which is already a security breach. An even more important reason, though, is to limit the damage when your database is compromised. Imagine if you store plain passwords in the database and your database is stolen by hackers - now the hackers are able to compromise all accounts in your system easily. Now imagine if you store them in a secure format such that even if the hackers steal your database, they still can’t get the passwords easily - less damage would be done that way.

To achieve this, we will hash the passwords in the database. Basically, what this means is that we will use a hash function that takes your passwords as an input and gives out a (seemingly) random string of a certain length. There are some properties that make certain functions suitable as a hash function: - it has finite output possibilites (since it has a fixed length); however, it is almost impossible to find two inputs that hash to the same output - computing the hash is quick and deterministic (meaning that hashing the same input always produces the same output) - given a hash, it is nearly impossible to find an input that hashes into it - a small change to the input makes a big change in the output

For our purposes, we will be using BCrypt that is conviniently bundled already in Ruby on Rails.

There is another problem though - if we only store the password hash, an attacker can compute as many hashes as they can, and store it in some table. When the database is compromised, the attacker can just use this table to lookup inputs that hash to these passwords! This attack is called rainbow tables.

Hence, we need to “upgrade” our hashing system by including password salts. Essentially, this means our hashes depend not only on our initial passwords, but also on the randomly generated salts. Each salt is associated with a user in our system and every user should have different salts, so that precomputing the hashes would take much more time (since the attacker also needs to use various salts as well).

Hence, in our database, we will store three things: 1. Username - string, not null, must be unique 2. Password hash - string, not null 3. Password salt - string, not null

So much about security. Let’s get to work!

## Creating our application

Let’s create our app! Assuming you have installed Rails (check this out if you haven’t), in your project root, run rails new simple-auth (here simple-auth is the name of my app). After the installation is done, enter your app folder.

Now, let’s generate the models we need for the application. To do so, run: bin/rails g model User username:string password_hash:string password_salt:string

Check the migration file generated (in db/migrate folder). Let’s add not null and unique constraints on the database by changing the migration file into:

class CreateUsers < ActiveRecord::Migration[5.0]
def change
create_table :users do |t|
t.string :username, null: false, unique: true

t.timestamps
end
end
end

Afterwards, run bin/rails db:migrate to migrate the database.

Next, enable the BCrypt gem by uncommenting the corresponding line in the Gemfile: # gem 'bcrypt', '~> 3.1.7'. Run bundle install afterwards (or just bundle).

class User < ApplicationRecord
validates :username, presence: true, uniqueness: true
validates :password, presence: true, on: :create

before_validation(on: :create) do
end

end

private

end
end

Here’s some explanation: - attr_accessor :password adds a “password” field in your User model that is not saved into the database. - The next two lines validate both username and password, by making sure they exist. Username is also ensured to be unique. Now you might realize that the username is already guaranteed in the database to be not null and unique; why bother doing it again in the model? Actually, it is enough to have it in either the model or the database (assuming the database only interacts with this app). I’m just going to including it in both. - The before_validation block makes sure that the password is encrypted before creating. - The authenticate method provides a method in the user to check if the given password (as argument) is the user’s correct password. - The encrypt_password method generates the salt and the hash. It is set private because there is no need to expose this method.

Now that we’ve done this, let’s open the console by using bin/rails c.

Let’s create a user! For example, I’ll create a user named donjar with password NUSh4ck3r5. Run in irb: u = User.create(username: 'donjar', password: 'NUSh4ck3r5')

The user should now be saved into the database. If you run User.all it should show your user, along with its hash and salt. Here’s what happened in my computer: #<ActiveRecord::Relation [#<User id: 1, username: "donjar", password_hash: "$2a$10$xb9MMJaEm7aP.lKdIPG0aeH/etrXqE4A/ObLri/8IMT...", password_salt: "$2a$10$xb9MMJaEm7aP.lKdIPG0ae", created_at: "2017-02-16 15:03:23", updated_at: "2017-02-16 15:03:23">]>

This can be different from the user in your computer - it’s okay! In fact, this shows that the salt is randomly generated, and with the same password yet a different salt, the hash will be very different.

Now we can test authenticating our user:

> u.authenticate('NUSh4ck3r5')
=> true
> u.authenticate('NUShackers')
=> false
> u.authenticate('nush4ck3r5')
=> false

So this means our user system is working. Yay!

## What’s next?

For a simple authentication system that is only working in the model, this is sufficient. However, if you are going to create a full application with view and controller, you also need to add the login and logout functions, which need more than this. There are many tutorials online - we might even add one in the future!

## Digest: Summary of the Current State of LaTeX

This article will be structured in two parts. The first discusses reasons to use or not use LaTeX to typeset your document, while the second discusses the current landscape of TeX engines, document typesetting systems, and distributions.

## 1. Introduction: do you really need to use TeX?

Back in its early days, the TeX typesetting system may have been the premier typesetting solution. This is no longer true for many use cases. Popular, commercial typesetting systems have matured considerably over the last decade. Word 2007 incorporated native equation composition which markup is actually more semantically expressive than TeX’s, and Word 2010 incorporated support for ligatures. Barring mathematical typesetting, I believe (but cannot testify) that solutions offered by Adobe (especially InDesign) and Apple come close to, or surpass, the TeX system. In addition, the WYSIWYG nature of most of these solutions’ UX lend them a tighter REPL.

What redeeming qualities, then, does typesetting in the TeX system offer? To me, there are three app-killers; one is the ability to make document-global edits through two mechanisms: TeX macros, and good ol’ search-and-replace. These features, lacking in affordable WYSIWYG editors, allow us to replace a fallible hunt-and-peck editing operation with a quick, straightforward one. The other is the number of packages that facilitate typesetting many non-conventional texts. The last, of course, is that it is free.

### Macros

It is not infrequent, during the composition of a document, that you find it necessary to change the appearance or treatment of a certain class of text, say the placement of floats or bullets of a list. As far as I know, outside of theming, Word does not offer a reliable mechanism to perform such document-wide changes reliably, and I am uncertain, but skeptical, about other commercial WYSIWYG solutions.

A TeX system, on the other hand, allows us to define macros that specify markup, and define them in the document preamble, hence localizing markup. TeX macros offer another advantage: they allow us to specify semantic markup for (document-local) classes of text rather than having to resign ourselves to presentation markup. By semantic markup for document-local classes of text, I mean notions that are not universal in all documents that need to be typeset and hence are not offered in WYSIWTG systems. Examples include conjugation tables template for language books, theorem environments, the first occurrence of new jargon in a textbook, todo placeholders to indicate further writing/editing needed, etc. Tne advantage of semantic markup is that we can easily change the presentation of these classes fromj one location, and that we can easily extract such classes of text for other purposes (e.g. creating flashcards from jargon introduced in a textbook)

### Visible markup

The other mechanism is its amenability to search-and-replace operations. Where we want to make changes on some but not necessarily all markup that fit a pattern, we can easily create a regex for it (though not foolproof, since regex, being regular, has trouble with context-free notions). This is not possible in WYSIWYG documents, since much presentation markup is hidden behind menus and not directly editable as text, and since there is the occasional piece of text that appears the same but has a different markup.

Macros and visible markup taken together mean that it is far easier to ensure consistency in a LaTeX-typeset document than one from a WYSIWYG editor.

### Packages and versatility

TeX was born in academia, and the packages in a typical TeX distribution reflects that. There are packages that provides theorem environments for mathematicians. There are packages for typesetting editions of ancient manuscripts. There are packages that enable more creatively typeset passages required by some genres of poetry. There are packages that allow one to typeset a deluge of footnotes as a paragraph, which is often found in critical editions of literature. There are packages for typesetting code listings and for pseudocode. Hence if your document requires unusual typesetting that nevertheless is already popular and conventional among a group of specialists, chances are you’ll find a package for that.

After weighing the pros and cons, if you are still interested in typesetting a document in LaTeX, and are interested in producing a document beyond the generic Computer Modern look, read on.

## 2. Engines, Document Preparation Systems, Distributions

### Beyond TeX and pdfTeX, the XeTeX and LuaTeX engines.

#### pdfTeX

The TeX program by Knunth, while stable and complete, nevertheless was born too early and misses out on three important developments in digital typesetting: the PDF format, and OpenType. The output of TeX is a DVI file, which while convertible to a PostScript or PDF file, lacks PDF features like PDF hyperlinking, PDF forms, and PDF table of contents, etc.

To remedy this (and introduce a couple other features), pdfTeX was born, that primarily compiled directly to PDF files.

#### XeTeX

Nevertheless, neither TeX nor pdfTeX support newer font technologies such as TrueType and OpenType. Among others, TrueType and OpenType formats allow font distributors a standardized way to specify and provide support for many typographic features; specifying different features for different languages and scripts, support for small caps, superscripts, subscripts, lining figures, old-style figures, proportional figures, ligatures, optional ligatures, combining accents, contextual alternatives, etc., all in a single, customer-friendly font file. Hence typographers have embraced OpenType. My favorite examples of the versatility of OpenType comes from Kazuraki and Zapfino Arabic.

The XeTeX engine introduced compatibility with OpenType fonts.

#### LuaTeX

Current development, however, is focused on LuaTeX. As the name suggests, LuaTeX is an engine that exposes Lua bindings. It is also designed to be compatible with XeTeX documents, supporting OpenType as well, though having different internals.

The TeX program, along with its macro language, was invented before the benefits of proper encapsulation in structured programming and in object-oriented programming were widespread and accepted. In addition, TeX lacks first-class support for numerical computation. All these legacy decisions mean that writing involved macros and macro packages often involve a fair number of kludges and side cases, arcane, non-obvious code, and non-obvious incompatibilities with other packages. LuaTeX is a step towards alleviating this situation, along with LaTeX3 (though the latter’s pace of development makes Perl 6 look like the Road Runner.)

In between, there have been other TeX-like engines being developed such as NTS, ExTeX, Omega and Aleph, though their functionality have been largely superseded by newer engines.

### TeX, LaTeX and ConTeXt

Documents compiled by any one of the above engines can be said to be TeX, LaTeX, or ConTeXt documents. TeX documents have to specify themselves the presentation of their elements somewhere. LaTeX is a macro framework sitting on top of TeX that specifies sane defaults, guided by an objective of allowing the author to care as little about the presentation of their document if desired; LaTeX documents assume that they will be compiled along with the LaTeX. As a result, it is not always straightforward to change certain aspects of a LaTeX document. ConTeXt is another macro framework incompatible with LaTeX, but ConTeXt is far more customization-friendly out of the box. The latest version of ConTeXt supports only LuaTeX, utilizing its modern feature set. Both LaTeX and ConTeXt specify mechanisms to import third-party packages. Ultimately, however, for better or worse, LaTeX, perhaps by virtue of being more accessible than TeX and of having been started earlier than ConTeXt, is the target of most third-pary packages being distributed freely.

### TeX Live

Whereas there used to be a variety of TeX distributions, TeX Live for Linux and Windows and its twin MacTeX have emerged to be the superior, albeit slightly weighty, standard distribution. They are maintained by the TeX Users Group and come with utilities to regularly update or auto-update from the CTAN (Comprehensive Tex Archive Network) mirrors. Being somewhat of a niche piece of software, CTAN packages do not often update as frequently as those in repositories for OSes, popular programming languages, etc.

## Ending Note

I hope this clarifies for the reader what the words and differences between TeX, LaTeX, ConTeXt, pdfTeX, XeTeX, LuaTeX, CTAN, etc. are, and gives an insight into the development history of the TeX ecosystem. If TeX is right for you, I encourage you to try out the newer engines and try out some novel customizations. To that end, I am writing a companion post which is an annotation of a template that I base my documents off.

## Digest: Dates with JavaScript

This post is about the various ways you can create a Date in JavaScript, both documented and undocumented, how it works under the hood, and the (maybe surprising) results you get.

I first came across this while working on NUSMods and Flow warned that passing an array of values to Date is not valid. However if you actually try it, say in node interpreter or in your browser’s console, it works:

new Date([2016, 8, 8])
> Mon Aug 08 2016 00:00:00 GMT+0800 (SGT)

And indeed, this is not a specified way of constructing a Date - you don’t see it in MDN reference for Date.

However, my code still ran fine. And when I tried to change it to use multiple arguments instead, the result was different:

new Date(2016, 8, 8)
> Thu Sep 08 2016 00:00:00 GMT+0800 (SGT)

Notice how it is Sep now instead of Aug.

But if I passed in a string, it works as expected:

new Date('2016, 8, 8')
> Mon Aug 08 2016 00:00:00 GMT+0800 (SGT)

This led to me investigating how the Date constructor works.

## Creating a Date

The information below comes from this very informative MDN reference on Date

Creating a Date looks pretty straightforward in JavaScript, you call the constructor with the new keyword - new Date().

There are 4 specified ways, of varying convenience, that you can call the constructor.

1. No arguments, gives you the current time: new Date()

2. Integer value (really a number) - treated as milliseconds since epoch (1 January 1970 00:00:00 UTC): new Date(1483372800000)

3. String representing a date - parsed by Date.parse(): new Date('2017-01-03')

4. Multiple number arguments - minimally requires a year and month, but you can specify up to 7 (extra arguments are ignore): new Date(2017, 0, 3)

## An aside on months

If we read the reference in more detail, we can understand why this happens:

new Date(2016, 8, 8)
> Thu Sep 08 2016 00:00:00 GMT+0800 (SGT)

month

Integer value representing the month, beginning with 0 for January to 11 for December.

So 8 is treated as the 9th month, which is Sep.

How about the case when we pass in a string?

new Date(Date.parse('2016, 8, 8'))
> Mon Aug 08 2016 00:00:00 GMT+0800 (SGT)

The month is treated more naturally, in this case 8 is parsed as August.

So what’s left is the case when an array is passed into Date.

## Creating Dates with every possible type

It helps to first examine the data types in JavaScript, and pass each of them to Date to see what happens.

There are 6 types: - number - string - null - undefined - boolean - object (an array is an object)

An integer in JavaScript is really a double (number type):

(new Date(10)).valueOf()
> 10
(new Date(10.9)).valueOf()
> 10

we use valueOf to get the milliseconds represented by the date (which is also how Date is implemented)

Seems like a double is rounded down.

Passing null and boolean seem to give us the same effect as passing 0:

(new Date(0)).valueOf()
> 0
(new Date(null)).valueOf()
> 0
(new Date(false)).valueOf()
> 0
(new Date(true)).valueOf()
> 0

How about undefined?

(new Date(undefined))
> Invalid Date
(new Date(undefined)).valueOf()
> NaN

It sets the internal value of Date to NaN.

And the same goes for an object:

(new Date({})).valueOf()
> NaN

However, when we pass in an array (which is an object), even with just 1 value, we get a perhaps surprising result:

new Date([0])
> Sat Jan 01 2000 00:00:00 GMT+0800 (SGT)

## Arrays, primitive value, and default value

So what’s going on? A quick Google search turned up this StackOverflow result.

The answer mentions something about converting the argument into a primitive value by calling an internal method called [[DefaultValue]], which converts the array into a string.

I like how concise the article is, but I also felt that it will be instructive to dig deeper into the spec and also the source code.

The relevant part of the ES5 spec is section 15.9.3.2. It describes what to do when a single argument is passed to the Date constructor.

1. Let v be ToPrimitive(value)

The first step is to convert the input to a primitive value using ToPrimitive, which behavior is detailed in section 9.1 of the same spec.

Of the 6 types mentioned above, only Object is not a primitive value. So it needs be converted.

An array in JavaScript is really an object, so this falls to the last case in the table, which is to “Return a default value for the Object” by calling [[DefaultValue]]. This is also detailed in the spec in section 8.12.8.

The description slightly more involved than what we’ve seen, so in short, the [[DefaultValue]] of an array is the result of calling toString() on the array.

[0].toString()
> "0"

So what really happened was that when we passed [0] to Date, it is first converted to the string "0", and this falls into case 3 of creating a Date. "0" is successfully parsed into a Date which gives us Jan 1st of 2000.

We can verify that this is the case:

(new Date([0])).valueOf() === (new Date('0')).valueOf()
> true

So now we have an answer to the original example:

[2016, 8, 8].toString()
> "2016,8,8"

> new Date([2016, 8, 8])
Mon Aug 08 2016 00:00:00 GMT+0800 (SGT)

> new Date([2016, 8, 8].toString())
Mon Aug 08 2016 00:00:00 GMT+0800 (SGT)

(new Date([2016, 8, 8])).valueOf() === (new Date('2016,8,8')).valueOf()
> true

## Implementation

I also wanted to see how the implementation is like so I went around digging. I thought I could find this in V8, but I only found a date.cc that didn’t seem to implement the Date constructor.

Digging a bit more led me to builtins-date.cc, which is basically what is called when we write new Date() in JavaScript.

Here you can see the checks for the number of arguments, as well as calling Object::ToPrimitive on the value, and then parsing it if it is a string.

Looking around a bit more, I stumbled upon a test file in node, date-constructor.js, which mentions a WebKit bug, that led me to look into WebKit.

This has a similar implementation to V8, in DateConstructor.cpp.

Slight aside on modern JavaScript engines: V8 is the JavaScript engine that powers chrome (and node), whereas WebKit uses JavaScriptCore, which is what powers Safari.

And if you’re willing to dig deeper, you can figure out how toPrimitive is implemented starting from builtins-conversion.cc in V8 and JSObject.cpp in WebKit.

But for now, my Date with JavaScript is over.

## Digest: Text Generation with Markov Models

Imagine if you could write a program to write you an essay, or produce a tweet, or even generate a research paper? Such an activity can be performed with reasonable quality by the use of something known as a Markov Model, with the help of an input text. In this post, I’ll be covering a little about what Markov Models are and how they can be used to generate potentially interesting text.

Disclaimer: I am by no means an expert on this topic, and this post is merely a way of sharing a cool idea I learnt during one of my classes (CS 2020) as a student at NUS.

## Background

A Markov Model is used to model a system that randomly evolves with time, where the future state depends only on the current state, and not on any of the previous states. To clarify what that means, take for example a simple game, such as snakes and ladders. Each position on the board reflects a state, and your next position is completely random (determined by the dice roll). Furthermore, your next position only depends on your current position on the board, and not where you were earlier. Such a process, that evolves randomly with time, is called a stochastic process, and one where the future state depends only on the current state is called a Markov Chain, named after the Russian mathematician Andrey Markov.

A Markov Model is used for such processes, to find the probabilities of being in particular states given an initial state and the probabilities of transitioning from one state to another. So, in the case of snakes of ladders, every state can lead to six new states, each one with a probability of 16.

An example of a Markov Chain from Wikipedia:

### Text Generation

Claude Shannon, in his famous paper A Mathematical Theory of Communication (which laid the groundwork of modern information theory), proposed, among other things, using a Markov chain to statistically model the sequences of letters in a piece of English text. The idea is to have every state represented by a k-gram, or a string of length k, and to assign probabilities of transitioning from one k-gram to another, based on the frequency of characters appearing after the k-gram in the input text. Such a model is called a Markov Model of order k.

Here is an example of the idea, for k = 1. Suppose the input text is “a g g c g a g g g a g c g g c a g g g g”  We can construct our Markov Model of order 1 as follows

k-gram(k = 1) Subsequent Character Probability
a g 44
g g 712
g a 212
g c 312
c g 23
c a 13

What this means is that the probability of finding a ‘g’ after an ‘a’ is 1, as in the text, every occurrence of an ‘a’ is followed by a ‘g’. Similarly, the probability of finding a ‘c’ after a g is 23 because that’s the number of times a ‘c’ appears after a ‘g’ in the text, divided by the number of times ‘g’ appears.

In this manner, we can construct the model by finding the probabilities of single characters appearing after each k-gram in the input text. Once that is done, we start from a particular state, and then randomly move to new states based on the probabilities, each movement to a new state represents adding a new character to our generated text.

A Diagrammatic Illustration of Text Generation:

You may wonder why such a simple idea like this could give rise to meaningful results. In practice, a lot of it has to do with the value of k. If k is a small value, then the generated text is often one filled with gibberish. On the other hand, if k is too large, the generated text will end up sounding a lot more like the input text. It’s also worth noting that the quality of the text generated depends on the input itself, i.e. you’re more likely to get a better play as your output if the input is a Shakesperean text, than if it is a poltician’s speech.

## Closing Thoughts

Implementing this idea is not hard, and potentially a lot of fun (CS2020 students can testify!) You could also try using words instead of characters, where your k-gram is a string of k words, and you find the probability of a particular word appearing after a given k-gram, rather than a character.

Here are some implementations online of Markov Models:

On a final note, text generation is just one application of Markov Models, they find applications in a large variety of spheres, such as the GeneMark algorithm for gene prediction, the Metropolis algorithm for measuring thermodynamical properties, and Google’s PageRank algorithm.

## Digest: Do you wanna PCB?

The scHnRk and Neander was born in the first half of this year.

Designed and Assembled in Singapore, these boards represent leaps of faith into the unknown world of electronics fabrication.

### Prerequisites

You should find yourself a working electronics lab with a functioning set of tools - multimeter - power supply - soldering iron - oscilloscope

You can access these tools at Ida Labs @ NLB/NDC, HackerspaceSG, Ground Up Intiative, One Maker Group and NUS Hackerspace.

Try to find some money as electronics don’t really come cheap :D

Estimated duration for subprocesses are given.

### Design

The best designs often start with a meaningful problem, and the passion to follow it through to its solution.

You may refer to my previous presentation and video on the scHnRk’s design process.

Time: [Couple of weeks - Months] - scHnRk : 2 months - Neander : 2 months

Unless you’re designing something peculiar, remember to breadboard your designs while laying the schematics out on the CAD software! This way you - have an idea of what connections need to be made - can iron out potential issues with individual components (incompatiblity, interferences, power requirements)

#### Software

In terms of which CAD software to use, I’d recommend KiCad because it’s - open source - “limitless”

#### Visualization

Most PCB vendors ship your designs immediately to the board house. This means that you cannot edit your boards after submission. One of the most common issues with PCB design is alignment errors.

You may want to check your PCBs for these issues with an online tool such as webGerber.

#### Sourcing Materials

The best way to get your hands on some scrummy electronics is Element14 and Taobao. Google Translate is your best friend! Aliexpress too!

A Bill Of Materials helps to keep track of your project materials and costs. Here’s a sample.

#### Microcontroller programming

Some microcontrollers have propietary interfaces that allows you to program the microcontrollers “in circuit”. These interfaces are usually called [iscp]s or [jtag] you may want to check out how they work.

Some chips have its own USB transceiver so you can program it directly from computer.

You may consider investing in a set of test pins/clips.

### Prototyping

This process really depends on the scope of your product and reach.

Time: [Month(s)] - scHnRk : 1 month - Neander : 2 months

#### How many prototyping phases?

Excluding breadboarding, you should have at least 2 prototyping phases. This way you can rectify any wrinkles in your design.

#### PCB vendors

Most PCB vendors usually take on the order of a week to a month to fulfill your order. I’d recommend getting the rush option.

You can test a variety of PCB vendors to see which one gives you the best bang for bucks. I’d recommend OSHPark, dirtyPCBs and seeedstudio.

Electronics don’t assemble themselves. Allocate about 1-2 minutes per component for assembly. Seasoned soldering people can take as little as 10 seconds to populate one component.

#### To PCB with Chemistry

Understand the chemistry behind soldering. It will guide you to - better maintainence of the soldering iron - easier soldering experience

### Production/Assembly

You can get away with soldering low component count boards (up to 5 boards). For added ease, you may consider PCB.NG.

For assemblies of more than 5 boards, I’d recommend getting PCB Assembly services like seeedstudio’s for peace of mind! Test your designs first as they may be hard to interpret at the assembler’s side -> you may get parts in the wrong orientation or placement!

The typical process goes like this

• Solder Paste Application
• Pick and Place
• Reflow Soldering
• Programming
• Soldering (Thru Hole Components)
• Reprogramming
• Packaging

Time: [Couple of weeks - Month] - scHnRk : 2 weeks - Neander : 3 weeks

### Other resources

There are many writeups of people in similar positions. Particularly from Darell, Sudharshan, Bunnie and vonger.

Conorpp has a great article on producing a USB authentication token for the masses.

Andrew Yong had great experiences building an energy monitor using seedstudio’s PCBA. Enclosures can be found here.

## Digest: My experience interning at 3 companies

From Oct 2015 to Aug 2016, I had the good fortune of interning at 3 respectable companies: Google, Stripe, and Uber.

In this post I wish to share my experiences at the 3 different places.

## Interview process

The interview processes at the 3 companies were pretty similar, with a initial HR phone call, followed by multiple technical screens. Of special note is Stripe’s, who flew me from Singapore to San Francisco for an on-site interview.

Initial HR phone call with recruiter, followed by 3 back-to-back technical interviews via phone and Google Docs on the same day. After I passed the technical interviews, there are N phone calls with teams that are interested to pick up an intern. For me, N = 1, but it can vary.

### Stripe

Initial HR phone call with recruiter, technical interview via Skype and shared screen coding, on site with 3 or 4 (can’t remember) technical interviews and a lunch with an engineering manager. More details on the on site can be found in this Quora post.

### Uber

Initial HR phone call with recruiter, followed by 2 technical phone calls with online collaborative IDE, and 1 phone call with a manager.

## Team I was in, projects I worked on

Details will be scarce in this section due to confidentiality, but I try to give a higher level look as to the kind of work interns got at each company.

Internal tool to convert a service written in old framework style into a service written in the new framework style. This was a internship task that my mentor scoped out for me. I finished this ahead of time, so I went on to another task related to a bot that comments on the internal bug tracker when certain conditions are met.

My project was clearly scoped out to be an intern project: it wasn’t anything critical, but was something that an Intern could do and would be helpful to the team. Throughout my internship, I was not involved in the projects that my colleagues were doing.

### Stripe

I was initially given smaller tasks that were on the sprints to get warmed up, so I dealt a little with merchant search (that’s the search feature you see if you are a logged into your Stripe dashboard), and also a little with API permission. I had 2 big tasks scoped out by my mentor, the first had to do with the frontend of a new Connect feature, the second has to do with creating a new data pipeline to calculate fees.

The features that I was tasked with are additions to the core product, especially the data pipeline with actually affects the company’s earnings! I felt like the entire team trusts me. My team ran a 2 week sprint system, with weekly sprint meetings, and I was encouraged to participate like a full timer.

### Uber

The first task was to run an experiment on web sign up form to get warmed up. While this was in progress (experiments take a while to run), I was in discussion with my mentor for subsequent tasks. Eventually we settled extracting a microservice, which was the work the rest of my team was going to be working on for the subsequent months.

My team runs a weekly sprint system, and the bigger team that we’re part of has a 2-week sprint cycle. I mostly worked within my immediate team, knocking tasks of a task board we set up to track our own tasks.

## Tech stack

Well known for its giant monorepo, has extremely well thought out tools for developing, building, testing, etc. Main languages used are C++, Java, Python. A lot of internal tools that are slowly being open sourced, build tool bazel, code search kythe, etc. Uses internal tool for running servers, open sourced as kubernetes.

### Stripe

Main languages are Ruby and Scala. Uses GitHub for code, PR, code review, phabricator for issue tracking. Deploys to AWS using a custom tool, and developers can develop on AWS instances too.

### Uber

I worked mainly on Python, phabricator for code review, has a custom tool to deploy, each dev spawns an AWS instance to develop on. There are 2 blog posts that goes into their stack more in depth, encourage you to read them here, and here.

## Day to day life

Life was really good at all 3 companies, I felt extremely looked after, to the extent of being pampered.

3 meals were provided at every company. At Google and Uber there would be some special pop-up events, hot chocolate, cupcakes, etc. Google has baristas in house, and they make really good lattes with very pretty art.

There is a weekly 1 on 1 with my mentor where we could talk about anything, problems at work, questions about company, issues settling in etc. Stripe has a interesting culture of taking coffee walks, so all my 1 on 1 at Stripe was done while making our way to a coffee shop.

All 3 companies had a weekly all hands meeting as well, where there would be some presentation and a Q&A with heads of the company.

There are tech talks too, and the frequency correlates with the size of the company.

## That’s all folks

I had a really good time at all 3 companies, I felt technically challenged, learned a lot, had very good mentors. If you are interested in applying for an internship and am not sure where to start, check out Project Intern.

## The Problem

Suppose you have a huge number of robots/vehicles and you want all of them to track some global value, maybe the average of the weight of the fuel that each contains.

One way to do this is to have a master server that takes in everyone’s input and generates the output. So others can get it from the master. But this approach results in a single point of failure and a huge traffic to one server.

The other way is to let all robots talk to each other, so each robot will have information from others, which can then be used to compute the sum. Obviously this will incur a huge communication overhead. Especially if we need to generate the value frequently.

If we can tolerate approximate results, we have a third approach: consensus filters.

## Consensus Filters

Let’s begin with an example:

$$\dot{x}i(t)=\sum\limits{j\in\text{Neightbour of }i} x_j(t) - x_i(t)$$

$$x_i(0)=c_i \in \mathbb{R}^n$$

Here $x_i$ is the value obtained by robot $i$, $x_i(t)$ refers to the value of $x_i$ at time $t$ and $\dot x_i(t)$ is the change of value of $x_i$ at time $t$. Then $c_i$ is some initial value. We can think of the update as $x_i(t + 1) = x_i(t) + \sigma\dot x_i(t)$, where $\sigma$ is some small value.

And guess what, it can be proven that with these update rules, every $x_i$ converge to the average of the original $c_i$s, i.e.

$$\lim\limits_{t\rightarrow\infty}xi(t)=\frac{1}{n}\sum\limits{c}c_i$$

The proof involve some kind of matrix/eigenvalue and other math blah blah so I will omit it here. You can refer to the reference below for the paper to it.

The update rule given above is one example of consensus filters. It’s a filter that lets each robot compute the average of every other robots’ initial values. This average value is unique so every robot will arrived at a ‘consensus’, asymptotically.

Notice that every robot only relies on the $x$ values of its neighbors, yet their initial values ‘propagate’ to the entire network, assuming that the network is connected.

That is the essence of consensus filtering - it allows all agents, or robots, in the network to arrive at the same value asymptotically by only communicating with their neighbours. The advantage of doing this is:

1. The communication overhead is very low. You don’t even need to set up routing tables etc. because agents are only doing single-hop communication.
2. In some case, the immediate, approximate value can be used, even when the network has not arrived at a consensus. i.e. if we do not require the consensus value to be very accurate, or we are OK with agents having values more related to nearby agents than all agents, we can let each agent use the approximate one very quickly.

## Dynamic filters

The example given above is a static filter - static because it tracks only the initial value $c_i$. What if as the time goes, the value also changes? For example, the fuel each robot has may reduce over time, so the average value of such is also reducing.

That’s where dynamic filters come in. Dynamic filters are similar to static ones, but they can track changing values - or signals.

One example, called highpass filter, is given below:

$$\dot{\mathbf{x}}(t)=\sum\limits_{j\in\text{Neightbour of }i} x_j(t) - x_i(t)+\dot{u}_i(t)$$

The only change here is then when computing $\dot x$, we add $\dot u$, which is the change in the local value (in the example given, it will the change in weight of the fuel).

This filter also allows convergence to a consensus - even when the tracked value is changing! Of course, the change of the value cannnot be too drastic, otherwise there’s no opportunity for it to ‘propagate’ to the network. But if the rate of change is more or less constant, the output is pretty accurate.

It’s called ‘highpass’ because if you do some math to derive the Laplace transform, it apparently propagate high frequency signals, which most of the time are noises. There are other more complicated, but stable (less susceptible to drastic changes) filters, but they incur more math :P, and also other kinds of constraints.

## Applications

Obviously tracking fuel isn’t that fun, but you can apply consensus filters to other decentralized algorithms like Kalman filters and Sparse Gaussian Process to allow a lower communication overhead. Also, as mentioned above, these algorithms are OK with approximate average, so consensus filters work great. You can get more information in the references below.

That being said, this technique is still a hot research area, so it’s not as mature yet.

## References

### Math and proof for static filters:

Akyildiz, Ian F, Weilian Su, Yogesh Sankarasubramaniam, and Erdal Cayirci. 2002. “A Survey on Sensor Networks.” Communications Magazine, IEEE 40 (8): 102–114.

Olfati-Saber, R., J.A. Fax, and R.M. Murray. 2007. “Consensus and Cooperation in Networked Multi-Agent Systems.” Proceedings of the IEEE 95 (1): 215–233. doi:10.1109/JPROC.2006.887293.

### Consensus filter applying to Kalman Filter

Olfati-Saber, Reza. 2005. “Distributed Kalman Filter with Embedded Consensus Filters.” In Decision and Control, 2005 and 2005 European Control Conference. CDC-ECC ’05. 44th IEEE Conference on, 8179–8184. doi:10.1109/CDC.2005.1583486.

### Other dynamic filters:

Freeman, R.A., Peng Yang, and K.M. Lynch. 2006. “Stability and Convergence Properties of Dynamic Average Consensus Estimators.” In Decision and Control, 2006 45th IEEE Conference on, 338–343. doi:10.1109/CDC.2006.377078.

Spanos, D. P., R. Olfati-saber, and R. M. Murray. 2005. “Dynamic Consensus for Mobile Networks.”

## Digest: Consistent Hashing

This is our first digest article. What’s a digest article? Basically we source around NUS and write something that’s technically interesting. Let us know if you would like to submit an article!

If you’ve ever taken a Data Structures course or written a sufficiently complex program, you probably know what a Hash Table is.

## Some background

In computing, a hash table (also hash map) is a data structure used to implement an associative array, a structure that can map keys to values. -Wikipedia

The key idea behind a Hash Table is the computation of an index using a hash function to determine which ‘bucket’ or ‘slot’ to insert the given value.

Assuming we choose a fairly good hash function, and have sufficiently large number of slots, we are able to map each key to a unique bucket (less collisions) and therefore achieve an O(1) constant cost per operation (get, set).

When the load on the hash table gets above a certain threshold, we do a resize operation.

Resizing is accompanied by a full or incremental table rehash whereby existing items are mapped to new bucket locations. -Wikipedia

If we choose when to resize carefully, this still leaves us with amortized constant time operations.

## The problem

Imagine you decide to design and implement caching layer on top of you extremely popular application. You decide that a hash table is a perfect data structure to implement this cache.

You start caching various thing and your application is now blazing fast. Your cache size starts to grow and you resize your cache. Remember, each time you resize the cache, you need to re-hash all the existing keys and remap them to a larger number of buckets.

What happens when you can’t fit your entire cache on a single machine?

2 machines! A distributed hash table (DHT)!

## Keyspace partitioning

If you’ve been following along, the question you should have right now, is how do you resize a DHT across machines?

If you think about it, if you had say, a million key-value pairs distributed across 5 machines, and now wanted to add 2 more machines (i.e. grow the hash table), resizing and rehashing all of them seems like a bad idea and probably a last resort.

While running collections of caching machines some limitations are experienced. A common way of load balancing n cache machines is to put object o in cache machine number hash(o) mod n. But this will not work if a cache machine is added or removed because n changes and every object is hashed to a new location. -Wikipedia

## Consistent Hashing to the rescue!

The key benefit of consistent hashing is that we can avoid rehashing every object when a new machine is added. The way we can visualize this, is via a circle and some modulo arithmetic.

1. Say for instance, we randomly scatter each machine on the circle (perhaps by hashing its IP address, or some unique identifier)
2. We end up with machines along random points of the circle.
3. For each key, we similarly hash it as before but using modulo arithmetic, place the resulting hash value on a point on this circle.
4. The closest machine to this point is responsible for storing this key.

Now when we add a new machine, it is easy to see how only a couple of keys (to the left and right of this new machine) need to be “re-hashed” and we end up with more “slots” in out DHT.

I’m obviously glossing over a lot of the intricate details here but hopefully you get the idea.

## Closing thoughts

You’ll find the technique of Consistent Hashing pretty prevalent in DHTs and on particular instance of this you might be familiar with is Apache’s Cassandra.

There are also some caveats to this, as you can imagine.

For instance, the distribution of keys might not be uniform among all machines (due to the way we select points on the circle and the resulting distribution of keys). Cassandra, for instance, chooses manage this by monitoring the load on each machine and moving their locations on the ‘circle’ accordingly.

Another interesting approach is using ‘Virtual Nodes’, as described in Amazon’s Dynamo paper to overcome this problem.