Day 3 - Cleanup & Identify Visitor Source

I was reading up on using Lambda, Node and DynamoDB and found some samples with much cleaner/easier code.

So first I will clean up some of my the code from Day 1 and Day 2.

🔗Goodbye Express, Serverless-HTTP

The next part is blatantly stolen from the post below. If you need to know why please read the post below. All credits to Van Huynh

Create a file called src/utils/request.util.js for our request utility.

For now, the request utility will contain a single function that will accept some parser function and return a new function that uses it to parse some text we pass to it.

Create another file called src/utils/response.util.js for our response utility.

The withStatusCode response utility function is similar to the request utility, but instead of parsing text, it will format data into text. It also contains addition checks to make sure our status codes are within the range of allowable status codes.

Then we’re creating a factory as we keep instantiating DynamoDb with the same options:

Using those new utils our index.js looks like this

That’s right. No more express, body-parser or serverless-http. So remove them and make sure to uninstall

🔗More cleanup

I had some inconsistencies in serverless.yml. Using trackinstead of page for table names. Fix that in serverless.yml

Quick sls deploy to get everything deployed. We did rename a table so the page events will be empty 🤨

Allright. Ready to do some new stuff. 😎

So what we want to do is two things today:

Store the identify() calls together with anonymousId so we can map visitor sessions (pages, tracks, …) to real users
Run page() calls through some external library and detect what type of traffic it is (search, paid, social,…)

Streaming DB inserts🔗

Because I have my own dataset with historic segment events I think I might import the old historic archive later. For that reason, I have decided to use streaming to launch a 2nd Lambda function after the record is inserted. This way I can safely batch import old page events and know they will be handled by the same processing logic.

Let’s register two new functions and set up the streaming link in serverless.yml

Hit sls deploy

In some cases, I had to erase my deployment sls remove or even sls remove -f because some changes to the tables (setting up the streaming) could not be completed.

Don’t forget to export BASE_DOMAIN after that to point to the new deployment.

🔗Processing Identify calls

Every time an identify is stored (Day 2) another Lambda function is automatically called. Let’s add the code (in /processIdentify.js) to store a mapping (userId to anonymously)

Two new files come with that too. The first one is a model to keep our Lambda handler clean and DRY.

The second file is to deal with the fact that DB events come with a special (marshalled) format. As we need the original event we need to unmarshall it. Read about it at Stack Overflow.

After that deploy that function only.

sls deploy --function=processIdentify

Now try to hit your segment endpoint with a new identify event

http POST $BASE_DOMAIN/events < events/identify.json

In a new tab monitor the output

sls logs --function=processIdentify

After that, you will see that a new record was written to the user mapping table. Mine (with the example event) looked like this:

🔗Processing Track calls

Next up is processing track calls very similarly. We already have the DB and streaming function created.

As mentioned in the plan I wanted to use a library to parse the HTTP Referrer and current URL and group sessions in some kind of grouping (paid, social, referrer, search…).

A quick check those libraries I found segment/inbound the easiest library. But when using it I bumped on to some issues:

Not maintained
Does not detect referrer: https://www.google.com as search
Outdated social networks and search engines
Tests were broken
Problem with a sublibrary that was accessing an external URL.

I found a better fork (better maintenance), but then I had to fork that again to apply some of my own fixes. Too bad I didn’t know about Monkey Referrer then :-( For now we’ll use the fork I created.

I need two new utils to keep the function clean:

The first is using this inbound library and returning a promise to parse the HTTP referrer and/or current URL and return something useful.

Another one to extract the important properties from a page() event

Making the processPage file (also automatically called on DB inserts) look like this

In short this will:

Get the page() payload -> run it through inbound lib -> store that together with url and HTTP referrer in a new table optimised for reads by anonymousId.

One more file (the model)

Give it a quick deploy

And after feeding the test event (make sure to clear the page table first) you should see some events.

So this call feeds the webhook (created in Day 2) a new page() event just like segment would. After that insert, DynamoDB is automatically calling another Lambda function (processPage) that runs the HTTP referrer and current URL through the inbound library and storing that payload (together with other important properties) in sma-event-attribution-dev

Feed it some more events (copy them from segment) and have a look in the DynamoDB interface. Here is what I saw with the test event:

There you go. Some great progress today. Tomorrow we’ll start by moving this to production and then creating an API to generate a stream per user or anonymousId.