Day 5 - Feed old events
I have a full archive of all segment events in our Postgres warehouse that I now want to feed the system. Let’s try to make that happen.
I have exported the postgres pages and identifies table in CSV files of 500k rows per file.
The database is using Amazon Aurora (postgres variant) so exploring the files went pretty smooth.
Now I have created two scripts to get that data into our new setup by:
- reading the file converting to JSON (some library)
- converting the flat (one column per property) to the event structure we need
- importing the records by 25 using dynamo batchWrite
Here is the code which I copied (and modified) from a StackOverflow article
And another one
This dataset goes all the way back to 2016 and has millions of records. Because the batchWrite will upload this stuff pretty efficiently and trigger the process functions at a really good speed we have to make some changes to the DynamoDB and switch to On-Demand pricing.
Update the Resources part in serverless
A quick sls deploy will do the job. I did remove the entire infrastructure a few times testing this.
Now let’s feed some events:
It took about an hour to get all the events in there but now all history from 2018, 2019 and 2020 should be in the attribution tables.
With a larger dataset it might be more effective to go from SQL straight to DynamoDB or use one of Amazon Migration Services. In my case it was a great chance to see how it all performs under some stress and clean up the console.log calls throughout the code :-)
Tomorrow I will explore how I can feed some events back to segment using analytics.js.
Other articles in the series
05/07/2021
Day 11 - Sales Attribution
03/07/2021
Day 10 - Six months later
03/06/2020
Day 9 - Dealing with tracking/ad blockers
18/05/2020
Day 8 - Feeding in sales data
06/05/2020
Day 7 - Reporting on visitor sources
01/05/2020
Day 6 - Feeding source attribution data back to Segment.com
27/04/2020
Day 5 - Feed old events
24/04/2020
Day 4 - Run in production + API
22/04/2020
Day 3 - Cleanup & Identify Visitor Source
21/04/2020
Day 2 - Capture segment events
20/04/2020
Day 1 - The Masterplan
19/04/2020
Solving marketing attribution (using segment)