I have exported the postgres pages and identifies table in CSV files of 500k rows per file.
The database is using Amazon Aurora (postgres variant) so exploring the files went pretty smooth.
Now I have created two scripts to get that data into our new setup by:
- reading the file converting to JSON (some library)
- converting the flat (one column per property) to the event structure we need
- importing the records by 25 using dynamo batchWrite
Here is the code which I copied (and modified) from a StackOverflow article
And another one
This dataset goes all the way back to 2016 and has millions of records. Because the batchWrite will upload this stuff pretty efficiently and trigger the process functions at a really good speed we have to make some changes to the DynamoDB and switch to On-Demand pricing.
Update the Resources part in serverless
A quick sls deploy will do the job. I did remove the entire infrastructure a few times testing this.
Now let’s feed some events:
It took about an hour to get all the events in there but now all history from 2018, 2019 and 2020 should be in the attribution tables.
With a larger dataset it might be more effective to go from SQL straight to DynamoDB or use one of Amazon Migration Services. In my case it was a great chance to see how it all performs under some stress and clean up the console.log calls throughout the code :-)
Tomorrow I will explore how I can feed some events back to segment using analytics.js.