Published: 08-11-2022
A brief round up of some of the stuff I got up to at Xor Systems.
While also working on various smaller projects, the main challenge at Xor Systems was continued development of a "legacy" PHP-based IoT system. The system provided noise, vibration, dust and weather monitoring around the UK and Europe, mostly for regulation compliance in various industries. It was a monolithic PHP-MySQL system running on bare metal.
The hardest part of working on this project was balancing the need for improved infrastructure, while providing constant end customer value. It was one of those projects you see and immediately wish you could start again, but as is often the case (or correct choice), starting again was not an option. A large backlog of required features, growing number of devices and data, combined with a small team made it interesting. Lot's of different hats were worn!
One of the first things I tackled was creating a system to automatically configure the infrastructure, so it could be replicated, repaired or replaced reliably. I settled on Ansible as the provisioning tool and built a series of playbooks that deconstructed the application setup. These were used to deploy develop and staging servers before replacing the production server. An AWX environment was added, allowing team members to deploy playbooks from a centralised dashboard. This system paved the way for future automated deployments of new services.
Monitoring was next. For this, I chose to set up a TICK stack with Grafana for visualisations and alerting. Infrastructure and application level logs were sent to this system using Telegraf. This built further confidence for automated deployments and helped debug several system performance issues.
Large section of the application were developed with coupled PHP-MVC style pages. Part of the process to improve the ability to run test involved decoupling several areas of code. This allowed better use of unit test where appropriate but mainly factored around shifting away from MVC to a well defined "RESTful" API. When there is no defined contract
A cron-triggered, single threaded script would pull audio files from remote locations with unreliable bandwidth when a noise exceedance had occurred. The slow and unreliable bandwidth to remote sites often caused bottlenecks and download delay issues, which not only made the system slow, but risked the download missing the time window that a file was around for before it got overwritten.
To solve this issue I architected a queue based worker system which was deployed as a separate service from the main application. Python worker scripts would pull "exceedances" from a RabbitMQ queue and process them in parallel, removing bottlenecks. There were nuances such as devices only being able to handle one download at a time (limited embedded system resources); a device could also legitimately be offline (power saving mode) and so retries needed to take that into account.
Workers were containerised alongside a RabbitMQ instance and deployed using Ansible. Telegraf was used to log worker and queue metrics for monitoring through Grafana.
There were places in the application which had a lot of aggregated data and due to siloed device data tables, a high number of individual reads were being used, putting the database under strain. To solve this issue I suggested we use adopt a CQRS based approach to do more processing work on "write". A materialised view table was added to the database and populated during data ingest, reducing the read pressure significantly.
The final few years saw me architecting and embedding a new user interface into the IoT platform. This consisted of an embedded React App that progressively consumed other pages/routes, providing a far more interactive and visually appealing experience. It also helped drive further development of the API surface area. Even though deployments were automated, having a decoupled frontend also made it easier to push out smaller UI changes and bug fixes.
I introduced a more systematic approach of mocking all UI components in FIgma before writing any UI code. This made it much easier to discuss changes and relay with the client, especially during Covid fully remote working times.
The app made use of Redux Toolkit for state management; Jest and RTL for testing; Styled components for CSS.
In combination with other greenfield IoT projects, I did some work exploring and setting up an IoT system on AWS. Being a smaller team, we hoped to leverage some of the higher level services to focus more on product development than operations.
One main area of work was determining how to use IoT core in a multi-tenanted environment. IoT policies for allowing MQTT access to devices needed to be dynamically matched with Cognito users. MQTT communication also needed to be implementing in the frontend. Amplify JS libraries were used to provide sign in and MQTT over websockets. A lot was learned about Cognito and Amplify's (the library, not service) shortcomings.
I wrote a service for storing time series data that could be used across projects. Requests went through API Gateway V2 and Lambda. DynamoDB acted as a long(er) term cache/DB (TTL ~1yr), with data also being written to S3 in an efficient format. Being fully managed and with no VPC complexities, this proved an effective way to create a decoupled service with low operational overhead and enough performance for our use cases.
All infrastructure was developed using the CDK and TypeScript.
Working on a legacy system in an environment with limited resources can be quite challenging, but it also tends to make the engineering more interesting. A lot of time was spent architecting ways to strategically split out high value parts of a monolith into decoupled services, but the hard part here is not the architecting or splitting out, it's doing so while providing continuous end customer value. This involves crafting vertical slices of work based on customer needs and being able to articulate these changes effectively through written proposals and mockups.
Avoiding leaky abstractions play a big role in preventing simple upgrades turning into massive rabbit holes.