Effective product development with small engineering teams

16 Oct 2021

Small engineering teams working in a startup usually have the luxury to be allowed to try several approaches and ideas toward their main goals, when it comes to product development optimizations: _leverage integrations, worship simplicity _and maintain _cost-effectiveness.

I have been leading one of those "small teams", let's say one with less than 10 members, for quite some time in Remagine where we have recently entered the sweet phase where our product finally hit production quality and we went live! This is the time when in a startup lot of things change, get real and become more interesting and engaging.

Another side effect of going_ live_ is that the small engineering things that you kept postponing as a tech lead, all of the sudden become crucial for the success of the Product and the health of the Team. Let me share some of the techniques and methodologies that are employed and are now an integral part of our daily routine and help us deliver and develop the product with relative ease and good speed.

Disclaimer: I hope that it's clear that those techniques work relatively well in small teams, in a company structure that hasn't (yet) many layers and everything related to the product development is at close reach to everyone. As soon as the team(s)' dynamic become more complex, many of these idea won't scale well and need to be rethought (and probably other tools employed as well).

The goals behind these choices are:

keep things as simple as possible, but not simpler. Each team member is familiar with the solutions because they make sense in the current, actual context
keep costs under control, buy services only when hitting a scalability issue and always try to leverage what you already have
Do not over-engineer solutions for a future that might not even become a reality

Background and context

Our tech stack uses Nodejs extensively, TypeScript everywhere and the infrastructure is deployed entirely in AWS. We use AWS ECS (Fargate), some Lambda functions (not too many though), RDS with Postgresql and several other services in that cloud. We also develop an amazing Mobile App using Flutter.

All our projects share the same 3-tier architecture: frontend in React (or Flutter), BFF which speaks GraphQL and a backend which speaks REST-ish (HTTP API with JSON payload, basically). We do not use microservices (yet), but we lean more toward an SoA](https://en.wikipedia.org/wiki/Service-oriented_architecture) with a monolith at its core.

Our Infrastructure as Code is written using AWS CDK, in TypeScript, and it is put close to the code so that each code repository also contains the code needed to be deployed, as a whole. We don't have dedicated Devops personnel.

We maintain at the moment only two environments, Dev and Live. A third environment, the Staging, is planned for the near future.

All our code is hosted in Github and the deployment is executed using Github Actions, which we all dearly love.

Our engineering team is remote first and our main communication tool is Slack. Email is also used but mostly as a read-only channel; we use several Google Groups where we route messages from the systems.

Sprints, planning and agile practices

This is definitely one of the hardest parts to get to work properly and in my career, I have seen it work very well in very few instances. The goal is always the same and can be broken down to:

avoid miscommunication among team members
avoid conflicting priorities
avoid lack of vision and lack of shared and understood product direction
provide a clear mid-term delivery plan for features and integrations
understand how to effectively estimate effort in a way that's tailored to your team's skills and availability
find a good balance between writing good specs and leaving freedom to designers and developers to express themselves
keep things interesting so that everyone keeps happily blowing in the sail, moving the ship forward

This is not the place to discuss the details of what we are doing and how we are doing it in that context (maybe a topic for a future post) but let me just say that very recently we decided to switch to 4-week long iterations, with 3 weeks allocated for "pure" product development and a cool-down week to deal with tech-debt, non-critical bugs and preparation for the next iteration. It makes sense for us at the moment, but it may not be for your situation. YMMV, as always. During the iterations we'll use Kanban as our methodology.

Development process and deployments

In a nutshell, we are not using the concept of "releases" (besides for the Mobile App, of course) and as soon as a pull request is merged in main, the code gets automatically deployed in the Dev environment. Live deployments are always manually triggered at the moment, at least until we'll fully trust our test suites (more on that later on).

The development workflow is really the simplest: the main branch always contains the most recent version of the codebase and once we want to start working on a new feature we create a new branch out of it. Naming the branch is important, and we recommend prefixing the branch name and the PR title with the ticket/issue number related to the issue you are gonna work on. It will help when you'll have to introduce some form of Change Management.

We also recommend opening the PR as soon as possible (using the draft PR feature in Github) for two reasons: to improve the quality and the number of commits and allow other folks to take a look at your work before is too late.

Once the feature is ready for review, the owner of the feature will ask one or more teammates to review it. The PR must contain a nice description of the changes to help the reviewers and, if it applies, one or more screenshots; if I need to check your CSS and HTML I would rather do it while I am looking at something without the need to run your branch locally. Screenshots also help QA and Product to see what's about to be merged before it is.

Finally, after at least one approval, the PR is squash merged by the owner, the person who created it and this is because it should be their responsibility to know when it can be safely merged, especially when a dependency on another system must be merged first. Squash merge is the only merge strategy we employ. There are several reasons for this choice but ultimately we like to have one nice final commit in main with a proper, long description and the issue number as part of it (this is also related to our heavily regulated business, where we need to able to track any change to the codebase and refer them to a specific ticket). In more than one year we haven't had the need to git bisect anything or felt "fat commits" to be a problem for other reasons and our git log looks nice.

As I said before we use Github Actions for all our deployments and for each application we have one workflow dedicated to Dev and another one for Live. Dev deployment is triggered as soon as we merge a PR whereas the Live one is always manually triggered when we feel "ready". We know that this approach is not scalable in the long run, but with smart usage of Feature Flags (read below) and an extensive e2e tests coverage, nothing should prevent us in the near future to unleash everything to Live automatically.

Here are some hints specifically related to github actions:

always use the timeout-minutes property. You pay for the minutes you run the scripts for, so you don't want to have a deployment stuck for hours because your ECS task doesn't become healthy (yeah...)
always use the concurrency property, so you can merge more than one PR and have them nicely queued up
we use a Slack notification plugin, so that we get notified when a Live deployment starts and finishes (with the status). Very easy to setup, and very nice to have
- by the way, we also use Github's Scheduled Reminders (it's in the Settings of your organizations), to remind people via Slack if there is one or more long-pending PR to review for them

What is actually running in Live?

To summarize, our setup is the following:

we don't fully use Continuous Deployment/Delivery
we don't use release branches
we deploy in Live every day or two

In a setup like this perhaps the biggest issue we need to solve is answering this simple question: what has been deployed in Live at any given moment or, what's the difference between what is in Dev and what is in Live?

To solve this problem we introduced automatic tagging of our main branch; at the end of any successful Live deployment the main branch is, from within the github action itself, tagged with a lightweight tag containing the current version (as semver). Knowing what is in Live and what is not, is then just a matter of running a git log: every commit that is above the newest tag, is not in Live.

To enable this feature we use another Github Actions plugin which is as smart as being able to automatically tag with a -beta semver tag in case, for a reason, we want to deploy Live a specific branch instead of main. We only increase the minor semver version for now.

Did you know? Inside Github Actions script you have access to a temporary GITHUB_TOKEN that gives you read/write access to the repository where the script is running. The token is destroyed once the action script ends

There are also two other common scenarios to solve:

Rolling back a disastrous deployment is solved by simply creating a branch out of the previous version ("previous tag") and deploying that branch. A tag is just an alias for a commit.
Hotfix deployment if you need to deploy a hotfix in life but you don't want to deploy the whole main because there are too many untested differences, you can create a branch out of the tag that's in Live, merge in there the PR with the hotfix and deploy that branch (or you can cherry-pick commits away, of course)

Feature Flags all the way

One of the best decisions that the team took early on, was to introduce the concept of Feature Flags. Those are something absolutely necessary (even more than tests, I would dare to say) to keep things under control and the earliest you introduce them the better since it's not only a matter of implementing them but also a matter of getting used to properly use them (and not abusing them).

Continuos deployment is about delivering features, not delivering code.

The importance of this concept cannot be stressed enough indeed: wrap a feature into a feature flag that is only enabled in, say, Dev and you can freely keep deploying the codebase in Live as much as you want. If you don't use Feature Flags, you can get around the complexity of having committed unreleased features under control using fancy and relatively complex algorithms, juggling branches in your git repo. Not an ideal situation, at our stage.

Although there are centralized solutions for managing Feature Flags, we decided to not rely on any of them (for now). We use a client library for our React components and a small library that each BFF customizes according to the needs of the specific application it drives. Once the client starts, it asks the BFF for the list of enabled features and that's it. It's not really DRY but the code for toggling FF hasn't changed in almost one year, so we are happy with that.

Documentation is key

When it comes to documentation, we all know the situation:

everyone agrees that documentation is important, and
nobody wants to write documentation, also
documentation is always out-of-date

Conscious of the situation, we employ a very pragmatic approach to tackle the issue.

we implemented Swagger for our REST API very early on and now the API documentation is quite curated and everybody loves and uses it;
whenever a task gets interesting and needs discussion and understanding by several parties and stakeholders, we write an "RFC" ([request for comments) which is a document that details as much as possible the problem and the solution we are going to implement and where a, possibly short, discussion takes place. The RFC could start in Google Docs (the editor and real-time editing are superior) but then it gets stored in our Notion knowledge base. This is also the weakest part of the documentation because, in theory, an RFC is supposed to be closed after it has been implemented, but as things change and the code is updated, you need to remember to also update the RFC, which does not always happen;
we favor unit tests to comments in the code, although there is always something, a corner case, a weird assumption, a "TODO" that wants to be left in the code for the future you, or your colleague;
we made a habit of opening a ticket whenever someone starts a conversation with "we should..."; OK, not that early on, but you got the idea. Tickets are just our distributed, community memory and we want to open a ticket also whenever someone writes a "TODO" or a "FIXME" inside the code;

Miscellaneous tips

we use several template repository to select from when creating a new repository in Github. Unfortunately, this brings you only to a certain point and for a more specific configuration you need to write a script. Here is an example of a script I personally wrote
for monitoring our Live systems we just connect AWS Cloudwatch to AWS SNS with slightly different metrics between Dev and Live; we then have a Google Group which is subscribed to the SNS Topics of interest. Too bad that we haven't still found a way to have the actual issue from Cloudwatch also reported in the email, but it's a start
as we all know, AWS gives you a lot of potential in terms of infrastructure but the usability of their tools is outgrown by the speed of delivering their new features. For this reason, we use a couple of precious scripts that allow a simpler "*ops" life: one is awslogs ("aws logs for humans") and another one is aws-cost-saver (the idea here is to switch off services in the Dev environment during the night or the weekends to save costs)

What are we still improving

Nothing is perfect obviously and even though we are reasonably happy with what we have so far so that we don't feel lacking behind tech debt or losing control of our systems as they develop and get bigger and more complex, we are still hard at work with the following topics:

Release management is still a bit rough and needs to be better defined in terms of expectations and direct responsibilities
Test coverage is not ideal in that we still don't trust it completely to be able to automatically deploy in Live
Feature testing is not performed in isolation, which is by itself another big topic altogether
Almost everything is using IaC but some areas need still to be ported
Code always moves faster than documentation

Comments? Opinions? Let's keep the conversation going on Twitter or LinkedIn!