Reliable, Scalable, and Maintainable Applications #
An application has to meet various requirements in order to be useful. Some nonfunctional requirements: security
, reliability
, compliance
, scalability
, compatibility
, and maintainability
.
Applications usually need to:
- store data (databases),
- speed up reads (caches),
- search data (search indexes),
- send a message to another process asynchronously (stream processing),
- periodically crunch data (batch processing).
The border is quite blurred because there are datastores that are also used as message queues (Redis
), and there are message queues with database-like durability guarantees (Apache Kafka
).
Reliability #
Reliability
means making systems work correctly, even when faults occur.
A fault is not the same as a failure, first is about one component, the other about the whole system. Prevent faults from causing failures. Faults in a system can be handled with fault-tolerant or resilient systems.
Examples of hardware faults:
- hard disk crashes (hard disks have an MTTF of 10 to 50 years),
- faulty RAM,
- blackouts in the power grid,
- an accidental disconnection of network cables.
To decrease the failure rate, redundancy
is typically added to hardware components.
Software errors are systematic faults that can cause many system failures. Unlike hardware faults, these errors are often correlated across nodes.
Measures to reduce the impact of software errors include:
- thorough testing,
- process isolation,
- allowing processes to crash and restart,
- careful consideration of assumptions and interactions in the system.
Human errors are a leading cause of outages, with configuration errors being the primary culprit. Only a small percentage of outages are due to hardware faults.
Prevention:
- Design systems to minimize opportunities for error, including well-designed abstractions, APIs, and admin interfaces.
- Provide fully-featured non-production sandbox environments for experimentation.
- Test thoroughly at all levels, from unit tests to whole-system integration tests and manual - tests.
- Allow quick and easy recovery from errors.
- Set up detailed and clear monitoring of system behaviour.
- Implement good management practices.
- Provide training to operators.
Scalability #
Scalability
maintains good performance as load increases by adding processing capacity.
An architect needs to describe the right parameters to measure. Load parameters
describe system load and may include requests per second to a web server, read-to-write ratios in a database, active users in a chat room, cache hit rates, or other relevant factors. Once you have described the load on your system, you can investigate what happens when the load increases.
What happens when the load increases:
- How is the performance affected?
- How much do you need to increase your resources?
Response time and latency are not synonyms. Response time includes network and queuing delays in addition to the actual service time, whereas latency is the time a request waits to be handled. Percentiles are commonly used to measure response time, with high percentiles (95th
, 99th
, and 99.9th
) important for understanding tail latencies.
How to deal with increased load?
Scaling up or vertical scaling: Moving to a more powerful machine
Scaling out or horizontal scaling: Distributing the load across multiple smaller machines.
Elastic systems: Automatically add computing resources when detected load increases. Quite useful if the load is unpredictable.
It is relatively easy to distribute stateless services across multiple machines. However, transitioning stateful data systems from a single node to a distributed setup can be quite complex. In the past, it was generally advised to keep databases on a single node until it became necessary to distribute them due to scaling costs or high availability requirements.
Shared-nothing architecture: each processor has its own set of disks. Data is horizontally partitioned across nodes, so each node has a subset of the rows from each table in the database. Each node is then responsible for processing only the rows on its own disks.
To design an architecture that scales effectively for a particular application, it is crucial to make assumptions about the frequency of common and rare operations, referred to as load parameters.
Maintainability #
Maintainability aims to improve the experience of engineering and operations teams working with the system. Three critical design principles for software systems to enhance maintainability are operability, simplicity, and evolvability.
Operability
ensures that the system is easy for operations teams to manage and maintain to ensure it runs smoothly.
Simplicity
reduces complexity as much as possible to make it easier for new engineers to understand the system. However, note that simplicity in this context doesn’t refer to the simplicity of the user interface.
Evolvability
makes it easy for engineers to make changes to the system as the requirements change, adapting it for unanticipated use cases. This principle is also known as extensibility, modifiability, or plasticity.
An operations team is responsible for tasks such as:
- Monitoring the system’s health and restoring service promptly during issues
- Identifying the root cause of problems like system failures or reduced performance
- Ensuring that software and platforms are updated, including security patches
- Monitoring the system’s interdependencies with other systems
- Maintaining the system’s security
Good operability is achieved by simplifying routine tasks, freeing up the operations team’s time to focus on higher-value activities.
Symptoms of complexity:
- Explosion of state space
- Tight coupling of modules
- Tangled dependencies
- Inconsistent naming and terminology
- Performance-related hacks
- Special cases to work around issues elsewhere
Abstraction as a tool for reducing complexity:
- Hides implementation details behind a simple interface
- High-level programming languages abstract machine code, CPU registers, and syscalls
- SQL abstracts complexities of on-disk and in-memory data structures, concurrent requests, and inconsistencies after crashes.