Control in Amazon Web Services
Any sufficiently sensitive deployment to Amazon Web Services (AWS) requires a well-defined and complete model for operational control of that deployment. In an enterprise context, this model will cover multiple workloads from multiple teams deployed into an environment in which compliance and security are paramount. We must know when change events happening in the infrastructure indicate violations and act appropriately. In the cloud, we expect that this oversight can be automated using a combination of managed services and bespoke code. AWS provides a number of tools for managing security and compliance within the cloud – we will consider a few of these before moving on to more subtle tools and techniques for controlling a cloud infrastructure
Identity and Access Management
The core tool for controlling actions within AWS is AWS Identity and Access Management (IAM). IAM provides basic entities such as users, groups and roles with which to define authentication and authorisation rules. There is a lot of detail here around authentication – including federation, multifactor and the ability to assume roles – which we will skip over for the purposes of this post. The key point is that actors in the system will be operating from a user or role which has one or more IAM policies attached. To understand the extent of control for IAM polices, we’ll look at them in more detail here.
IAM Policies
IAM policies define what resources an entity can access. These are created in JSON format as a series of statements to define which Actions can be performed on what Resources under what Conditions. This is a powerful construct that allows granular permissions to be defined.However, we find that in an enterprise context where very specific rules are required, the following issues emerge
- Complex requirements mean long and complicated JSON definitions with significant repetition
- JSON is somewhat limited when trying to express interdependencies
- There is incomplete coverage of resource attributes for defining rules. No doubt this is being actively worked on. However, with the rate of change at AWS, it is likely that omission and frustrating edge cases will continue.
So, what are the alternatives and supporting technologies provided by AWS?
AWS Organisation
The recently released organisation API, provides a single parent entity around which AWS Accounts can be aggregated. From a security perspective it provides two key features:
- Service Control Policies: Service control policies are roughly equivalent to IAM policies applied at an organisation or organisation unit level. They control which resources within an account are available to be authorised by IAM. Typically this will mean that services are globally blacklisted to prevent premature adoption within specific accounts
- The ability to create and destroy accounts via the Organisations API. In our opinion, this is a critical new feature in the development of AWS security. By making the management of accounts in AWS programmable this strongly facilitates the use of accounts as the primary isolation mechanism within AWS. Whilst Security Groups and subnet ACLs can provide solid East-West protection for EC2 instances within a VPC, for non-VPC based services it is difficult to separate out the activities of different teams. Account level isolation provides a full separation of activities between different projects. Now that management is programmable – the account creation process can be included in your project build scripts
- The use of accounts also limits the “blast radius” of any issues – such as performance, security or outage – to the extent of that account
AWS Config
AWS Config takes an inventory of all resources in a region and allows you to set rules against them. Changes in resource inventory are streamed to a S3 bucket and can be delivered by SNS to endpoints including email, HTTP endpoint and SQS queue. Rules can be set against changes using a pick-list of off-the-shelf rules or bespoke rules created using AWS Lambda which prevent or notify on actions.Config takes an administrative viewpoint – specific events within the system are checked and action taken where necessary. The use of AWS Lambda allows a more accessible – and potentially portable – approach to the implementation of business logic using code written in Python, JavaScript, C# and Java. The delivery of Lambda into an AWS environment can be automated such that the code can be tested using standard test tools and then delivered into production environments using Jenkins or other CI tool
Complementary technologies
Another approach utilises CloudWatch events to trigger Lambda functions. This allows the environment to respond to other types of event within the system such as billing alarms, low utilisation statistics etc.Regular compliance checks using scheduled events or configWe have seen success with the Cloud Custodian open source framework from Capital One. This provides a YAML DSL for a range of control functions and deployment options across AWS.
Beyond Compliance to Observability
The technologies we have reviewed so far are excellent choices for ensuring compliance in the cloud. As events occur, we can verify them against compliance rules and verify whether the changes are allowed or not.This approach, while effective against clear or mandatory requirements, can yield issues when you apply it against requirements which are more fluid. The context of events is essential when considering security, cost and other controls which are less well defined. For example,
- Even a very large EC2 instance spun up for one hour will generally cost less than a smaller instance spun up for a month.
- If an user’s access key becomes compromised we will see a change in the types of services being provisioned using that access key. Only by understanding the standard behaviour of the user can we expect to detect the change in behaviour.
We must build up a picture of context and behaviour across the system in order to effectively interpret events. With limited resources and restricted timescales for execution, Lambda functions are poorly suited for working with global state, so what should we use?The context we require is captured and made available by core AWS services such as CloudTrail, Config, CloudWatch and VPC Flow logs as well as logs and metrics from AWS Services and applications running on EC2. From these sources, we hope to gain a near complete record about activity, change and other events within an AWS infrastructure.As log and event data, it is well suited to an Event Sourcing approach – wherein events are stored in an append-only log, processed by one or more systems and then materialised in multiple views for different end consumers to analyse and visualise. AWS Kinesis Streams provides a managed service with append-only semantics and a number of native integrations with event producers (such as CloudWatch). With a default storage of only 7 days, this works best as an event middleware – for data which must be retained, one consumer can publish it to a permanent system of record such as Glacier or more interactive system like Cassandra.With an event sourcing setup and Kinesis log stream, we allow multiple systems to work with the infrastructure data and process events in context. Each system can perform a specialist function and work with a data store or view with data suited to it’s task – whether this be machine learning, stream processing with Spark or visualisation with a time series database such as Prometheus.We demonstrate this approach in our webinar: “Detecting AWS Stolen Credentials with Apache Spark“
Infrastructure Mapping
One possible output from the event streaming system we have described is a graph database such as Neo4j. Graph databases allow simple, native modelling of the connections between resources in an infrastructure system and allow system events to be modelled prior to their deployment. This is a relatively embryonic approach but something we will be actively developing to provide impact mapping and prediction in infrastructure deployment.
Conclusion
It is vital that a rule based approach to infrastructure control is taken to ensure well-defined compliance and security rules are enforced. However, beyond these controls we limit agility and experimentation with a rule-based approach. Where controls are based on expectations rather than hard data this is exacerbated – our control assumptions may overly restrictive or simply wrong.By taking a event based approach to processing infrastructure events, we can learn about behaviour within the system and fold this insight back into control structures. It also allows patterns of activity to be assessed in a flexible fashion and more subtle intrusion or other threats to be detected using bespoke, but easily assembled, processing pipelines over a near-complete record of infrastructure events.Graph structures which model dependencies within your infrastructure and service architecture allow you to reason about your distributed system and perform impact mapping – in terms of hardware failure, outage, system change and threat.This blog is written exclusively by the OpenCredo team. We do not accept external contributions.