Cloud Architecture Resources

DZone's Featured Cloud Architecture Resources

Cybersecurity in the Cloud: Integrating Continuous Security Testing Within DevSecOps

By Prithvish Kovelamudi

Cloud computing has revolutionized software organizations' operations, offering unprecedented scalability, flexibility, and cost-efficiency in managing digital resources. This transformative technology enables businesses to rapidly deploy and scale services, adapt to changing market demands, and reduce operational costs. However, the transition to cloud infrastructure is challenging. The inherently dynamic nature of cloud environments and the escalating sophistication of cyber threats have made traditional security measures insufficient. In this rapidly evolving landscape, proactive and preventative strategies have become paramount to safeguard sensitive data and maintain operational integrity. Against this backdrop, integrating security practices within the development and operational workflows—DevSecOps—has emerged as a critical approach to fortifying cloud environments. At the heart of this paradigm shift is Continuous Security Testing (CST), a practice designed to embed security seamlessly into the fabric of cloud computing. CST facilitates the early detection and remediation of vulnerabilities and ensures that security considerations keep pace with rapid deployment cycles, thus enabling a more resilient and agile response to potential threats. By weaving security into every phase of the development process, from initial design to deployment and maintenance, CST embodies the proactive stance necessary in today's cyber landscape. This approach minimizes the attack surface and aligns with cloud services' dynamic and on-demand nature, ensuring that security evolves in lockstep with technological advancements and emerging threats. As organizations navigate the complexities of cloud adoption, embracing Continuous Security Testing within a DevSecOps framework offers a comprehensive and adaptive strategy to confront the multifaceted cyber challenges of the digital age. Most respondents (96%) of a recent software security survey believe their company would benefit from DevSecOps' central idea of automating security and compliance activities. This article describes the details of how CST can strengthen your cloud security and how you can integrate it into your cloud architecture. Key Concepts of Continuous Security Testing Continuous Security Testing (CST) helps identify and address security vulnerabilities in your application development lifecycle. Using automation tools, it analyzes your complete security structure and discovers and resolves the vulnerabilities. The following are the fundamental principles behind it: Shift-left approach: CST promotes early adoption of safety measures by bringing security testing and mitigation to the start of the software development lifecycle. This method reduces the possibility of vulnerabilities in later phases by assisting in the early detection and resolution of security issues. Automated security testing: Critical to CST is automation, which allows for consistent and quick evaluation of security measures, scanning for vulnerabilities, and code analysis. Automation ensures consistent and rapid security evaluation. Continuous monitoring and feedback: As part of CST, safety incidents and feedback chains are monitored in real-time, allowing security vulnerabilities to be identified and fixed quickly. Integrating Continuous Security Testing Into the Cloud Let's explore the phases involved in integrating CST into cloud environments. Laying the Foundation for Continuous Security Testing in the Cloud To successfully integrate Continuous Security Testing (CST), you must prepare your cloud environment first. Use a manual tool like OWASP or an automated security testing process to perform a thorough security audit and ensure your cloud environments are well-protected to lay a robust groundwork for CST. Before diving into integrating Continuous Security Testing (CST) within your cloud infrastructure, it's crucial to lay a solid foundation by meticulously preparing your cloud environment. This preparatory step involves conducting a comprehensive security audit to identify vulnerabilities and ensure your cloud architecture is fortified against threats. Leveraging tools such as the Open Web Application Security Project (OWASP) for manual evaluations or employing sophisticated automated security testing processes can significantly aid this endeavor. Conduct a detailed inventory of all assets and resources within your cloud architecture to assess your cloud environment's security posture. This includes everything from data storage solutions and archives to virtual machines and network configurations. By understanding the full scope of your cloud environment, you can better identify potential vulnerabilities and areas of risk. Next, systematically evaluate these components for security weaknesses, ensuring no stone is left unturned. This evaluation should encompass your cloud infrastructure's internal and external aspects, scrutinizing access controls, data encryption methods, and the security protocols of interconnected services and applications. Identifying and addressing these vulnerabilities at this stage sets a robust groundwork for the seamless integration of Continuous Security Testing, enhancing your cloud environment's resilience to cyber threats and ensuring a secure, uninterrupted operation of cloud-based services. By undertaking these critical preparatory steps, you position your organization to leverage CST effectively as a dynamic, ongoing practice that detects emerging threats in real-time and integrates security seamlessly into every phase of your cloud computing operations. Establishing Effective Security Testing Criteria The cornerstone of implementing Continuous Security Testing (CST) within cloud ecosystems is meticulously defining the security testing requirements. This pivotal step involves identifying a holistic suite of testing methodologies encompassing your security landscape, ensuring thorough coverage and protection against potential vulnerabilities. A multifaceted approach to security testing is essential for a robust defense strategy. This encompasses a variety of criteria, such as: Vulnerability scanning: Systematic examination of your cloud environment to identify and classify security loopholes. Penetration testing: Simulated cyber attacks against your system to evaluate the effectiveness of security measures. Compliance inspections: Assessments to ensure that cloud operations adhere to industry standards and regulatory requirements. Source code analysis: Examination of application source code to detect security flaws or vulnerabilities. Configuration analysis: Evaluation of system configurations to identify security weaknesses stemming from misconfigurations or outdated settings. Container security analysis: Analysis focused on the security of containerized applications, including their deployment, management, and orchestration. Organizations can proactively identify and rectify security vulnerabilities within their cloud architecture by selecting the appropriate mix of these testing criteria. This proactive stance enhances the overall security posture and embeds a culture of continuous improvement and vigilance across the cloud computing landscape. Adopting a comprehensive and systematic approach to security testing ensures that your cloud environment remains resilient against evolving cyber threats, safeguarding your critical assets and data effectively. Choosing the Right Security Testing Tools for Automation The transition to automated security testing tools is critical for achieving faster and more accurate security assessments, significantly reducing the manual effort, workforce involvement, and resources dedicated to routine tasks. A diverse range of tools exists to support this need, including Static Application Security Testing (SAST), Dynamic Application Security Testing (DAST), and safety measures for Infrastructure as Code (IaC) etc. These technologies are easy to integrate into Continuous Integration/Continuous Deployment (CI/CD) pipelines and improve security by finding and fixing vulnerabilities before development. More than half of DevOps teams conduct SAST scans, 44% conduct DAST scans, and almost 50% inspect containers and dependencies as part of their security measures. However, when choosing the right tool for automation, consider features like ease of use, the ability to get updated with the vulnerability, and ROI vs. the cost of the tool. When choosing the right automation tools, evaluating them based on several critical factors beyond their primary functionalities is vital. The ease of integration into existing workflows, their capacity for timely updates in response to new vulnerabilities, and the balance between their cost and the return on investment they offer are crucial considerations. These factors ensure that the selected tools enhance security measures and align with the organization's overall security strategy and resource allocation, facilitating a more secure and efficient development lifecycle. Continuous Monitoring and Improvement The bedrock of maintaining an up-to-date and secure cloud infrastructure lies in the practices of continuous monitoring and iterative improvement throughout the entirety of its lifecycle. Integrate your cloud log with Security Information and Event Management (SIEM) capabilities to get centralized security intelligence and initiate continuous monitoring and improvement. Similarly, ELK Stack (Elasticsearch, Logstash, Kibana) is another tool that can help you visualize, collect, and analyze your log data. Regularly monitoring your security landscape and adapting based on the insights gleaned from testing and monitoring outputs are essential. Such a proactive approach not only aids in preemptively identifying and mitigating potential threats but also ensures that your security framework remains robust and adaptive to the ever-evolving cyber threat landscape. Strategic Risk Management and Mitigation Efforts Effective security management requires a strategic approach to evaluating and mitigating vulnerabilities, guided by their criticality, exploitability, and potential repercussions for the organization. Utilizing threat modeling techniques enables a targeted allocation of resources, focusing on areas of highest risk to reduce exposure and avert potential security incidents. Following identifying critical vulnerabilities, devising and executing a comprehensive risk mitigation strategy is imperative. This strategy should encompass a range of solutions tailored to diminish the identified risks, including the deployment of software patches and updates, the establishment of enhanced security protocols, the integration of additional safeguarding measures, or even the strategic overhaul of existing systems and processes. Organizations can fortify their defenses by prioritizing and systematically addressing vulnerabilities based on severity and impact, ensuring a more secure and resilient operational environment. Benefits of Continuous Security Testing in the Cloud There are numerous benefits of using continuous security testing in cloud environments. Early vulnerability detection: Using CST, you can identify security issues early on and address them before they pose a risk. Enhanced security quality: To better defend your cloud infrastructure against cyberattacks, security testing gives it an additional layer of protection. Enhanced innovation and agility: CST enables faster release cycles by identifying risks early on, allowing you to take proactive measures to counter them. Enhanced team collaboration: CST promotes collaboration between different teams to cultivate a culture of collective accountability for security. Compliance with industry standards: By routinely assessing its security controls and procedures, you can lessen the possibility of fines and penalties for noncompliance with corporate policies and legal requirements. Conclusion In the rapidly evolving landscape of cloud computing, Continuous Security Testing (CST) emerges as a cornerstone for safeguarding cloud environments against pervasive cyber threats. By weaving security seamlessly into the development fabric through automation and vigilant monitoring, CST empowers organizations to detect and neutralize vulnerabilities preemptively. The adoption of CST transcends mere risk management; it fosters an environment where security, innovation, and collaboration converge, propelling businesses forward. This synergistic approach elevates organizations' security posture and instills a culture of continuous improvement and adaptability. As businesses navigate the complexities of the digital age, implementing CST positions them to confidently address the dynamic nature of cyber threats, ensuring resilience and securing their future in the cloud. More

Participate in DZone Research Surveys: You Can Shape Trend Reports! (+ Enter the Raffles)

By Caitlin Candelmo

Hello, DZone Community! We have several surveys in progress as part of our research for upcoming Trend Reports. We would love for you to join us by sharing your experiences and insights (anonymously if you choose) — readers just like you drive the content that we cover in our Trend Reports. you can find details for each research survey below Over the coming months, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Low and No Code Research DZone's research will be shaping our June Low-Code Development Trend Report. Our 2024 low code survey explores: Low code's impact on software quality, performance, maintainability, and scalability Opinions on low code use cases and experience with implementation How teams are integrating AI into their low-code practices Concerns about security and governance over low code Join the Low Code Research You'll also have the chance to enter the $300 raffle at the end of the survey — four random people will be drawn and receive $75 each (USD)! Cloud and Kubernetes Research This year, we're combining our annual cloud native and Kubernetes research into one 12-minute survey that dives further into these topics as they relate to both one another and at the intersection of security, observability, AI, and more. DZone's research will be informing these Trend Reports: May – Cloud Native: Championing Cloud Development Across the SDLC September – Kubernetes in the Enterprise Our 2024 cloud native survey covers: Microservices, container orchestration, and tools/solutions Kubernetes use cases, pain points, and security measures Cloud infrastructure, costs, tech debt, and security threats AI for release management and monitoring/observability Join the Cloud Native Research Don't forget to enter the $750 raffle at the end of the survey! Five random people will be selected to each receive $150 (USD). Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Publications team More

Understanding the 2024 Cloud Security Landscape

By Rajat Gupta

Automatic Snapshots Using Snapshot Manager

By Shailendra Suryawanshi

Data Processing in GCP With Apache Airflow and BigQuery

By Sreenath Devineni

Securing Cloud Storage Access: Approach to Limiting Document Access Attempts

In today's digital age, cloud-hosted applications frequently use storage solutions like AWS S3 or Azure Blob Storage for images, documents, and more. Public URLs allow direct access to publicly accessible resources. However, sensitive images require protection and are not readily accessible via public URLs. Accessing such an image involves a JWT-protected API endpoint, which returns the needed image. We must pass the JWT token in the header to fetch the image using the GET API. The standard method for rendering these images in HTML uses JavaScript, which binds the byte content from the API to the img src attribute. Though straightforward, this approach might not always be suitable, especially when avoiding JavaScript execution. Could we simplify the process by assigning a direct URL to the img src attribute to render images without JavaScript and don't need to pass the JWT token in the header? This is possible with pre-signed URLs provided by AWS S3 and Azure Blob Storage, which grant temporary access to private resources by appending a unique, expiring token to the URL. While enhancing security by limiting access time, pre-signed URLs don't restrict the number of access attempts, allowing potentially unlimited access within the time window. Acknowledging this, a solution is needed that restricts access time for images within the HTML attribute and limits access attempts, ensuring sensitive images are safeguarded against unauthorized distribution. Time and Attempts Limited Cloud Storage Resource Access To address this challenge, we developed a solution combining cloud resources and associated database mapping with unique identifiers (GUIDs) and a token system that gets appended to the URL. We employ a GET API for secure image rendering that combines the base URL, document identifier, and token as query parameters. This method circumvents the limitations of embedding tokens in headers for image src attributes. Unique Identifier for the Image For the image for which we have the requirement to be rendered in an HTML document through the image src attribute, we generate a unique identifier for this image and persist the image cloud storage path and associated unique identifier (GUID) in the database; there is a dedicated API which does this functionality. This API is part of the microservice responsible for managing all the cloud documents for us. Token Management In a master token table, we define token types with attributes such as description, reusability, expiry, and access limits. Using the token type above, we generate a limited-time use token for the image identifier. Each image identifier is assigned a token, stored in a transaction token table with the image identifier GUID, enabling us to track access attempts. We have a microservice to manage these tokens. We will generate this limited-time token using one of the APIs from that microservice. Now we have both a unique identifier for the image and a time use token; based on that, we build the URL for the cloud storage resource something like below: {{baseUrl}/v1/document/{{imageIdentifier}?token={{limitedTimeUseToken} Sample URL: https://api.fabrikam.com/v1/document/e8655967-3d85-4a5c-b1a8-bb885cc4b81b?token=d5c68f04-b674-4df8-8729-081fe7a8f6b7 The above URL is for the API endpoint, which will send image bytes as a response once the token is validated. We will be covering this API in detail below. API To Render Image This API is straightforward. It will fetch the image cloud storage path based on the document UUID sent and then go to AWS S3 to pull the image. But before it performs all this, it goes through the token validation process through a filter. Sometimes, we need to perform certain operations on client requests before they reach the controller. Similarly, we may need to process controller responses before they are returned to clients. We can accomplish this by utilizing filters in Spring web applications. The above URL is an API endpoint in the documents microservice; the call goes through the filter DocAccessTokenFilter before reaching the controller. DocAccessTokenFilter acts as a gatekeeper for incoming requests to our documents or images serving API. This filter intercepts HTTP requests before they reach their intended controller, performing token validation to ensure that the requestor has permission to access the requested resource. Below is the implementation of the filter: Java @Order(1) public class DocAccessTokenFilter implements Filter { private Logger logger = CoreLoggerFactory.getLogger(DocAccessTokenFilter.class); private String tokenValidationApiUrl; public DocAccessTokenFilter(String dlApiBaseUrl) { this.tokenValidationApiUrl = String.format("%s/%s", dlApiBaseUrl, "doctoken/validate"); } @Override public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException { final HttpServletRequest req = (HttpServletRequest) request; final HttpServletResponse res = (HttpServletResponse) response; final String docToken = req.getParameter("token"); if (StringUtils.isNullOrEmpty(docToken)) { sendError(res, "Missing Document Access Token"); } else { RestTemplate restTemplate = new RestTemplate(); HttpComponentsClientHttpRequestFactory requestFactory = new HttpComponentsClientHttpRequestFactory(); restTemplate.setRequestFactory(requestFactory); HttpHeaders headers = new HttpHeaders(); headers.setContentType(MediaType.APPLICATION_JSON); TokenValidateRequest tknValidateReq = new TokenValidateRequest(); tknValidateReq.setToken(docToken); ResponseEntity<ApiResult> tknValidateResp = null; HttpEntity<TokenValidateRequest> tknValidateReqEntity = new HttpEntity<>(tknValidateReq, headers); try { tknValidateResp = restTemplate.postForEntity(tokenValidationApiUrl, tknValidateReqEntity, ApiResult.class); if (tknValidateResp.getStatusCode() == HttpStatus.OK) { logger.warn("Token validation successful"); chain.doFilter(request, response); } else { sendError(res, "Invalid Token"); } } catch (Exception ex) { logger.warn(String.format("Exception while validating token %s", ex.getMessage())); sendError(res, "Invalid Token"); } } } private void sendError(HttpServletResponse response, String errorMsg) throws IOException { response.resetBuffer(); response.setStatus(HttpServletResponse.SC_UNAUTHORIZED); response.setHeader("Content-Type", "application/json"); ApiResult result = new ApiResult(); result.setStatus(HttpStatus.UNAUTHORIZED); result.setMessage(errorMsg); ObjectMapper mapper = new ObjectMapper(); String concatenatedMsg = mapper.writeValueAsString(result); response.getOutputStream().print(concatenatedMsg); response.flushBuffer(); } } In the filter, we take the token query parameter, pass, and validate that token against the service responsible for managing the tokens, including validating the tokens. If this API returns HTTP status 200, then it's a valid token, and in all other cases, it would be treated as an invalid token. In cases where the token is not passed or token validation fails, the filter throws an HTTP 401 unauthorized error code back to the consuming client application. The filter calls another API responsible for validation and managing all tokens. This filter ensures that every request to access a document or image passes through a security checkpoint, verifying that the requestor possesses a valid, unexpired access token. We need to configure the filter as part of the Spring Boot application as below: Java @EnableWebSecurity @Configuration @Order(2) public class DocSecurityConfiguration extends WebSecurityConfigurerAdapter { @Autowired Environment env; @Value("${token.validation.api}") String tokenValidationApiUrl;; @Override public void configure(HttpSecurity http) throws Exception { http .csrf().disable() .requestMatchers() .antMatchers("/api/v1/document/**").and() .addFilterBefore(new DocAccessTokenFilter(this.tokenValidationApiUrl), UsernamePasswordAuthenticationFilter.class) .authorizeRequests().anyRequest().permitAll(); } } We configure security for a Spring web application, specifically applying token validation for requests accessing document resources. We insert a custom filter to validate access tokens, ensuring that document or image access is securely controlled based on token validation rules of attempts and time. Conclusion The solution transcends the limitations of pre-signed URLs with an access control system based on time and attempts, enhancing security for cloud-stored images. It simplifies their integration into HTML documents, especially for generating digital documents with HTML content in mobile applications. An application example includes displaying a user's digital wet ink signature, securely stored in cloud storage and seamlessly embedded without relying on JavaScript.

By Amol Gote

CORE

Decoding Synchronous and Asynchronous Communication in Cloud-Native Applications

Imagine building a complex machine with numerous independent parts, each performing its function, but all needing to communicate effectively with each other to accomplish a task. This is the challenge we face when designing cloud-native applications, which consist of interconnected microservices and serverless components. In this article, we explore the specifics of designing robust and resilient communication systems that can effectively coordinate these independent elements both within and beyond the application boundaries. These finely-grained services engage in internal and external interactions, employing various communication methods, synchronous or asynchronous. In synchronous communication, a service invokes another service using HTTP or gRPC, awaiting a response within a specified timeframe before proceeding. Conversely, asynchronous communication involves exchanging messages without expecting an immediate response. Message brokers such as RabbitMQ or Kafka serve as intermediaries, buffering messages to ensure reliable delivery. In cloud-native applications, embracing a combination of communication patterns is often a practical approach. Let's begin with synchronous communications first. What Is Synchronous Communication? Synchronous communication is like a conversation. One service (let’s call it Service A) initiates a request and then waits for a response from another service (Service B) or external APIs. This is akin to asking a question and waiting for an answer. Service A sends a request over HTTP and waits. It’s either waiting for a response from Service B or for a maximum waiting time to expire. During this waiting period, Service A is temporarily blocked, much like a person who pauses their activities to wait for a response. This pattern, often referred to as a request-reply pattern, is relatively simple to implement. However, using it extensively can introduce challenges that require careful consideration. Synchronous communication in the cloud Challenges of Synchronous Communication While synchronous communication is a powerful tool in our cloud-native toolkit, it comes with its own set of challenges that require careful consideration. Temporal Coupling Excessive reliance on synchronous communication throughout the solution can lead to temporal coupling issues. It occurs when numerous synchronous calls are chained together, resulting in extended wait times for client applications to receive responses. Availability Dependency Synchronous communication necessitates simultaneous availability of all communicating services. If backend services are under unexpected loads, client applications may experience failures due to timeout errors, impacting overall performance. Network Quality Impact Network quality can directly impact the performance of synchronous communication, including available bandwidth and the duration required for responses to traverse between serving backend services. Despite these challenges, synchronous communication can prove invaluable in specific scenarios. Let us explore some use cases in the next section where synchronous communication might be the better choice. When To Use Synchronous Communication Here are some situations where using synchronous communication can prove to be a better choice. Real-Time Data Access or Guaranteed Outcome When immediate or real-time feedback is needed, synchronous communication offers efficiency. For instance, when a customer places an order on an e-commerce website, the e-commerce front end needs to check the inventory system to ensure the item is in stock. This is a synchronous operation because the application needs to wait for the inventory system’s response before it can proceed with the order. Orchestrating Sequence of Dependent Tasks In cases where a service must execute a sequence of tasks, each dependent on the previous one, synchronous communication maintains order. It is specifically appropriate for workflows where task order is critical. Maintaining Transactional Integrity When maintaining data consistency across multiple components is vital, synchronous communication can help maintain atomic transactions. It is relevant for scenarios like financial transactions where data integrity is paramount. Synchronous communication is a powerful tool and has its challenges. The good news is that we also have the option of asynchronous communication—a complementary style that can work alongside synchronous methods. Let us explore this further in the next section. What Is Asynchronous Communication? Asynchronous communication patterns offer a dynamic and efficient approach to inter-service communication. Unlike synchronous communication, asynchronous communication allows a service to initiate a request without awaiting an immediate response. In this model, responses may not be immediate or arrive asynchronously on a separate channel, such as a callback queue. This mode of communication relies on protocols like the Advanced Message Queuing Protocol (AMQP) and messaging middleware, including message brokers or event brokers. This messaging middleware acts as an intermediary with minimal business logic. It receives messages from the source or producer service and then channels them to the intended consuming service. Integrating message middleware can significantly boost the resilience and fault tolerance of this decoupled approach. Asynchronous communication encompasses various implementations. Let us explore those further. One-To-One Communication In a One-to-One message communication, a producer dispatches messages exclusively to a receiver using a messaging broker. Typically, the message broker relies on queues to ensure reliable communication and offer delivery guarantees, such as at least once. The implementation resembles a command pattern, where the delivered messages act as commands consumed by the subscriber service to trigger actions. Let us consider an example of an online retail shop to illustrate its use. An online business heavily depends on the reliability of its website. The pattern provides fault tolerance and message guarantees, ensuring once a customer has placed an order on the website, the backend fulfillment systems receive the order to be processed. The messages broker preserves the message even if the backend system is down and will deliver when these can be processed. For instance, in an e-commerce application, when a customer places an order, the order details can be sent as a message from the order service (producer) to the fulfillment service (consumer) using a message broker. This is an example of one-to-one communication. Asynchronous One-to-One communication in the cloud An extension of the one-to-one message pattern is the asynchronous request-reply pattern. In this scenario, the dispatcher sends a message without expecting a response. But in a few specific scenarios, the consumer must respond to the producing service, utilizing the queues in the same message broker infrastructure queues. The response from the consumer may contain additional metadata such as ID to correlate the initial request or address to respond. Since the producer does not expect an immediate response, an independent producer workflow manages these replies. The fulfillment service (consumer) responds back to the frontend order service (producer) once the order has been dispatched so that the customer can be updated on the website. Asynchronous One-to-One Request Reply communication in cloud The single consumer communication comes in handy when two services communicate point to point. However, there could be scenarios when a publisher must send a particular event to multiple subscribers, which leads us to the following pattern. One-To-Many Communication This communication style is valuable when a single component (publisher) needs to broadcast an event to multiple components and services (subscribers). The one-to-many communication uses a concept of topic, which is analogous to an online forum. It is like an online forum where multiple users can post articles, and their followers can read them in their own time, responding as they fit. Similarly, applications can have topics where producer services write to these topics, and consuming services can read from the topic. It is one of the most popular patterns in real-world applications. Consider again the e-commerce platform has a service that updates product prices and multiple services need this information (like a subscription service, a recommendation service, etc.), the price update can be sent as a message to a topic in a message broker. All interested services (subscribers) can listen to this topic and receive the price update. This is an example of one-to-many communication. Several tools are available to implement this pattern, with Apache Kafka, Redis Pub/Sub, Amazon SNS, and Azure Event Grid ranking among the most popular choices. Asynchronous One-to-Many communication in cloud Challenges of Asynchronous Communication While asynchronous communication offers many benefits, it also introduces its own set of challenges. Resiliency and Fault Tolerance With numerous microservices and serverless components, each having multiple instances, failures are inevitable. Instances can crash, become overwhelmed, or experience transient faults. Moreover, the sender does not wait for the message to be processed, so it might not be immediately aware if an error occurs. We must adopt strategies like: Retry mechanisms: Retrying failed network calls for transient faults Circuit Breaker pattern: Preventing repeated calls to failing services to avoid resource bottlenecks Distributed Tracing Asynchronous communication can span multiple services, making it challenging to monitor overall system performance. Implementing distributed tracing helps tie logs and metrics together to understand transaction flow. Complex Debugging and Monitoring Asynchronous communications can be more difficult to debug and monitor because operations do not follow a linear flow. Specialized tools and techniques are often required to effectively debug and monitor these systems. Resource Management Asynchronous systems often involve long-lived connections and background processing, which can lead to resource management challenges. Care must be taken to manage resources effectively to prevent memory leaks or excessive CPU usage. Understanding these challenges can help design more robust and resilient asynchronous communication systems in their cloud-native applications. Final Words The choice between synchronous and asynchronous communication patterns is not binary but rather a strategic decision based on the specific requirements of the application. Synchronous communication is easy to implement and provides immediate feedback, making it suitable for real-time data access, orchestrating dependent tasks, and maintaining transactional integrity. However, it comes with challenges such as temporal coupling, availability dependency, and network quality impact. On the other hand, asynchronous communication allows a service to initiate a request without waiting for an immediate response, enhancing the system’s responsiveness and scalability. It offers flexibility, making it ideal for scenarios where immediate feedback is not necessary. However, it introduces complexities in resiliency, fault tolerance, distributed tracing, debugging, monitoring, and resource management. In conclusion, designing robust and resilient communication systems for cloud-native applications requires a deep understanding of both synchronous and asynchronous communication patterns. By carefully considering the strengths and weaknesses of each pattern and aligning them with the requirements, architects can design systems that effectively coordinate independent elements, both within and beyond the application boundaries, to deliver high-performing, scalable, and reliable cloud-native applications.

By Gaurav Gaur

CORE

Leveraging Feature Flags With IBM Cloud App Configuration in React Applications

In modern application development, delivering personalized and controlled user experiences is paramount. This necessitates the ability to toggle features dynamically, enabling developers to adapt their applications in response to changing user needs and preferences. Feature flags, also known as feature toggles, have emerged as a critical tool in achieving this flexibility. These flags empower developers to activate or deactivate specific functionalities based on various criteria such as user access, geographic location, or user behavior. React, a popular JavaScript framework known for its component-based architecture, is widely adopted in building user interfaces. Given its modular nature, React applications are particularly well-suited for integrating feature flags seamlessly. In this guide, we'll explore how to integrate feature flags into your React applications using IBM App Configuration, a robust platform designed to manage application features and configurations. By leveraging feature flags and IBM App Configuration, developers can unlock enhanced flexibility and control in their development process, ultimately delivering tailored user experiences with ease. IBM App Configuration can be integrated with any framework be it React, Angular, Java, Go, etc. React is a popular JavaScript framework that uses a component-based architecture, allowing developers to build reusable and modular UI components. This makes it easier to manage complex user interfaces by breaking them down into smaller, self-contained units. Adding feature flags to React components will make it easier for us to handle the components. Integrating With IBM App Configuration IBM App Configuration provides a comprehensive platform for managing feature flags, environments, collections, segments, and more. Before delving into the tutorial, it's important to understand why integrating your React application with IBM App Configuration is necessary and what benefits it offers. By integrating with IBM App Configuration, developers gain the ability to dynamically toggle features on and off within their applications. This capability is crucial for modern application development, as it allows developers to deliver controlled and personalized user experiences. With feature flags, developers can activate or deactivate specific functionalities based on factors such as user access, geographic location, or user preferences. This not only enhances user experiences but also provides developers with greater flexibility and control over feature deployments. Additionally, IBM App Configuration offers segments for targeted rollouts, enabling developers to gradually release features to specific groups of users. Overall, integrating with IBM App Configuration empowers developers to adapt their applications' behavior in real time, improving agility, and enhancing user satisfaction. To begin integrating your React application with App Configuration, follow these steps: 1. Create an Instance Start by creating an instance of IBM App Configuration on cloud.ibm.com. Within the instance, create an environment, such as Dev, to manage your configurations. Now create a collection. Creating collections comes in handy when there are multiple feature flags created for various projects. Each project can have a collection in the same App Configuration instance and you can tag these feature flags to the collection to which they belong. 2. Generate Credentials Access the service credentials section and generate new credentials. These credentials will be required to authenticate your React application with App Configuration. 3. Install SDK In your React application, install the IBM App Configuration React SDK using npm: CSS npm i ibm-appconfiguration-react-client-sdk 4. Configure Provider In your index.js or App.js, wrap your application component with AppConfigProvider to enable AppConfig within your React app. The Provider must be wrapped at the main level of the application, to ensure the entire application has access. The AppConfigProvider requires various parameters as shown in the screenshot below. All of these values can be found in the credentials created. 5. Access Feature Flags Now, within your App Configuration instance, create feature flags to control specific functionalities. Copy the feature flag ID for further integration into your code. Integrating Feature Flags Into React Components Once you've set up the AppConfig in your React application, you can seamlessly integrate feature flags into your components. Enable Components Dynamically Use the feature flag ID copied from the App Configuration instance to toggle specific components based on the flag's status. This allows you to enable or disable features dynamically without redeploying your application. Utilizing Segments for Targeted Rollouts IBM App Configuration offers segments to target specific groups of users, enabling personalized experiences and controlled rollouts. Here's how to leverage segments effectively: Define Segments Create segments based on user properties, behaviors, or other criteria to target specific user groups. Rollout Percentage Adjust the rollout percentage to control the percentage of users who receive the feature within a targeted segment. This enables gradual rollouts or A/B testing scenarios. Example If the rollout percentage is set to 100% and a particular segment is targeted, then the feature is rolled out to all the users in that particular segment. If the rollout percentage is set between 1% to 99% and the rollout percentage is 60%, for example, and a particular segment is targeted, then the feature is rolled out to randomly 60% of the users in that particular segment. If the rollout percentage is set to 0% and a particular segment is targeted, then the feature is rolled out to none of the users in that particular segment. Conclusion Integrating feature flags with IBM App Configuration empowers React developers to implement dynamic feature toggling and targeted rollouts seamlessly. By leveraging feature flags and segments, developers can deliver personalized user experiences while maintaining control over feature deployments. Start integrating feature flags into your React applications today to unlock enhanced flexibility and control in your development process.

By Pradeep Gopalgowda

Optimizing Kubernetes Clusters for Better Efficiency and Cost Savings

Optimizing resource utilization is a universal aspiration, but achieving it is considerably more complex than one might express in mere words. The process demands extensive performance testing, precise server right-sizing, and numerous adjustments to resource specifications. These challenges persist and, indeed, become more nuanced within Kubernetes environments than in traditional systems. At the core of constructing a high-performing and cost-effective Kubernetes cluster is the art of efficiently managing resources by tailoring your Kubernetes workloads. Delving into the intricacies of Kubernetes, it's essential to comprehend the different components that interact when deploying applications on k8s clusters. During my research for this article, an enlightening piece on LinkedIn caught my attention, underscoring the tendency of enterprises to overprovision their Kubernetes clusters. I propose solutions for enterprises to enhance their cluster efficiency and reduce expenses. Before we proceed, it's crucial to familiarize ourselves with the terminology that will be prevalent throughout this article. This foundational section is designed to equip the reader with the necessary knowledge for the detailed exploration ahead. Understanding the Basics Pod: A Pod represents the smallest deployable unit that can be created and managed in Kubernetes, consisting of one or more containers that share storage, network, and details on how to run the containers. Replicas: Replicas in Kubernetes are multiple instances of a Pod maintained by a controller for redundancy and scalability to ensure that the desired state matches the observed state. Deployment: A Deployment in Kubernetes is a higher-level abstraction that manages the lifecycle of Pods and ensures that a specified number of replicas are running and up to date. Nodes: Nodes are the physical or virtual machines that make up the Kubernetes cluster, each responsible for running the Pods or workloads and providing them with the necessary server resources. Kube-scheduler: The Kube-scheduler is a critical component of Kubernetes that selects the most suitable Node for a Pod to run on based on resource availability and other scheduling criteria. Kubelet: The kubelet runs on each node in a Kubernetes cluster and ensures that containers are running in a Pod as specified in the pod manifest file. It also manages the lifecycle of containers, monitors their health, and handles the instructions to start and stop containers directed by the control plane. Understanding Kubernetes Resource Management In Kubernetes, during the deployment of a pod, you can specify the necessary CPU and memory—a decision that shapes the performance and stability of your applications. The kube-scheduler uses the resource requests you set to determine the optimal node for your pod. At the same time, kubelet enforces the resource limits, ensuring that containers operate within their allocated share. Resource requests: Resource requests guarantee that a container will have a minimum amount of CPU or memory available. The kube-scheduler considers these requests to ensure a node has sufficient resources to host the pod, aiming for an even distribution of workloads. Resource limits: Resource limits, on the other hand, act as a safeguard against excessive usage. Should a container exceed these limits, it may face restrictions like CPU throttling or, in memory's case, termination to prevent resource starvation on the node. YAML apiVersion: apps/v1 kind: Deployment metadata: name: example-deployment spec: replicas: 2 selector: matchLabels: app: example template: metadata: labels: app: example spec: containers: - name: example-container image: nginx:1.17 ports: - containerPort: 80 resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m" Let’s break down these concepts with two illustrative cases: Case 1 (No Limits Specified) Imagine a pod with a memory request of 64Mi and a CPU request of 250m on a node with ample resources—4GB of memory and 4 CPUs. This pod can utilize more resources than requested without defined limits, borrowing from the node's surplus. However, this freedom comes with potential side effects; it can influence the availability of resources for other pods and, in extreme cases, lead to system components like kubelet becoming unresponsive. Case 2 (With Defined Requests and Limits) In another scenario, a pod with a memory request of 64Mi and a limit of 128Mi, along with a CPU request of 250m and a limit of 500m, finds itself on the same resource-rich node. Kubernetes will reserve the requested resources for this pod but enforce the set limits strictly. Exceeding these limits in the case of memory can result in the kubelet restarting or terminating the container or throttling its CPU usage if the CPU maintains a harmonious balance on the node. The Double-Edged Sword of CPU Limits CPU limits are designed to protect the node from overutilization but can be a mixed blessing. They might trigger CPU throttling, impacting container performance and response times. This was observed by Buffer, where containers experienced throttling even when CPU usage was below the defined limits. To navigate this, they isolated "No CPU Limits" services on specific nodes and fine-tuned their CPU and memory requests through vigilant monitoring. While this strategy reduced container density, it also improved service latency and performance—a delicate trade-off in the quest for optimal resource utilization. Understanding Kubernetes Scaling Now that we've covered the critical roles of requests and limits in workload deployment let's explore their impact on Kubernetes' automated scaling. Kubernetes offers two primary scaling methods: one for pod replicas and another for cluster nodes, both crucial for maximizing resource utilization, cost efficiency, and performance. Horizontal Pod Autoscaling (HPA) Horizontal Pod Autoscaling (HPA) in Kubernetes dynamically adjusts the number of pod replicas in a deployment or replica set based on observed CPU, memory utilization, or other specified metrics. It's a mechanism designed to automatically scale the number of pods horizontally—not to be confused with vertical scaling, which increases the resources for existing pods. The HPA operates within the defined minimum and maximum replica parameters and relies on metrics provided by the cluster's metrics server to make scaling decisions. It is essential to specify resource requests for CPU and memory in your pod specifications, as these inform the HPA's understanding of each pod's resource utilization and guide its scaling actions. The HPA evaluates resource usage at regular intervals, scaling the number of replicas up or down to meet the desired target metrics efficiently. This process ensures that your application maintains performance and availability, even as workload demands fluctuate. The example below automatically adjusts the number of pod replicas within the range of 1 to 10 based on CPU utilization, aiming to maintain an average CPU utilization of 50% across all pods. YAML apiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: name: example-hpa namespace: default spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: example-deployment minReplicas: 1 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50 Cluster Autoscaler The Cluster Autoscaler can automatically adjust the size of cluster nodes so that all pods have a place to run and there are no unneeded nodes. It works by increasing the number of nodes during high demand when pods fail to launch due to insufficient resources, and it decreases the number when nodes are underutilized. The autoscaler estimates the necessity for node scaling based on Pod resource requests—pods that cannot be scheduled due to lack of resources will trigger the autoscaler to add nodes. In contrast, nodes that have been underutilized for a set period of time and have pods that can be comfortably moved to other nodes will be considered for removal. This ensures a cost-effective and performance-optimized cluster operation. Conclusion Optimization is not a one-time event but a continuous process. Rigorous load testing is essential to comprehend how an application performs under varying levels of demand. Utilizing observability tools such as NewRelic, Dynatrace, or Grafana can reveal resource consumption patterns. Take the average resource utilization from several load tests and consider adding a 10-15% buffer to accommodate unexpected spikes, adjusting as necessary for your specific application needs. Once you establish baseline resource needs, deploy workloads with appropriately configured resource requests and limits. Ongoing monitoring is paramount to ensure resources are being used efficiently. Set up comprehensive alerting systems to notify you of underutilization and potential performance issues, such as throttling. This vigilance ensures your workloads are not just running but running optimally. Organize your infrastructure by creating distinct node groups tailored to different application types—like those requiring GPUs or high memory. In cloud environments, smart utilization of spot instances can lead to substantial cost savings. However, always prioritize non-critical applications for these instances to safeguard business continuity should the cloud provider need to reclaim resources.

By Harshavardhan Nerella

How Secure Cloud Development Replaces Virtual Desktop Infrastructures

Why Do Organizations Need Secure Development Environments? The need to secure corporate IT environments is common to all functions of organizations, and software application development is one of them. At its core, the need for securing IT environments in organizations arises from the digital corporate assets that they carry. It’s often data attached to privacy concerns, typically under regulations such as GDPR or HIPAA, or application source code, credentials, and most recently operational data that can have strategic significance. Threat scenarios attached to corporate data are not only bound to leaking data to outsiders but also preventing insiders with nefarious intent to exfiltrate data. Hence the security problem is multifaceted: it spans from careless asset handling to willful mishandling. In the case of environments for software application development, the complexity of the security problem lies in addressing the diversity of these environments’ settings. They range from data access needs and environment configuration to the developer’s relationship with the company; e.g., internal employee, consultant, temporary, etc. Security left aside, development environments have notoriously complex setups and often require significant maintenance because many applications and data are locally present on the device’s internal storage; for example, the integrated development environment (IDE) and the application’s source code. Hence, for these environments data protection against leaks will target locally stored assets such as source code, credentials, and potentially sensitive data. Assessing the Risk of Locally Stored Data Let’s first take a quick step back in ICT history and look at an oft-cited 2010 benchmark study named "The Billion Dollar Lost Laptop Problem". The study looks at 329 organizations over 12 months and reports that over 86,000 laptops were stolen or lost, resulting in a loss of 2.1 billion USD, an average of 6.4 million per organization. In 2010, the use of the Cloud as a storage medium for corporate data was nascent; hence today, the metrics to determine the cost and impact of the loss of a corporate laptop would likely look very different. For example, for many of the business functions that were likely to be impacted at that time, Cloud applications have brought today a solution by removing sensitive data from employees’ laptops. This has mostly shifted the discussion on laptop security to protecting the credentials required to access Cloud (or self-hosted) business resources, rather than protecting locally stored data itself. Figure 1: In 2024, most business productivity data has already moved to the cloud. Back in the 2010s, a notable move was CRM data, which ended up greatly reducing the risk of corporate data leaks. There is, though, a notable exception to the above shift in technology: the environments used for code development. For practical reasons, devices used for development today have a replica of projects’ source code, in addition to corporate secrets such as credentials, web tokens, cryptographic keys and perhaps strategic data to train machine learning models or to test algorithms. In other words, there is still plenty of interesting data stored locally in development environments that warrant protection against loss or theft. Therefore, the interest in securing development environments has not waned. There are a variety of reasons for malicious actors to go after assets in these environments, from accessing corporate intellectual property (see the hack of Grand Theft Auto 6), to understanding existing vulnerabilities of an application in order to compromise it in operation. Once compromised, the application might provide access to sensitive data such as personal user information, including credit card numbers. See, for example, the source code hack at Samsung. The final intent here is again to leak potentially sensitive or personal data. Recent and notorious hacks of this kind were password manager company LastPass and the Mercedes hack in early 2024. Despite all these potential downfalls resulting from the hacking of a single developer’s environment, few companies today can accurately determine where the replicas of their source code, secrets, and data are (hint: likely all over the devices of their distributed workforce), and are poorly shielded against the loss of a laptop or a looming insider threat. Recall that, using any online or self-hosted source code repositories such as GitHub does not get rid of any of the replicas in developers’ environments. This is because local replicas are needed for developers to update the code before sending it back to the online Git repository. Hence protecting these environments is a problem that grows with the number of developers working in the organization. Use Cases for Virtual Desktops and Secure Developer Laptops The desire to remove data from developers’ environments is prevalent across many regulated industries such as Finance and Insurance. One of the most common approaches is the use of development machines accessed remotely. Citrix and VMware have been key actors in this market by enabling developers to remotely access virtual machines hosted by the organization. In addition, these platforms implement data loss prevention mechanisms that monitor user activities to prevent data exfiltration. Figure 2: Left - Developers to remotely access virtual machines hosted by the organization. Right - Virtualization has evolved from emulating machines to processes, which is used as a staple for DevOps. Running and accessing a virtual machine remotely for development has many drawbacks in particular on the developer’s productivity. One reason is that the streaming mechanism used to access the remote desktop requires significant bandwidth to be truly usable and often results in irritating lags when typing code. The entire apparatus is also complex to set up, as well as costly to maintain and operate for the organization. In particular, the use of a virtual machine is quite a heavy mechanism that requires significant computational resources (hence cost) to run. Finally, such a setup is general-purpose; i.e., it is not designed in particular for code development and requires the installation of the entire development tool suite. For the reasons explained above, many organizations have reverted to securing developer laptops using end-point security mechanisms implementing data loss prevention measures. In the same way, as for the VDI counterpart, this is also often a costly solution because such laptops have complex setups. When onboarding remote development teams, organizations often send these laptops through the mail at great expense, which complicates the maintenance and monitoring process. The Case for Secure Cloud Development Environments Recently, virtualization has evolved from emulating entire machines to the granularity of single processes with the technology of software containers. Containers are well-suited for code development because they provide a minimal and sufficient environment to compile typical applications, in particular web-based ones. Notably, in comparison to virtual machines, containers start in seconds instead of minutes and require much fewer computational resources to execute. Containers are typically tools used locally by developers on their devices to isolate software dependencies related to a specific project in a way that the source code can be compiled and executed without interference with potentially unwanted settings. The great thing about containers is that they don’t have to remain a locally used development tool. They can be run online and used as an alternative to a virtual machine. This is the basic mechanism used to implement a Cloud Development Environment (CDE). Figure 3: Containers can be run online and become a lightweight alternative to a virtual machine. This is the basic mechanism to implement a Cloud Development Environment. Running containers online has been one of the most exciting recent trends in virtualization aligned with DevOps practices where containers are critical to enable efficient testing and deployments. CDEs are accessed online with an IDE via network connection (Microsoft Visual Studio Code has such a feature as explained here) or using a Cloud IDE (an IDE running in a web browser such as Microsoft Visual Studio Code, Eclipse Theia, and others.) A Cloud IDE allows a developer to access a CDE with the benefit that no environment needs to be installed on the local device. Access to the remote container is done transparently. Compared to a remotely executing desktop as explained before, discomfort due to a streaming environment does not apply here since the IDE is executing as a web application in the browser. Hence the developer will not suffer display lags in particular in low bandwidth environments as is the case with VDI and DaaS. Bandwidth requirements between the IDE and the CDE are low because only text information is exchanged between the two. Figure 4: Access to the remote container is done with an IDE running in a web browser; hence, developers will not suffer display lags, particularly in low bandwidth environments As a result, in the specific context of application development, the use of CDEs is a lightweight mechanism to remove development data from local devices. However, this still does not achieve the security delivered by Citrix and other VDI platforms, because CDEs are designed for efficiency and not for security. They do not provide any data loss prevention mechanism. This is where the case to implement secure Cloud Development Environments lies: CDEs with data loss prevention provide a lightweight alternative to the use of VDI or secure development laptops, with the additional benefit of an improved developer experience. The resulting platform is a secure Cloud Development platform. Using such a platform, organizations can significantly start to reduce the cost of provisioning secure development environments for their developers. Figure 5: To become a replacement for VDIs or secure laptops, Cloud Development Environments need to include security measures against data leaks. Moving From Virtual Desktops To Secure Cloud Development Environments As a conclusion to this discussion, below I briefly retrace the different steps to build the case for a secure Cloud-based development platform that combines the efficient infrastructure of CDE with end-to-end data protection against data exfiltration, leading to a secure CDE. Initially, secure developer laptops were used to directly access corporate resources sometimes using a VPN when outside the IT perimeter. According to the benchmark study that I mentioned at the beginning of this article, 41% of laptops routinely contained sensitive data according to the study that I mentioned at the beginning of this article. Then, the use of virtual machines and early access to web applications has allowed organizations to remove data from local laptop storage. But code development on remote virtual machines was and remains strenuous. Recently, the use of lightweight virtualization based on containers has allowed quicker access to online development environments, but all current vendors in this space do not have data security since the primary use case is productivity. Figure 6: A representation of the technological evolution of mechanisms used by organizations to provision secure development environments, across the last decade. Finally, a secure Cloud Development Environment platform (as shown in the rightmost figure below) illustrates the closest incarnation of the secure development laptop. Secure CDEs benefit from the experiences of pioneering companies like Citrix, seizing the chance to separate development environments from traditional hardware. This separation allows for a blend of infrastructure efficiency and security without compromising developers' experience.

By Laurent Balmelli, PhD

Harnessing the Power of Observability in Kubernetes With OpenTelemetry

In today's dynamic and complex cloud environments, observability has become a cornerstone for maintaining the reliability, performance, and security of applications. Kubernetes, the de facto standard for container orchestration, hosts a plethora of applications, making the need for an efficient and scalable observability framework paramount. This article delves into how OpenTelemetry, an open-source observability framework, can be seamlessly integrated into a Kubernetes (K8s) cluster managed by KIND (Kubernetes IN Docker), and how tools like Loki, Tempo, and the kube-prometheus-stack can enhance your observability strategy. We'll explore this setup through the lens of a practical example, utilizing custom values from a specific GitHub repository. The Observability Landscape in Kubernetes Before diving into the integration, let's understand the components at play: KIND offers a straightforward way to run K8s clusters within Docker containers, ideal for development and testing. Loki is a horizontally scalable, highly available, multi-tenant log aggregation system inspired by Prometheus. Tempo is a high-volume, minimal-dependency trace aggregator, providing a robust way to store and query distributed traces. kube-prometheus-stack bundles Prometheus together with Grafana and other tools to provide a comprehensive monitoring solution out-of-the-box. OpenTelemetry Operator simplifies the deployment and management of OpenTelemetry collectors in K8s environments. Promtail is responsible for gathering logs and sending them to Loki. Integrating these components within a K8s cluster orchestrated by KIND not only streamlines the observability but also leverages the strengths of each tool, creating a cohesive and powerful monitoring solution. Setting up Your Kubernetes Cluster With KIND Firstly, ensure you have KIND installed on your machine. If not, you can easily install it using the following command: Shell curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.11.1/kind-$(uname)-amd64 chmod +x ./kind mv ./kind /usr/local/bin/kind Once KIND is installed, you can create a cluster by running: Shell kind create cluster --config kind-config.yaml kubectl create ns observability kubectl config set-context --current --namespace observability kind-config.yaml should be tailored to your specific requirements. It's important to ensure your cluster has the necessary resources (CPU, memory) to support the observability tools you plan to deploy. Deploying Observability Tools With HELM HELM, the package manager for Kubernetes, simplifies the deployment of applications. Here's how you can install Loki, Tempo, and the kube-prometheus-stack using HELM: Add the necessary HELM repositories: helm repo add grafana https://grafana.github.io/helm-charts helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update Install Loki, Tempo, and kube-prometheus-stack: For each tool, we'll use a custom values file available in the provided GitHub repository. This ensures a tailored setup aligned with specific monitoring and tracing needs. Loki: helm upgrade --install loki grafana/loki --values https://raw.githubusercontent.com/brainupgrade-in/kubernetes/main/observability/opentelemetry/01-loki-values.yaml Tempo: helm install tempo grafana/tempo --values https://raw.githubusercontent.com/brainupgrade-in/kubernetes/main/observability/opentelemetry/02-tempo-values.yaml kube-prometheus-stack: helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack --values https://raw.githubusercontent.com/brainupgrade-in/kubernetes/main/observability/opentelemetry/03-grafana-helm-values.yaml Install OpenTelemetry Operator and Promtail: The OpenTelemetry Operator and Promtail can also be installed via HELM, further streamlining the setup process. OpenTelemetry Operator: helm install opentelemetry-operator open-telemetry/opentelemetry-operator Promtail: helm install promtail grafana/promtail --set "loki.serviceName=loki.observability.svc.cluster.local" Configuring OpenTelemetry for Optimal Observability Once the OpenTelemetry Operator is installed, you'll need to configure it to collect metrics, logs, and traces from your applications. OpenTelemetry provides a unified way to send observability data to various backends like Loki for logs, Prometheus for metrics, and Tempo for traces. A sample OpenTelemetry Collector configuration might look like this: YAML apiVersion: opentelemetry.io/v1alpha1 kind: OpenTelemetryCollector metadata: name: otel namespace: observability spec: config: | receivers: filelog: include: ["/var/log/containers/*.log"] otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: memory_limiter: check_interval: 1s limit_percentage: 75 spike_limit_percentage: 15 batch: send_batch_size: 1000 timeout: 10s exporters: # NOTE: Prior to v0.86.0 use `logging` instead of `debug`. debug: prometheusremotewrite: endpoint: "http://prometheus-kube-prometheus-prometheus.observability:9090/api/v1/write" loki: endpoint: "http://loki.observability:3100/loki/api/v1/push" otlp: endpoint: http://tempo.observability.svc.cluster.local:4317 retry_on_failure: enabled: true tls: insecure: true service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch] exporters: [debug,otlp] metrics: receivers: [otlp] processors: [memory_limiter, batch] exporters: [debug,prometheusremotewrite] logs: receivers: [otlp] processors: [memory_limiter, batch] exporters: [debug,loki] mode: daemonset This configuration sets up the collector to receive data via the OTLP protocol, process it in batches, and export it to the appropriate backends. To enable auto-instrumentation for java apps, you can define the following. YAML apiVersion: opentelemetry.io/v1alpha1 kind: Instrumentation metadata: name: java-instrumentation namespace: observability spec: exporter: endpoint: http://otel-collector.observability:4317 propagators: - tracecontext - baggage sampler: type: always_on argument: "1" java: env: - name: OTEL_EXPORTER_OTLP_ENDPOINT value: http://otel-collector.observability:4317 Leveraging Observability Data for Insights With the observability tools in place, you can now leverage the collected data to gain actionable insights into your application's performance, reliability, and security. Grafana can be used to visualize metrics and logs, while Tempo allows you to trace distributed transactions across microservices. Visualizing Data With Grafana Grafana offers a powerful platform for creating dashboards that visualize the metrics and logs collected by Prometheus and Loki, respectively. You can create custom dashboards or import existing ones tailored to Kubernetes monitoring. Tracing With Tempo Tempo, integrated with OpenTelemetry, provides a detailed view of traces across microservices, helping you pinpoint the root cause of issues and optimize performance. Illustrating Observability With a Weather Application Example To bring the concepts of observability to life, let's walk through a practical example using a simple weather application deployed in our Kubernetes cluster. This application, structured around microservices, showcases how OpenTelemetry can be utilized to gather crucial metrics, logs, and traces. The configuration for this demonstration is based on a sample Kubernetes deployment found here. Deploying the Weather Application Our weather application is a microservice that fetches weather data. It's a perfect candidate to illustrate how OpenTelemetry captures and forwards telemetry data to our observability stack. Here's a partial snippet of the deployment configuration. Full YAML is found here. YAML apiVersion: apps/v1 kind: Deployment metadata: labels: app: weather tier: front name: weather-front spec: replicas: 1 selector: matchLabels: app: weather tier: front template: metadata: labels: app: weather tier: front app.kubernetes.io/name: weather-front annotations: prometheus.io/scrape: "true" prometheus.io/port: "8888" prometheus.io/path: /actuator/prometheus instrumentation.opentelemetry.io/inject-java: "true" # sidecar.opentelemetry.io/inject: 'true' instrumentation.opentelemetry.io/container-names: "weather-front" spec: containers: - image: brainupgrade/weather:metrics imagePullPolicy: Always name: weather-front resources: limits: cpu: 1000m memory: 2048Mi requests: cpu: 100m memory: 1500Mi env: - name: APP_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.labels['app.kubernetes.io/name'] - name: NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: OTEL_SERVICE_NAME value: $(NAMESPACE)-$(APP_NAME) - name: spring.application.name value: $(NAMESPACE)-$(APP_NAME) - name: spring.datasource.url valueFrom: configMapKeyRef: name: app-config key: spring.datasource.url - name: spring.datasource.username valueFrom: secretKeyRef: name: app-secret key: spring.datasource.username - name: spring.datasource.password valueFrom: secretKeyRef: name: app-secret key: spring.datasource.password - name: weatherServiceURL valueFrom: configMapKeyRef: name: app-config key: weatherServiceURL - name: management.endpoints.web.exposure.include value: "*" - name: management.server.port value: "8888" - name: management.metrics.web.server.request.autotime.enabled value: "true" - name: management.metrics.tags.application value: $(NAMESPACE)-$(APP_NAME) - name: otel.instrumentation.log4j.capture-logs value: "true" - name: otel.logs.exporter value: "otlp" ports: - containerPort: 8080 This deployment configures the weather service with OpenTelemetry's OTLP (OpenTelemetry Protocol) exporter, directing telemetry data to our OpenTelemetry Collector. It also labels the service for clear identification within our telemetry data. Visualizing Observability Data Once deployed, the weather service starts sending metrics, logs, and traces to our observability tools. Here's how you can leverage this data. Trace the request across services using Tempo datasource Metrics: Prometheus, part of the kube-prometheus-stack, collects metrics on the number of requests, response times, and error rates. These metrics can be visualized in Grafana to monitor the health and performance of the weather service. For example, grafana dashboard (ID 17175) can be used to view Observability for Spring boot apps Logs: Logs generated by the weather service are collected by Promtail and stored in Loki. Grafana can query these logs, allowing you to search and visualize operational data. This is invaluable for debugging issues, such as understanding the cause of an unexpected spike in error rates. Traces: Traces captured by OpenTelemetry and stored in Tempo provide insight into the request flow through the weather service. This is crucial for identifying bottlenecks or failures in the service's operations. Gaining Insights With the weather application up and running, and observability data flowing, we can start to gain actionable insights: Performance optimization: By analyzing response times and error rates, we can identify slow endpoints or errors in the weather service, directing our optimization efforts more effectively. Troubleshooting: Logs and traces help us troubleshoot issues by providing context around errors or unexpected behavior, reducing the time to resolution. Scalability decisions: Metrics on request volumes and resource utilization guide decisions on when to scale the service to handle load more efficiently. This weather service example underscores the power of OpenTelemetry in a Kubernetes environment, offering a window into the operational aspects of applications. By integrating observability into the development and deployment pipeline, teams can ensure their applications are performant, reliable, and scalable. This practical example of a weather application illustrates the tangible benefits of implementing a comprehensive observability strategy with OpenTelemetry. It showcases how seamlessly metrics, logs, and traces can be collected, analyzed, and visualized, providing developers and operators with the insights needed to maintain and improve complex cloud-native applications. Conclusion Integrating OpenTelemetry with Kubernetes using tools like Loki, Tempo, and the kube-prometheus-stack offers a robust solution for observability. This setup not only simplifies the deployment and management of these tools but also provides a comprehensive view of your application's health, performance, and security. With the actionable insights gained from this observability stack, teams can proactively address issues, improve system reliability, and enhance the user experience. Remember, the key to successful observability lies in the strategic implementation and continuous refinement of your monitoring setup. Happy observability!

By Rajesh Gheware

Effective Log Data Analysis With Amazon CloudWatch: Harnessing Machine Learning

In today's cloud computing world, all types of logging data are extremely valuable. Logs can include a wide variety of data, including system events, transaction data, user activities, web browser logs, errors, and performance metrics. Managing logs efficiently is extremely important for organizations, but dealing with large volumes of data makes it challenging to detect anomalies and unusual patterns or predict potential issues before they become critical. Efficient log management strategies, such as implementing structured logging, using log aggregation tools, and applying machine learning for log analysis, are crucial for handling this data effectively. One of the latest advancements in effectively analyzing a large amount of logging data is Machine Learning (ML) powered analytics provided by Amazon CloudWatch. It is a brand new capability of CloudWatch. This innovative service is transforming the way organizations handle their log data. It offers a faster, more insightful, and automated log data analysis. This article specifically explores utilizing the machine learning-powered analytics of CloudWatch to overcome the challenges of effectively identifying hidden issues within the log data. Before deep diving into some of these features, let's have a quick refresher about Amazon CloudWatch. What Is Amazon CloudWatch? It is an AWS-native monitoring and observability service that offers a whole suite of capabilities: Monitoring: Tracks performance and operational health. Data collection: Gathers logs, metrics, and events, providing a comprehensive view of AWS resources. Unified operational view: Provides insights into applications running on AWS and on-premises servers. Challenges With Logs Data Analysis Volume of Data There's too much log data. In this modern era, applications emit a tremendous amount of log events. Log data can grow so rapidly that developers often find it difficult to identify issues within it; it is like finding a needle in a haystack. Change Identification Another common problem we have often seen is the fundamental problem of log analysis that goes back as long as logs have been around, identifying what has changed in your logs. Proactive Detection Proactive detection is another common challenge. It's great if you can utilize logs to dive in when an application's having an issue, find the root cause of that application issue, and fix it. But how do you know when those issues are occurring? How do you proactively detect them? Of course, you can implement metrics, alarms, etc., for the issues you know about. But there's always the problem of unknowns. So, we're often instrumenting observability and monitoring for past issues. Now, let's dive deep into the machine learning capabilities from CloudWatch that will help you overcome the challenges we have just discussed. Machine Learning Capabilities From CloudWatch Pattern Analysis Imagine you are troubleshooting a real-time distributed application accessed by millions of customers globally and generating a significant amount of application logs. Analyzing tens of thousands of log events manually is challenging, and it can take forever to find the root cause. That is where the new AWS CloudWatch machine learning-based capability can quickly help by grouping log events into patterns within the Logs Insight page of CloudWatch. It is much easier to identify through a limited number of patterns and quickly filter the ones that might be interesting or relevant based on the issue you are trying to troubleshoot. It also allows you to expand the specific pattern to look for the relevant events along with related patterns that might be pertinent. In simple words, Pattern Analysis is the automated grouping and categorization of your log events. Comparison Analysis How can we elevate pattern analysis to the next level? Now that we've seen how pattern analysis works let's see how we can extend this feature to perform comparison analysis. "Comparison Analysis" aims to solve the second challenge of identifying the log changes. Comparison analysis lets you effectively profile your logs using patterns from one time period and then compare them to the patterns extracted for another period and analyze the differences. This will help us answer this fundamental question of what changed to my logs. You can quickly compare your logs while your application's having an issue to a known healthy period. Any changes between two time periods are a strong indicator of the possible root cause of your problem. CloudWatch Logs Anomaly Detection Anomaly detection, in simple terms, is the process of identifying unusual patterns or behaviors in the logs that do not conform to expected norms. To use this feature, we need to first select the LogGroup for the application and enable CloudWatch Logs anomaly detection for it. At that point, CloudWatch will train a machine-learning model on the expected patterns and the volume of each pattern associated with your application. CloudWatch will take five minutes to train the model using logs from your application, and the feature will become active and automatically start servicing these anomalies any time they occur. So things like a brand new error message occurring that wasn't there before, a sudden spike in the volume, or if there's a spike in HTTP 400s are some examples that will result in an anomaly being generated for that. Generate Logs Insight Queries Using Generative AI With this capability, you can give natural language commands to filter log events, and CloudWatch can generate queries using Generative AI. If you are unfamiliar with CloudWatch query language or are from a non-technical background, you can easily use this feature to generate queries and filter logs. It's an iterative process; you need to learn precisely what you want from the first query. So you can update and iterate the query based on the results you see. Let's look at a couple of examples: Natural Language Prompt: "Check API Response Times" Auto-generated query by CloudWatch: In this query: fields @timestamp, @message selects the timestamp and message fields from your logs. | parse @message "Response Time: *" as responseTime parses the @message field to extract the value following the text "Response Time: " and labels it as responseTime. | stats avg(responseTime) calculates the average of the extracted responseTime values. Natural Language Prompt: "Please provide the duration of the ten invocations with the highest latency." Auto-generated query by CloudWatch In this query: fields @timestamp, @message, latency selects the @timestamp, @message, and latency fields from the logs. | stats max(latency) as maxLatency by @message computes the maximum latency value for each unique message. | sort maxLatency desc sorts the results in descending order based on the maximum latency, showing the highest values at the top. | limit 10 restricts the output to the top 10 results with the highest latency values. We can execute these queries in the CloudWatch “Logs Insights” query box to filter the log events from the application logs. These queries extract specific information from the logs, such as identifying errors, monitoring performance metrics, or tracking user activities. The query syntax might vary based on the particular log format and the information you seek. Conclusion CloudWatch's machine learning features offer a robust solution for managing the complexities of log data. These tools make log analysis more efficient and insightful, from automating pattern analysis to enabling anomaly detection. The addition of generative AI for query generation further democratizes access to these powerful insights.

By Rajat Gupta

Building and Deploying a Chatbot With Google Cloud Run and Dialogflow

In this tutorial, we will learn how to build and deploy a conversational chatbot using Google Cloud Run and Dialogflow. This chatbot will provide responses to user queries on a specific topic, such as weather information, customer support, or any other domain you choose. We will cover the steps from creating the Dialogflow agent to deploying the webhook service on Google Cloud Run. Prerequisites A Google Cloud Platform (GCP) account Basic knowledge of Python programming Familiarity with Google Cloud Console Step 1: Set Up Dialogflow Agent Create a Dialogflow Agent: Log into the Dialogflow Console (Google Dialogflow). Click on "Create Agent" and fill in the agent details. Select the Google Cloud Project you want to associate with this agent. Define Intents: Intents classify the user's intentions. For each intent, specify examples of user phrases and the responses you want Dialogflow to provide. For example, for a weather chatbot, you might create an intent named "WeatherInquiry" with user phrases like "What's the weather like in Dallas?" and set up appropriate responses. Step 2: Develop the Webhook Service The webhook service processes requests from Dialogflow and returns dynamic responses. We'll use Flask, a lightweight WSGI web application framework in Python, to create this service. Set Up Your Development Environment: Ensure you have Python and pip installed. Create a new directory for your project and set up a virtual environment: Shell python -m venv env source env/bin/activate # `env\Scripts\activate` for windows Install Dependencies: Install Flask and the Dialogflow library: Shell pip install Flask google-cloud-dialogflow Create the Flask App: In your project directory, create a file named app.py. This file will contain the Flask application: Python from flask import Flask, request, jsonify app = Flask(__name__) @app.route('/webhook', methods=['POST']) def webhook(): req = request.get_json(silent=True, force=True) # Process the request here. try: query_result = req.get('queryResult') intent_name = query_result.get('intent').get('displayName') response_text = f"Received intent: {intent_name}" return jsonify({'fulfillmentText': response_text}) except AttributeError: return jsonify({'fulfillmentText': "Error processing the request"}) if __name__ == '__main__': app.run(debug=True) Step 3: Deploy To Google Cloud Run Google Cloud Run is a managed platform that enables you to run containers statelessly over a fully managed environment or in your own Google Kubernetes Engine cluster. Containerize the Flask App: Create a Dockerfile in your project directory: Dockerfile FROM python:3.8-slim WORKDIR /app COPY requirements.txt requirements.txt RUN pip install -r requirements.txt COPY . . CMD ["flask", "run", "--host=0.0.0.0", "--port=8080"] Don't forget to create a requirements.txt file listing your Python dependencies: Flask==1.1.2 google-cloud-dialogflow==2.4.0 Build and Push the Container: Use Cloud Build to build your container image and push it to the container registry. Shell gcloud builds submit --tag gcr.io/YOUR_CHATBOT_PRJ_ID/chatbot-webhook . Deploy to Cloud Run: Deploy your container image to Cloud Run. Shell gcloud run deploy --image gcr.io/YOUR_PROJECT_ID/chatbot-webhook --platform managed Follow the prompts to enable the required APIs, choose a region, and allow unauthenticated invocations. Step 4: Integrate With Dialogflow In the Dialogflow Console, navigate to the Fulfillment section. Enable Webhook, paste the URL of your Cloud Run service (you get this URL after deploying to Cloud Run), and click "Save." Testing and Iteration Test your chatbot in the Dialogflow Console's simulator. You can refine your intents, entities, and webhook logic based on the responses you receive. Conclusion You have successfully built and deployed a conversational chatbot using Google Cloud Run and Dialogflow. This setup allows you to create scalable, serverless chatbots that can handle dynamic responses to user queries. This foundation allows for further customization and expansion, enabling the development of more complex and responsive chatbots to meet a variety of needs. Continue to refine your chatbot by adjusting intents, entities, and the webhook logic to improve interaction quality and user experience.

By Ashok Gorantla

CORE

The Enterprise Journey to Cloud Adoption

"Migrate" comes from the Latin "migratio," meaning to move from one place to another. In information technology, migration entails understanding new systems' benefits, identifying current system shortfalls, planning, and transferring selected applications. Not all IT assets must be moved; migration can mean moving a part of them. This article will delve into the details of transferring IT assets to public clouds like AWS, Azure, or GCP. Many factors can influence the decision to switch to the cloud, such as expiring data center leases, the high costs of data center management, outdated hardware, software license renewals, geographical compliance needs, market growth, and the need to adjust resources to match demand quickly. Executive backing is crucial for a company to begin its cloud migration journey. This support is the cornerstone for any large-scale migration success. Leadership must unify their teams for the journey, as collaboration is essential. Attempts by isolated teams can lead to problems. Regular leadership meetings, whether weekly or bi-weekly, can overcome hurdles and keep the migration process on track. A pivotal step in achieving successful cloud migration is the formation of a Cloud Steering Committee. This team unites technical experts to forge the initial cloud adoption patterns. The optimal team structure includes infrastructure, security, application, and operations engineers alongside a lead architect, all steered by a Cloud Steering Committee leader. Together, they establish objectives for security, availability, reliability, scalability, data cleansing, and compliance. Embarking on the cloud migration journey can seem daunting when confronted with the vast array of applications and servers awaiting transition. There's no universal solution for migration; each enterprise faces unique challenges and opportunities. However, many organizations have successfully navigated their way to the cloud by exploring established frameworks and diverse migration strategies. Framework Evaluate and Discover Enterprises must establish a strong business case by aligning their objectives with an understanding of their current applications' age, architecture, and limitations. Leadership engagement, clear communication, and a defined purpose are essential to unify the organization and set feasible goals and timelines for the migration. In addition, a comprehensive portfolio analysis is critical, including discovering the applications, mapping interdependencies, and formulating migration strategies and priorities. This stage determines the complexity and business impact of applications, guiding the migration sequence. Starting with non-critical, simpler applications helps the team to understand the new platform and understand the gaps. Design and Migrate After classifying the applications identified in the discovery phase—whether they are web, mobile, or database systems—a standard blueprint must be created for each type. Each application requires careful design, migration, and validation following one of six migration strategies, which will be discussed later in this paper. An ethos of continuous improvement is advised, involving regular assessments of the blueprint to identify and rectify gaps. Modernize Migration isn't just about moving applications; it's about optimizing them. This means decommissioning outdated systems and steadily refining the operational model. Consider this as an evolving synergy among people, processes, and technology that improves incrementally throughout the migration journey. Migration Strategies Retain Some applications or segments of IT assets remain on-premises because they aren't suited for the cloud, don't deliver business value, or are not prepared for cloud migration. This could be due to dependencies on on-premises systems or data sovereignty issues. Retire Applications that no longer yield value to the business can be phased out. By retiring these systems, resources can be reallocated to more impactful business areas. Rehost Commonly known as "lift and shift," this strategy is popular among enterprises for its ability to facilitate a swift migration to the cloud. It requires minimal alterations to the code, database, or architecture, allowing for a more straightforward transition. Replatform Often termed "lift-tinker-and-shift," this process involves minor optimizations to applications for cloud efficiency, such as software updates, configuration tweaks, or the use of cloud-native services like Kubernetes-as-a-Service or Database-as-a-Service. An example includes transitioning from traditional database services to a cloud-based option like Amazon RDS. Repurchase This strategy comes into play when an application isn't suitable for cloud deployment or existing licensing agreements don't support a Bring-Your-Own-License (BYOL) model in the cloud. It involves switching to a Software-as-a-Service (SaaS) platform or working with Independent Software Vendors (ISVs). For example, replacing an on-premises customer relationship management (CRM) system with Salesforce. Refactor/Re-Architect This more intensive approach requires redesigning and rebuilding the application from scratch to fully exploit cloud-native features, significantly enhancing agility, scalability, and performance. Though it's the most costly and time-intensive strategy, it positions businesses to seamlessly integrate future cloud innovations with minimal effort. A typical scenario is transforming an application from a monolithic architecture to microservices. Conclusion While there is no one-size-fits-all solution for cloud migration, enterprises can benefit from analyzing successful migration journeys. Organizations can optimize their approach by emulating effective strategies and adapting them to their unique requirements. Taking the time to understand the new platform, learning from past missteps thoroughly, and refining processes are key steps toward meaningful outcomes. Moreover, it's strategic to prioritize migration projects based on business needs, considering factors such as complexity, revenue impact, operational criticality, and the timing of hardware upgrades. Additionally, investing in training for staff to master new tools and technologies is essential for a smooth transition to the cloud.

By Harshavardhan Nerella

A Brief History of DevOps and the Link to Cloud Development Environments

The history of DevOps is definitely worth reading in a few good books about it. On that topic, “The Phoenix Project,” self-characterized as “a novel of IT and DevOps,” is often mentioned as a must-read. Yet for practitioners like myself, a more hands-on one is “The DevOps Handbook” (which shares Kim as an author in addition to Debois, Willis, and Humble) that recounts some of the watershed moments around the evolution of software engineering and provides good references around implementation. This book actually describes how to replicate the transformation explained in the Phoenix Project and provides case studies. In this brief article, I will use my notes on this great book to regurgitate a concise history of DevOps, add my personal experience and opinion, and establish a link to Cloud Development Environments (CDEs), i.e., the practice of providing access to and running, development environments online as a service for developers. In particular, I explain how the use of CDEs concludes the effort of bringing DevOps “fully online.” Explaining the benefits of this shift in development practices, plus a few personal notes, is my main contribution in this brief article. Before clarifying the link between DevOps and CDEs, let’s first dig into the chain of events and technical contributions that led to today’s main methodology for delivering software. The Agile Manifesto The creation of the Agile Manifesto in 2001 sets forth values and principles as a response to more cumbersome software development methodologies like Waterfall and the Rational Unified Process (RUP). One of the manifesto's core principles emphasizes the importance of delivering working software frequently, ranging from a few weeks to a couple of months, with a preference for shorter timescales. The Agile movement's influence expanded in 2008 during the Agile Conference in Toronto, where Andrew Shafer suggested applying Agile principles to IT infrastructure rather than just to the application code. This idea was further propelled by a 2009 presentation at the Velocity Conference, where a paper from Flickr demonstrated the impressive feat of "10 deployments a day" using Dev and Ops collaboration. Inspired by these developments, Patrick Debois organized the first DevOps Days in Belgium, effectively coining the term "DevOps." This marked a significant milestone in the evolution of software development and operational practices, blending Agile's swift adaptability with a more inclusive approach to the entire IT infrastructure. The Three Ways of DevOps and the Principles of Flow All the concepts that I discussed so far are today incarnated into the “Three Ways of DevOps,” i.e., the foundational principles that guide the practices and processes in DevOps. In brief, these principles focus on: Improving the flow of work (First Way), i.e., the elimination of bottlenecks, reduction of batch sizes, and acceleration of workflow from development to production, Amplifying feedback loops (Second Way), i.e., quickly and accurately collect information about any issues or inefficiencies in the system and Fostering a culture of continuous learning and experimentation (Third Way), i.e., encouraging a culture of continuous learning and experimentation. Following the leads from Lean Manufacturing and Agile, it is easy to understand what led to the definition of the above three principles. I delve more deeply into each of these principles in this conference presentation. For the current discussion, though, i.e., how DevOps history leads to Cloud Development Environments, we just need to look at the First Way, the principle of flow, to understand the causative link. Chapter 9 of the DevOps Handbook explains that the technologies of version control and containerization are central to implementing DevOps flows and establishing a reliable and consistent development process. At the center of enabling the flow is the practice of incorporating all production artifacts into version control to serve as a single source of truth. This enables the recreation of the entire production environment in a repeatable and documented fashion. It ensures that production-like code development environments can be automatically generated and entirely self-serviced without requiring manual intervention from Operations. The significance of this approach becomes evident at release time, which is often the first time where an application's behavior is observed in a production-like setting, complete with realistic load and production data sets. To reduce the likelihood of issues, developers are encouraged to operate production-like environments on their workstations, created on-demand and self-serviced through mechanisms such as virtual images or containers, utilizing tools like Vagrant or Docker. Putting these environments under version control allows for the entire pre-production and build processes to be recreated. Note that production-like environments really refer to environments that, in addition to having the same infrastructure and application configuration as the real production environments, also contain additional applications and layers necessary for development. Developers are encouraged to operate production-like environments (Docker icon) on their workstations using mechanisms such as virtual images or containers to reduce the likelihood of execution issues in production. From Developer Workstations to a CDE Platform The notion of self-service is already emphasized in the DevOps Handbook as a key enabler to the principle of flow. Using 2016 technology, this is realized by downloading environments to the developers’ workstations from a registry (such as DockerHub) that provides pre-configured, production-like environments as files (dubbed infrastructure-as-code). Docker is often a tool to implement this function. Starting from this operation, developers create an application in effect as follows: They access and copy files with development environment information to their machines, Add source code to it in the local storage, and Build the application locally using their workstation computing resources. This is illustrated in the left part of the figure below. Once the application works correctly, the source code is sent (“pushed) to a central code repository, and the application is built and deployed online, i.e., using Cloud-based resources and applications such as CI/CD pipelines. The three development steps listed above are, in effect, the only operations in addition to the authoring of source code using an IDE that is “local,” i.e., they use workstations’ physical storage and computing resources. All the rest of the DevOps operations are performed using web-based applications and used as-a-service by developers and operators (even when these applications are self-hosted by the organization.). The basic goal of Cloud Development Environments is to move these development steps online as well. To do that, CDE platforms, in essence, provide the following basic services, illustrated in the right part of the figure below: Manage development environments online as containers or virtual machines such that developers can access them fully built and configured, substituting step (1) above; then Provide a mechanism for authoring source code online, i.e., inside the development environment using an IDE or a terminal, substituting step (2); and finally Provide a way to execute build commands inside the development environment (via the IDE or terminal), substituting step (3). Figure: (left) The classic development data flow requires the use of the local workstation resources. (right) The cloud development data flow replaced local storage and computing while keeping a similar developer experience. On each side, operations are (1) accessing environment information, (2) adding code, and (3) building the application. Note that the replacement of step (2) can be done in several ways. For example, for example, the IDE can be browser-based (aka a Cloud IDE), or a locally installed IDE can implement a way to remotely author the code in the remote environment. It is also possible to use a console text editor via a terminal such as vim. I cannot conclude this discussion without mentioning that, often multiple containerized environments are used for testing on the workstation, in particular in combination with the main containerized development environment. Hence, cloud IDE platforms need to reproduce the capability to run containerized environments inside the Cloud Development Environment (itself a containerized environment). If this recursive process becomes a bit complicated to grasp, don’t worry; we have reached the end of the discussion and can move to the conclusion. What Comes Out of Using Cloud Development Environments in DevOps A good way to conclude this discussion is to summarize the benefits of moving development environments from the developers’ workstations online using CDEs. As a result, the use of CDEs for DevOps leads to the following advantages: Streamlined Workflow: CDEs enhance the workflow by removing data from the developer's workstation and decoupling the hardware from the development process. This ensures the development environment is consistent and not limited by local hardware constraints. Environment Definition: With CDEs, version control becomes more robust as it can uniformize not only the environment definition but all the tools attached to the workflow, leading to a standardized development process and consistency across teams across the organization. Centralized Environments: The self-service aspect is improved by centralizing the production, maintenance, and evolution of environments based on distributed development activities. This allows developers to quickly access and manage their environments without the need for Operations manual work. Asset Utilization: Migrating the consumption of computing resources from local hardware to centralized and shared cloud resources not only lightens the load on local machines but also leads to more efficient use of organizational resources and potential cost savings. Improved Collaboration: Ubiquitous access to development environments, secured by embedded security measures in the access mechanisms, allows organizations to cater to a diverse group of developers, including internal, external, and temporary workers, fostering collaboration across various teams and geographies. Scalability and Flexibility: CDEs offer scalable cloud resources that can be adjusted to project demands, facilitating the management of multiple containerized environments for testing and development, thus supporting the distributed nature of modern software development teams. Enhanced Security and Observability: Centralizing development environments in the Cloud not only improves security (more about secure CDEs) but also provides immediate observability due to their online nature, allowing for real-time monitoring and management of development activities. By integrating these aspects, CDEs become a solution for modern, in particular cloud-native software development, and align with the principles of DevOps to improve flow, but also feedback, and continuous learning. In an upcoming article, I will discuss the contributions of CDEs across all three ways of DevOps. In the meantime, you're welcome to share your feedback with me.

By Laurent Balmelli, PhD

Cloud Architecture

DZone's Featured Cloud Architecture Resources

Top Cloud Architecture Experts

The Latest Cloud Architecture Topics