fbpx

Power of Amazon S3: Transforming Storage into a Data Lakehouse and Avoiding Costly Pitfalls


I’m thrilled to kick off this blog with some groundbreaking developments in the world of data management. Today, we’re exploring how Amazon S3 is evolving beyond mere storage to become your new database, thanks to the introduction of S3 Iceberg Tables and S3 Metadata. Buckle up as we dive deep into these transformative features.

Transforming S3 into a First-Class Database

Amazon S3 has long been the backbone of data storage for countless organizations, offering durability, scalability, and flexibility. However, with the introduction of S3 Tables and S3 Metadata, AWS is taking a bold step toward redefining S3 as a comprehensive database solution. These new features leverage the Apache Iceberg open table format to provide powerful data management capabilities directly within S3, bridging the gap between data lakes and traditional databases.

What Are S3 Tables?

S3 Tables represent a new bucket type within Amazon S3, seamlessly integrating the storage layer with the Iceberg table format. Here’s how it works:

  • Data Ingestion: When you write data to an S3 Tables bucket via the S3 API, the data is automatically converted into Parquet files and organized following the Iceberg table structure.
  • Enhanced Functionality: S3 Tables offer advanced table management features, including automatic maintenance and optimization. These tables are designed to be compatible with a wide range of query engines, adhering to the open standards set by Apache Iceberg.
  • Cost Considerations: While S3 Tables provide significant functionality enhancements, they come at a higher cost—approximately 37% more expensive than standard S3 storage.

Understanding S3 Metadata

S3 Metadata is another innovative feature that transforms how metadata is handled within S3:

  • Automatic Metadata Management: When enabled, S3 Metadata automatically captures and maintains metadata for all objects within any S3 bucket. This metadata is stored in an Iceberg table, ensuring it is kept up-to-date in near real-time.
  • Ease of Use: With metadata stored in a structured Iceberg table, you can leverage modern query engines to analyze, visualize, and process your data lake’s metadata effortlessly.

The Evolution of S3 Bucket Types

Before the introduction of S3 Tables, Amazon S3 offered two primary bucket types:

  1. S3 General Purpose Buckets: These are the standard, highly replicated S3 buckets that most users are familiar with.
  2. S3 Directory Buckets: Introduced alongside the Single Zone Express storage class in 2023, these are single-zone, non-replicated buckets with a hierarchical, directory-like structure.

With S3 Tables, AWS continues its trend of offering specialized bucket types tailored to specific use cases, further enhancing S3’s versatility.

Delving into S3 Tables: A Managed Iceberg Catalog

S3 Tables function similarly to an Iceberg Catalog, providing a centralized source of truth for table metadata. Here’s a breakdown of their key features:

  • Single Source of Truth: All metadata is managed centrally, ensuring consistency and reliability.
  • Automated Table Maintenance: S3 Tables handle various maintenance tasks automatically, including:
    • Compaction: Merging small files into larger ones to optimize storage and query performance.
    • Snapshot Management: Managing table snapshots by expiring and deleting outdated ones.
    • Unreferenced File Removal: Cleaning up stale or orphaned objects to maintain table integrity.
  • Security and Access Control: Leveraging AWS’s existing IAM policies, S3 Tables provide robust, table-level role-based access control (RBAC), ensuring secure data management.

Performance Advantages

AWS touts substantial performance improvements with S3 Tables:

  • 3x Faster Query Performance: Optimized storage and metadata management lead to significantly faster query responses.
  • 10x More Transactions Per Second (TPS): Enhanced throughput capabilities support high-volume transactional workloads.

These performance gains are particularly notable when compared to manually implementing Iceberg tables on S3, positioning AWS as a leader in managed data lakehouse solutions.

Compatibility and Integration

S3 Tables are designed for broad compatibility:

  • Open Source Integration: Out-of-the-box support for Apache Spark allows seamless integration with popular open-source data processing frameworks.
  • AWS Services Integration: Proprietary AWS services like Athena, Redshift, and EMR can easily interact with S3 Tables through AWS Glue integration, often with just a few clicks.

For a practical demonstration, check out Roy Hasson’s LinkedIn demo, which showcases how to work with S3 Tables using Spark.

Pricing Breakdown: Understanding the Costs

S3 Tables introduce a nuanced pricing structure, comprising four main components:

  1. Storage Costs:
    • 15% Higher Than Standard S3:
      • S3 Standard Rates: $0.023 / $0.022 / $0.021 per GiB for the first 50TB, next 450TB, and over 500TB each month, respectively.
      • S3 Tables Rates: $0.0265 / $0.0253 / $0.0242 per GiB.
    • Tiered Pricing: Costs vary based on storage volume, encouraging efficient data management as usage scales.
  2. PUT and GET Request Costs:
    • PUT Requests: $0.005 per 1,000 PUT requests.
    • GET Requests: $0.0004 per 1,000 GET requests.
  3. Monitoring Costs:
    • $0.025 per 1,000 Objects per Month: Similar to S3 Intelligent Tiering’s Archive Access monitoring, this cost ensures real-time visibility into object metadata.
  4. Compaction Costs:
    • $0.004 per 1,000 Objects Processed: Charged based on the number of objects involved in the compaction process.
    • $0.05 per GiB Processed: Applies to the volume of data being compacted.

Cost Estimation Example

Let’s break down the costs with an example:

  • Storage: Storing 1 TiB in S3 Standard costs approximately $21.5-$23.5 per month. In S3 Tables, this would be around $25-$27, reflecting the 15% increase.
  • Compaction: For ingesting 1 TB of new data monthly:
    • GiB-Processed Cost: 1,024 GiB * $0.05 = $51.20 (one-time fee).
    • Per-Object Cost: Assuming 10 MiB files, this results in roughly 10,500 objects, costing $0.042. Even with smaller 10 MiB files, the cost remains minimal at $0.42.
  • Monitoring: Post-compaction, storing 2,048 objects incurs a monitoring cost of approximately $0.0512 per month, compared to $0.2625 pre-compaction.

Overall, while S3 Tables introduce higher costs, the benefits in performance and management often justify the investment, especially for large-scale data operations.

S3 Metadata: Simplifying Data Management

S3 Metadata enhances the usability of S3 by automating metadata management across any S3 bucket:

  • Activation: Simply enable S3 Metadata on your desired S3 bucket, and AWS handles the rest.
  • Metadata Storage: All metadata is stored in a read-only S3 Metadata Table, an Iceberg table that AWS maintains in near real-time.

Types of Metadata

S3 Metadata categorizes information into:

  1. User-Defined Metadata:
    • Custom Key-Value Pairs: Assign arbitrary metadata such as product SKUs, item IDs, hashes, etc., directly to objects.
  2. System-Defined Metadata:
    • Standard Attributes: Includes essential details like object size, last modified date, encryption algorithm, and more.

Cost Structure

The pricing for S3 Metadata is straightforward:

  • Metadata Updates: $0.00045 per 1,000 updates, equating to $0.45 per million updates—a cost comparable to standard GET requests.
  • S3 Tables Costs: Since metadata is stored in S3 Tables, you also incur storage and maintenance costs as outlined previously. However, the data volume for metadata is typically small, making these costs negligible.

Preventing Data Swamps with S3 Metadata

One of the persistent challenges in data lake management is avoiding the transformation of a data lake into a data swamp—a repository of unmanaged, disorganized data that’s hard to navigate and utilize.

S3 Metadata addresses this by:

  • Automated Organization: By capturing and maintaining metadata automatically, S3 Metadata ensures that your data remains organized and searchable.
  • Enhanced Discoverability: Leveraging Iceberg tables allows you to use powerful query engines to efficiently explore and analyze your data’s metadata, making data governance and utilization much more manageable.

This seamless metadata management simplifies data classification, categorization, and organization, especially when dealing with massive data volumes like 1,000 Petabytes.

The Competitive Landscape: AWS vs. Lakehouse Providers

AWS’s strategic enhancements to S3 pose significant challenges to other data lakehouse providers:

  • Managed Services Advantage: AWS leverages its robust infrastructure and extensive sales networks to offer fully managed services that are deeply integrated with their ecosystem.
  • Ease of Use: With just a few clicks, users can deploy complex data management solutions without the overhead of setting up and maintaining separate systems.
  • Vendor Lock-In Concerns: Other vendors may struggle to compete with AWS’s seamless integration and comprehensive feature set, potentially impacting their market share and revenue projections.

For instance, Confluent’s efforts to compete with AWS’s MSK highlight the formidable challenge posed by AWS’s polished and heavily invested offerings.

The Open Table Format Wars: Iceberg Takes the Lead

The introduction of S3 Tables and S3 Metadata is part of the broader open table format wars, where different formats vie for dominance in the data management ecosystem. Here’s a closer look:

Apache Iceberg vs. Competitors

Apache Iceberg has emerged as a frontrunner in this battle, favored for its open standards and flexibility. Key milestones include:

  • Late 2022: Google announced Iceberg support in BigLake.
  • August 2023: Snowflake integrated Iceberg support into their unified tables Snowflake Blog.
  • March 2024: Confluent introduced Iceberg support in Kafka via Tableflow.

The Tabular Acquisition and Its Implications

A pivotal moment in the open table format wars was Databricks’s acquisition of Tabular for an astonishing $1-2 billion despite Tabular generating only $1-5 million annually. This move underscored the intense competition and strategic importance of Iceberg, signaling AWS’s opportunity to capitalize on Iceberg’s strengths without directly engaging in the protocol war.

The Role of Catalogs in Lock-In

At the heart of the competition is the Iceberg Catalog, which manages Iceberg tables and their metadata. Catalogs are critical because:

  • Centralized Metadata Management: They act as a metastore, defining what constitutes a table and managing access controls.
  • Access Control Mechanisms: AWS’s integration of IAM policies into S3 Tables simplifies security management, reducing the need for separate catalog access policies.

This tight integration effectively locks customers into AWS’s ecosystem, as migrating to another provider would require reconfiguring complex access controls and metadata management systems.

The Proliferation of Catalog Implementations

Despite efforts to standardize catalogs via Apache Iceberg’s REST API, the market is saturated with various catalog implementations, including:

  • LakeKeeper
  • Project Nessie
  • Gravitino
  • Starburst Polaris Catalog
  • Dremio Iceberg REST Integration
  • Databricks Unity Catalog
  • Snowflake Polaris

This fragmentation complicates the landscape, making it challenging for organizations to choose and maintain the right catalog solution.

AWS’s Strategic Play: Moving Up the Stack

AWS’s introduction of S3 Tables and S3 Metadata exemplifies a classic “move up the stack” strategy:

  • From Storage to Data Management: What began with the commoditization of storage through S3 is now expanding into data management and database functionalities.
  • Leveraging Scale and Integration: AWS capitalizes on its massive infrastructure, seamless service integration, and extensive sales capabilities to dominate this expanded role.

This strategic evolution threatens to overshadow specialized lakehouse providers, as AWS offers a comprehensive, managed solution that minimizes the need for third-party services.

The Future of Data Management: S3 as the Central Hub

With S3 Tables and S3 Metadata, Amazon S3 is poised to become the central hub for data management, blending the scalability of data lakes with the functionality of traditional databases. Key benefits include:

  • Unified Storage and Compute: Achieve a data lakehouse architecture where storage and compute are seamlessly integrated, reducing data duplication and associated costs.
  • Openness and Interoperability: Support for Apache Iceberg ensures that data remains accessible and manageable across various query engines and tools, fostering a flexible and vendor-agnostic ecosystem.
  • Enhanced Performance and Efficiency: Automated maintenance tasks and optimized storage formats lead to significant performance improvements, making data operations more efficient and cost-effective.

Conclusion: Embracing S3 as Your New Database

Amazon S3’s evolution into a full-fledged database solution through S3 Tables and S3 Metadata marks a significant milestone in data management. By leveraging the open standards of Apache Iceberg and offering robust, managed services, AWS is redefining what’s possible with cloud storage.

As organizations increasingly seek scalable, efficient, and integrated data solutions, S3 stands out as a compelling choice, offering the combined strengths of data lakes and databases in a single, unified platform. Embrace the future of data management with S3 as your new database, and unlock unprecedented levels of performance, flexibility, and ease of use.


Stay Connected! Subscribe to Big Data Stream to stay updated on the latest trends and innovations in data management. Don’t miss out on future insights that can transform your data strategy!


Further Reading:

Appendix: The Hidden Costs of an Empty S3 Bucket

How an Empty S3 Bucket Can Make Your AWS Bill Explode

Update 7.05.2024

The S3 team is working on a fix:
https://twitter.com/jeffbarr/status/1785386554372042890

The Unexpected AWS Bill from an Empty S3 Bucket

Imagine creating an empty, private AWS S3 bucket in your preferred region. The next morning, you check your AWS bill and discover an unexpected charge of over $1,300, despite the bucket containing no intentional data. How did this happen? Let’s delve into this alarming scenario.

The Mysterious Surge in PUT Requests

A few weeks ago, while developing a proof-of-concept (PoC) for a document indexing system for a client, I created a single S3 bucket in the eu-west-1 region and uploaded some test files. Two days later, I was shocked to see nearly 100,000,000 S3 PUT requests in my billing details for just one day!

Where did these requests come from?

By default, AWS doesn’t log requests made to your S3 buckets. To investigate, I enabled AWS CloudTrail and S3 Server Access Logging. The logs revealed thousands of write requests from multiple accounts and external sources.

Unveiling the Culprit

The barrage of unauthorized requests wasn’t a targeted attack but rather a result of a misconfigured open-source tool. This tool had a default configuration to back up data to an S3 bucket using a placeholder name—the same name as my bucket. Consequently, every deployment of this tool with default settings attempted to store backups in my S3 bucket, leading to an explosion of PUT requests.

Note: I cannot disclose the name of the affected tool to prevent potential data leaks and protect the involved companies.

Understanding AWS’s Billing for Unauthorized Requests

AWS charges for unauthorized incoming requests, including those that result in 4xx errors. As AWS support confirmed:

“Yes, S3 charges for unauthorized requests (4xx) as well. That’s expected behavior.”

This means that even if a request is denied with an AccessDenied error, the requester incurs costs. Remarkably, no AWS account is needed to generate these charges.

Additionally, over half of my bill originated from the us-east-1 region, despite having no buckets there. This occurred because S3 requests without a specified region default to us-east-1 and are redirected accordingly, with the bucket owner bearing the extra costs for these redirected requests.w

Security Implications and Preventive Measures

To prevent such scenarios, consider the following lessons:

  1. Bucket Name Exposure Leads to Cost Risks:
    • Risk: Anyone aware of your S3 bucket name can potentially generate massive AWS bills by sending numerous PUT requests.
    • Mitigation: Avoid using predictable or common bucket names. Instead, use unique, random suffixes to make it harder for misconfigured tools to target your bucket.
  2. Default Configurations Can Backfire:
    • Risk: Open-source tools with default S3 configurations may unintentionally target your buckets if names collide.
    • Mitigation: Always review and customize default configurations of third-party tools to ensure they don’t inadvertently use your bucket names.
  3. Specify AWS Regions Explicitly:
    • Risk: Unspecified regions in API requests can lead to unexpected redirections and additional costs.
    • Mitigation: When making API requests to S3, always specify the intended AWS region to avoid unnecessary redirection charges.

A Risky Experiment: Public Writes

Curious about the extent of this issue, I temporarily opened my bucket for public writes. Within 30 seconds, my bucket accumulated over 10GB of data. This experiment highlighted how easily data could be exfiltrated due to configuration oversights, posing significant security risks beyond just unexpected billing.

Aftermath and Actions Taken

  1. Reporting the Issue:
    • I informed the maintainers of the vulnerable open-source tool, leading them to promptly fix the default configuration. However, existing deployments remained vulnerable.
  2. Engaging with AWS:
    • I contacted the AWS security team, recommending they implement safeguards to prevent such misconfigurations from causing financial and security issues. Unfortunately, AWS opted not to address third-party product misconfigurations.
  3. Notifying Affected Parties:
    • I reached out to two companies whose data had been unintentionally stored in my bucket. They did not respond, possibly treating my notifications as spam.
  4. Billing Resolution:
    • AWS generously canceled my inflated S3 bill, though they clarified this was an exception rather than the norm.

Final Thoughts

This experience underscores the importance of:

  • Vigilant Configuration Management: Always ensure that your S3 bucket configurations are secure and not easily guessable.
  • Monitoring and Logging: Enable comprehensive logging (e.g., CloudTrail) to detect and respond to unauthorized activities promptly.
  • Understanding AWS Billing Mechanisms: Be aware of how AWS charges for different types of requests to prevent unexpected costs.

By taking these precautions, you can safeguard your AWS resources from similar pitfalls and maintain control over your cloud expenditures.