Matthew Koscak

How I've Learned Generative AI

Mon, 23 Sep 2024 00:00:00 GMT

Life update since my last post… I got a new job!

In June of this year, I started a new role as a Solutions Architect at Cohere. In this role I help customers design AI applications that use Cohere's Large Language Models.

Moving from Data Integration to Machine Learning and Generative AI has been a technical change for me. While there have been dozens of resources that helped me get up to speed on GenAI/ML, today's post focuses on the top two I'd recommend to anyone looking to understand how this stuff actually works.

Large Language Model University (LLMu)

This is the best resource I've encountered teaching Generative AI for newbies.

Completing this will get you from 0 to intermediate level knowledge on all things GenAI and Natural Language Processing. AI is a difficult topic to learn because we are dealing with text inputs/outputs. When data is in numerical/tabular format, it's easier to conceptualize and to work with. Deterministic software has guaranteed outputs. But when you're working with random generation (stochastic software), it can be difficult to conceptualize and implement in a trustworthy manner. The best part about LLMu is that it's created for beginners, and is explained at an easy to understand learning level. If you want to understand how computers work with text, RAG, Agentic AI, and more — LLMu is your one stop shop.

Full disclosure that this was created by my current company, Cohere, but I promise I am recommending this based on merit. This whole course will take you roughly 20 hours.

Check it out at: cohere.com/llmu

Google's Machine Learning Crash Course

This course is about Machine Learning in general. Knowing the basics of machine learning are table stakes for any AI practitioner. What is a machine learning model? How does model training work? Why? How do we select the data to train on? This course does a great job of getting into these technical concepts. You'll finish this course understanding those concepts and how exactly different types of models can predict future outcomes/trends from the training process.

Check it out — this is a little more in depth and will take roughly 30 hours to complete: developers.google.com/machine-learning/crash-course

A final note about my blog

As a closing note, I plan to refresh my blog's content to better align with my day to day duties and my ongoing learning journey in the field of Generative AI. My blogging will focus on a variety of real-world AI applications, highlight the trends and customer demands I encounter, and provide practical guides on implementing these cutting-edge solutions.

This space is moving insanely fast and I look forward to sharing what I learn along the way.

—Matt

Tool Use, AI Agents, and the next few years of AI

Tue, 04 Jun 2024 00:00:00 GMT

With AI hype at an all-time high… what is the actual value?

In the near term, the answer is: Tool Use, Tool Chains, and finally, AI Agents.

Quick note: the code used in this post is from LangChain's open-source documentation and Quickstart examples.

What is Tool Use?

Tool use, also referred to as function-calling, is a process that certain Large Language Models can utilize to intelligently reason and help a user invoke external tools with natural language (for example — APIs, search engines, database calls, functions, and more). This gives us the ability to not just ask LLMs questions… but to accomplish real-world tasks.

By the end of this article, you will fully understand how to use natural language to:

Access revenue data in a Snowflake Data Warehouse
Access revenue projections stored in a Google Drive document
Ask the LLM the question "Did our Black Friday revenue this year beat our projections?"
Get the correct answer.

This is a perfect example of Tool Use — which is, in my opinion, the number one area AI will add massive enterprise value in the coming years.

How does Tool Use work?

First, the user registers the tools with the LLM. This is commonly done with the @tool decorator in Python.

Then, any time a prompt is sent to the LLM that requires using those pre-defined tools, the LLM can reason which tool it will call. The model can also understand the parameters with which it will execute that call. Finally, the user executes the tool call and accomplishes whatever task was in the prompt.

Live example of Tool Use

Looking at the QuickStart in LangChain, we can easily create a sample tool that multiplies two numbers together.

@tool
def multiply(first_int: int, second_int: int) -> int:
    """Multiply two integers together."""
    return first_int * second_int

And when we invoke that tool, we see that we get the right answer!

multiply.invoke({"first_int": 4, "second_int": 5})

20

You might be thinking "Multiplication isn't new. What's the big deal?"

Look at the first line of the first code block above. Do you see that @tool decorator? This is how we register the multiply function as a tool with the Large Language Model (LLM). When we register this tool with our LLM, the LLM can then intelligently use this functionality on its own. When I ask "What is 5 times 4," the LLM can reason and decide that it should utilize my multiplication function to solve that, and the model can deduce that the two parameters are 5 and 4.

That's pretty impressive. Let's map that same Tool Use capability to a more impressive enterprise use case: querying a Snowflake Data Warehouse containing a table with your company's historical sales data.

If I utilize LangChain's built-in Snowflake integration, I can create a tool to query Snowflake tables and load Snowflake documents with natural language. Suddenly, I can ask questions like "How much revenue did we generate on Black Friday this year?" and that query would be automatically answered for me, without any need to connect to the database and write a custom SQL statement!

Now THAT is valuable.

But one tool is not that impressive. What is impressive? Invoking multiple tools. That's where Tool Chains come in.

What is a Tool Chain?

A Tool Chain is when multiple tools are called sequentially. It's literally a "chain" of "tools." Pretty easy concept to grasp.

Keeping on the math example, let's add in a second tool — addition!

@tool
def add(first_int: int, second_int: int) -> int:
    "Add two integers."
    return first_int + second_int

This simple tool adds two numbers together, and functions the same way as the multiplication tool. But now, since there are two tools registered, I can prompt the LLM on either one.

chain.invoke("What's 5 times 4")

20

chain.invoke("What's 20 plus 100")

120

We now have multiple tools in our chain. We can call either of them as needed!

Earlier we created our Snowflake connection and queried a Snowflake table via natural language to find out our Black Friday revenue. Now, we set up a Google Drive connector. In our fake scenario here, this is where we keep our company's financial projections for this year. We can use natural language to ask both:

"How much revenue did we make on Black Friday this year?" and "Did we meet our 2024 Black Friday revenue projections?"

This is a perfect example of a Tool Chain — a chain of tools that each accomplish individual tasks, which all together accomplish a goal.

But there is one limitation: I can only invoke one of these tools at a time. What if I wanted to ask "Did our Black Friday revenue this year beat our projections?"

This is where AI Agents come in.

AI Agents

The core idea of AI Agents (sometimes called Intelligent Agents) is to use a language model to choose a sequence of actions to take. In agents, a language model is used as a reasoning engine to determine which actions to take and in which order.

Looking back at our addition and multiplication example from the LangChain documentation, we can now invoke both tools in one question!

chain.invoke("What is 5 times 4, plus 100?")

"5 × 4 equals 20, and adding 100 to that total gives us 120"

This ability to take my natural language prompt, deduce what functions should be used, and what parameters to use in those functions is all quite impressive! Let's relate this to our enterprise example one final time.

With the ability to invoke both my Snowflake tool and my Google Drive tool in one call, I can ask the question "Did our Black Friday jean sales this year beat our projections?" and the LLM will call the Snowflake tool to return the revenue, call the Google Drive tool to find our Black Friday projection, compare those two results, and then return an answer such as:

chain.invoke("Did our Black Friday jean sales this year beat our projections?")

"Black Friday sales were $100,000, which is higher than the projected $80,000 for this year."

Conclusion

Tool Use and Intelligent Agents are the next phase of AI.

With NVIDIA at almost a 3 trillion dollar market cap, and every single tech company spending money on AI, I think it's safe to say AI hype is at an all-time high.

But, as with any hyped technology, it seems like we could be approaching a bubble. So where is the real return on investment for companies when it comes to AI?

In the near term? Tool Use and Intelligent Agents.

Tool Use and Intelligent Agents are going to completely transform knowledge work, and give users the ability to automate tasks that they wouldn't have been able to dream of before the advent of LLMs.

These agents will expand far beyond asking questions to a Snowflake table. Soon, entire workflows will be automated away without any human intervention. The LLM will function as a sort of human-like brain, which can do everything within the limits of the target tools' APIs.

The next few years will bring in massive changes in enterprise automation and knowledge work.

Apache Iceberg: The Quickstart Guide

Thu, 04 Apr 2024 00:00:00 GMT

In case you missed it… 2024 is the year of Apache Iceberg.

Today we are going to discuss:

What is Apache Iceberg?
How does it work?
What are some real-world use cases?
The open-source community around Iceberg vs. competitors

What is Apache Iceberg

Apache Iceberg has quickly become the most popular Open Table Format. So what is it?

Apache Iceberg is a truly open-source table format for Parquet, ORC, and Avro files. Businesses can capture data fast and cheaply in these file formats in their data lake, and then use Apache Iceberg tables as an abstraction layer over those files to introduce the following functionality:

Schema evolution
SQL querying on data lakes
Incremental processing of data
Consistent, reliable data states for all users
Time travel — querying current or past snapshots

How exactly does it work?

Iceberg tables don't actually house the data. Instead, the data is kept in Parquet, ORC, or Avro files, and Iceberg is used as an abstraction layer. What does that mean?

Apache Iceberg utilizes a system of pointers and metadata files to keep track of CHANGES to the underlying data files. The pointers and metadata files comprise our Apache Iceberg table!

We will now dive into how Iceberg works under the covers, and learn how a SELECT statement would execute in this architecture.

Architecturally, there are three layers of an Iceberg table format:

Layer 1 — The Iceberg catalog

The catalog is the highest level, and is the starting point for any interactions with Iceberg tables. It contains the current metadata pointer, which points to the metadata file of the current Iceberg table. If we have database 1 (db1), and table1 to represent our first Apache Iceberg table, the metadata pointer kept in the catalog would be db1.table1. This points to the current metadata file.

Layer 2 — The Metadata layer

The metadata layer consists of a few reference files:

Metadata file — This file stores high-level metadata about your table at a certain point in time. The most important information contained in this file is the current snapshot. The current snapshot gives us the CURRENT table, which consists of a manifest list, manifest files, and data files (stay with me, it will make sense by the end).
Manifest list — a list of manifest files. This list contains the path (the location) of each manifest file contained in a snapshot.
Manifest files — The purpose of a manifest file is to track the data files. Manifest files contain information about the underlying data files in object storage. Information like location, record count, and partition information are stored in the manifest file, and can be used to make querying more efficient.

Layer 3 — The Data (Storage) layer

Data files — This layer contains the actual data in your Iceberg table. These would be in either the Parquet, ORC, or Avro file format. These data files are managed by the files in the metadata layer.

Real-world use cases

Now you understand how Iceberg tables work under the hood. But what use cases are becoming more prevalent that make 2024 the year of Apache Iceberg?

Strict data privacy laws — There are data privacy laws that require deleting data after a certain period. If that data is kept in a data lake, Iceberg allows for easy deletion of relevant records or tables.
Updates at the record level — If you sell something and that transaction is stored in your data lake, and then the customer returns it… what do you do? In immutable data stores, we must reprocess the entire data set. With Iceberg, we can make record-level changes in the data lake.
ACID transactions — allows data lakes to function as transactional data stores.

Open Source Community

The last point, and probably the most important, is that there are only a few open table format options for a modern data repository — Iceberg, Delta Lake, and Hudi. Iceberg, in many experts' opinion, has the best open-source community contributing to its development.

Delta Lake (a competing Open Table Format) is open-source, but its two biggest contributors by far are Databricks and Microsoft. If your business works with those companies, then Delta Lake may be a good choice for your lakehouse architecture. But if you are looking for true open source, Apache Iceberg is likely the better option. Iceberg has an incredibly diverse and talented community across a plethora of companies contributing to its advancement.

It's the most feature-rich, and the most open-source table format out there.

2024 is the year of Apache Iceberg. Are you ready for the Iceberg takeover?

The Benefits of Open Table Formats

Thu, 29 Feb 2024 00:00:00 GMT

If you work with data, you've probably heard of the term "Open Table Format."

If you haven't, or if you want to learn what Open Table Formats (OTFs) are and why they are all the rage, this post is for you.

What is an Open Table Format

You're probably familiar with a table in a relational database. It's a grouping of columns and rows of data that we can query.

So what's the difference between this "Table Format" and an "Open Table Format"? Let me explain.

Remember that data lakes are composed of files (Parquet, ORC, etc.) in HDFS or Object Storage. These files are visible to us as the end user. Using Parquet as an example, that file could be Example.parquet.

These file formats are different from Open Table Formats like Apache Iceberg, Delta Lake, or Apache Hudi. File formats and open table formats work together.

An Open Table Format is an abstraction layer on top of a data lake's files/storage that introduces functionality traditional database tables have.

What functionality are we talking about?

Benefits of an Open Table Format

Schema and partition evolution + CRUD operations

Relational database tables allow for C.R.U.D. operations (Create, Read, Update, and Delete). In a typical data lake, however, users can only create objects (or files) and read them. Data lakes typically utilize object storage (or HDFS), which does not provide an easy way for users to update the data. These storage mechanisms are designed to hold immutable (unchangeable) copies of data. That is, until OTFs came onto the scene. With OTFs, you can update columns/records, schemas, and partitions across object stores without completely reprocessing the data.

Improved performance

Open Table Formats allow analytical engines (Spark, Presto) to filter by metadata BEFORE executing a query. This drastically reduces the number of compute operations and records to read through for queries over large data sets. Quick example…

Let's say our company, Rockford Corp, has customers aged 20–70. We store all purchase transactions in Parquet files, partitioned by decade of age (20–29, 30–39, etc.). We want to analyze just those individuals aged 20–29. With an Open Table Format, we have the metadata of these files stored in our catalog, allowing us to search ONLY those files that meet the condition of age = 20–29. This allows us to skip over all other age groups, drastically improving our time to query and the performance of that query versus if we had to query all records.

ACID functionality

ACID functionality — which stands for Atomicity, Consistency, Isolation, and Durability — are four key characteristics of a database table. These four properties together ensure database operations across groups of records can happen concurrently without issues. If any singular event or transaction fails, the entire process fails and the database reverts to the last stable state. This is extremely important for certain applications and use cases where multiple reads and writes are happening concurrently.

Atomicity — Guarantees all commands in a transaction either succeed together or fail together.
Consistency — Guarantees all transactions follow the constraints or rules set.
Isolation — Transactions run in an isolated environment, allowing two transactions to run concurrently.
Durability — Transactions that complete successfully are guaranteed to persist in the database.

Time travel

Open Table Formats utilize metadata to create each "snapshot" or version of that table and its contents at a point in time. Each snapshot is a grouping of metadata across the files and object stores. One unique aspect of OTFs is that because these snapshots are captured and kept, users can roll back to previous snapshots. This allows for "time travel," as we can utilize older versions of these tables whenever we need.

Note — Between the main Open Table Formats (Iceberg, Delta Lake, and Hudi) each one works slightly differently than the others.

The important point here is that without an open table format, data lakes DO NOT have the critical functionality listed above. But with data files organized under a standardized table format, we get the full data warehouse experience on the data lake.

This is where the term "Data Lakehouse" comes from.

In summary, an Open Table Format is an abstraction layer on top of modern file formats that gives us functionality such as schema/partition evolution, CRUD operations, better performance, ACID transactions, and time travel.

Thanks for reading! If you've read this far, I'd love to hear your thoughts on this article and would appreciate any feedback you may have.

The Rise of Object Storage

Fri, 26 Jan 2024 00:00:00 GMT

The Blob.

The Bucket.

The almighty Object Store.

Just under 80% of the world's data is unstructured or in an object store. If you work with data, this affects you. You should know what object storage is.

For this blog post, give me a few minutes of your time and I'll explain:

What object storage is
Why you should care

What is Object Storage?

Object storage is the basis of the modern data repository. Yep, it's that important.

Object storage is where all of our unstructured, semi-structured, and sometimes structured files reside in a data lake or data lakehouse. Sure, there are other aspects of a data lakehouse, but object storage reigns king as the holder of information.

As a refresher, data lakes are where we dump data (usually unstructured) to potentially clean and analyze later. Remember how unstructured data is hard to work with? Lakehouses fixed that — think of a data lakehouse as a sort of queryable data lake.

So is object storage new or something?

No. Object storage is actually like 30 years old. It was originally invented in the 1990s to help companies meet new compliance laws. During the '90s, a bunch of naughty companies were deleting or changing their financial records data. To stop people from doing this, new laws came into effect that changed how companies could store, change, and delete data for record keeping.

Object storage initially came at a time when companies needed:

Auditable data trails
Unchangeable data stores

And object stores are great for those two things! But they can also:

Have expansive, customizable metadata
Scale cheaply

All four of these benefits together are what make object storage unique versus other storage mechanisms like block storage. If block storage was a covered parking lot with valet service, object storage was economy parking.

But object storage wasn't crazy popular at first. The need for good metadata wasn't apparent. Object storage wasn't yet available on the cloud.

That all changed around 2010, as enterprises increasingly needed a place to cheaply dump massive amounts of unstructured and other data to work with later. The data lake was born! It needed storage — Cloud Object Storage was the cheap, safe, and scalable option that worked best. It became the standard storage for data lakes.

Fast forward to today — object storage dominates the world's data. According to IDC, just under 80% of the world's data is unstructured or object storage.

But object storage isn't just popular because it is a cheap, scalable storage option. A bunch of dirty unstructured data together? That sounds like a data swamp! We needed the ability to search and query our object storage.

Remember how "expansive, customizable metadata" was a benefit of object storage? The need to make sense of all this object storage demanded a solution.

Enter the Data Lakehouse. With the advent of Data Lakehouses, and more specifically the open table format in 2013, object stores eventually became easier to navigate and analyze.

Open table formats made it possible to query data across object stores. There's a lot of work involved in getting there, but that's the gist of it.

In summary, object storage is cheap, scalable storage with descriptive metadata.

It became popular in the 2010s because of the advent of Cloud Object Storage plus the creation of open table formats. Together, these two advancements ushered in an explosion of unstructured data analytics and a dominant period for Object Storage.

Why it matters to you

This matters to you if you work with data — specifically, unstructured data. Think emails, audio files, photos, logs, videos, and other sources of "information." These files are likely landing in object storage to be analyzed later.

These sources don't have rows and columns to get straightforward insights from. But they do have valuable data to analyze. Let's look at a quick example of what I mean.

Let's say you own a clothing store. You just released a new pair of blue jeans you think are great, but know you can improve to bring your business to new heights. You obtain an audio file of one of your customers explaining what they like and don't like about your new blue jeans. This is great news, as this audio has information that can be analyzed to help you improve your blue jeans!

In a perfect world, this audio file would be a spreadsheet. But it's not. That's not how the world works. Data is captured in all sorts of unstructured formats that you can cleanse into something you can query.

So how do we "query" or analyze this? The answer starts with object storage.

Object stores include both the audio recording file and metadata about that file's contents. After cleansing this audio file and object, users can then query that metadata and the (cleansed) file contents to derive insights.

Utilizing metadata about object stores also gives us improved query performance and cost. For example, if we wanted to query all audio files created in January, we could just query metadata containing a January timestamp, eliminating the need to search through 11 other months of results.

By utilizing metadata to pre-filter our query results, we drastically speed up execution time (less data to analyze) and, as a result of fewer computations, save on compute costs. Talk about good data engineering!

With object storage, a whole new data repository has been created right before our eyes. By piecing together object storage, open file formats, and open table formats, we have officially entered into the golden age of unstructured data analytics and AI.

The Big 3 AI File Formats

Wed, 10 Jan 2024 00:00:00 GMT

Jordan, Rodman, Pippen.

Parquet, ORC, Avro?

Big 3s are cool in basketball. Big 3s in file formats? Probably something you gloss over.

Today's blog will convince you otherwise. Knowing the basics of these three file formats, and when to use which, will make you lethal in the world of AI.

So let's dive in.

The main file formats we deal with in our day-to-day data job are typically CSV or JSON. They work for some use cases. But these file formats weren't particularly designed to deal with BIG data. When files of this format encounter massive amounts of data and are used for analytical workloads, compute resources are typically used up much faster. Searching through millions or billions of records, value by value, to find an answer… it just isn't efficient for computers. Not with a basic file format, at least!

To be specific, older file formats cannot work with compression algorithms, faster read/write speeds, and nested data structures often required of big data projects. Because of these shortfalls, more sophisticated file formats eventually came to be.

Today we will run through the big three file formats you'll encounter in data lakes and the world of big data and AI:

ORC
Parquet
Avro

Before we overview each, we need to establish two important concepts.

First concept — the structure of a data file is very important.

Namely, row-based storage vs. columnar storage. The way your data is organized can change your time to answer a question or query from 6 minutes to 6 milliseconds. You must optimize your format to match your storage method and your data use case.

Second concept — what is a column-oriented data file format?

This is a fundamental concept of two of these three formats. A lot of folks think of typical row-based storage when storing records. But another helpful option is column-oriented files. In a column-oriented file, data is stored by — you guessed it — each column. So each column contains all the values for that specific attribute across all records.

Let's say our business, Rockford Corp., has customer records stored with some basic information. In row-based storage, records are stored one full record at a time. In columnar storage, the data is stored by each column.

Take a second to understand the differences in the way these are stored…

So now… who cares? You should! The columnar type of file format is often ideal for analytics and machine learning.

A real-world example of this would be finding the average age of Rockford Corp's customers. Let's find the average age using both the row storage and column storage file formats.

ROW STORAGE — Since the "age" attribute is stored as one value in each record containing 4 total values (FirstName, LastName, Age, ZipCode), a computer must load all 16 total values into memory to calculate the average age of 36.25.

COLUMNAR STORAGE — Here the ages are all in the same "record." Searching for the answer to this particular question would be a lot quicker, as we just load the 4 age values into memory for our average.

That's 4 operations instead of 16!

And thus goes columnar storage. While this is an overly simplified example with only a few data points, you can see how this drastically improves compute resources and efficiency over large data sets. Numerous machine learning and AI use cases are better accomplished with data in columnar file formats.

Other benefits of columnar file formats and storage include:

Better compression
Faster query performance
Scalability

Got it? Good stuff. Let's move on to the big three file formats.

Apache Parquet

There's a reason I put Parquet with Michael Jordan in the opening picture. Apache Parquet is the most popular file format on this list.

Parquet is an open-source, column-oriented data file format that came out in 2013. The main draw of utilizing Parquet was improved analytical querying performance. That's fancy speak for better data storage and data retrieval. It is extremely popular for Python-based projects, which is the most popular programming language in the world.

Parquet provides:

Efficient data compression — your data doesn't take up tons of space
Ability to handle complex data in bulk
Availability in multiple languages (Python, Java, C++, etc.) — useful in lots of big data projects
Availability to any project in the Hadoop ecosystem, regardless of the specific data processing framework, model, or language. This is huge as you aren't locked into one framework.

Parquet is optimized for write-heavy workloads. It also has excellent support for complex nested data structures, making Parquet a great candidate for JSON and other nested data types.

Apache ORC

The second file format you'll run into is Apache ORC, or Optimized Row Columnar. This file is again columnar-based and is designed for big data processing systems like Hadoop.

Inside the ORC file, data is stored in stripes (which is just a grouping of rows of data). Those stripes are chunked into smaller groupings of columns and then compressed into much smaller storage. The result is a massive data set that doesn't take up much space!

Need proof? Facebook (Meta) uses ORC in their data warehouse to save tens of petabytes of data versus other formats.

ORC also stores indexes and vast metadata in the file, so that certain query results can be retrieved quickly instead of searching through the entire file.

The Apache ORC file format is an excellent candidate for read-heavy use cases, especially streaming, with support for finding the required records with speed.

You're probably thinking "Hey, this sounds a lot like Parquet." The main difference to remember is you would use ORC for read-heavy workloads, and Apache Parquet for write-heavy workloads. It's a little more complicated than that… but if you remember this one fact, you'll be ahead of most data and AI learners.

Apache Avro

Avro is a ROW-based storage format for Hadoop. Avro stores the schema as JSON, making it easy to read by almost any program. The data itself is stored in binary, which makes it compact. One important feature of Avro is support for data schemas that change over time (this is called schema evolution). Avro can handle schema changes like missing, added, or changed fields.

The Avro format also provides support for numerous rich data structures, and even support for multiple data structures in the same record. Avro is often recommended for Kafka, and when serializing data in Hadoop. Avro is splittable and compressible and is a really good candidate for the Hadoop ecosystem and for running in parallel.

One thing that's special about Avro is that it is self-describing. Serialized data AND that data's schema are bundled in the same Avro file. This allows different programs to easily deserialize messages.

So now to the important question… what use case is best for Avro?

Based on the file format's strengths, Avro is an ideal candidate for your data lake's landing zone. This is because data in this zone is typically read in its entirety downstream (row is better than column for this), PLUS those downstream systems retrieving that data can also easily retrieve the schemas (since they are stored with the file). Another great use case is standardizing data on Avro across your different systems as a consistent communication format.

And those are your big three file formats in the age of AI! Just like Jordan, Pippen, and Rodman dominated the 1990s, these three file formats dominate big data.

The main takeaway here is that your file format should match up to your downstream use case. Yes, that takes planning. But that planning will save you time and money a thousandfold in the future. And that, my data-driven friend, is what good engineering is.

Thanks for tuning in!

The Modern Data Repository Crash Course

Thu, 21 Dec 2023 00:00:00 GMT

Database → Data Warehouse → Data Lake → Data Lakehouse

By the end of this blog post, you'll have a solid understanding of each, and finally understand (in simple terms) how the Data Lakehouse came to be and why they are all the rage.

When you think of collecting data, the first thing that pops into your head is likely a database.

And that's a great start! Databases were the first way to efficiently store and recall large amounts of data in an organized fashion. The RDBMS reigned supreme for years.

Then came this thing called the internet. During the explosion of the internet and e-commerce throughout the later 1990s and early 2000s, there was a subsequent explosion of data being generated. These events gave rise to the age of "Big Data."

Big Data gave rise to the era of analytics and reporting. These activities, under the umbrella of "Business Intelligence," came to dominate enterprises. The ability to accurately obtain historical data for reporting, forecasting, customer analysis, market trends, etc., quickly became a key focus of every business everywhere.

But we needed a technology to power our Business Intelligence.

So you might be thinking, why not use a database? Doesn't it store lots of data?

Databases are designed to be able to write data really fast — that's what made them special. The problem is, the use cases of reporting and analytics require reading large amounts of data fast. A system purpose-built for storing and reading massive amounts of historical data did not exist at the time. That is, until the Data Warehouse came on the scene at the turn of the Millennium.

The purpose of an Enterprise Data Warehouse (EDW) is to consolidate data in an organized fashion from a variety of databases to help businesses slice and dice their data. The ultimate goal of this was to use data to make better business decisions.

It's worth noting that EDWs don't actually do the slicing and dicing (Business Intelligence tools do that instead). Instead, the EDW provides those tools a trustworthy foundation that allows us to reliably slice and dice that historical data.

A Data Warehouse can be broken down into 4 components:

Ingestion — use ETL tools to bring data from siloed sources
Storage — stores the data in a central database
Metadata — data about your data, specifying things like usage, values, statistics, and other insights
Consumption — tools to access the data within your data warehouse such as querying, reporting, development, and OLAP tooling

With Data Warehouses, enterprises were now enabled with BI, analytics, and reporting. Finally we had a solid way to slice and dice swathes of historical data!

But the continued explosion of data soon ran into a new problem… the issue of unstructured and semi-structured data from sources like social media, IoT sensors, email, and more.

Data warehouses were designed to work really well with structured data. And while EDWs could technically work with unstructured/semi-structured data, that data had to be drastically cleaned up first (which wasn't really practical). This presented a big problem, as organizations were unable to get much (if any) value from their non-structured data sources.

This all changed around 2010, when Pentaho CTO James Dixon introduced the concept of a Data Lake.

A data lake is a repository of data where we are storing files or objects in their original format.

In this case, there is no pre-defined schema like in a data warehouse. This allows us to consolidate and analyze data of all kinds for a variety of business purposes. The end result? Valuable insights from previously scrambled data.

The key advantage of a data lake is being able to store almost any type/size/format of data in its original state (both structured and unstructured). The main trade-off here, however, is that data lakes can lack governance and guardrails on that data.

As data lakes emerged, they were (and sometimes still are) custom-built. This gives data engineers great flexibility, as they can choose what each component is made of. The key components include:

Data ingestion — ETL tools for batch, as well as Kafka (and others) for real-time/streaming. You want a standardized ingestion framework.
Storage — Early data lakes were built using on-premises HDFS clusters. But the high cost of these systems ultimately ushered in the era of cloud data lakes (which are based on object storage).
Processing (trusted) zone — This is where data is transformed and enriched for use (quality checks and remediation).
Consumption zone — how the data is accessed for business use.
Data governance and management zone — Data auditing, metadata management, lineage, cataloging, security, monitoring, operations, etc. This zone applies to the other four as an overlay.

You're probably thinking "Hey, these components seem similar to that data warehouse."

The components are similar, but the key differentiator is the need to specify the schema and cleanse the data beforehand. In a data warehouse, you had to do all this before landing the data in storage. With a data lake, we can simply dump copies of the original data into object storage. This offers us greater flexibility and scalability.

Data lakes and data warehouses are typically used in tandem. Data lakes act as a catch-all system for new data, and data warehouses apply downstream structure to specific data from this system. The problem is, coordinating these systems to provide reliable data can be costly in both time and resources.

So like Rocky Balboa and Apollo Creed in the third Rocky movie, the data lake and data warehouse inevitably joined forces — giving us a best-of-both-worlds data repository, the Data Lakehouse.

The data lakehouse merged the best aspects of the data lake and the data warehouse. That is, the ability to quickly land data in its original format in cheap, scalable storage, while providing the data structure of a data warehouse.

Lakehouses utilize similar data structures from a warehouse, paired with the object storage component of data lakes. This gives companies the ability to access trusted big data quickly. Lakehouses also support structured, semi-structured, and unstructured data. This allows users to accomplish BI and complex data science or machine learning use cases.

Data lakehouses are somewhat similar to data lakes, at least at the start. Typically, however, data within a lakehouse will be converted to a format like Delta Lake, which is an open-source storage layer that brings reliability, metadata management, and ACID transaction functionality (like a data warehouse) to a data lake.

The Delta Lake framework is a bit out of scope for this crash course, but for those who want to learn more, check out their website: delta.io

Why should you care?

Because 70% of enterprises say that the majority of all analytics workloads will be on their data lakehouse within three years. These same organizations project a 75% cost savings with the lakehouse architecture versus their current data repository architectures.

It's not some fancy new architecture that makes data lakehouses all the rage… it's the benefits this architecture brings:

Drastic cost reduction — By utilizing lower-cost Cloud Object Storage, operational costs are drastically lower than data warehouses.
Scales better — With warehouses, compute and storage are coupled together. Since lakehouses decouple these two, folks can access the same storage while using their own compute.
Real-time support — With the continued rise of streaming and real-time ingestion, this is huge for enterprises.
Improved governance — Normal data lakes lack governance. But with lakehouses, ingested data can meet defined schema requirements (eliminating data quality issues and data swamps).

Data Lakehouses yield cost-efficiency AND are easier to use. What a win-win!

In summary — while databases, data warehouses, and data lakes offer businesses a ton of value and will remain in use… the Data Lakehouse is the data repository of the future.

A Simple Framework for Enterprise AI

Wed, 20 Dec 2023 00:00:00 GMT

Picture this:

It's Monday morning, you arrive at work, and you get an email with a list of customers that are set to churn from your business this week. You send them a marketing promotion and… BAM, they renew for another year instead of churning!

This used to be fantasy. With AI, it's quickly becoming a reality. Businesses can take their raw data and turn it into unprecedented insights to get ahead of their competition.

But let's be real — it's difficult to do!

Enter the AI Ladder: a four-step framework to get from messy, siloed data to AI-powered business insights across your company. Each step of the AI ladder involves data. So before we jump into this framework, let's quickly answer the question — what exactly is data in the age of AI?

What is Data?

We know data as the information that flows through our digital world. There are various types of data, which fall into three categories:

Structured — neatly organized into tables and rows, like a spreadsheet
Semi-structured — think web documents, JSON files, etc.
Unstructured — like the freeform text you find in emails or social media posts

People tend to think of data as just the structured kind. That is far from true! The companies that will get ahead are those that utilize ALL available data.

Okay, so you have your data. It's in its raw format, siloed and scattered. How do we get from raw, messy data to actionable business insights (real AI)? Enter our framework — the AI Ladder.

The four rungs of the AI ladder, which we will go through one by one, are:

Collect your data
Organize that data
Analyze the data (this is the AI part)
Infuse the results throughout your organization

Disclaimer — I did not invent this framework. This is IBM's AI framework that has resulted in successful AI projects at thousands of businesses worldwide.

Collect

Data is the lifeblood of AI. If a business wants to predict churn, learn more about their customers and industry, or understand which trends to invest in, it all starts with their data. Data is harvested from various sources, like:

Databases storing customer information
Social media posts about your company
Enterprise Resource Planning (ERP) systems
Customer Relationship Management (CRM) systems

Each piece of data is a clue, a tiny fragment of the puzzle that AI systems aim to solve.

Collecting relevant data from all available sources is the first step toward AI. But collecting data isn't enough. We have to make sure our data is high quality and accessible before it becomes useful.

Organize

Data, in its natural state, can be messy and unruly. Cleaning and preparing data is the artisanal craft of data scientists, ensuring that AI algorithms can work their magic effectively.

Businesses have to make sure that their data is:

Protected
Accessible
High quality
Trustworthy
Traceable (lineage)

If your data is lacking in any of these departments, you're just wasting your time. You can have the most talented data scientists in the world working with the most modern machine learning algorithms available… but if the data is bad? It's all a waste.

Garbage in = garbage out.

Tools like data catalogs, data warehouses, and other integration tooling are helpful when it comes to creating a trusted, accessible, business-ready data foundation.

Once data is organized and accessible, the real fun begins.

Analyze

Now it's time for the magic. Here is where we apply AI to our data to give us insights and answers we couldn't find ourselves. The thing is, AI has turned into a catch-all term… but what is it really?

Summed up: Artificial Intelligence is when machines have the ability to process information like humans.

When most people think of AI, they think of SkyNet and the Terminator.

But you aren't most people. You are a technologist! You likely think of AI as ChatGPT and the Transformer architecture (a type of advanced AI model used for understanding and generating human language).

In reality, AI is more of a concept. The Transformer is just a more relevant deep learning architecture that has been popularized recently. ChatGPT, the Transformer architecture, and most data science as a whole can be accurately captured as Machine Learning.

Machine learning is a branch of AI which focuses on using data and algorithms to imitate the way humans learn. These algorithms are typically built using advanced tools like TensorFlow and PyTorch.

Some great examples of machine learning in action are the Netflix recommendation engine, or self-driving cars.

Using data and statistics, algorithms are trained to make insights and predictions about the subject at hand. For us at home, it's the right Netflix movie. For enterprises? It's scenarios like:

Who is the next customer to churn from my business?
How much should we budget for advertising in the coming fiscal year?
How much should we charge for our products to meet next year's revenue targets?

These insights subsequently drive decision-making within different segments of the business, ideally impacting key growth metrics. All because of the data!

Infuse

The journey up the AI Ladder reaches its pinnacle in the "Infuse" stage — where the true value of your efforts comes to life.

Imagine this: data is not just analyzed; it's woven into the very fabric of your enterprise, driving innovation and efficiency at every level.

From enhancing customer experiences to streamlining operations, from bolstering risk management to revolutionizing financial strategies — infusion is the critical leap from potential to reality. Infusing results ensures that the insights gathered from your data don't just remain a theoretical exercise.

Summary

By following the AI Ladder — collecting and organizing your data, analyzing your trusted data using Machine Learning and AI, and infusing the insights across your business…

You and your business don't just adapt to the future; you actively create it.

In the game of business, you either adapt or get left behind. When it comes to data, where does your business stand?