Client-Side Compute: A Greener Approach to Natural Language Data Queries

yodobrin · Aug 20, 2024

Introduction

Using natural language to interact with data can significantly enhance our ability to work with and understand information, making data more accessible and useful for everyone. Considering the latest advances in large language models (LLMs), it seems like the obvious solution. However, while we've made strides in interacting with unstructured data using NLP and AI, structured data interaction still poses challenges. Using LLMs to convert natural language into domain-specific languages like SQL is a common and valid use case, showcasing a strong capability of these models. This blog identifies the limitations of current solutions and introduces novel, energy-efficient approaches to enhance efficiency and flexibility.

My team focuses on ISVs and how each design decision impacts them. For example, if the ISV needs to allow "chat with data" as a solution, they must also address the challenges of hosting, monetizing, and securing these features. We present two key strategies:

Leveraging deterministic tools to execute the domain-specific language on the appropriate systems and
Offloading compute to client devices.

These strategies not only improve performance and scalability but also reduce server load, making them ideal for ISVs looking to provide seamless and sustainable data access to their customers.

The Challenge: Efficiently Interacting with Structured Data

Structured data, typically stored in databases, structured files, and spreadsheets, is the backbone of business intelligence and analytics. However, querying and extracting insights from this data often requires knowledge of specific query languages like SQL, creating a barrier for many users. Additionally, ISVs face the challenge of anticipating the diverse ways their customers want to interact with their data. Due to increasing customer demand for natural language interfaces to simplify and intuitively access their data, ISVs are pressured to develop solutions that bridge the gap between users and the structured data they need to interact with.

While using LLMs to convert natural language queries into domain-specific languages such as SQL is a powerful capability, it alone doesn't solve the problem. The next step is to execute these queries efficiently on the appropriate systems. Implementing such a solution must include several fundamental guardrails to ensure the generated SQL is safe to execute. Moreover, there is the additional challenge of managing the computational load. Hosting these capabilities on ISV servers can be resource-intensive and costly.

Therefore, an effective solution must not only translate natural language into executable queries but also optimize how these queries are processed. This involves leveraging deterministic tools to execute domain-specific languages and offloading compute tasks to client devices. By doing so, ISVs can provide more efficient, scalable, and cost-effective data interaction solutions to their customers.

A Common Use Case

An ISV collects data from various sources, some public and most from its customers (or tenants). These tenants could come from various industries such as retail, healthcare, and finance, each requiring tailored data solutions. The ISV implements a medallion pattern for data ingestion, a design pattern that organizes data into layers (bronze, silver, and gold) to ensure data quality and accessibility. In this pattern, raw data is ingested into the bronze layer, cleaned and enriched into the silver layer, and then aggregated into the gold layer for analysis. The gold tables, containing the aggregated data, are generally smaller than 20MB per tenant.

The data ingestion pipeline runs periodically, populating the gold tables hosted on Azure SQL Database. Data isolation is managed using row-level security or multiple schemas, tailored to the ISV's requirements. The next step for the ISV is to provide access for its tenants to the data through a web application, leveraging homegrown dashboards and reporting capabilities. Often, these ISVs are small companies that do not have the resources to implement a full Business Continuity and Disaster Recovery (BCDR) approach or afford paid tools like Power BI, and thus rely on homegrown or free packages.

Despite having a robust infrastructure, the ISV faces several challenges:

Complex Query Language: Users often struggle with the complexity of SQL or other query languages required to extract insights from the data. This creates a barrier to effective data utilization.
Performance and Scalability: The server load increases significantly with complex queries, especially when multiple tenants access the data simultaneously. This can lead to performance bottlenecks and scalability issues.
Cost and Resource Management: Hosting the necessary computational resources to handle data queries on the ISV’s servers is resource-intensive and costly. This includes maintaining high-performance databases and application servers.
User Experience: Customers increasingly demand the ability to interact with their data using natural language, expecting a seamless and intuitive user experience.

For more detailed information on the medallion pattern, you can refer to this link.

The architecture diagram above illustrates the current setup:

Data Sources: Public sources and tenant data are ingested into the system.
Storage: The data lake (or lake house) process the data from multiple sources, perform cleansing, and store the data in the gold tables periodically.
Orchestrator: Orchestrating ELT/ETL is done using Azure Fabric/Synapse or Azure Data Factory pipelines.
Serving: The web application is hosted on Azure App Service, the data is queried using Azure SQL Database.
Visualize: Data is reported using Power BI or other reporting tools, including home grown dashboards.

Enhanced Approach: Energy-Efficient Data Interaction

To address the challenges mentioned earlier, the ISV can adopt the following strategies:

Leveraging Deterministic Tools for Query Execution:
- Translation: Utilize LLMs to convert natural language queries into SQL.
- Execution: Create a sandbox environment for each customer's data. This sandbox is hosted on lower-cost storage, such as a storage container per customer, which contains a snapshot of the data they can interact with.
- Data Management: The same data ingestion pipeline that updates the gold table in Azure SQL is adapted to update a customer-specific data set stored in their respective storage container. The idea is to use SQLite to store the customer-specific data, ensuring it is lightweight and portable.
- Benefits:
  - Efficiency and Security: Ensures that queries are executed efficiently and securely, leveraging the robust capabilities of SQL databases while minimizing risks. By isolating each customer's data in a sandbox, the need for sophisticated guardrails against bad queries and overloading the reporting database is significantly reduced.
  - Cost & Energy Savings: No need to manage or host a dedicated reporting database. Since the customer-specific data is hosted on Azure storage containers, the ISV avoids the costs and energy consumption associated with maintaining high-performance database infrastructure.
  - Scalability and Reliability: The ISV does not need to plan for the worst-case scenario of all customers running queries simultaneously, which could impact the health of a centralized reporting database. Each customer's queries are isolated to their data, ensuring system stability and performance.
Offloading Compute to Client Devices:
- Data Transmission: The client-side application ensures it has the current data snapshot available for the customer to work with. For example, it can check the data’s timestamp or use another method to verify if the local data is up-to-date and download the latest version if necessary. This snapshot is encapsulated in portable formats like JSON, SQLite, or Parquet.
- Local Processing: The client-side application processes the data locally using the translated SQL queries.
- Benefits:
  - Performance: Reduces server load, enhances scalability, and provides faster query responses by utilizing the client’s computational resources.
  - Cost & Energy Savings: Significant cost savings by reducing the need for high-performance server infrastructure. Hosting a static website and leveraging client devices' processing power also reduces overall energy consumption.
  - Flexibility: Ensures that customers always work with the most current data without the need for constant server communication.

Revised Architecture

Data Sources: Public sources and tenant data are ingested into the system.
Storage: The data lake (or lake house) process the data from multiple sources, perform cleansing, and store the data in customer specific containers. This enhances security and isolation.
Orchestrator: Orchestrating ELT/ETL is done using Azure Fabric/Synapse or Azure Data Factory pipelines.

The above components are hosted in the ISV's infrastructure.
The client side web application will pull the data from the customer specific containers and process the data locally. Please visit our Azure OpenAI .NET Starter Kit for further reading and understanding - focus on the 07_ChatWithJson and 08_ChatWithData notebooks.

Why This Approach?

Efficiency: Data queries are executed locally, reducing the load on the server and improving performance.
Security: Data is securely isolated within a client-side sandbox, ensuring customers can only query what is provided.
Cost & Energy Saving: Hosting a static website is significantly cheaper and more energy-efficient than hosting a web application with a database. This approach leverages the processing power of client devices, further reducing infrastructure costs and energy consumption.
Scalability: By isolating each customer's data in a sandbox, the ISV does not need to worry about the impact of simultaneous queries on a centralized database, ensuring system reliability and scalability.
Flexibility: Ensures that customers always have access to the most current data without the need for constant server communication.

Potential Downsides and Pitfalls

Client-Side Performance Variability: The approach relies on the computational power of client devices.
Data Synchronization: Ensuring that the local data snapshot on client devices is up-to-date can be challenging. Delays in synchronization could lead to users working with outdated data.

Conclusion

By adopting these strategies, ISVs can provide a more efficient, scalable, and cost-effective solution for natural language querying of structured data. Leveraging deterministic tools for executing domain-specific languages within isolated sandboxes ensures robust and secure query execution. Offloading compute to client devices not only reduces server load but also enhances performance and scalability, providing a seamless and intuitive user experience.

Continue reading...

Client-Side Compute: A Greener Approach to Natural Language Data Queries

yodobrin

Introduction​

The Challenge: Efficiently Interacting with Structured Data​

A Common Use Case​

The architecture diagram above illustrates the current setup:​

Enhanced Approach: Energy-Efficient Data Interaction​

Revised Architecture​

​

Why This Approach?​

Potential Downsides and Pitfalls​

​

Conclusion​