S
Song_Minseok
Teach ChatGPT to Answer Questions Based on PDF content: Using Azure Cognitive Search and Azure OpenAI
Can't I just copy and paste text from a PDF file to teach ChatGPT?
The purpose of this tutorial is to explain how to efficiently extract and use information from large amounts of PDFs. Dealing with a 5-page PDF can be straightforward, but it's a different story when you're dealing with complex documents of 100+ pages. In these situations, the integration of Azure Cognitive Search with Azure OpenAI enables fast and accurate information retrieval and processing. In this tutorial, we handle 5 PDFs, but you can apply this method to scale to handle more than 10,000 files. In this two-part series, we will explore how to build intelligent service using Azure. In Series 1, we'll use Azure Cognitive Search to extract keywords from unstructured data stored in Azure Blob Storage. In Series 2, we'll Create a feature to answer questions based on PDF documents using Azure OpenAI. Here is an overview of this tutorial.
This tutorial is related to the following topics
- AI Engineer
- Developer
- Azure Blob Storage
- Azure Cognitive Search
- Azure OpenAI
Learning objectives
In this tutorial, you'll learn the following:
- How to store your unstructured data in Azure Blob Storage.
- How to create search experiences based on data stored in Blob Storage with Azure Cognitive Search.
- Learn how to teach ChatGPT to answer questions based on your PDF content using Azure Cognitive search and Azure OpenAI.
Prerequisites
- Azure subscription
- Visual Studio Code
Microsoft Cloud Technologies used in this Tutorial
- Azure Blob Storage
- Azure Cognitive Search
- Azure OpenAI Service
Table of Contents
Series 1: Extract Key Phrases for Search Queries Using Azure Cognitive Search
1. Create a Blob Container
2. Store PDF Documents in Azure Blob Storage
3. Create a Cognitive Search Service
4. Connect to Data from Azure Blob Storage
5. Add Cognitive Skills
6. Customize Target Index and Create an Indexer
7. Extract Key Phrases for Search Queries Using Azure Cognitive Search
Series 2: Implement a ChatGPT Service with Azure OpenAI
1. Change your indexer settings to use Azure OpenAI
2. Create an Azure OpenAI
3. Set up the project and install the libraries
4. Set up the project in VS Code
5. Search with Azure Cognitive Search
6. Get answers from PDF content using Azure OpenAI and Cognitive Search
7. Note: Full code for example.py and config.py
Series 1: Extract Key Phrases for Search Queries Using Azure Cognitive Search
1. Create a Blob Container
Azure Blob Storage is a service designed for storing large amounts of unstructured data, such as PDFs.
1. To begin, create an Azure Storage account by typing `storage` in the search bar and selecting Services - Storage accounts.
2. Select the +Create button.
3. Enter the resource group name that will serve as the folder for the storage account, enter the storage account name, and select a region. When you're done, click the Next button and you can continue with the defaults.
2. Store PDF Documents in Azure Blob Storage
1. After your storage account is set up, navigate to Storage Browser by typing `storage browser` in the search bar.
2. Add a new container to store PDF documents.
- Select your storage account.
- Select the Blob containers button.
- Select the +Add Container button to create a new container.
3. Once the container is set up, upload your PDFs into this container.
- Select the container you created.
- Select the Upload button and upload your PDF documents.
For the tutorial, I downloaded 5 PDF documents of recent papers on GPT from Microsoft Academicand uploaded them to the container.
3. Create a Cognitive Search Service
1. Type `cognitive service` into the search bar and select Services – Cognitive Search.
2. Select the +Create button.
3. Create a new Cognitive Search Service.
- Select your Resource Group.
- Specify the Service name.
- Select your Location.
NOTE:
Azure OpenAI resource is currently available in limited regions.
If Azure OpenAI resource is not available in your region, I recommend setting your location to East US.
- Choose a Pricing tier that suits your needs; since semantic search is available from the basic tier, I recommend setting your Pricing tier to basic for the tutorial.
NOTE:
In this tutorial we will use the Basic tier to explore semantic search with Azure Cognitive Search.
You can expect a cost of approximately $2.50 per 100 program runs with this tier.
If you plan to use the free plan, please note that the code demonstrated in this tutorial may differ from what you'll need.
- Select the Review + create button.
4. Navigate to the Cognitive Search service you created and select Semantic search(Preview), then select the Free plan. (If you choose the free tier, you can skip it.)
4. Connect to Data from Azure Blob Storage
1. Navigate to the Cognitive Search service you created and select Import data.
2. Select Azure Blob Storage as the data source and connect it to the Blob Storage where your PDFs are stored.
3. Specify your Data source name.
4. Select Choose an existing connection and select the blob storage container you created.
5. Select Next: Add cognitive skills button.
5. Add Cognitive Skills
1. To power your cognitive skills, select an existing AI Services resource or create a new one; the Free resource is sufficient for this tutorial.
2. Specify the Skillset name.
TIP:
If you want to search for text in a photo, you need to check Enable OCR and merge all text into merged_content field. In this tutorial, we will not check it because we will search based on the text in the paper.
3. Select Enrichment granularity level. (In this tutorial, we'll use a page-by-page granularity, so we'll select Pages (5000 characters chunk).)
4. Select Extract Key phrases. (You can select additional checkboxes depending on the type of PDF data.)
5. Select Next: Customize target index button.
NOTE:
Why set the Enrichment granularity level to Pages (5000 characters chunk)?
To get ChatGPT responses based on a PDF, We need to call the GPT-3.5-turbo model of ChatGPT API. The GPT-3.5-turbo model can handle up to 4096 tokens, including both the text you use as input and the length of the answer the ChatGPT API returns. For this reason, documents that are too long cannot be entered all at once, but must be broken into multiple chunks and processed after multiple calls to the ChatGPT API. (Tokens can be words, punctuation, spaces, etc.)
6. Customize Target Index and Create an Indexer
1. Customize target Index.
- Specify your Index name.
- Check the boxes as shown in the image below.
TIP:
You can change the boxes to suit your data. To help you understand, I've written a description of each field in the index at the bottom of this page.
2. Add a new field.
In this tutorial, we have selected the Enrichment granularity level of Pages (5000 characters chunk). So, we need to create a field to search for pages that are separated by 5000 character chunks.
- Select + Add field button.
- Create a field named `pages`.
- Select Collection(Edm.String) as the type for the `pages` field.
- Check the box Retrievable.
3. Delete unnecessary fields.
4. Create an Indexer.
- Specify your Indexer Name.
- Select the Schedule – Once.
(For data coming in in real time, you'll need to set up a schedule periodically, but since we're dealing with unchanging PDF data in this tutorial, we'll only need to schedule Once.)
- Select the Submit button.
TIP:
Description of each field in the index
These fields represent the attributes for each PDF document.
For example, suppose we have a PDF document named `example.pdf` stored in blob storage. If we checked Retrievable for metadata_storage_size, we would be able to search for and find the size of the PDF document `example.pdf`.
1. content (Edm.String)
This field indicates the actual content of the stored data.
2. metadata_storage_content_type (Edm.String)
This field indicates the type of content stored.
Ex) The metadata_storage_content_type of `example.pdf` is `pdf`
3. metadata_storage_size (Edm.Int64)
This field Indicates the size of the stored data. The size information is stored as an
integer.
Ex) The metadata_storage_size of `example.pdf` is ` 487743`(bytes).
4. metadata_storage_last_modified (Edm.DateTimeOffset)
This field indicates the most recent modification date and time of the stored data.
Ex) The metadata_storage_last_modified of `example.pdf` is `2023-10-
06T18:45:32+00:00`.
5. metadata_storage_content_md5 (Edm.String)
This field indicates a checksum value for the data, which is used to validate the
integrity of the content during transmission or storage. The MD5 hash value is
represented as a string of alphanumeric characters.
Ex) The metadata_storage_content_md5 of `example.pdf` is
`d41d8cd98f00b204e9800998ecf8427e`
6. metadata_storage_name (Edm.String)
This field indicates a file name stored in blob storage.
Ex) The metadata_storage_name of `example.pdf` is `example.pdf `
7. metadata_storage_path (Edm.String)
This field Indicates the storage path where the data file or object resides within the
Azure storage architecture.
Ex) The metadata_storage_path of `example.pdf` is
`https://yourstorageaccount.blob.core.windows.net/testcontainer/example.pdf`
8. metadata_storage_file_extension (Edm.String)
This field indicates the file extension.
Ex) The metadata_storage_file_extention of `example.pdf` is `.pdf `
9. metadata_content_type (Edm.String)
This field Indicates the nature of the internal content, such as whether it is text,
HTML, JSON, etc.
Ex) The metadata_content_type of `example.pdf` is `text`.
10. metadata_language (Edm.String)
This field indicates the language in which the content is written, facilitating languagespecific processing and searching.
Ex) The metadata_language of `example.pdf` is `EN`.
11. metadata_creation_date (Edm.DateTimeOffset)
This field indicates the date and time when the data was originally created.
Ex) The metadata_creation_date of `example.pdf` is `2023-09-30T14:32:10+00:00`.
7. Extract Key Phrases for Search Queries Using Azure Cognitive Search
1. Once your indexer and index creation are complete, navigate to your Cognitive Search service and select the Indexes page.
2. Select the index you created.
3. You can use a query string or simply enter text to perform a search.
Ex) In this tutorial, I entered the following question: `How to prompt GPT to be reliable?`
4. Set Semantic configurations.
- Semantic configurations are available from the basic price tier onwards. If you chose the free tier, you can skip it.
- Select Semantic configurations, then select + Add semantic configuration.
- Specify your semantic configuration Name.
- Select the Title field – content.
- Select the Save button.
- When you've finished setting up your semantic configuration, return and select the Save button.
We completed extracting key phrases based on our questions using Azure Cognitive Search.
In the next series, we'll connect this Cognitive Search service with Azure Open AI to make a ChatGPT that answers questions based on PDFs stored in blob storage.
Series 2: Implement a ChatGPT Service with Azure OpenAI
1. Change your indexer settings to use Azure OpenAI
1. navigate to your Cognitive Search service and select the Indexers page.
2. Select the indexer you created.
3. Select the Indexer Definition (JSON)
4. In the JSON, modify the "outputFieldMappings" part as shown below.
"outputFieldMappings": [
{
"sourceFieldName": "/document/content/pages/*/keyphrases/*",
"targetFieldName": "keyphrases"
},
{
"sourceFieldName": "/document/content/pages/*",
"targetFieldName": "pages"
}
]
5. Select the Save button.
6. Select the Reset button.
7. Select the Run button.
TIP:
Description of “outputFieldMappings”
"outputFieldMappings" are settings that map data processed by the Cognitive Search service to specific fields in the index.
For example, in the path "/document/content/pages//keyphrases/", keywords are extracted from each page and mapped to the "keyphrases" field.
>Similarly, for the “pages” field that we created earlier, we need to specify what data will be mapped to this field. In this tutorial, we have selected the Enrichment granularity level of Pages (5000 characters chunk). So, we need to specify that 5000-character chunks from "/document/content/pages/" are mapped to the "pages" field. We need to add JSON code to map the data to the "pages" field so that we can send chunks of 5000 characters to OpenAI instead of sending entire pages.
2. Create an Azure OpenAI
Currently, access to the Azure OpenAI service is granted by request only. You can request access to the Azure OpenAI service by filling out the form at https://aka.ms/oai/access/ .
1. Type 'azure openai' in the search bar and select Services - Azure OpenAI.
2. Select the + Create button.
3. Select a network security Type.
4. Select the Create button.
5. Deploy your Azure OpenAI model.
- Navigate to your Azure OpenAI, then Select the Go to Azure OpenAI Studio button.
- In Azure openAI Studio, select the Deployments button.
- Select the + Create new deployment button, then create the gpt-35-turbo and text-embedding-ada-002 models
NOTE:
In this tutorial we will use the gpt-35-turbo and text-embedding-ada-002 models. I recommend using the same name for both the deployment name and the model name.
3. Set up the project and install the libraries
1. Create a folder where you can work.
- We will create an `azure-proj` folder inside the `User` folder and work inside the `gpt-proj1` folder.
- Open a command prompt window and create a folder named `azure-proj` in the default path.
mkdir azure-proj
- Navigate to the `azure-proj` folder you just created.
cd azure-proj
- In the same way, create a `gpt-proj1` folder inside the `azure-proj` folder. Navigate to the `gpt-proj1` folder.
mkdir gpt-proj
cd gpt-proj1
2. Create a virtual environment.
- Type the following command to create a virtual environment named `.venv`.
Python -m venv .venv
- Once the virtual environment is created, type the following command to activate the virtual environment.
.venv\Scripts\activate.bat
- Once activated, the name of the virtual environment will appear on the far left of the command prompt window.
3. Install the required packages.
- At the Command prompt, type the following command.
(.venv) C:\Users\sms79\azure-proj\gpt-proj1>pip install OpenAI
(.venv) C:\Users\sms79\azure-proj\gpt-proj1>pip install Langchain
(.venv) C:\Users\sms79\azure-proj\gpt-proj1>pip install faiss-cpu
(.venv) C:\Users\sms79\azure-proj\gpt-proj1>pip install tiktoken
TIP:
How to use CMD in VS Code
Select TERMINAL at the bottom of VS Code, then select the + button, then select the Command Prompt.
4. Set up the project in VS Code
1. In VS Code, select the folder that you have created.
- Open VS Code and select File > Open Folder from the menu. Select the `gpt-proj1` folder that you created earlier, which is located at C:\Users\yourUserName\azure-proj\gpt-proj1.
2. Create a new file.
- In the left pane of VS Code, right-click and select 'New File' to create a new file named `example.py`.
3. Import the required packages.
- Type the following code in the `example.py' file in VS Code.
# Library imports
from collections import OrderedDict
import requests
# Langchain library imports
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chat_models import AzureChatOpenAI
from langchain.docstore.document import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
4. Create a configuration file - `config.py`.
NOTE:
Complete folder structure:
└── YourUserName
└── azure-proj
└── gpt-proj1
├── example.py
└── config.py
- Create a `config.py` file. This file should contain information about your Azure.
- Add the code below to your `config.py` file.
# Azure Search Service settings
SEARCH_SERVICE_NAME = 'your-search-service-name' # 'test-search-service1'
SEARCH_SERVICE_ENDPOINT = f'https://{SEARCH_SERVICE_NAME.lower()}.search.windows.net/'
SEARCH_SERVICE_KEY = 'your-search-service-key'
SEARCH_SERVICE_API_VERSION = 'your-API-version' # '2023-07-01-preview'
# Azure Search Service Index settings
SEARCH_SERVICE_INDEX_NAME1 = 'your-search-service-index-name' # 'azureblob-index1'
# Azure Cognitive Search Service Semantic configuration settings
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME = 'your-semantic-configuration-name' # 'test-configuration'
# Azure OpenAI settings
AZURE_OPENAI_NAME = 'your-openai-name' # 'testopenai1004'
AZURE_OPENAI_ENDPOINT = f'https://{AZURE_OPENAI_NAME.lower()}.openai.azure.com/'
AZURE_OPENAI_KEY = 'your-openai-key'
AZURE_OPENAI_API_VERSION = 'your-API-version' # '2023-08-01-preview'
5. Fill in your `config.py` file with your Azure information.
NOTE:
You'll need to include information about your Azure Cognitive Search Service name, index name, semantic configuration name, key, and API version, and Azure OpenAI name, key, and API version.
TIP:
Find your Azure information
1. Find the Azure Cognitive Search Keys.
- Navigate to your Cognitive Search service, then select Keys, then copy and paste your key into the `config.py` file.
2. Find the Azure Cognitive Search Index name.
- Navigate to your Cognitive Search service, then select Indexes, then copy and paste your index name into the `config.py` file.
3. Find the Azure Cognitive Search Semantic configuration name.
- Navigate to your Cognitive Search service, select Indexes, and then click your index name.
- Select Semantic configurations and copy and paste your Semantic configuration name into the `config.py` file.
4. Find the Azure OpenAI Keys.
- Navigate to your Azure OpenAI, then select Keys and Endpoint, then copy and paste your key into the config.py file
5. Choose your Azure Cognitive Search API and Azure OpenAI version.
- Select your version of the Azure Cognitive Search API and Azure OpenAI API using the hyperlinks below.
- Select an Azure Cognitive Search API version
- Select an Azure OpenAI API version
- I have selected the latest version of the Azure Cognitive Search API, 2023-07-01-preview, and the Azure OpenAI API, 2023-08-01-preview.
5. Search with Azure Cognitive Search
In this section, we'll use Azure Cognitive Search within VS Code. We have already installed all the necessary packages in the previous chapter. Now we will focus on how to use Azure Cognitive Search and Azure OpenAI in VS Code.
In Chapters 4 and 5, we'll create functions that use Azure Cognitive Service and Azure OpenAI and use them in `main.py`.
To use Azure Cognitive Search and Azure OpenAI, we need to import the information from Azure that we entered in `config.py` into `example.py` that we created earlier.
All the following code comes from `example.py`.
The full code is provided at the end of the chapter for your convenience.\
1. Add code to `example.py` that imports the values from `config.py`.
# Configuration imports
from config import (
SEARCH_SERVICE_ENDPOINT,
SEARCH_SERVICE_KEY,
SEARCH_SERVICE_API_VERSION,
SEARCH_SERVICE_INDEX_NAME1,
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
AZURE_OPENAI_ENDPOINT,
AZURE_OPENAI_KEY,
AZURE_OPENAI_API_VERSION,
)
2. Add the Azure Cognitive Search Service header.
# Cognitive Search Service header settings
HEADERS = {
'Content-Type': 'application/json',
'api-key': SEARCH_SERVICE_KEY
}
3. Now, we will create functions related to Azure Cognitive Search and run them from the main function.
- Add the two functions.
# Function to search documents using Azure Cognitive Search
def search_documents(question):
url = (SEARCH_SERVICE_ENDPOINT + 'indexes/' +
SEARCH_SERVICE_INDEX_NAME1 + '/docs')
params = {
'api-version': SEARCH_SERVICE_API_VERSION,
'search': question,
'select': '*',
'$top': 3,
'queryLanguage': 'en-us',
'queryType': 'semantic',
'semanticConfiguration': SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
'$count': 'true',
'speller': 'lexicon',
'answers': 'extractive|count-3',
'captions': 'extractive|highlight-false'
}
resp = requests.get(url, headers=HEADERS, params=params)
return resp.json()
# Extract documents that score above a certain threshold in semantic search
def extract_documents(search_results):
file_content = OrderedDict()
for result in search_results['value']:
# The @search.rerankerScore value ranges from 1 to 4.00, where a higher score indicates a better semantic match.
if result['@search.rerankerScore'] > 1.5:
file_content[result['metadata_storage_path']] = {
'chunks': result['pages'][:10],
'captions': result['@search.captions'][:10],
'score': result['@search.rerankerScore'],
'file_name': result['metadata_storage_name']
}
return file_content
4. Now we'll run the code using the above functions in the main function.
- Add the code below.
- When you run it, you'll see the total number of PDFs in the blob storage, the top few documents adopted, and the number of chunks.
- I asked the question, 'Tell me about effective prompting strategies' based on the paper I had stored on the blob storage.
- If you want to see the full search results, add `print(search_results)` to your main function.
def main():
QUESTION = 'Tell me about effective prompting strategies'
# Search for documents with Azure Cognitive Search
search_results = search_documents(QUESTION)
file_content = extract_documents(search_results)
print('Total Documents Found: {}, Top Documents: {}'.format(
search_results['@odata.count'], len(search_results['value'])))
# 'chunks' is the value that corresponds to the Pages field that you set up in the Cognitive Search service.
# Find the number of chunks
docs = []
for key,value in file_content.items():
for page in value['chunks']:
docs.append(Document(page_content = page,
metadata={"source": value["file_name"]}))
print("Number of chunks: ", len(docs))
# execute the main function
if __name__ == "__main__":
main()
6. Get answers from PDF content using Azure OpenAI and Cognitive Search
Now that Azure Cognitive Search is working well in VS Code, it's time to start using
Azure OpenAI.
In this chapter, we'll create functions related to Azure OpenAI and ultimately create
and run a program in `example.py` that answers a question with Azure OpenAI based on
the search information from Azure Cognitive Search.
1. We will create functions related to Azure OpenAI and Lang Chain and run them from
the main function.
- Add the following functions above the main function.
# Function to create an embedding model
def create_embeddings():
return OpenAIEmbeddings(
openai_api_type='azure',
openai_api_key=AZURE_OPENAI_KEY,
openai_api_base=AZURE_OPENAI_ENDPOINT,
openai_api_version=AZURE_OPENAI_API_VERSION,
deployment='text-embedding-ada-002',
model='text-embedding-ada-002',
chunk_size=1
)
# Function to create a Vectorstore
def create_vector_store(docs, embeddings):
return FAISS.from_documents(docs, embeddings)
# Function to retrieve search results using Langchain and Chatgpt
def search_with_langchain(vector_store, question):
llm = AzureChatOpenAI(
openai_api_key=AZURE_OPENAI_KEY,
openai_api_base=AZURE_OPENAI_ENDPOINT,
openai_api_version=AZURE_OPENAI_API_VERSION,
openai_api_type='azure',
deployment_name='gpt-35-turbo',
temperature=0.0,
max_tokens=500
)
chain = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm,
chain_type='stuff',
retriever=vector_store.as_retriever(),
return_source_documents=True
)
return chain({'question': question})
2. Add the code below to your main function.
# Create an embedding model
embeddings = create_embeddings()
# Create a Vectorstore
vector_store = create_vector_store(docs, embeddings)
# Search results using Langchain and Chatgpt
result = search_with_langchain(vector_store, QUESTION)
print('Question: ', QUESTION)
print('Answer: ', result['answer'])
print('Reference: ', result['sources'].replace(",","\n"))
3. Now let's run it and see if it answers your question.
- The result of executing the code.
```
Total Documents Found: 5, Top Documents: 3
Number of chunks: 10
Question: Tell me about effective prompting strategies
Answer: Effective prompting strategies for improving the reliability of GPT-3 include
establishing simple prompts that improve GPT-3's reliability in terms of generalizability,
social biases, calibration, and factuality. These strategies include prompting with
randomly sampled examples from the source domain, using examples sampled from
a balanced demographic distribution and natural language intervention to reduce
social biases, calibrating output probabilities, and updating the LLM's factual
knowledge and reasoning chains. Natural language intervention can also effectively
guide model predictions towards better fairness.
Reference: Prompting GPT-3 To Be Reliable.pdf
```
NOTE: Full code for example.py and config.py
This chapter is designed to provide all the code used in the tutorial. It is a separate section from the rest of the tutorial.
For your convenience, I've attached the full code used in the tutorial.
1. config.py
# Azure Search Service settings
SEARCH_SERVICE_NAME = 'your-search-service-name' # 'test-search-service1'
SEARCH_SERVICE_ENDPOINT = f'https://{SEARCH_SERVICE_NAME.lower()}.search.windows.net/'
SEARCH_SERVICE_KEY = 'your-search-service-key'
SEARCH_SERVICE_API_VERSION = 'your-API-version' # '2023-07-01-preview'
# Azure Search Service Index settings
SEARCH_SERVICE_INDEX_NAME1 = 'your-search-service-index-name' # 'azureblob-index1'
# Azure Cognitive Search Service Semantic configuration settings
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME = 'your-semantic-configuration-name' # 'test-configuration'
# Azure OpenAI settings
AZURE_OPENAI_NAME = 'your-openai-name' # 'testopenai1004'
AZURE_OPENAI_ENDPOINT = f'https://{AZURE_OPENAI_NAME.lower()}.openai.azure.com/'
AZURE_OPENAI_KEY = 'your-openai-key'
AZURE_OPENAI_API_VERSION = 'your-API-version' # '2023-08-01-preview'
2. example.py
# Library imports
from collections import OrderedDict
import requests
# Langchain library imports
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chat_models import AzureChatOpenAI
from langchain.docstore.document import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
# Configuration imports
from config import (
SEARCH_SERVICE_ENDPOINT,
SEARCH_SERVICE_KEY,
SEARCH_SERVICE_API_VERSION,
SEARCH_SERVICE_INDEX_NAME1,
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
AZURE_OPENAI_ENDPOINT,
AZURE_OPENAI_KEY,
AZURE_OPENAI_API_VERSION,
)
# Cognitive Search Service header settings
HEADERS = {
'Content-Type': 'application/json',
'api-key': SEARCH_SERVICE_KEY
}
# Function to search documents using Azure Cognitive Search
def search_documents(question):
url = (SEARCH_SERVICE_ENDPOINT + 'indexes/' +
SEARCH_SERVICE_INDEX_NAME1 + '/docs')
params = {
'api-version': SEARCH_SERVICE_API_VERSION,
'search': question,
'select': '*',
'$top': 3,
'queryLanguage': 'en-us',
'queryType': 'semantic',
'semanticConfiguration': SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
'$count': 'true',
'speller': 'lexicon',
'answers': 'extractive|count-3',
'captions': 'extractive|highlight-false'
}
resp = requests.get(url, headers=HEADERS, params=params)
return resp.json()
# Extract documents that score above a certain threshold in semantic search
def extract_documents(search_results):
file_content = OrderedDict()
for result in search_results['value']:
# The '@search.rerankerScore' range is 1 to 4.00, where a higher score indicates a stronger semantic match.
if result['@search.rerankerScore'] > 1.5:
file_content[result['metadata_storage_path']] = {
'chunks': result['pages'][:10],
'captions': result['@search.captions'][:10],
'score': result['@search.rerankerScore'],
'file_name': result['metadata_storage_name']
}
return file_content
# Function to create an embedding model
def create_embeddings():
return OpenAIEmbeddings(
openai_api_type='azure',
openai_api_key=AZURE_OPENAI_KEY,
openai_api_base=AZURE_OPENAI_ENDPOINT,
openai_api_version=AZURE_OPENAI_API_VERSION,
deployment='text-embedding-ada-002',
model='text-embedding-ada-002',
chunk_size=1
)
# Function to create a Vectorstore
def create_vector_store(docs, embeddings):
return FAISS.from_documents(docs, embeddings)
# Function to retrieve search results using Langchain and Chatgpt
def search_with_langchain(vector_store, question):
llm = AzureChatOpenAI(
openai_api_key=AZURE_OPENAI_KEY,
openai_api_base=AZURE_OPENAI_ENDPOINT,
openai_api_version=AZURE_OPENAI_API_VERSION,
openai_api_type='azure',
deployment_name='gpt-35-turbo',
temperature=0.0,
max_tokens=500
)
chain = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm,
chain_type='stuff',
retriever=vector_store.as_retriever(),
return_source_documents=True
)
return chain({'question': question})
def main():
QUESTION = 'Tell me about effective prompting strategies'
# Search for documents with Azure Cognitive Search
search_results = search_documents(QUESTION)
file_content = extract_documents(search_results)
print('Total Documents Found: {}, Top Documents: {}'.format(
search_results['@odata.count'], len(search_results['value'])))
# 'chunks' is the value that corresponds to the Pages field that you set up in the Cognitive Search service.
# Find the number of chunks
docs = []
for key,value in file_content.items():
for page in value['chunks']:
docs.append(Document(page_content = page,
metadata={"source": value["file_name"]}))
print("Number of chunks: ", len(docs))
# Create an embedding model
embeddings = create_embeddings()
# Create a Vectorstore
vector_store = create_vector_store(docs, embeddings)
# Search results using Langchain and Chatgpt
result = search_with_langchain(vector_store, QUESTION)
print('Question: ', QUESTION)
print('Answer: ', result['answer'])
print('Reference: ', result['sources'].replace(",","\n"))
# execute the main function
if __name__ == "__main__":
main()
Congratulations!
You've completed this tutorial
Congratulations! You've learned the integration tutorial of Azure Cognitive Search with Azure OpenAI
In this tutorial, we have navigated through a practical journey of integrating Azure Blob Storage, Azure Cognitive Search, and Azure OpenAI to create a powerful search and response mechanism.
1. Storing Data in Azure Blob Storage
Our first step was to efficiently store PDF files in Azure Blob Storage, an unstructured data store known for its scalability and security. This storage served as a foundational base, housing the search material that would later be indexed and queried to retrieve relevant information.
2. Implementing Azure Cognitive Search
In the next step, we used Azure Cognitive Search to search based on the data we had stored in Azure Blob Storage. This powerful service was instrumental in indexing and searching the data stored in Azure Blob Storage.
3. Integrating Azure OpenAI with VS Code
The final step of our tutorial was to integrate Azure OpenAI through a program created in VS Code. This program was designed to use the search information processed and refined by Azure Cognitive Search to generate accurate and contextually relevant answers. The synergy between these technologies illustrated the seamless interplay of storage, search, and response mechanisms.
I hope that the knowledge and skills imparted will serve as invaluable tools in your future projects. The harmonious integration of Azure Blob Storage, Azure Cognitive Search, and Azure OpenAI represents the pinnacle of unstructured data management and utilization.
Thank you for your commitment and hard work throughout this learning journey.
Next Steps
Documentation
Azure Blob Storage documentation
Introduction to Azure Blob Storage
Azure Cognitive Search Documentation
Azure OpenAI Service Documentation
Azure OpenAI on your data
Training Content
Configure Azure Blob Storage
Store application data with Azure Blob Storage
Implement advanced search features in Azure Cognitive Search
Use semantic search to get better search results in Azure Cognitive Search
Get started with Azure OpenAI Service
Apply prompt engineering with Azure OpenAI Service
Continue reading...
Can't I just copy and paste text from a PDF file to teach ChatGPT?
The purpose of this tutorial is to explain how to efficiently extract and use information from large amounts of PDFs. Dealing with a 5-page PDF can be straightforward, but it's a different story when you're dealing with complex documents of 100+ pages. In these situations, the integration of Azure Cognitive Search with Azure OpenAI enables fast and accurate information retrieval and processing. In this tutorial, we handle 5 PDFs, but you can apply this method to scale to handle more than 10,000 files. In this two-part series, we will explore how to build intelligent service using Azure. In Series 1, we'll use Azure Cognitive Search to extract keywords from unstructured data stored in Azure Blob Storage. In Series 2, we'll Create a feature to answer questions based on PDF documents using Azure OpenAI. Here is an overview of this tutorial.
This tutorial is related to the following topics
- AI Engineer
- Developer
- Azure Blob Storage
- Azure Cognitive Search
- Azure OpenAI
Learning objectives
In this tutorial, you'll learn the following:
- How to store your unstructured data in Azure Blob Storage.
- How to create search experiences based on data stored in Blob Storage with Azure Cognitive Search.
- Learn how to teach ChatGPT to answer questions based on your PDF content using Azure Cognitive search and Azure OpenAI.
Prerequisites
- Azure subscription
- Visual Studio Code
Microsoft Cloud Technologies used in this Tutorial
- Azure Blob Storage
- Azure Cognitive Search
- Azure OpenAI Service
Table of Contents
Series 1: Extract Key Phrases for Search Queries Using Azure Cognitive Search
1. Create a Blob Container
2. Store PDF Documents in Azure Blob Storage
3. Create a Cognitive Search Service
4. Connect to Data from Azure Blob Storage
5. Add Cognitive Skills
6. Customize Target Index and Create an Indexer
7. Extract Key Phrases for Search Queries Using Azure Cognitive Search
Series 2: Implement a ChatGPT Service with Azure OpenAI
1. Change your indexer settings to use Azure OpenAI
2. Create an Azure OpenAI
3. Set up the project and install the libraries
4. Set up the project in VS Code
5. Search with Azure Cognitive Search
6. Get answers from PDF content using Azure OpenAI and Cognitive Search
7. Note: Full code for example.py and config.py
Series 1: Extract Key Phrases for Search Queries Using Azure Cognitive Search
1. Create a Blob Container
Azure Blob Storage is a service designed for storing large amounts of unstructured data, such as PDFs.
1. To begin, create an Azure Storage account by typing `storage` in the search bar and selecting Services - Storage accounts.
2. Select the +Create button.
3. Enter the resource group name that will serve as the folder for the storage account, enter the storage account name, and select a region. When you're done, click the Next button and you can continue with the defaults.
2. Store PDF Documents in Azure Blob Storage
1. After your storage account is set up, navigate to Storage Browser by typing `storage browser` in the search bar.
2. Add a new container to store PDF documents.
- Select your storage account.
- Select the Blob containers button.
- Select the +Add Container button to create a new container.
3. Once the container is set up, upload your PDFs into this container.
- Select the container you created.
- Select the Upload button and upload your PDF documents.
For the tutorial, I downloaded 5 PDF documents of recent papers on GPT from Microsoft Academicand uploaded them to the container.
3. Create a Cognitive Search Service
1. Type `cognitive service` into the search bar and select Services – Cognitive Search.
2. Select the +Create button.
3. Create a new Cognitive Search Service.
- Select your Resource Group.
- Specify the Service name.
- Select your Location.
NOTE:
Azure OpenAI resource is currently available in limited regions.
If Azure OpenAI resource is not available in your region, I recommend setting your location to East US.
- Choose a Pricing tier that suits your needs; since semantic search is available from the basic tier, I recommend setting your Pricing tier to basic for the tutorial.
NOTE:
In this tutorial we will use the Basic tier to explore semantic search with Azure Cognitive Search.
You can expect a cost of approximately $2.50 per 100 program runs with this tier.
If you plan to use the free plan, please note that the code demonstrated in this tutorial may differ from what you'll need.
- Select the Review + create button.
4. Navigate to the Cognitive Search service you created and select Semantic search(Preview), then select the Free plan. (If you choose the free tier, you can skip it.)
4. Connect to Data from Azure Blob Storage
1. Navigate to the Cognitive Search service you created and select Import data.
2. Select Azure Blob Storage as the data source and connect it to the Blob Storage where your PDFs are stored.
3. Specify your Data source name.
4. Select Choose an existing connection and select the blob storage container you created.
5. Select Next: Add cognitive skills button.
5. Add Cognitive Skills
1. To power your cognitive skills, select an existing AI Services resource or create a new one; the Free resource is sufficient for this tutorial.
2. Specify the Skillset name.
TIP:
If you want to search for text in a photo, you need to check Enable OCR and merge all text into merged_content field. In this tutorial, we will not check it because we will search based on the text in the paper.
3. Select Enrichment granularity level. (In this tutorial, we'll use a page-by-page granularity, so we'll select Pages (5000 characters chunk).)
4. Select Extract Key phrases. (You can select additional checkboxes depending on the type of PDF data.)
5. Select Next: Customize target index button.
NOTE:
Why set the Enrichment granularity level to Pages (5000 characters chunk)?
To get ChatGPT responses based on a PDF, We need to call the GPT-3.5-turbo model of ChatGPT API. The GPT-3.5-turbo model can handle up to 4096 tokens, including both the text you use as input and the length of the answer the ChatGPT API returns. For this reason, documents that are too long cannot be entered all at once, but must be broken into multiple chunks and processed after multiple calls to the ChatGPT API. (Tokens can be words, punctuation, spaces, etc.)
6. Customize Target Index and Create an Indexer
1. Customize target Index.
- Specify your Index name.
- Check the boxes as shown in the image below.
TIP:
You can change the boxes to suit your data. To help you understand, I've written a description of each field in the index at the bottom of this page.
2. Add a new field.
In this tutorial, we have selected the Enrichment granularity level of Pages (5000 characters chunk). So, we need to create a field to search for pages that are separated by 5000 character chunks.
- Select + Add field button.
- Create a field named `pages`.
- Select Collection(Edm.String) as the type for the `pages` field.
- Check the box Retrievable.
3. Delete unnecessary fields.
4. Create an Indexer.
- Specify your Indexer Name.
- Select the Schedule – Once.
(For data coming in in real time, you'll need to set up a schedule periodically, but since we're dealing with unchanging PDF data in this tutorial, we'll only need to schedule Once.)
- Select the Submit button.
TIP:
Description of each field in the index
These fields represent the attributes for each PDF document.
For example, suppose we have a PDF document named `example.pdf` stored in blob storage. If we checked Retrievable for metadata_storage_size, we would be able to search for and find the size of the PDF document `example.pdf`.
1. content (Edm.String)
This field indicates the actual content of the stored data.
2. metadata_storage_content_type (Edm.String)
This field indicates the type of content stored.
Ex) The metadata_storage_content_type of `example.pdf` is `pdf`
3. metadata_storage_size (Edm.Int64)
This field Indicates the size of the stored data. The size information is stored as an
integer.
Ex) The metadata_storage_size of `example.pdf` is ` 487743`(bytes).
4. metadata_storage_last_modified (Edm.DateTimeOffset)
This field indicates the most recent modification date and time of the stored data.
Ex) The metadata_storage_last_modified of `example.pdf` is `2023-10-
06T18:45:32+00:00`.
5. metadata_storage_content_md5 (Edm.String)
This field indicates a checksum value for the data, which is used to validate the
integrity of the content during transmission or storage. The MD5 hash value is
represented as a string of alphanumeric characters.
Ex) The metadata_storage_content_md5 of `example.pdf` is
`d41d8cd98f00b204e9800998ecf8427e`
6. metadata_storage_name (Edm.String)
This field indicates a file name stored in blob storage.
Ex) The metadata_storage_name of `example.pdf` is `example.pdf `
7. metadata_storage_path (Edm.String)
This field Indicates the storage path where the data file or object resides within the
Azure storage architecture.
Ex) The metadata_storage_path of `example.pdf` is
`https://yourstorageaccount.blob.core.windows.net/testcontainer/example.pdf`
8. metadata_storage_file_extension (Edm.String)
This field indicates the file extension.
Ex) The metadata_storage_file_extention of `example.pdf` is `.pdf `
9. metadata_content_type (Edm.String)
This field Indicates the nature of the internal content, such as whether it is text,
HTML, JSON, etc.
Ex) The metadata_content_type of `example.pdf` is `text`.
10. metadata_language (Edm.String)
This field indicates the language in which the content is written, facilitating languagespecific processing and searching.
Ex) The metadata_language of `example.pdf` is `EN`.
11. metadata_creation_date (Edm.DateTimeOffset)
This field indicates the date and time when the data was originally created.
Ex) The metadata_creation_date of `example.pdf` is `2023-09-30T14:32:10+00:00`.
7. Extract Key Phrases for Search Queries Using Azure Cognitive Search
1. Once your indexer and index creation are complete, navigate to your Cognitive Search service and select the Indexes page.
2. Select the index you created.
3. You can use a query string or simply enter text to perform a search.
Ex) In this tutorial, I entered the following question: `How to prompt GPT to be reliable?`
4. Set Semantic configurations.
- Semantic configurations are available from the basic price tier onwards. If you chose the free tier, you can skip it.
- Select Semantic configurations, then select + Add semantic configuration.
- Specify your semantic configuration Name.
- Select the Title field – content.
- Select the Save button.
- When you've finished setting up your semantic configuration, return and select the Save button.
We completed extracting key phrases based on our questions using Azure Cognitive Search.
In the next series, we'll connect this Cognitive Search service with Azure Open AI to make a ChatGPT that answers questions based on PDFs stored in blob storage.
Series 2: Implement a ChatGPT Service with Azure OpenAI
1. Change your indexer settings to use Azure OpenAI
1. navigate to your Cognitive Search service and select the Indexers page.
2. Select the indexer you created.
3. Select the Indexer Definition (JSON)
4. In the JSON, modify the "outputFieldMappings" part as shown below.
"outputFieldMappings": [
{
"sourceFieldName": "/document/content/pages/*/keyphrases/*",
"targetFieldName": "keyphrases"
},
{
"sourceFieldName": "/document/content/pages/*",
"targetFieldName": "pages"
}
]
5. Select the Save button.
6. Select the Reset button.
7. Select the Run button.
TIP:
Description of “outputFieldMappings”
"outputFieldMappings" are settings that map data processed by the Cognitive Search service to specific fields in the index.
For example, in the path "/document/content/pages//keyphrases/", keywords are extracted from each page and mapped to the "keyphrases" field.
>Similarly, for the “pages” field that we created earlier, we need to specify what data will be mapped to this field. In this tutorial, we have selected the Enrichment granularity level of Pages (5000 characters chunk). So, we need to specify that 5000-character chunks from "/document/content/pages/" are mapped to the "pages" field. We need to add JSON code to map the data to the "pages" field so that we can send chunks of 5000 characters to OpenAI instead of sending entire pages.
2. Create an Azure OpenAI
Currently, access to the Azure OpenAI service is granted by request only. You can request access to the Azure OpenAI service by filling out the form at https://aka.ms/oai/access/ .
1. Type 'azure openai' in the search bar and select Services - Azure OpenAI.
2. Select the + Create button.
3. Select a network security Type.
4. Select the Create button.
5. Deploy your Azure OpenAI model.
- Navigate to your Azure OpenAI, then Select the Go to Azure OpenAI Studio button.
- In Azure openAI Studio, select the Deployments button.
- Select the + Create new deployment button, then create the gpt-35-turbo and text-embedding-ada-002 models
NOTE:
In this tutorial we will use the gpt-35-turbo and text-embedding-ada-002 models. I recommend using the same name for both the deployment name and the model name.
3. Set up the project and install the libraries
1. Create a folder where you can work.
- We will create an `azure-proj` folder inside the `User` folder and work inside the `gpt-proj1` folder.
- Open a command prompt window and create a folder named `azure-proj` in the default path.
mkdir azure-proj
- Navigate to the `azure-proj` folder you just created.
cd azure-proj
- In the same way, create a `gpt-proj1` folder inside the `azure-proj` folder. Navigate to the `gpt-proj1` folder.
mkdir gpt-proj
cd gpt-proj1
2. Create a virtual environment.
- Type the following command to create a virtual environment named `.venv`.
Python -m venv .venv
- Once the virtual environment is created, type the following command to activate the virtual environment.
.venv\Scripts\activate.bat
- Once activated, the name of the virtual environment will appear on the far left of the command prompt window.
3. Install the required packages.
- At the Command prompt, type the following command.
(.venv) C:\Users\sms79\azure-proj\gpt-proj1>pip install OpenAI
(.venv) C:\Users\sms79\azure-proj\gpt-proj1>pip install Langchain
(.venv) C:\Users\sms79\azure-proj\gpt-proj1>pip install faiss-cpu
(.venv) C:\Users\sms79\azure-proj\gpt-proj1>pip install tiktoken
TIP:
How to use CMD in VS Code
Select TERMINAL at the bottom of VS Code, then select the + button, then select the Command Prompt.
4. Set up the project in VS Code
1. In VS Code, select the folder that you have created.
- Open VS Code and select File > Open Folder from the menu. Select the `gpt-proj1` folder that you created earlier, which is located at C:\Users\yourUserName\azure-proj\gpt-proj1.
2. Create a new file.
- In the left pane of VS Code, right-click and select 'New File' to create a new file named `example.py`.
3. Import the required packages.
- Type the following code in the `example.py' file in VS Code.
# Library imports
from collections import OrderedDict
import requests
# Langchain library imports
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chat_models import AzureChatOpenAI
from langchain.docstore.document import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
4. Create a configuration file - `config.py`.
NOTE:
Complete folder structure:
└── YourUserName
└── azure-proj
└── gpt-proj1
├── example.py
└── config.py
- Create a `config.py` file. This file should contain information about your Azure.
- Add the code below to your `config.py` file.
# Azure Search Service settings
SEARCH_SERVICE_NAME = 'your-search-service-name' # 'test-search-service1'
SEARCH_SERVICE_ENDPOINT = f'https://{SEARCH_SERVICE_NAME.lower()}.search.windows.net/'
SEARCH_SERVICE_KEY = 'your-search-service-key'
SEARCH_SERVICE_API_VERSION = 'your-API-version' # '2023-07-01-preview'
# Azure Search Service Index settings
SEARCH_SERVICE_INDEX_NAME1 = 'your-search-service-index-name' # 'azureblob-index1'
# Azure Cognitive Search Service Semantic configuration settings
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME = 'your-semantic-configuration-name' # 'test-configuration'
# Azure OpenAI settings
AZURE_OPENAI_NAME = 'your-openai-name' # 'testopenai1004'
AZURE_OPENAI_ENDPOINT = f'https://{AZURE_OPENAI_NAME.lower()}.openai.azure.com/'
AZURE_OPENAI_KEY = 'your-openai-key'
AZURE_OPENAI_API_VERSION = 'your-API-version' # '2023-08-01-preview'
5. Fill in your `config.py` file with your Azure information.
NOTE:
You'll need to include information about your Azure Cognitive Search Service name, index name, semantic configuration name, key, and API version, and Azure OpenAI name, key, and API version.
TIP:
Find your Azure information
1. Find the Azure Cognitive Search Keys.
- Navigate to your Cognitive Search service, then select Keys, then copy and paste your key into the `config.py` file.
2. Find the Azure Cognitive Search Index name.
- Navigate to your Cognitive Search service, then select Indexes, then copy and paste your index name into the `config.py` file.
3. Find the Azure Cognitive Search Semantic configuration name.
- Navigate to your Cognitive Search service, select Indexes, and then click your index name.
- Select Semantic configurations and copy and paste your Semantic configuration name into the `config.py` file.
4. Find the Azure OpenAI Keys.
- Navigate to your Azure OpenAI, then select Keys and Endpoint, then copy and paste your key into the config.py file
5. Choose your Azure Cognitive Search API and Azure OpenAI version.
- Select your version of the Azure Cognitive Search API and Azure OpenAI API using the hyperlinks below.
- Select an Azure Cognitive Search API version
- Select an Azure OpenAI API version
- I have selected the latest version of the Azure Cognitive Search API, 2023-07-01-preview, and the Azure OpenAI API, 2023-08-01-preview.
5. Search with Azure Cognitive Search
In this section, we'll use Azure Cognitive Search within VS Code. We have already installed all the necessary packages in the previous chapter. Now we will focus on how to use Azure Cognitive Search and Azure OpenAI in VS Code.
In Chapters 4 and 5, we'll create functions that use Azure Cognitive Service and Azure OpenAI and use them in `main.py`.
To use Azure Cognitive Search and Azure OpenAI, we need to import the information from Azure that we entered in `config.py` into `example.py` that we created earlier.
All the following code comes from `example.py`.
The full code is provided at the end of the chapter for your convenience.\
1. Add code to `example.py` that imports the values from `config.py`.
# Configuration imports
from config import (
SEARCH_SERVICE_ENDPOINT,
SEARCH_SERVICE_KEY,
SEARCH_SERVICE_API_VERSION,
SEARCH_SERVICE_INDEX_NAME1,
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
AZURE_OPENAI_ENDPOINT,
AZURE_OPENAI_KEY,
AZURE_OPENAI_API_VERSION,
)
2. Add the Azure Cognitive Search Service header.
# Cognitive Search Service header settings
HEADERS = {
'Content-Type': 'application/json',
'api-key': SEARCH_SERVICE_KEY
}
3. Now, we will create functions related to Azure Cognitive Search and run them from the main function.
- Add the two functions.
# Function to search documents using Azure Cognitive Search
def search_documents(question):
url = (SEARCH_SERVICE_ENDPOINT + 'indexes/' +
SEARCH_SERVICE_INDEX_NAME1 + '/docs')
params = {
'api-version': SEARCH_SERVICE_API_VERSION,
'search': question,
'select': '*',
'$top': 3,
'queryLanguage': 'en-us',
'queryType': 'semantic',
'semanticConfiguration': SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
'$count': 'true',
'speller': 'lexicon',
'answers': 'extractive|count-3',
'captions': 'extractive|highlight-false'
}
resp = requests.get(url, headers=HEADERS, params=params)
return resp.json()
# Extract documents that score above a certain threshold in semantic search
def extract_documents(search_results):
file_content = OrderedDict()
for result in search_results['value']:
# The @search.rerankerScore value ranges from 1 to 4.00, where a higher score indicates a better semantic match.
if result['@search.rerankerScore'] > 1.5:
file_content[result['metadata_storage_path']] = {
'chunks': result['pages'][:10],
'captions': result['@search.captions'][:10],
'score': result['@search.rerankerScore'],
'file_name': result['metadata_storage_name']
}
return file_content
4. Now we'll run the code using the above functions in the main function.
- Add the code below.
- When you run it, you'll see the total number of PDFs in the blob storage, the top few documents adopted, and the number of chunks.
- I asked the question, 'Tell me about effective prompting strategies' based on the paper I had stored on the blob storage.
- If you want to see the full search results, add `print(search_results)` to your main function.
def main():
QUESTION = 'Tell me about effective prompting strategies'
# Search for documents with Azure Cognitive Search
search_results = search_documents(QUESTION)
file_content = extract_documents(search_results)
print('Total Documents Found: {}, Top Documents: {}'.format(
search_results['@odata.count'], len(search_results['value'])))
# 'chunks' is the value that corresponds to the Pages field that you set up in the Cognitive Search service.
# Find the number of chunks
docs = []
for key,value in file_content.items():
for page in value['chunks']:
docs.append(Document(page_content = page,
metadata={"source": value["file_name"]}))
print("Number of chunks: ", len(docs))
# execute the main function
if __name__ == "__main__":
main()
6. Get answers from PDF content using Azure OpenAI and Cognitive Search
Now that Azure Cognitive Search is working well in VS Code, it's time to start using
Azure OpenAI.
In this chapter, we'll create functions related to Azure OpenAI and ultimately create
and run a program in `example.py` that answers a question with Azure OpenAI based on
the search information from Azure Cognitive Search.
1. We will create functions related to Azure OpenAI and Lang Chain and run them from
the main function.
- Add the following functions above the main function.
# Function to create an embedding model
def create_embeddings():
return OpenAIEmbeddings(
openai_api_type='azure',
openai_api_key=AZURE_OPENAI_KEY,
openai_api_base=AZURE_OPENAI_ENDPOINT,
openai_api_version=AZURE_OPENAI_API_VERSION,
deployment='text-embedding-ada-002',
model='text-embedding-ada-002',
chunk_size=1
)
# Function to create a Vectorstore
def create_vector_store(docs, embeddings):
return FAISS.from_documents(docs, embeddings)
# Function to retrieve search results using Langchain and Chatgpt
def search_with_langchain(vector_store, question):
llm = AzureChatOpenAI(
openai_api_key=AZURE_OPENAI_KEY,
openai_api_base=AZURE_OPENAI_ENDPOINT,
openai_api_version=AZURE_OPENAI_API_VERSION,
openai_api_type='azure',
deployment_name='gpt-35-turbo',
temperature=0.0,
max_tokens=500
)
chain = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm,
chain_type='stuff',
retriever=vector_store.as_retriever(),
return_source_documents=True
)
return chain({'question': question})
2. Add the code below to your main function.
# Create an embedding model
embeddings = create_embeddings()
# Create a Vectorstore
vector_store = create_vector_store(docs, embeddings)
# Search results using Langchain and Chatgpt
result = search_with_langchain(vector_store, QUESTION)
print('Question: ', QUESTION)
print('Answer: ', result['answer'])
print('Reference: ', result['sources'].replace(",","\n"))
3. Now let's run it and see if it answers your question.
- The result of executing the code.
```
Total Documents Found: 5, Top Documents: 3
Number of chunks: 10
Question: Tell me about effective prompting strategies
Answer: Effective prompting strategies for improving the reliability of GPT-3 include
establishing simple prompts that improve GPT-3's reliability in terms of generalizability,
social biases, calibration, and factuality. These strategies include prompting with
randomly sampled examples from the source domain, using examples sampled from
a balanced demographic distribution and natural language intervention to reduce
social biases, calibrating output probabilities, and updating the LLM's factual
knowledge and reasoning chains. Natural language intervention can also effectively
guide model predictions towards better fairness.
Reference: Prompting GPT-3 To Be Reliable.pdf
```
NOTE: Full code for example.py and config.py
This chapter is designed to provide all the code used in the tutorial. It is a separate section from the rest of the tutorial.
For your convenience, I've attached the full code used in the tutorial.
1. config.py
# Azure Search Service settings
SEARCH_SERVICE_NAME = 'your-search-service-name' # 'test-search-service1'
SEARCH_SERVICE_ENDPOINT = f'https://{SEARCH_SERVICE_NAME.lower()}.search.windows.net/'
SEARCH_SERVICE_KEY = 'your-search-service-key'
SEARCH_SERVICE_API_VERSION = 'your-API-version' # '2023-07-01-preview'
# Azure Search Service Index settings
SEARCH_SERVICE_INDEX_NAME1 = 'your-search-service-index-name' # 'azureblob-index1'
# Azure Cognitive Search Service Semantic configuration settings
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME = 'your-semantic-configuration-name' # 'test-configuration'
# Azure OpenAI settings
AZURE_OPENAI_NAME = 'your-openai-name' # 'testopenai1004'
AZURE_OPENAI_ENDPOINT = f'https://{AZURE_OPENAI_NAME.lower()}.openai.azure.com/'
AZURE_OPENAI_KEY = 'your-openai-key'
AZURE_OPENAI_API_VERSION = 'your-API-version' # '2023-08-01-preview'
2. example.py
# Library imports
from collections import OrderedDict
import requests
# Langchain library imports
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chat_models import AzureChatOpenAI
from langchain.docstore.document import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
# Configuration imports
from config import (
SEARCH_SERVICE_ENDPOINT,
SEARCH_SERVICE_KEY,
SEARCH_SERVICE_API_VERSION,
SEARCH_SERVICE_INDEX_NAME1,
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
AZURE_OPENAI_ENDPOINT,
AZURE_OPENAI_KEY,
AZURE_OPENAI_API_VERSION,
)
# Cognitive Search Service header settings
HEADERS = {
'Content-Type': 'application/json',
'api-key': SEARCH_SERVICE_KEY
}
# Function to search documents using Azure Cognitive Search
def search_documents(question):
url = (SEARCH_SERVICE_ENDPOINT + 'indexes/' +
SEARCH_SERVICE_INDEX_NAME1 + '/docs')
params = {
'api-version': SEARCH_SERVICE_API_VERSION,
'search': question,
'select': '*',
'$top': 3,
'queryLanguage': 'en-us',
'queryType': 'semantic',
'semanticConfiguration': SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
'$count': 'true',
'speller': 'lexicon',
'answers': 'extractive|count-3',
'captions': 'extractive|highlight-false'
}
resp = requests.get(url, headers=HEADERS, params=params)
return resp.json()
# Extract documents that score above a certain threshold in semantic search
def extract_documents(search_results):
file_content = OrderedDict()
for result in search_results['value']:
# The '@search.rerankerScore' range is 1 to 4.00, where a higher score indicates a stronger semantic match.
if result['@search.rerankerScore'] > 1.5:
file_content[result['metadata_storage_path']] = {
'chunks': result['pages'][:10],
'captions': result['@search.captions'][:10],
'score': result['@search.rerankerScore'],
'file_name': result['metadata_storage_name']
}
return file_content
# Function to create an embedding model
def create_embeddings():
return OpenAIEmbeddings(
openai_api_type='azure',
openai_api_key=AZURE_OPENAI_KEY,
openai_api_base=AZURE_OPENAI_ENDPOINT,
openai_api_version=AZURE_OPENAI_API_VERSION,
deployment='text-embedding-ada-002',
model='text-embedding-ada-002',
chunk_size=1
)
# Function to create a Vectorstore
def create_vector_store(docs, embeddings):
return FAISS.from_documents(docs, embeddings)
# Function to retrieve search results using Langchain and Chatgpt
def search_with_langchain(vector_store, question):
llm = AzureChatOpenAI(
openai_api_key=AZURE_OPENAI_KEY,
openai_api_base=AZURE_OPENAI_ENDPOINT,
openai_api_version=AZURE_OPENAI_API_VERSION,
openai_api_type='azure',
deployment_name='gpt-35-turbo',
temperature=0.0,
max_tokens=500
)
chain = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm,
chain_type='stuff',
retriever=vector_store.as_retriever(),
return_source_documents=True
)
return chain({'question': question})
def main():
QUESTION = 'Tell me about effective prompting strategies'
# Search for documents with Azure Cognitive Search
search_results = search_documents(QUESTION)
file_content = extract_documents(search_results)
print('Total Documents Found: {}, Top Documents: {}'.format(
search_results['@odata.count'], len(search_results['value'])))
# 'chunks' is the value that corresponds to the Pages field that you set up in the Cognitive Search service.
# Find the number of chunks
docs = []
for key,value in file_content.items():
for page in value['chunks']:
docs.append(Document(page_content = page,
metadata={"source": value["file_name"]}))
print("Number of chunks: ", len(docs))
# Create an embedding model
embeddings = create_embeddings()
# Create a Vectorstore
vector_store = create_vector_store(docs, embeddings)
# Search results using Langchain and Chatgpt
result = search_with_langchain(vector_store, QUESTION)
print('Question: ', QUESTION)
print('Answer: ', result['answer'])
print('Reference: ', result['sources'].replace(",","\n"))
# execute the main function
if __name__ == "__main__":
main()
Congratulations!
You've completed this tutorial
Congratulations! You've learned the integration tutorial of Azure Cognitive Search with Azure OpenAI
In this tutorial, we have navigated through a practical journey of integrating Azure Blob Storage, Azure Cognitive Search, and Azure OpenAI to create a powerful search and response mechanism.
1. Storing Data in Azure Blob Storage
Our first step was to efficiently store PDF files in Azure Blob Storage, an unstructured data store known for its scalability and security. This storage served as a foundational base, housing the search material that would later be indexed and queried to retrieve relevant information.
2. Implementing Azure Cognitive Search
In the next step, we used Azure Cognitive Search to search based on the data we had stored in Azure Blob Storage. This powerful service was instrumental in indexing and searching the data stored in Azure Blob Storage.
3. Integrating Azure OpenAI with VS Code
The final step of our tutorial was to integrate Azure OpenAI through a program created in VS Code. This program was designed to use the search information processed and refined by Azure Cognitive Search to generate accurate and contextually relevant answers. The synergy between these technologies illustrated the seamless interplay of storage, search, and response mechanisms.
I hope that the knowledge and skills imparted will serve as invaluable tools in your future projects. The harmonious integration of Azure Blob Storage, Azure Cognitive Search, and Azure OpenAI represents the pinnacle of unstructured data management and utilization.
Thank you for your commitment and hard work throughout this learning journey.
Next Steps
Documentation
Azure Blob Storage documentation
Introduction to Azure Blob Storage
Azure Cognitive Search Documentation
Azure OpenAI Service Documentation
Azure OpenAI on your data
Training Content
Configure Azure Blob Storage
Store application data with Azure Blob Storage
Implement advanced search features in Azure Cognitive Search
Use semantic search to get better search results in Azure Cognitive Search
Get started with Azure OpenAI Service
Apply prompt engineering with Azure OpenAI Service
Continue reading...