Build a vector search application in Python using BigFrames

Vector Search & AI: Illuminating Hidden Data
Tired of keyword searches missing the mark? Imagine finding songs, products, or articles that truly feel right. That’s where Vector Search comes in. It goes beyond simple keyword matching, understanding the meaning behind your data.
How? With embeddings — digital fingerprints that turn text, images, video, audio or any sort of data into a list of numbers, a vector — that captures its essence. Similar items get similar vectors, making it easy to find them. Think of it as mapping everything into a space where closeness equals similarity. Vector search finds what you mean, not just what you say.
And in today’s world, where mountains of unstructured data — images, videos, documents — remain hidden, vector search, powered by generative AI, is the key to finally bringing those valuable insights to light.
BigQuery (Google Cloud’s unified Data to AI platform) provides a complete vector analytics experience, handling everything from embedding generation to search within its data platform, all in a serverless and integrated environment using just SQL.
BigQuery’s Vector Power, Now in Python with BigFrames
Data scientists and developers often struggle with the complexity of integrating vector search into their workflows. They face the challenge of moving data between different systems, managing separate vector databases and indexes, and dealing with the performance bottlenecks of traditional Python-based vector search implementations, especially at scale. In this blog, I will demonstrate how we are making the platform more flexible by bringing the same experience to Python practitioners using BigFrames.
BigQuery DataFrames aka BigFrames is an open source Python library offered by Google. BigFrames scales Python data processing by transpiling common Python data science APIs to BigQuery SQL. You can read more about BigFrames in the official introduction to BigFrames and can refer to the public git repository for BigFrames.
BigFrames empowers data scientists and application developers to leverage familiar Python tools like Pandas and Scikit-learn for vector search within BigQuery. This means they can seamlessly integrate vector search workflows into their existing Python-based data pipelines, benefiting from BigQuery’s scalability and performance without needing to learn new languages or paradigms.
Python for Patent Insights: A real-world example
Let’s demonstrate this with a real world example using the Google BigQuery public patent dataset (Sample shown below).
The following code block showcases how you would read a BigQuery public dataset and then use BigFrames functions to create a text embedding model & generate vector embeddings over the “Abstract” column & store it back to a BQ table for future use. Note that during this workflow, data is never copied over to the client side. BigFrames will transpile the calls into BigQuery SQL and utilize server side processing to do all the work. BigFrames lets you work with large datasets like you would with pandas, but without the usual memory and processing limits. You can process massive amounts of data directly in your notebook, without worrying about compute constraints.
## Read the public dataset into a dataframe
publications = bf.read_gbq('patents-public-data.google_patents_research.publications')
## Rename the abstract column to content so we can generate embeddings for that column
publications = publications[["publication_number", "title", "abstract"]].rename(columns={'abstract': 'content'})
## create text embedding remote model
text_model = bf_llm.TextEmbeddingGenerator()
## Create the embedding for the Abstracts of the patents filed
## The original table has over 100M rows so please run this on a subset while testing this code
## Please refer to example notebook linked at the end of the blog for details
## Do not run the embedding model on the whole table as that will call the Embedding model on all 100 Million + rows
embedding = text_model.predict(publications)[["publication_number", "title", "content", "ml_generate_embedding_result","ml_generate_embedding_status"]]
## store embeddings in a BQ table
embedding.to_gbq(f"{DATASET_ID}.{TEXT_EMBEDDING_TABLE_ID}", if_exists='replace')
Indexing for Speed: Unleashing Scalable Search
The patent data has over 100M entries, so performing a brute force search will be computationally very expensive. That’s where vector indexes shine. It employs clever algorithms, like approximate nearest neighbor techniques (IVF, ScANN), to create efficient data structures, enabling sub-linear search times. To learn more about the performance enhancements brought by vector indexes, particularly ScANN in BigQuery vector search, for high-volume queries, see this article.
Use BigFrames to create an index :
bf_bq.create_vector_index(
table_id = f"{DATASET_ID}.{TEXT_EMBEDDING_TABLE_ID}",
column_name = "ml_generate_embedding_result",
replace= True,
index_name = "bf_python_index",
distance_type="cosine",
index_type= "ivf"
)
Use case 1: Is It Already Patented: Finding What’s Been Done Before
This section will demonstrate how to use vector search to quickly identify patents that are closely related to a specific idea. This approach significantly reduces the computational complexity of patent searches, enabling scalable and accurate identification of relevant patents. The code below shows the implementation & the table following shows the top 3 results semantically closest to the user query.
## replace with whatever search string you want to use for the patent prior art search
TEXT_SEARCH_STRING = "Chip assemblies employing solder bonds to back-side lands including an electrolytic nickel layer"
# convert search string to dataframe
TEXT_SEARCH_DF = bf.DataFrame([TEXT_SEARCH_STRING], columns=['search_string'])
# generate embedding of search query
search_query = bf.DataFrame(text_model.predict(TEXT_SEARCH_DF))
# Vector search using BigFrames
vector_search_results = bf_bq.vector_search(
base_table=f"{DATASET_ID}.{TEXT_EMBEDDING_TABLE_ID}",
column_to_search="ml_generate_embedding_result",
query=search_query,
distance_type="COSINE",
query_column_to_search="ml_generate_embedding_result",
top_k=3)
Use case 2: AI-Powered Patent Summarization with RAG
Patent documents can be dense and time-consuming to digest. AI-Powered Patent Summarization utilizes Retrieval Augmented Generation (RAG) to streamline this process. By retrieving relevant patent information through vector search and then synthesizing it with a large language model, we can generate concise, human-readable summaries, saving valuable time and effort. The code sample below walks through how to set this up continuing with the same user query as the previous use case.
## create gemini model
llm_model = bf_llm.GeminiTextGenerator(model_name = "gemini-1.5-flash-002")
# Extract all the abstracts into a list of JSON strings
json_strings = [json.dumps({'abstract': s}) for s in vector_search_results['content_1']]
ALL_ABSTRACTS = json_strings
## Define the prompt
prompt = f"""
You are an expert patent analyst. I will provide you the abstracts of the top 5 patents in json format retrieved by a vector search based on a user's query.
Your task is to analyze these abstracts and generate a concise, coherent summary that encapsulates the core innovations and concepts shared among them.
In your output, share the original user query.
Then output the concise, coherent summary that encapsulates the core innovations and concepts shared among the top 5 abstracts. The heading for this section should
be : Summary of the top 5 abstracts that are semantically closest to the user query.
User Query: {TEXT_SEARCH_STRING}
Top 5 abstracts: {ALL_ABSTRACTS}
"""
### Helper function to call the gemini model
def predict(prompt: str, temperature: float = TEMPERATURE) -> str:
# Create dataframe
input = bf.DataFrame(
{
"prompt": [prompt],
}
)
# Return response
return llm_model.predict(input, temperature=temperature).ml_generate_text_llm_result.iloc[0]
## Invoke LLM with prompt
response = predict(prompt, temperature = TEMPERATURE)
## Print results as Markdown
Markdown(response)
Here is the response the code provides:
User Query: Chip assemblies employing solder bonds to back-side lands
including an electrolytic nickel layer
Summary of the top 5 abstracts that are semantically closest to the user query.
The top 5 patent abstracts describe various chip packaging techniques focusing
on improved interconnect density, heat dissipation, and miniaturization.
While none explicitly mention "electrolytic nickel layers" in the back-side
lands, a common theme is the use of multiple layers and advanced materials
to facilitate high-density chip-to-substrate connections. Several patents
detail the use of solder bumps (or balls) for electrical connections
between chips and a substrate or carrier, often incorporating intermediary
metal layers (e.g., under bump metallization (UBM)) for improved reliability
and electrical performance. The methods described aim to achieve smaller
pitches between components and higher integration density through innovative
arrangements of pads, conductors, and insulating layers. The use of different
materials such as silicon dioxide, silicon nitride, and various metals
(including copper and potentially nickel-containing alloys, inferred from
the context of the user query) is prevalent in building robust and efficient
chip packaging solutions. The patents also highlight techniques for managing
the thermal challenges associated with high-density packaging.
Your Journey to Intelligent Data Discovery Starts Now
We’ve just scratched the surface of how BigFrames unlocks the power of vector search within BigQuery. By bringing familiar Python tools directly to your data, you can now seamlessly integrate vector search into your analytics and RAG pipelines, accelerating your journey from raw data to actionable insights.
Ready to dive deeper and explore the endless possibilities? Start building your own vector search applications with BigFrames and BigQuery today! Check out our documentation, run the entire code discussed above in your environment & explore our other sample notebooks, and unleash the power of vector analytics on your data.
The BigFrames team would also love to hear from you. If you would like to reach out, please send an email to: bigframes-feedback@google.com or by filing an issue at the open source BigFrames repository. To receive updates about BigFrames, subscribe to the BigFrames email list.