Tutorial: Timescale Vector (PostgreSQL) Python Client Library

This notebook shows how to use the PostgreSQL as vector database via the Python Vector python client library. You’ll learn how to use the client for (1) semantic search, (2) time-based vector search, (3) and how to create indexes to speed up queries.

Follow along by downloading the Jupyter notebook version of this tutorial here.

Sample dataset: We’ll analyze a gitlog dataset (git commit messages) and use the vector embeddings of the commit messages to find relevant commit messages to a given query question. Each git commit entry has a timestamp associated with it, as well as natural language message and other metadata (e.g author, commit hash etc).

What is Timescale Vector?

Timescale Vector is PostgreSQL++ for AI applications.

Timescale Vector enables you to efficiently store and query millions of vector embeddings in PostgreSQL.

Enhances pgvector with faster and more accurate similarity search on 100M+ vectors via DiskANN inspired indexing algorithm.
Enables fast time-based vector search via automatic time-based partitioning and indexing.
Provides a familiar SQL interface for querying vector embeddings and relational data.

Timescale Vector is cloud PostgreSQL for AI that scales with you from POC to production:

Simplifies operations by enabling you to store relational metadata, vector embeddings, and time-series data in a single database.
Benefits from rock-solid PostgreSQL foundation with enterprise-grade feature liked streaming backups and replication, high-availability and row-level security.
Enables a worry-free experience with enterprise-grade security and compliance.

How to access Timescale Vector

Timescale Vector is available on Timescale, the cloud PostgreSQL platform. (There is no self-hosted version at this time.)

Good news! You get a 90-day free trial for Timescale Vector:

To get started, signup to Timescale, create a new database and follow this notebook!
See the Timescale Vector explainer blog for more details and performance benchmarks.

0. Setup

Download the Jupyter notebook version of this tutorial.

# install needed packages
!pip install timescale-vector
!pip install openai
!pip install tiktoken
!pip install python-dotenv

# import needed packages
import os
from dotenv import load_dotenv, find_dotenv
import timescale_vector
from timescale_vector import client
import openai
import pandas as pd
from pathlib import Path
import numpy as np
import json
import tiktoken
import ast
import math
import uuid
from datetime import datetime
from datetime import timedelta
from typing import List, Tuple

We’ll use OpenAI’s embedding models so let’s load our OpenAI API keys from a .env file.

If you do not have an OpenAI API Key, signup for an OpenAI Developer Account and create an API Key. See OpenAI’s developer platform for more information.

# Run export OPENAI_API_KEY=sk-YOUR_OPENAI_API_KEY...
# Get openAI api key by reading local .env file

_ = load_dotenv(find_dotenv())
openai.api_key  = os.environ['OPENAI_API_KEY']

Next let’s load our Timescale service URL which we’ll use to connect to our cloud PostgreSQL database hosted on Timescale.

Launch a PostgreSQL database on Timescale and download the .env file after your service is created.

# Get the service url by reading local .env file
# The .env file should contain a line starting with `TIMESCALE_SERVICE_URL=postgresql://`
_ = load_dotenv(find_dotenv())
TIMESCALE_SERVICE_URL = os.environ["TIMESCALE_SERVICE_URL"]

# OR set it explicitly
# TIMESCALE_SERVICE_URL = "postgres://tsdbadmin:<password>@<id>.tsdb.cloud.timescale.com:<port>/tsdb?sslmode=require"

1. Create a table to store the vectors and metadata

First, we’ll define a table name, which will be the name of our table in the PostgreSQL database.

We set the time_partition_interval argument in the client creation function to enable automatic time-based partitioning of the table. This will partition the table into time-based chunks (in this case each containing data for 7 days) and create indexes on the time-based chunks to speed up time-based queries.

Each partition will consist of data for the specified length of time. We’ll use 7 days for simplicity, but you can pick whatever value make sense for your use case – for example if you query recent vectors frequently you might want to use a smaller time delta like 1 day, or if you query vectors over a decade long time period then you might want to use a larger time delta like 6 months or 1 year.

# Table information
TABLE_NAME = "commit_history"
EMBEDDING_DIMENSIONS = 1536
TIME_PARTITION_INTERVAL = timedelta(days=7)

# Create client object
vec = client.Async(TIMESCALE_SERVICE_URL, 
                   TABLE_NAME,  
                   EMBEDDING_DIMENSIONS, 
                   time_partition_interval=TIME_PARTITION_INTERVAL)

# create table
await vec.create_tables()

The create_tables() function will create a table with the following schema:

id | metadata | contents | embedding

id is the UUID which uniquely identifies each vector.
metadata is a JSONB column which stores the metadata associated with each vector.
contents is the text column which stores the content we want vectorized (in this case the commit message).
embedding is the vector column which stores the vector embedding representation of the content.

2. Load in dataset, create vector embeddings, and prepare data for ingestion

Load sample dataset

First, you’ll need to download the sample dataset and place it in the same directory as this notebook.

You can use following command:

# Download the file using curl and save it as commit_history.csv
# Note: Execute this command in your terminal, in the same directory as the notebook
# curl -O https://s3.amazonaws.com/assets.timescale.com/ai/commit_history.csv

Then, we’ll load in the gitlog dataset CSV file into a pandas dataframe.

Note: Since this is a demo, we will only work with the first 1000 records. In practice, you can load as many records as you want.

# Set the path to the dataset file relative to this notebook
file_path = Path("commit_history.csv")

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)

# Light data cleaning on CSV
df.dropna(inplace=True)
df = df.astype(str)
df = df[:1000]

# Take a look at the data in the csv (optional)
df.head()

	commit	author	date	change summary	change details
0	44e41c12ab25e36c202f58e068ced262eadc8d16	Lakshmi Narayanan Sreethar<lakshmi@timescale.com>	Tue Sep 5 21:03:21 2023 +0530	Fix segfault in set_integer_now_func	When an invalid function oid is passed to set_...
1	e66a40038e3c84fb1a68da67ad71caf75c64a027	Bharathy<satish.8483@gmail.com>	Sat Sep 2 09:24:31 2023 +0530	Fix server crash on UPDATE of compressed chunk	UPDATE query with system attributes in WHERE c...
2	c6a930897e9f9e9878db031cc7fb6ea79d721a74	Jan Nidzwetzki<jan@timescale.com>	Tue Aug 29 21:13:51 2023 +0200	Use Debian Bookworm for 32-bit tests	So far, we have used Debian Buster (10) for ou...
3	8e941b80ae1b0e0b6affe5431454cdc637628d99	Lakshmi Narayanan Sreethar<lakshmi@timescale.com>	Mon Aug 28 23:19:22 2023 +0530	Fix incorrect row count in EXPLAIN ANALYZE INS...	INSERT ... ON CONFLICT statements record few m...
4	caada43454e25d3098744fa6b675ac7d07390550	Lakshmi Narayanan Sreethar<lakshmi@timescale.com>	Tue May 30 20:32:29 2023 +0530	PG16: Fix concurrent update issues with MERGE.	PG16 commit postgres/postgres@9321c79c fixes a...

Create vector embeddings

Next we’ll create vector embeddings of the commit messages using the OpenAI API. We’ll use the text-embedding-ada-002 model to create the embeddings (Learn more here).

# Create embeddings for each commit message

embedding_list = []
content_list = []

openai_client = openai.Client()

# Helper function: get embeddings for a text
def get_embeddings(text):
    response = openai_client.embeddings.create(
        model="text-embedding-ada-002",
        input = text.replace("\n"," ")
    )
    embedding = response.data[0].embedding
    return embedding

for index, row in df.iterrows():
    #construct text we want to embed
    text = row['author'] + " "+ row['date'] + " " +row['commit']+ " " + row['change summary'] + " "+ row['change details']
    content_list.append(text)
    embedding = get_embeddings(text)
    embedding_list.append(embedding)

# Append embddings and content to dataframe
df['content'] = content_list
df['embedding'] = embedding_list

Prepare data for ingestion

Next, we’ll create a uuid for each git log entry.

We’ll define a helper funciton create_uuid() to create a uuid for commit message and associated vector embedding based on its timestamp.

In the helper function, we’ll use the timescale vector client library’s uuid_from_time() method to take a date and create a uuid with a datetime portion that reflects the date string.

# Create uuids for each message
uuid_list = []

# helper function to take in a date string in the past and return a uuid v1
def create_uuid(date_string: str):
    if date_string is None:
        return None
    time_format = '%a %b %d %H:%M:%S %Y %z'
    datetime_obj = datetime.strptime(date_string, time_format)
    uuid = client.uuid_from_time(datetime_obj)
    return str(uuid)

for index, row in df.iterrows():
    uuid = create_uuid(row['date'])
    uuid_list.append(uuid)

# Add uuids to dataframe
df['uuid'] = uuid_list

Finally, let’s create a json of metadata for each entry in our dataset.

We’ll again use some helper functions to ensure we have data in the right format for easy filtering.

# Helper functions
# Helper function to split name and email given an author string consisting of Name Lastname <email>
def split_name(input_string: str) -> Tuple[str, str]:
    if input_string is None:
        return None, None
    start = input_string.find("<")
    end = input_string.find(">")
    name = input_string[:start].strip()
    return name

def create_date(input_string: str) -> datetime:
    if input_string is None:
        return None
    # Define a dictionary to map month abbreviations to their numerical equivalents
    month_dict = {
        "Jan": "01",
        "Feb": "02",
        "Mar": "03",
        "Apr": "04",
        "May": "05",
        "Jun": "06",
        "Jul": "07",
        "Aug": "08",
        "Sep": "09",
        "Oct": "10",
        "Nov": "11",
        "Dec": "12",
    }

    # Split the input string into its components
    components = input_string.split()
    # Extract relevant information
    day = components[2]
    month = month_dict[components[1]]
    year = components[4]
    time = components[3]
    timezone_offset_minutes = int(components[5])  # Convert the offset to minutes
    timezone_hours = timezone_offset_minutes // 60  # Calculate the hours
    timezone_minutes = timezone_offset_minutes % 60  # Calculate the remaining minutes
    # Create a formatted string for the timestamptz in PostgreSQL format
    timestamp_tz_str = f"{year}-{month}-{day} {time}+{timezone_hours:02}{timezone_minutes:02}"
    return timestamp_tz_str

metadata_list = []

for index, row in df.iterrows():
    metadata = {
        "author": split_name(row['author']),
        "date": create_date(row['date']),
        "commit": row['commit'],
    }
    metadata_list.append(metadata)

# Add metadata to dataframe
df['metadata'] = metadata_list

print(metadata_list[0])

{'author': 'Lakshmi Narayanan Sreethar', 'date': '2023-09-5 21:03:21+0850', 'commit': '44e41c12ab25e36c202f58e068ced262eadc8d16'}

We’ve finished loading in our dataset and preparing it for storage in Timescale Vector.

Note that we’ve explained each step of the data preparation process in detail for the purposes of this tutorial. In practice, you can create embeddings, uuids and metadata in a single step.

3. Add data to Timescale Vector

Next, we’ll add our commit history data to the commit_history table in Timescale Vector.

We’ll prepare a list of tuples to add to the table. Each tuple will contain the following data:

(id, metadata, contents, embedding)

to match the schema of our table.

# Remind ourselves of the data in the dataframe
# df.head()

# convert dataframe to array of tuples
records = []
for index, row in df.iterrows():
    record = (row['uuid'], row['metadata'], row['content'], row['embedding'])
    records.append(record)

Lastly, we batch upsert the data into the table using the .upsert() method passing in our list of tuples we prepared above.

# batch upsert vectors into table
await vec.upsert(records)

4. Similarity search with Timescale Vector

Simple similarity search

Let’s see an example of performing a simple similarity search using Timescale Vector, where we’ll find the most similar commit messages to a given query by performing a nearest neighbor search.

# define search query and query_embedding
query_string = "What's new with continuous aggregates?"
query_embedding = get_embeddings(query_string)

# search table for similar vectors to query_embedding
records = await vec.search(query_embedding)

The response from the .search() method returns a list Records objects. Each Record object contains the following attributes: id, metadata, contents, embedding, distance.

The results are sorted by distance. The first result is the most similar to our query.

Let’s inspect them below:

records

[<Record id=UUID('18331d00-fc57-11ec-a166-06cee12dbc78') metadata={'date': '2022-07-5 13:39:14+0320', 'author': 'Fabrízio de Royes Mello', 'commit': ' e34218ce2963358a500f6bc315aace0fad29c450'} contents="Fabrízio de Royes Mello<fabriziomello@gmail.com> Tue Jul 5 13:39:14 2022 +0200  e34218ce2963358a500f6bc315aace0fad29c450 Migrate Continuous Aggregates to the new format Timescale 2.7 released a new version of Continuous Aggregate (#4269) that store the final aggregation state instead of the byte array of the partial aggregate state, offering multiple opportunities of optimizations as well a more compact form.  When upgrading to Timescale 2.7, new created Continuous Aggregates are using the new format, but existing Continuous Aggregates keep using the format they were defined with.  Created a procedure to upgrade existing Continuous Aggregates from the old format to the new format, by calling a simple procedure:  test=# CALL cagg_migrate('conditions_summary_daily');  Closes #4424 " embedding=array([-0.02072006, -0.00232497, -0.00290987, ..., -0.00420762,
        -0.00879542, -0.02118798], dtype=float32) distance=0.15402132505614874>,
 <Record id=UUID('18331d00-fc57-11ec-8f40-352ea14812b8') metadata={'date': '2022-07-5 13:39:14+0320', 'author': 'Fabrízio de Royes Mello', 'commit': ' e34218ce2963358a500f6bc315aace0fad29c450'} contents="Fabrízio de Royes Mello<fabriziomello@gmail.com> Tue Jul 5 13:39:14 2022 +0200  e34218ce2963358a500f6bc315aace0fad29c450 Migrate Continuous Aggregates to the new format Timescale 2.7 released a new version of Continuous Aggregate (#4269) that store the final aggregation state instead of the byte array of the partial aggregate state, offering multiple opportunities of optimizations as well a more compact form.  When upgrading to Timescale 2.7, new created Continuous Aggregates are using the new format, but existing Continuous Aggregates keep using the format they were defined with.  Created a procedure to upgrade existing Continuous Aggregates from the old format to the new format, by calling a simple procedure:  test=# CALL cagg_migrate('conditions_summary_daily');  Closes #4424 " embedding=array([-0.02072006, -0.00232497, -0.00290987, ..., -0.00420762,
        -0.00879542, -0.02118798], dtype=float32) distance=0.15402132505614874>,
 <Record id=UUID('c98d1c00-6c13-11ed-b2e5-ba3746d2d4a5') metadata={'date': '2022-11-24 13:19:36+-500', 'author': 'Fabrízio de Royes Mello', 'commit': ' 35c91204987ccb0161d745af1a39b7eb91bc65a5'} contents='Fabrízio de Royes Mello<fabriziomello@gmail.com> Thu Nov 24 13:19:36 2022 -0300  35c91204987ccb0161d745af1a39b7eb91bc65a5 Add Hierarchical Continuous Aggregates validations Commit 3749953e introduce Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) but it lacks of some basic validations.  Validations added during the creation of a Hierarchical Continuous Aggregate:  * Forbid create a continuous aggregate with fixed-width bucket on top of   a continuous aggregate with variable-width bucket.  * Forbid incompatible bucket widths:   - should not be equal;   - bucket width of the new continuous aggregate should be greater than     the source continuous aggregate;   - bucket width of the new continuous aggregate should be multiple of     the source continuous aggregate. ' embedding=array([-0.03262706, -0.0018098 , -0.01641467, ...,  0.00157952,
        -0.01413165, -0.01476743], dtype=float32) distance=0.15454035563716317>,
 <Record id=UUID('c98d1c00-6c13-11ed-9626-aeca4bbf6c5d') metadata={'date': '2022-11-24 13:19:36+-500', 'author': 'Fabrízio de Royes Mello', 'commit': ' 35c91204987ccb0161d745af1a39b7eb91bc65a5'} contents='Fabrízio de Royes Mello<fabriziomello@gmail.com> Thu Nov 24 13:19:36 2022 -0300  35c91204987ccb0161d745af1a39b7eb91bc65a5 Add Hierarchical Continuous Aggregates validations Commit 3749953e introduce Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) but it lacks of some basic validations.  Validations added during the creation of a Hierarchical Continuous Aggregate:  * Forbid create a continuous aggregate with fixed-width bucket on top of   a continuous aggregate with variable-width bucket.  * Forbid incompatible bucket widths:   - should not be equal;   - bucket width of the new continuous aggregate should be greater than     the source continuous aggregate;   - bucket width of the new continuous aggregate should be multiple of     the source continuous aggregate. ' embedding=array([-0.03262706, -0.0018098 , -0.01641467, ...,  0.00157952,
        -0.01413165, -0.01476743], dtype=float32) distance=0.15454035563716317>,
 <Record id=UUID('2144db80-88bd-11ec-8cea-ce147abb2c4b') metadata={'date': '2022-02-8 09:57:23+0140', 'author': 'Erik Nordström', 'commit': ' 5af9f45488d51027804cac16362811f71a89bb64'} contents='Erik Nordström<erik@timescale.com> Tue Feb 8 09:57:23 2022 +0100  5af9f45488d51027804cac16362811f71a89bb64 Add extra telemetry for continuous aggregates Add the following telemetry fields for continuous aggregates:  * The number of continuous aggregates created on distributed   hypertables * The number of continuous aggregates using real-time aggregation ' embedding=array([-0.02287812, -0.02124397,  0.01628467, ..., -0.01387608,
        -0.01325794, -0.01354214], dtype=float32) distance=0.15761212923621704>,
 <Record id=UUID('2144db80-88bd-11ec-a0f8-155e201a9074') metadata={'date': '2022-02-8 09:57:23+0140', 'author': 'Erik Nordström', 'commit': ' 5af9f45488d51027804cac16362811f71a89bb64'} contents='Erik Nordström<erik@timescale.com> Tue Feb 8 09:57:23 2022 +0100  5af9f45488d51027804cac16362811f71a89bb64 Add extra telemetry for continuous aggregates Add the following telemetry fields for continuous aggregates:  * The number of continuous aggregates created on distributed   hypertables * The number of continuous aggregates using real-time aggregation ' embedding=array([-0.02287812, -0.02124397,  0.01628467, ..., -0.01387608,
        -0.01325794, -0.01354214], dtype=float32) distance=0.15761212923621704>,
 <Record id=UUID('dddb6100-d17a-11ec-8fe6-3457c1f43f23') metadata={'date': '2022-05-11 19:36:58+-500', 'author': 'Fabrízio de Royes Mello', 'commit': ' f266f5cf564fcc5509b91493a39eb201c6f5914a'} contents="Fabrízio de Royes Mello<fabriziomello@gmail.com> Wed May 11 19:36:58 2022 -0300  f266f5cf564fcc5509b91493a39eb201c6f5914a Continuous Aggregates finals form Following work started by #4294 to improve performance of Continuous Aggregates by removing the re-aggregation in the user view.  This PR get rid of `partialize_agg` and `finalize_agg` aggregate functions and store the finalized aggregated (plain) data in the materialization hypertable.  Because we're not storing partials anymore and removed the re-aggregation, now is be possible to create indexes on aggregated columns in the materialization hypertable in order to improve the performance even more.  Also removed restrictions on types of aggregates users can perform with Continuous Aggregates: * aggregates with DISTINCT * aggregates with FILTER * aggregates with FILTER in HAVING clause * aggregates without combine function * ordered-set aggregates * hypothetical-set aggregates  By default new Continuous Aggregates will be created using this new format, but the previous version (with partials) will be supported.  Users can create the previous style by setting to `false` the storage paramater named `timescaledb.finalized` during the creation of the Continuous Aggregate.  Fixes #4233 " embedding=array([-0.03077092,  0.0143465 , -0.01135488, ..., -0.00501059,
        -0.01490651, -0.02304872], dtype=float32) distance=0.1637590571138441>,
 <Record id=UUID('dddb6100-d17a-11ec-825d-6556061133f2') metadata={'date': '2022-05-11 19:36:58+-500', 'author': 'Fabrízio de Royes Mello', 'commit': ' f266f5cf564fcc5509b91493a39eb201c6f5914a'} contents="Fabrízio de Royes Mello<fabriziomello@gmail.com> Wed May 11 19:36:58 2022 -0300  f266f5cf564fcc5509b91493a39eb201c6f5914a Continuous Aggregates finals form Following work started by #4294 to improve performance of Continuous Aggregates by removing the re-aggregation in the user view.  This PR get rid of `partialize_agg` and `finalize_agg` aggregate functions and store the finalized aggregated (plain) data in the materialization hypertable.  Because we're not storing partials anymore and removed the re-aggregation, now is be possible to create indexes on aggregated columns in the materialization hypertable in order to improve the performance even more.  Also removed restrictions on types of aggregates users can perform with Continuous Aggregates: * aggregates with DISTINCT * aggregates with FILTER * aggregates with FILTER in HAVING clause * aggregates without combine function * ordered-set aggregates * hypothetical-set aggregates  By default new Continuous Aggregates will be created using this new format, but the previous version (with partials) will be supported.  Users can create the previous style by setting to `false` the storage paramater named `timescaledb.finalized` during the creation of the Continuous Aggregate.  Fixes #4233 " embedding=array([-0.03077092,  0.0143465 , -0.01135488, ..., -0.00501059,
        -0.01490651, -0.02304872], dtype=float32) distance=0.1637590571138441>,
 <Record id=UUID('0df31a00-44f7-11ed-82a8-18143619d1eb') metadata={'date': '2022-10-5 18:45:40+-500', 'author': 'Fabrízio de Royes Mello', 'commit': ' 3749953e9704e45df8f621607989ada0714ce28d'} contents='Fabrízio de Royes Mello<fabriziomello@gmail.com> Wed Oct 5 18:45:40 2022 -0300  3749953e9704e45df8f621607989ada0714ce28d Hierarchical Continuous Aggregates Enable users create Hierarchical Continuous Aggregates (aka Continuous Aggregates on top of another Continuous Aggregates).  With this PR users can create levels of aggregation granularity in Continuous Aggregates making the refresh process even faster.  A problem with this feature can be in upper levels we can end up with the "average of averages". But to get the "real average" we can rely on "stats_aggs" TimescaleDB Toolkit function that calculate and store the partials that can be finalized with other toolkit functions like "average" and "sum".  Closes #1400 ' embedding=array([-0.02771199,  0.01358744,  0.00311197, ..., -0.00547272,
        -0.01917629, -0.03033948], dtype=float32) distance=0.16440806282766374>,
 <Record id=UUID('0df31a00-44f7-11ed-99ea-57328005937d') metadata={'date': '2022-10-5 18:45:40+-500', 'author': 'Fabrízio de Royes Mello', 'commit': ' 3749953e9704e45df8f621607989ada0714ce28d'} contents='Fabrízio de Royes Mello<fabriziomello@gmail.com> Wed Oct 5 18:45:40 2022 -0300  3749953e9704e45df8f621607989ada0714ce28d Hierarchical Continuous Aggregates Enable users create Hierarchical Continuous Aggregates (aka Continuous Aggregates on top of another Continuous Aggregates).  With this PR users can create levels of aggregation granularity in Continuous Aggregates making the refresh process even faster.  A problem with this feature can be in upper levels we can end up with the "average of averages". But to get the "real average" we can rely on "stats_aggs" TimescaleDB Toolkit function that calculate and store the partials that can be finalized with other toolkit functions like "average" and "sum".  Closes #1400 ' embedding=array([-0.02771199,  0.01358744,  0.00311197, ..., -0.00547272,
        -0.01917629, -0.03033948], dtype=float32) distance=0.16440806282766374>]

Let’s look at how to access the fields of a record object. The client provides helper methods to easily access the fields of a record object.

We’ll use them inspect the content of the results to see that they are indeed relevant to the query we searched for.

for result in records:
    print("-" * 80)
    print(result[client.SEARCH_RESULT_CONTENTS_IDX])

--------------------------------------------------------------------------------
Fabrízio de Royes Mello<fabriziomello@gmail.com> Tue Jul 5 13:39:14 2022 +0200  e34218ce2963358a500f6bc315aace0fad29c450 Migrate Continuous Aggregates to the new format Timescale 2.7 released a new version of Continuous Aggregate (#4269) that store the final aggregation state instead of the byte array of the partial aggregate state, offering multiple opportunities of optimizations as well a more compact form.  When upgrading to Timescale 2.7, new created Continuous Aggregates are using the new format, but existing Continuous Aggregates keep using the format they were defined with.  Created a procedure to upgrade existing Continuous Aggregates from the old format to the new format, by calling a simple procedure:  test=# CALL cagg_migrate('conditions_summary_daily');  Closes #4424 
--------------------------------------------------------------------------------
Fabrízio de Royes Mello<fabriziomello@gmail.com> Tue Jul 5 13:39:14 2022 +0200  e34218ce2963358a500f6bc315aace0fad29c450 Migrate Continuous Aggregates to the new format Timescale 2.7 released a new version of Continuous Aggregate (#4269) that store the final aggregation state instead of the byte array of the partial aggregate state, offering multiple opportunities of optimizations as well a more compact form.  When upgrading to Timescale 2.7, new created Continuous Aggregates are using the new format, but existing Continuous Aggregates keep using the format they were defined with.  Created a procedure to upgrade existing Continuous Aggregates from the old format to the new format, by calling a simple procedure:  test=# CALL cagg_migrate('conditions_summary_daily');  Closes #4424 
--------------------------------------------------------------------------------
Fabrízio de Royes Mello<fabriziomello@gmail.com> Thu Nov 24 13:19:36 2022 -0300  35c91204987ccb0161d745af1a39b7eb91bc65a5 Add Hierarchical Continuous Aggregates validations Commit 3749953e introduce Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) but it lacks of some basic validations.  Validations added during the creation of a Hierarchical Continuous Aggregate:  * Forbid create a continuous aggregate with fixed-width bucket on top of   a continuous aggregate with variable-width bucket.  * Forbid incompatible bucket widths:   - should not be equal;   - bucket width of the new continuous aggregate should be greater than     the source continuous aggregate;   - bucket width of the new continuous aggregate should be multiple of     the source continuous aggregate. 
--------------------------------------------------------------------------------
Fabrízio de Royes Mello<fabriziomello@gmail.com> Thu Nov 24 13:19:36 2022 -0300  35c91204987ccb0161d745af1a39b7eb91bc65a5 Add Hierarchical Continuous Aggregates validations Commit 3749953e introduce Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) but it lacks of some basic validations.  Validations added during the creation of a Hierarchical Continuous Aggregate:  * Forbid create a continuous aggregate with fixed-width bucket on top of   a continuous aggregate with variable-width bucket.  * Forbid incompatible bucket widths:   - should not be equal;   - bucket width of the new continuous aggregate should be greater than     the source continuous aggregate;   - bucket width of the new continuous aggregate should be multiple of     the source continuous aggregate. 
--------------------------------------------------------------------------------
Erik Nordström<erik@timescale.com> Tue Feb 8 09:57:23 2022 +0100  5af9f45488d51027804cac16362811f71a89bb64 Add extra telemetry for continuous aggregates Add the following telemetry fields for continuous aggregates:  * The number of continuous aggregates created on distributed   hypertables * The number of continuous aggregates using real-time aggregation 
--------------------------------------------------------------------------------
Erik Nordström<erik@timescale.com> Tue Feb 8 09:57:23 2022 +0100  5af9f45488d51027804cac16362811f71a89bb64 Add extra telemetry for continuous aggregates Add the following telemetry fields for continuous aggregates:  * The number of continuous aggregates created on distributed   hypertables * The number of continuous aggregates using real-time aggregation 
--------------------------------------------------------------------------------
Fabrízio de Royes Mello<fabriziomello@gmail.com> Wed May 11 19:36:58 2022 -0300  f266f5cf564fcc5509b91493a39eb201c6f5914a Continuous Aggregates finals form Following work started by #4294 to improve performance of Continuous Aggregates by removing the re-aggregation in the user view.  This PR get rid of `partialize_agg` and `finalize_agg` aggregate functions and store the finalized aggregated (plain) data in the materialization hypertable.  Because we're not storing partials anymore and removed the re-aggregation, now is be possible to create indexes on aggregated columns in the materialization hypertable in order to improve the performance even more.  Also removed restrictions on types of aggregates users can perform with Continuous Aggregates: * aggregates with DISTINCT * aggregates with FILTER * aggregates with FILTER in HAVING clause * aggregates without combine function * ordered-set aggregates * hypothetical-set aggregates  By default new Continuous Aggregates will be created using this new format, but the previous version (with partials) will be supported.  Users can create the previous style by setting to `false` the storage paramater named `timescaledb.finalized` during the creation of the Continuous Aggregate.  Fixes #4233 
--------------------------------------------------------------------------------
Fabrízio de Royes Mello<fabriziomello@gmail.com> Wed May 11 19:36:58 2022 -0300  f266f5cf564fcc5509b91493a39eb201c6f5914a Continuous Aggregates finals form Following work started by #4294 to improve performance of Continuous Aggregates by removing the re-aggregation in the user view.  This PR get rid of `partialize_agg` and `finalize_agg` aggregate functions and store the finalized aggregated (plain) data in the materialization hypertable.  Because we're not storing partials anymore and removed the re-aggregation, now is be possible to create indexes on aggregated columns in the materialization hypertable in order to improve the performance even more.  Also removed restrictions on types of aggregates users can perform with Continuous Aggregates: * aggregates with DISTINCT * aggregates with FILTER * aggregates with FILTER in HAVING clause * aggregates without combine function * ordered-set aggregates * hypothetical-set aggregates  By default new Continuous Aggregates will be created using this new format, but the previous version (with partials) will be supported.  Users can create the previous style by setting to `false` the storage paramater named `timescaledb.finalized` during the creation of the Continuous Aggregate.  Fixes #4233 
--------------------------------------------------------------------------------
Fabrízio de Royes Mello<fabriziomello@gmail.com> Wed Oct 5 18:45:40 2022 -0300  3749953e9704e45df8f621607989ada0714ce28d Hierarchical Continuous Aggregates Enable users create Hierarchical Continuous Aggregates (aka Continuous Aggregates on top of another Continuous Aggregates).  With this PR users can create levels of aggregation granularity in Continuous Aggregates making the refresh process even faster.  A problem with this feature can be in upper levels we can end up with the "average of averages". But to get the "real average" we can rely on "stats_aggs" TimescaleDB Toolkit function that calculate and store the partials that can be finalized with other toolkit functions like "average" and "sum".  Closes #1400 
--------------------------------------------------------------------------------
Fabrízio de Royes Mello<fabriziomello@gmail.com> Wed Oct 5 18:45:40 2022 -0300  3749953e9704e45df8f621607989ada0714ce28d Hierarchical Continuous Aggregates Enable users create Hierarchical Continuous Aggregates (aka Continuous Aggregates on top of another Continuous Aggregates).  With this PR users can create levels of aggregation granularity in Continuous Aggregates making the refresh process even faster.  A problem with this feature can be in upper levels we can end up with the "average of averages". But to get the "real average" we can rely on "stats_aggs" TimescaleDB Toolkit function that calculate and store the partials that can be finalized with other toolkit functions like "average" and "sum".  Closes #1400

Here’s how to access each field of a record object:

result = records[0]

record_id = result[client.SEARCH_RESULT_ID_IDX]
record_metadata = result[client.SEARCH_RESULT_METADATA_IDX]
record_embedding = result[client.SEARCH_RESULT_EMBEDDING_IDX]
record_distance = result[client.SEARCH_RESULT_DISTANCE_IDX]
record_contents = result[client.SEARCH_RESULT_CONTENTS_IDX]

print("-" * 80)
print(record_id)
print(record_metadata)
print(record_contents)
print(record_distance)
print("-" * 80)

--------------------------------------------------------------------------------
18331d00-fc57-11ec-a166-06cee12dbc78
{'date': '2022-07-5 13:39:14+0320', 'author': 'Fabrízio de Royes Mello', 'commit': ' e34218ce2963358a500f6bc315aace0fad29c450'}
Fabrízio de Royes Mello<fabriziomello@gmail.com> Tue Jul 5 13:39:14 2022 +0200  e34218ce2963358a500f6bc315aace0fad29c450 Migrate Continuous Aggregates to the new format Timescale 2.7 released a new version of Continuous Aggregate (#4269) that store the final aggregation state instead of the byte array of the partial aggregate state, offering multiple opportunities of optimizations as well a more compact form.  When upgrading to Timescale 2.7, new created Continuous Aggregates are using the new format, but existing Continuous Aggregates keep using the format they were defined with.  Created a procedure to upgrade existing Continuous Aggregates from the old format to the new format, by calling a simple procedure:  test=# CALL cagg_migrate('conditions_summary_daily');  Closes #4424 
0.15402132505614874
--------------------------------------------------------------------------------

Similarity search with metadata filtering

Timecale Vector also supports filtering results by metadata. Let’s see an example of this, where we’ll specify the number of results to return using the limit argument and filter the results by the author metadata field.

records_filtered = await vec.search(query_embedding, limit=3, filter={"author": "Rafia Sabih"})

for result in records_filtered:
    print("-" * 80)
    print(result[client.SEARCH_RESULT_CONTENTS_IDX])

--------------------------------------------------------------------------------
Rafia Sabih<rafia.sabih@gmail.com> Wed Feb 8 11:54:28 2023 +0100  98218c1d079231a9aa469b37ddd0ed39e77c2adb Enable joins for heirarchical continuous aggregates The joins could be between a continuous aggregate and hypertable, continuous aggregate and a regular Postgres table, and continuous aggregate and a regular Postgres view. 
--------------------------------------------------------------------------------
Rafia Sabih<rafia.sabih@gmail.com> Wed Feb 8 11:54:28 2023 +0100  98218c1d079231a9aa469b37ddd0ed39e77c2adb Enable joins for heirarchical continuous aggregates The joins could be between a continuous aggregate and hypertable, continuous aggregate and a regular Postgres table, and continuous aggregate and a regular Postgres view. 
--------------------------------------------------------------------------------
Rafia Sabih<rafia.sabih@gmail.com> Thu Apr 27 15:01:38 2023 +0200  d9849325d0d0f81a13db1e41aa56f8b567945e72 Improve test suite Add more regression tests for Continuous aggregates with joins.

We can from the output above that we only get results back from the author we filtered for!

5. Using ANN search indexes to speed up queries

You can speed up similarity queries by creating an index on the embedding column. We recommend doing this if you have large number (10k+) vectors in your table.

Timescale Vector supports the following indexes: - timescale_vector_index: a disk-ann inspired graph index for fast similarity search (default). - pgvector’s HNSW index: a hierarchical navigable small world graph index for fast similarity search. - pgvector’s IVFFLAT index: an inverted file index for fast similarity search.

To learn more about Timescale Vector’s new DiskANN inspired index, see the Timescale Vector explainer blog for more details and performance benchmarks.

Important note: In PostgreSQL, each table can only have one index on a particular column. So if you’d like to test the performance of different index types, you can do so either by (1) creating multiple tables with different indexes, (2) creating multiple vector columns in the same table and creating different indexes on each column, or (3) by dropping and recreating the index on the same column and comparing results.

Let’s look at how to create each type of index, starting with the StreamingDiskANN index.

# Create a timescale vector (DiskANN) search index on the embedding column
await vec.create_embedding_index(client.DiskAnnIndex())

Timescale Vector also supports HNSW and ivfflat indexes:

await vec.drop_embedding_index()

# Create HNSW search index on the embedding column
await vec.create_embedding_index(client.HNSWIndex())

await vec.drop_embedding_index()

# Create IVFFLAT search index on the embedding column
await vec.create_embedding_index(client.IvfflatIndex())

In general we recommend using the timescale vector or HNSW index for most use cases, as they are most useful for high dimension and large datasets.

await vec.drop_embedding_index()
await vec.create_embedding_index(client.DiskAnnIndex())

6. Similarity search with time filtering

We’ll use Timescale Vector to find similar commits to a given query within a time range.

A key use case for Timescale Vector is efficient time-based vector search. Timescale Vector enables this by automatically partitioning vectors (and associated metadata) by time. This allows you to efficiently query vectors by both similarity to a query vector and time.

Time-based vector search functionality is helpful for applications like: - Storing and retrieving LLM response history (e.g. chatbots) - Finding the most recent embeddings that are similar to a query vector (e.g recent news). - Constraining similarity search to a relevant time range (e.g asking time-based questions about a knowledge base)

Let’s look at how to run similarity searches with time range filters using the client.

The first step to using time filtering with Timescale Vector is to create a table with the time_partition_interval argument set to the desired time interval. This will automatically partition the table into time-based chunks to speed up queries. We completed this step in Part 1 above.
Next, we ensure the id of our row is a uuid with a datetime portion that reflects the date and time we want to associated with the embedding. We completed this step in Part 2 above, where we used the uuid_from_time() method provided by the Timescale Vector library.
Finally, we can run similarity searches with time range filters using the client. We’ll illustrate this below.

# define search query
query_string = "What's new with TimescaleDB functions?"
query_embedding = get_embeddings(query_string)

# Time filter variables for query
start_date = datetime(2023, 8, 1, 22, 10, 35) # Start date = 1 August 2023, 22:10:35
end_date = datetime(2023, 8, 30, 22, 10, 35) # End date = 30 August 2023, 22:10:35
td = timedelta(days=7) # Time delta = 7 days

# Method 1: Filter within a provided start date and end date.
records_time_filtered = await vec.search(query_embedding,
                        limit=3, 
                        uuid_time_filter=client.UUIDTimeRange(start_date, end_date))

for result in records_time_filtered:
    print("-" * 80)
    print(result[client.SEARCH_RESULT_CONTENTS_IDX])

--------------------------------------------------------------------------------
Sven Klemm<sven@timescale.com> Tue Aug 29 18:13:24 2023 +0200  e4facda540286b0affba47ccc63959fefe2a7b26 Add compatibility layer for _timescaledb_internal functions With timescaledb 2.12 all the functions present in _timescaledb_internal were moved into the _timescaledb_functions schema to improve schema security. This patch adds a compatibility layer so external callers of these internal functions will not break and allow for more flexibility when migrating. 
--------------------------------------------------------------------------------
Sven Klemm<sven@timescale.com> Tue Aug 29 18:13:24 2023 +0200  e4facda540286b0affba47ccc63959fefe2a7b26 Add compatibility layer for _timescaledb_internal functions With timescaledb 2.12 all the functions present in _timescaledb_internal were moved into the _timescaledb_functions schema to improve schema security. This patch adds a compatibility layer so external callers of these internal functions will not break and allow for more flexibility when migrating. 
--------------------------------------------------------------------------------
Dmitry Simonenko<dmitry@timescale.com> Thu Aug 3 14:30:23 2023 +0300  7aeed663b9c0f337b530fd6cad47704a51a9b2ec Feature flags for TimescaleDB features This PR adds several GUCs which allow to enable/disable major timescaledb features:  - enable_hypertable_create - enable_hypertable_compression - enable_cagg_create - enable_policy_create

Only vectors within the specified time range are returned. These queries are very efficient as they only need to search the relevant partitions.

Resources and next steps

To continue your learning journey check out the following resources:

Timescale Vector AI tutorials and guides
Timescale Vector explainer blog and performance benchmarks for more details and performance benchmarks.
Using Timescale Vector with LangChain
Using Timescale Vector with LlamaIndex

And finally, if you haven’t already, remember to claim your 90 days free of Timescale Vector by signing up here.