自查詢 | akasha 0.9 使用手冊

自查詢(Self Query)

自查詢是指能夠自我查詢的檢索問題方法。透過結構化、包含metadata的文件集，能將使用者的問題先透過metadata進行篩選，再藉由向量來查詢語意相似的文檔，比起直接做向量查詢能更精確的找到需要的文件。

流程

自查詢的流程為

0. create metadata

添加結構化資料到文件集與向量資料庫中

1. query constructor

將使用者問題轉換成 query(用來做向量查詢的字串)與filter(用來篩選文件集的metatada keywords)
例如，使用者問題為: A公司在2024年對電動車的銷售量為多少?

透過語言模型將其轉換為

{
    "query": "電動車的銷售量",
    "filter": "and(eq(\"公司名稱\", \"A公司\"), eq(\"日期年\", \"2024\"))"
}

2. filter translator

將filter處理為程式可執行的dictionary

3. filter docs

根據filter篩選需要的文件集

1
2
3

{'filter': {'$and': [{'拜訪年': {'$eq': '2024'}}, {'公司名稱': {'$eq': 'A公司'}}]}}

4. similarity search

從篩選出來的文件集做向量相似度搜尋

範例

1. load db from data source and add metadata to db


import akasha.utils.db as adb
import akasha.helper as ah
data_source = ["docs/pns_query_small"]
embed_name = "openai:text-embedding-3-small"
chunk_size = 1000
db, ignore_files = adb.process_db(
    data_source=data_source, embeddings=embed_name, chunk_size=chunk_size, verbose=True
)


### you should create your own metadata function to add metadata for every text chunks in the db ###
### in this example, we use the source(file name) of the text chunk to map text chunks and metadatas ###
def add_metadata(db: adb.dbs):
    """this function is used to add metadata to the old_db object.

    Args:
        old_db (adb.dbs):

    Returns:
        (adb.dbs):
    """
    import json
    from pathlib import Path

    for metadata in db.metadatas:
        file_path = metadata["source"]  # source is the file path
        try:
            with Path(file_path).open("r", encoding="utf-8") as file:
                dictionary = json.load(file)

            # dictionary = helper.extract_json(text)
            metadata["課別"] = dictionary["課別"]
            metadata["業務擔當"] = dictionary["業務擔當"]
            ddate = dictionary["拜訪日期"]
            metadata["拜訪年"] = int(ddate.split("-")[0])
            metadata["拜訪月"] = int(ddate.split("-")[1])
            metadata["產品"] = dictionary["產品"]
            metadata["公司名稱"] = dictionary["公司名稱"]
            metadata["大分類"] = dictionary["大分類"]
            metadata["中分類"] = dictionary["中分類"]
        except Exception as e:
            print(f"JSONDecodeError: {e}")
    return


add_metadata(db)
adb.update_db(db, data_source, embed_name, chunk_size=chunk_size)

2. use self-query to filter docs and use query for similarity search

import akasha.utils.db as adb
import akasha.helper as ah
from akasha.utils.search.retrievers.base import get_retrivers
import akasha

db, ignore_files = adb.process_db(
    data_source=data_source, embeddings=embed_name, chunk_size=chunk_size, verbose=True
)
prompt = "A公司在2024年對電動車的銷售量為多少?"
search_type = "knn"
model_obj = ah.handle_model("openai:gpt-4o", True)


## each metadata attribute should include name, description and type(integer, float, string) ##
metadata_field_info = [
    {"name": "拜訪年", "description": "此訪談紀錄的拜訪年份", "type": "integer"},
    {"name": "拜訪月", "description": "此訪談紀錄的拜訪月份", "type": "integer"},
    {"name": "業務擔當", "description": "業務的名稱", "type": "string"},
    {"name": "中分類", "description": "訪談產品的中等分類", "type": "string"},
    {"name": "公司名稱", "description": "訪談對象的公司名稱", "type": "string"},
    {"name": "大分類", "description": "訪談產品的大分類", "type": "string"},
    {"name": "產品", "description": "訪談的產品名稱/型號", "type": "string"},
    {"name": "課別", "description": "公司部門的課別名稱或代號", "type": "string"},
]

document_content_description = "業務與客戶的訪談紀錄"

####################


### use self-query to filter docs
new_dbs, query, matched_fields = ah.self_query(
    prompt, model_obj, db, metadata_field_info, document_content_description
)

### option1   use knn similarity search to sort docs from filtered docs

retriver = get_retrivers(new_dbs, embed_name, threshold=0.0, search_type=search_type)[0]

docs, scores = retriver.get_relevant_documents_and_scores(query)
print(docs)

### option2    use new_dbs(filtered docs) to run other akasha functions


ak = akasha.RAG(
    model=model_obj,
    embeddings=embed_name,
)
resposne = ak(data_source=new_dbs, prompt=prompt)

自訂parser函式

若您使用的語言模型回答無法使用預設的parser找出query與filter，可以自訂一個parser函式，輸入為語言模型的回答(string)，輸出為[query(string), filter(dictionary)]。

import akasha.utils.db as adb
import akasha.helper as ah
import json
def just_an_example(input_text: str):
    # input_text :
    #```json
    #{
    #   "query": "產品疑慮",
    #   "filter": "and(eq(\"公司名稱\", \"a公司\"), eq(\"拜訪年\", \"2024\"))"
    #}
    #```

    jstr = '{"$and": [{"公司名稱": {"$eq": "a公司"}}, {"拜訪年": {"$eq": "2024"}}]}'
    dic = json.loads(jstr)
    return "產品疑慮", dic

### use self-query to filter docs
new_dbs, query, matched_fields = ah.self_query(
            prompt, model_obj, db, metadata_field_info,
            document_content_description, just_an_example)

loose filter

使用參數 loose_filter=True會將filter中的$and替換成$or，只要文件有符合任意attribute便會選取

import akasha.utils.db as adb
import akasha.helper as ah
### use self-query to filter docs
## filter become '{"$or": [{"公司名稱": {"$eq": "a公司"}}, {"拜訪年": {"$eq": "2024"}}]}'
new_dbs, query, matched_fields = ah.self_query(
            prompt, model_obj, db, metadata_field_info,
            document_content_description, loose_filter = True)