0%

自查詢

自查詢(Self Query)

自查詢是指能夠自我查詢的檢索問題方法。透過結構化、包含metadata的文件集,能將使用者的問題先透過metadata進行篩選,再藉由向量來查詢語意相似的文檔,比起直接做向量查詢能更精確的找到需要的文件。

流程

自查詢的流程為

0. create metadata

添加結構化資料到文件集與向量資料庫中

1. query constructor

將使用者問題轉換成 query(用來做向量查詢的字串)與filter(用來篩選文件集的metatada keywords)
例如,使用者問題為: A公司在2024年對電動車的銷售量為多少?

透過語言模型 將其轉換為

1
2
3
4
{
"query": "電動車的銷售量",
"filter": "and(eq(\"公司名稱\", \"A公司\"), eq(\"日期年\", \"2024\"))"
}

2. filter translator

將filter處理為程式可執行的dictionary

3. filter docs

根據filter篩選需要的文件集

1
2
3
{'filter': {'$and': [{'拜訪年': {'$eq': '2024'}}, {'公司名稱': {'$eq': 'A公司'}}]}}


從篩選出來的文件集做向量相似度搜尋



範例

1. load db from data source and add metadata to db

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

import akasha.utils.db as adb
import akasha.helper as ah
data_source = ["docs/pns_query_small"]
embed_name = "openai:text-embedding-3-small"
chunk_size = 1000
db, ignore_files = adb.process_db(
data_source=data_source, embeddings=embed_name, chunk_size=chunk_size, verbose=True
)


### you should create your own metadata function to add metadata for every text chunks in the db ###
### in this example, we use the source(file name) of the text chunk to map text chunks and metadatas ###
def add_metadata(db: adb.dbs):
"""this function is used to add metadata to the old_db object.

Args:
old_db (adb.dbs):

Returns:
(adb.dbs):
"""
import json
from pathlib import Path

for metadata in db.metadatas:
file_path = metadata["source"] # source is the file path
try:
with Path(file_path).open("r", encoding="utf-8") as file:
dictionary = json.load(file)

# dictionary = helper.extract_json(text)
metadata["課別"] = dictionary["課別"]
metadata["業務擔當"] = dictionary["業務擔當"]
ddate = dictionary["拜訪日期"]
metadata["拜訪年"] = int(ddate.split("-")[0])
metadata["拜訪月"] = int(ddate.split("-")[1])
metadata["產品"] = dictionary["產品"]
metadata["公司名稱"] = dictionary["公司名稱"]
metadata["大分類"] = dictionary["大分類"]
metadata["中分類"] = dictionary["中分類"]
except Exception as e:
print(f"JSONDecodeError: {e}")
return


add_metadata(db)
adb.update_db(db, data_source, embed_name, chunk_size=chunk_size)


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import akasha.utils.db as adb
import akasha.helper as ah
from akasha.utils.search.retrievers.base import get_retrivers
import akasha

db, ignore_files = adb.process_db(
data_source=data_source, embeddings=embed_name, chunk_size=chunk_size, verbose=True
)
prompt = "A公司在2024年對電動車的銷售量為多少?"
search_type = "knn"
model_obj = ah.handle_model("openai:gpt-4o", True)


## each metadata attribute should include name, description and type(integer, float, string) ##
metadata_field_info = [
{"name": "拜訪年", "description": "此訪談紀錄的拜訪年份", "type": "integer"},
{"name": "拜訪月", "description": "此訪談紀錄的拜訪月份", "type": "integer"},
{"name": "業務擔當", "description": "業務的名稱", "type": "string"},
{"name": "中分類", "description": "訪談產品的中等分類", "type": "string"},
{"name": "公司名稱", "description": "訪談對象的公司名稱", "type": "string"},
{"name": "大分類", "description": "訪談產品的大分類", "type": "string"},
{"name": "產品", "description": "訪談的產品名稱/型號", "type": "string"},
{"name": "課別", "description": "公司部門的課別名稱或代號", "type": "string"},
]

document_content_description = "業務與客戶的訪談紀錄"

####################


### use self-query to filter docs
new_dbs, query, matched_fields = ah.self_query(
prompt, model_obj, db, metadata_field_info, document_content_description
)

### option1 use knn similarity search to sort docs from filtered docs

retriver = get_retrivers(new_dbs, embed_name, threshold=0.0, search_type=search_type)[0]

docs, scores = retriver.get_relevant_documents_and_scores(query)
print(docs)

### option2 use new_dbs(filtered docs) to run other akasha functions


ak = akasha.RAG(
model=model_obj,
embeddings=embed_name,
)
resposne = ak(data_source=new_dbs, prompt=prompt)



自訂parser函式

若您使用的語言模型回答無法使用預設的parser找出query與filter,可以自訂一個parser函式,輸入為語言模型的回答(string),輸出為[query(string), filter(dictionary)]。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import akasha.utils.db as adb
import akasha.helper as ah
import json
def just_an_example(input_text: str):
# input_text :
#```json
#{
# "query": "產品疑慮",
# "filter": "and(eq(\"公司名稱\", \"a公司\"), eq(\"拜訪年\", \"2024\"))"
#}
#```

jstr = '{"$and": [{"公司名稱": {"$eq": "a公司"}}, {"拜訪年": {"$eq": "2024"}}]}'
dic = json.loads(jstr)
return "產品疑慮", dic

### use self-query to filter docs
new_dbs, query, matched_fields = ah.self_query(
prompt, model_obj, db, metadata_field_info,
document_content_description, just_an_example)



loose filter

使用參數 loose_filter=True會將filter中的$and替換成$or,只要文件有符合任意attribute便會選取

1
2
3
4
5
6
7
8
import akasha.utils.db as adb
import akasha.helper as ah
### use self-query to filter docs
## filter become '{"$or": [{"公司名稱": {"$eq": "a公司"}}, {"拜訪年": {"$eq": "2024"}}]}'
new_dbs, query, matched_fields = ah.self_query(
prompt, model_obj, db, metadata_field_info,
document_content_description, loose_filter = True)