自查詢

自查詢(Self Query)

自查詢是指能夠自我查詢的檢索問題方法。透過結構化、包含metadata的文件集,能將使用者的問題先透過metadata進行篩選,再藉由向量來查詢語意相似的文檔,比起直接做向量查詢能更精確的找到需要的文件。

流程

自查詢的流程為

0. create metadata

添加結構化資料到文件集與向量資料庫中

1. query constructor

將使用者問題轉換成 query(用來做向量查詢的字串)與filter(用來篩選文件集的metatada keywords)
例如,使用者問題為: A公司在2024年對電動車的銷售量為多少?

透過語言模型 將其轉換為

1
2
3
4
{
"query": "電動車的銷售量",
"filter": "and(eq(\"公司名稱\", \"A公司\"), eq(\"日期年\", \"2024\"))"
}

2. filter translator

將filter處理為程式可執行的dictionary

3. filter docs

根據filter篩選需要的文件集

1
2
3
{'filter': {'$and': [{'拜訪年': {'$eq': '2024'}}, {'公司名稱': {'$eq': 'A公司'}}]}}


從篩選出來的文件集做向量相似度搜尋



範例

1. create chromadb & add metadatas

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

import akasha



def add_metadata(metadata_list: list):

import json
from pathlib import Path
for metadata in metadata_list:
file_path = metadata['source'] # file path
try:
with Path(file_path).open('r', encoding='utf-8') as file:
dictionary = json.load(file)


metadata['課別'] = dictionary['課別']
metadata['業務擔當'] = dictionary['業務擔當']
ddate = dictionary['拜訪日期']
metadata['拜訪年'] = int(ddate.split('-')[0])
metadata['拜訪月'] = int(ddate.split('-')[1])
metadata['產品'] = dictionary['產品']
metadata['公司名稱'] = dictionary['公司名稱']
metadata['大分類'] = dictionary['大分類']
metadata['中分類'] = dictionary['中分類']
except Exception as e:
print(f"JSONDecodeError: {e}")
return metadata_list


### set parameter ###
dir = "docs/pns_query"
embed_name = "openai:text-embedding-ada-002"
chunk_size = 99999 # make sure 1 file 1 chunk
emb_obj = akasha.handle_embeddings("openai:text-embedding-ada-002")
####################

# create chromadb from docs
db, _ = akasha.db.processMultiDB(dir, False, emb_obj, chunk_size,
True)


### add metadata to chromadb ###
metadata_list = akasha.db.get_db_metadata(dir, embed_name, chunk_size) # get original metada from chromadb, list of dictionary
metadata_list = add_metadata(metadata_list) # update/add metadata, you can build your own function to update metadata
akasha.db.update_db_metadata(metadata_list, dir, embed_name, chunk_size) # update and save new metadatas to chromadb
print(akasha.db.get_db_metadata(dir, embed_name, chunk_size)[0])


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77

from akasha.self_query import query_filter
### set parameter ###

db, _ = akasha.db.processMultiDB(dir, False, emb_obj, chunk_size,
True)
prompt = "A公司在2024年對電動車的銷售量為多少?"
search_type = "bm25"
model_obj = akasha.helper.handle_model("openai:gpt-4o", True)

## each metadata attribute should include name, description and type(integer, float, string) ##
metadata_field_info = [
{
"name": "拜訪年",
"description": "此訪談紀錄的拜訪年份",
"type": "integer"
},
{
"name": "拜訪月",
"description": "此訪談紀錄的拜訪月份",
"type": "integer"
},
{
"name": "業務擔當",
"description": "業務的名稱",
"type": "string"
},
{
"name": "中分類",
"description": "訪談產品的中等分類",
"type": "string"
},
{
"name": "公司名稱",
"description": "訪談對象的公司名稱",
"type": "string"
},
{
"name": "大分類",
"description": "訪談產品的大分類",
"type": "string"
},
{
"name": "產品",
"description": "訪談的產品名稱/型號",
"type": "string"
},
{
"name": "課別",
"description": "公司部門的課別名稱或代號",
"type": "string"
},
]

document_content_description = "業務與客戶的訪談紀錄"

####################




### use self-query to filter docs
new_dbs, query, matched_fields = query_filter(
prompt, model_obj, db, metadata_field_info,
document_content_description)

### option1 use svm similarity search to sort docs from filtered docs
retriver = akasha.search.get_retrivers(new_dbs, emb_obj, 0.0,
search_type)[0]
retri_docs, retri_scores = retriver._gs(query)


### option2 use new_dbs(filtered docs) to run other akasha functions
ak = akasha.Doc_QA(embeddings=emb_obj, model=model_obj)
ak.get_response(doc_path=new_dbs,prompt=prompt)




自訂parser函式

若您使用的語言模型回答無法使用預設的parser找出query與filter,可以自訂一個parser函式,輸入為語言模型的回答(string),輸出為[query(string), filter(dictionary)]。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

import json
def just_an_example(input_text: str):
# input_text :
#```json
#{
# "query": "產品疑慮",
# "filter": "and(eq(\"公司名稱\", \"a公司\"), eq(\"拜訪年\", \"2024\"))"
#}
#```

jstr = '{"$and": [{"公司名稱": {"$eq": "a公司"}}, {"拜訪年": {"$eq": "2024"}}]}'
dic = json.loads(jstr)
return "產品疑慮", dic

### use self-query to filter docs
new_dbs, query, matched_fields = query_filter(
prompt, model_obj, db, metadata_field_info,
document_content_description, just_an_example)



loose filter

使用參數 loose_filter=True會將filter中的$and替換成$or,只要文件有符合任意attribute便會選取

1
2
3
4
5
6
7

### use self-query to filter docs
## filter become '{"$or": [{"公司名稱": {"$eq": "a公司"}}, {"拜訪年": {"$eq": "2024"}}]}'
new_dbs, query, matched_fields = query_filter(
prompt, model_obj, db, metadata_field_info,
document_content_description, loose_filter = True)