Skip to content

QAmatch(qa_match)/文本匹配/文本分类/文本embedding/文本聚类/文本检索(bow/ifidf/ngramtf-df/bert/albert/bm25/…/nn/gbdt/xgb/kmeans/dscan/faiss/….)

Notifications You must be signed in to change notification settings

MachineLP/TextMatch

Repository files navigation

TextMatch

TextMatch is a semantic matching model library for QA & text search ... It's easy to train models and to export representation vectors.

Let's Run examples !

test models List

Modelmodelstests
Bow1test
TFIDF2test
Ngram-TFIDF3test
W2V4test
BERT5BERT-whitening校正了BERT句向量分布,使cos相似度更合理。 SimCSE
ALBERT6test 链接:https://pan.baidu.com/s/1HSVS104iBBOsfw7hXdyqLQ 密码:808k
DSSM
bm258test
edit_sim9test
jaccard_sim10test
wmd11test
Kmeans12test
DBSCAN13test
PCA14test
FAISS15test
....
lr92test
gbdt93test
gbdt_lr94test
lgb95test
xgb96test
Bagging97test
QA98test
Text Embedding99test

train models List

Modelmodelstrain
Bow1train
TFIDF2train
Ngram-TFIDF3train
W2V4train
BERT5train
ALBERT6train
DSSM
Kmeans12train
DBSCAN13train
PCA14train
....
lr92train
gbdt93train
gbdt_lr94train
lgb95train
xgb96train

TODO

(1)knn (2)dssm (3)实体识别 (4)文本纠错

  • wechat ID: lp9628

样例:

git clone https://github.com/MachineLP/TextMatch cd TextMatch pip install -r requirements.txt export PYTHONPATH=${PYTHONPATH}:../TextMatch python examples/text_search.py 

examples/text_search.py

importsysfromtextmatch.models.text_embedding.model_factory_sklearnimportModelFactoryif__name__=='__main__': # docdoc_dict={"0":"我去玉龙雪山并且喜欢玉龙雪山玉龙雪山", "1":"我在玉龙雪山并且喜欢玉龙雪山", "2":"我在九寨沟", "3":"你好"} # queryquery="我在九寨沟,很喜欢"# 模型工厂,选择需要的模型加到列表中: 'bow', 'tfidf', 'ngram_tfidf', 'bert', 'albert', 'w2v'mf=ModelFactory( match_models=['bow', 'tfidf', 'ngram_tfidf'] ) # 模型处理初始化mf.init(words_dict=doc_dict, update=True) # query 与 doc的相似度search_res=mf.predict(query) print ('search_res>>>>>', search_res) # search_res>>>>>{'bow': [('0', 0.2773500981126146), ('1', 0.5303300858899106), ('2', 0.8660254037844388), ('3', 0.0)], 'tfidf': [('0', 0.2201159065358879), ('1', 0.46476266418455736), ('2', 0.8749225357988296), ('3', 0.0)], 'ngram_tfidf': [('0', 0.035719486884261346), ('1', 0.09654705406841395), ('2', 0.9561288696241232), ('3', 0.0)]}# query的embeddingquery_emb=mf.predict_emb(query) print ('query_emb>>>>>', query_emb) ''' pre_emb>>>>>{'bow': array([1., 0., 0., 1., 1., 0., 1., 0.]), 'tfidf': array([0.61422608, 0. , 0. , 0.4842629 , 0.4842629 , 0. , 0.39205255, 0. ]), 'ngram_tfidf': array([0. , 0. , 0.37156534, 0.37156534, 0. , 0. , 0. , 0.29294639, 0. , 0.37156534, 0.37156534, 0. , 0. , 0.37156534, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.29294639, 0.37156534, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ])} '''

text_match

样例:

git clone https://github.com/MachineLP/TextMatch cd TextMatch export PYTHONPATH=${PYTHONPATH}:../TextMatch python tests/tools_test/faiss_test.py 

tests/tools_test/faiss_test.py

importsysimportjsonimporttimeimportfaissimportnumpyasnpfromfaissimportnormalize_L2fromtextmatch.config.constantimportConstantasconstfromtextmatch.core.text_embeddingimportTextEmbeddingfromtextmatch.tools.decomposition.pcaimportPCADecompositionfromtextmatch.tools.faiss.faissimportFaissSearchtest_dict={"id0": "其实事物发展有自己的潮流和规律", "id1": "当你身处潮流之中的时候,要紧紧抓住潮流的机会", "id2": "想办法脱颖而出,即使没有成功,也会更加洞悉时代的脉搏", "id3": "收获珍贵的知识和经验。而如果潮流已经退去", "id4": "这个时候再去往这个方向上努力,只会收获迷茫与压抑", "id5": "对时代、对自己都没有什么帮助", "id6": "但是时代的浪潮犹如海滩上的浪花,总是一浪接着一浪,只要你站在海边,身处这个行业之中,下一个浪潮很快又会到来。你需要敏感而又深刻地去观察,略去那些浮躁的泡沫,抓住真正潮流的机会,奋力一搏,不管成败,都不会遗憾。", "id7": "其实事物发展有自己的潮流和规律", "id8": "当你身处潮流之中的时候,要紧紧抓住潮流的机会" } if__name__=='__main__': # ['bow', 'tfidf', 'ngram_tfidf', 'bert']# ['bow', 'tfidf', 'ngram_tfidf', 'bert', 'w2v']# text_embedding = TextEmbedding( match_models=['bow', 'tfidf', 'ngram_tfidf', 'w2v'], words_dict=test_dict ) text_embedding=TextEmbedding( match_models=['bow', 'tfidf', 'ngram_tfidf', 'w2v'], words_dict=None, update=False ) feature_list= [] forsentenceintest_dict.values(): pre=text_embedding.predict(sentence) feature=np.concatenate([pre[model] formodelin ['bow', 'tfidf', 'ngram_tfidf', 'w2v']], axis=0) feature_list.append(feature) pca=PCADecomposition(n_components=8) data=np.array( feature_list ) pca.fit( data ) res=pca.transform( data ) print('res>>', res) pre=text_embedding.predict("潮流和规律") feature=np.concatenate([pre[model] formodelin ['bow', 'tfidf', 'ngram_tfidf', 'w2v']], axis=0) test=pca.transform( [feature] ) faiss_search=FaissSearch( res, sport_mode=False ) faiss_res=faiss_search.predict( test ) print( "faiss_res:", faiss_res ) ''' faiss kmeans result times 8.0108642578125e-05 faiss_res: [{0: 0.7833399, 7: 0.7833399, 3: 0.63782495}] '''faiss_search=FaissSearch( res, sport_mode=True ) faiss_res=faiss_search.predict( test ) print( "faiss_res:", faiss_res ) ''' faiss kmeans result times 3.266334533691406e-05 faiss_res: [{0: 0.7833399, 7: 0.7833399, 3: 0.63782495}] '''

run train_model/ (train embedding(bow/tfidf/ngram tfidf/bert/albert... train classifer))

git clone https://github.com/MachineLP/TextMatch cd TextMatch pip install -r requirements.txt export PYTHONPATH=${PYTHONPATH}:../TextMatch python train_model/train_bow.py (文本embedding) python train_model/train_tfidf.py (文本embedding) python train_model/train_ngram_tfidf.py (文本embedding) python train_model/train_bert.py (文本embedding) python train_model/train_albert.py (文本embedding) python train_model/train_w2v.py (文本embedding) python train_model/train_dssm.py (文本embedding) python train_model/train_lr_classifer.py (文本分类) python train_model/train_gbdt_classifer.py (文本分类) python train_model/train_gbdlr_classifer.py (文本分类) python train_model/train_lgb_classifer.py (文本分类) python train_model/train_xgb_classifer.py (文本分类) python train_model/train_dnn_classifer.py (文本分类) 

run tests/core_test (qa/文本embedding)

git clone https://github.com/MachineLP/TextMatch cd TextMatch pip install -r requirements.txt export PYTHONPATH=${PYTHONPATH}:../TextMatch python tests/core_test/qa_match_test.py python tests/core_test/text_embedding_test.py 

run tests/models_test (模型测试)

git clone https://github.com/MachineLP/TextMatch cd TextMatch pip install -r requirements.txt export PYTHONPATH=${PYTHONPATH}:../TextMatch python tests/models_test/bm25_test.py (bm25) python tests/models_test/edit_sim_test.py (编辑距离) python tests/models_test/jaccard_sim_test.py (jaccard) python tests/models_test/bow_sklearn_test.py (bow) python tests/models_test/tf_idf_sklearn_test.py (tf_idf) python tests/models_test/ngram_tf_idf_sklearn_test.py (ngram_tf_idf) python tests/models_test/w2v_test.py (w2v) python tests/models_test/albert_test.py (bert) 

run tests/ml_test (机器学习模型测试)

git clone https://github.com/MachineLP/TextMatch cd TextMatch pip install -r requirements.txt export PYTHONPATH=${PYTHONPATH}:../TextMatch python tests/ml_test/lr_test.py (lr) python tests/ml_test/gbdt_test.py (gbdt) python tests/ml_test/gbdt_lr_test.py (gbdt_lr) python tests/ml_test/lgb_test.py (lgb) python tests/ml_test/xgb_test.py (xgb) 

run tests/tools_test (聚类/降维工具测试)

git clone https://github.com/MachineLP/TextMatch cd TextMatch pip install -r requirements.txt export PYTHONPATH=${PYTHONPATH}:../TextMatch python tests/tools_test/kmeans_test.py (kmeans) python tests/tools_test/dbscan_test.py (dbscan) python tests/tools_test/pca_test.py (pca) python tests/tools_test/faiss_test.py (faiss) 

run tests/tools_test (词云)

git clone https://github.com/MachineLP/TextMatch cd TextMatch pip install -r requirements.txt cd tests/tools_test python generate_word_cloud.py 

word_cloud

About

QAmatch(qa_match)/文本匹配/文本分类/文本embedding/文本聚类/文本检索(bow/ifidf/ngramtf-df/bert/albert/bm25/…/nn/gbdt/xgb/kmeans/dscan/faiss/….)

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages