基于向量的召回算法及其在个性化广告新闻中的应用实践-刘政
Embedding Based Recall: Practice,Progress and PerspectivesZheng Liu , Jianxun Lian, Xing XieSocial Computing Group, MSRAAug 15 , 2021thReinforced Anchor Knowledge Graph Generation for News Recommendation Reasoning, Liu et. al.KDD 2021 Outline Overview Multi-Stage Pipeline EBR: Pros and Cons Embedding learning algorithms Negative Augmentation Hard Negative Sampling Diversified representation Training as knowledge distillation Things beyond learning algorithms Efficiency issues Combo of sparse and dense Overview: Multi-Stage PipelineL1 Stage(Recall)L2 Stage(Rank)L3 Stage(Re-rank)× 1001× 1069× 1013 Rank : high-precision, KPI-oriented Recall : fast, accurate, comprehensive Overview: Multi-Stage PipelineFast, AccurateAdsQueryXbox 360 4GB SlimConsoleMicrosoft Xbox 360Microsoft Xbox 360 E250GBXbox 360 Game SystemHDMI Overview: Multi-Stage PipelineVocabulary mis-matchAdsQueryXbox 360 4GB SlimConsoleMicrosoft gameMicrosoft Xbox 360 Econsole250GBXbox 360 Game SystemHDMI Overview: EBR, Pros and CosHigh-generalizable,Relatively fastMicrosoft gameconsoleXbox 360 Game SystemANN Index (PQ, HNSW)HDMI Overview: EBR, Pros and ConsModels training isdata-intensive< Q: Xbox 360 Game System Microsoft gameconsoleXbox 360 Game SystemHDMIHDMI, K: Microsoft game console > Overview : EBR, Pros and Consembeddings canbeambiguousNintendo switchconsole the Xbox 360 console harddrive Xbox 360 GameSystem HDMIHard drive will be at least 20GBmodel supports HD graphics in 16 x 9wide-screen, with anti-aliasing Overview : EBR, Pros and ConsAlignmentUniformity1 SimCSE, Gao et.al.2 Understanding ContrastiveLearning ICML 2020, Wang et. al. Outline Overview Multi-Stage Pipeline EBR: Pros and Cons Embedding learning algorithms Negative Augmentation Hard Negative Sampling Training as distillation Diversified representation Things beyond learning algorithms Efficiency issues Combo of sparse and dense Algos: Negative AugmentationQKOnes positiveusedLarger batch size->others negativehigher accuracyDPR, Karpukhin et. al.batch Algos: Negative AugmentationExpand#negativesby × +Dev-1Dev-Cross-device negativesamplingDev-RocketQA , Ding et. al. Algos: Negative AugmentationCross-device valuesVirtual Differentiable Cross-SoPQ, Xiao and Liu et.made virtual-Device Sharing (V-DCS)al.differentiable1. Generate embeddings foreach batch, one batch/device2. Broadcast embeddings to alldevices3. Compute the global NCE-losssymmetrically on all devices,based on the broadcastedembeddings4. Back-propagate and reducethe gradients on all devices Algos: Hard Negative SamplingApproximate Nearest NeighborGet hard negatives byNegative Contrastive Learning forDense Text Retrieval (ANCE,Xiong et. al.)sample from ANN search1. Learn embedding model within-batch negative2. Build ANN index and get hardnegatives3. Update embedding model withhard negative4. Repeat 2. and 3. untilconverge Algos: Training as distillationKeywords sorted byTraining as distillation Training teacher modelwith labeled data Annotate unlabeled datawith teacherrelevanceWeak-annotatedlabelEst. Train student withlabeled and weak-annotated datarelevanceStudentTeacherRocketQA , Ding et. al.Weak Annotation, Li et. al.KeywordQueryQuery + Keyword Algos: diversified representationUser history may consistofhighly diverse eventsUserembeddings canbe ambiguousTarget(news/ads)User history(Webbrowsings) Algos: diversified representationComprehensive,Elastic Multi-embeddingRetrieval (Bloom-filter styleinterest extractor)elastic,parameter-efficientANN Index Generate item embeddings Compute item embeddingsmembership via learned hash Group items based on binarycodes010000101000 Aggregate items with the samebinary codes for user3 hash-class, eachembeddingshas 4 functions->34 latent paritionsOctopus , Liu et. al. Outline Overview Multi-Stage Pipeline EBR: Pros and Cons Embedding learning algorithms Negative Augmentation Hard Negative Sampling Diversified representation Training as knowledge distillation Things beyond learning algorithms Efficiency issues Combo of sparse and dense Things beyond: EfficiencyDesired properties aboutANN (HSWN, PQ,ANNOY ) Accurate (high recall) Fast (low latency) Light (low mem cost)Not enough budgetto host the index inMEM FAISS, Facebook AIResearch