原文出处:https://discuss.nebula-graph.com.cn/t/topic/11873
本文是一个基于图数据库 NebulaGraph 上的图算法、图数据库、图神经网络的 ID-Mapping 方法综述,除了基本方法思想的介绍之外,我还给大家弄了可以跑的 Playground。
基于图数据库的用户 ID 识别方法用户ID 识别,是一个很常见的图技术应用场景,在不同的语境下它可能还被叫做 Entity Correlation(实体关联)、Entity Linking(实体链接)、ID Mapping(身份映射)等等。ID 识别解决的问题是找出相同的用户在同一个系统或者不同系统中的不同账号。
由于 ID 识别天然地是一个关联关系问题,也是一个典型的图、图数据库应用场景。
建立图谱图建模我们从一个最简单、直接的图谱开始,如下边的图结构示意显示,我们定义了点:
- userProp: [name, email, birthday, address, phone_num]
- phone
- device
- ip
- address
在他们之间有很自然的边:
- used_deviceProp: time
- logged_in_fromProp: time
- has_phone
- has_address
- has_email
这份数据是开源的,地址在 https://github.com/wey-gu/identity-correlation-datagen
数据写入数据写入我们用一行部署图数据库服务的 nebula-up:https://github.com/wey-gu/nebula-up/
curl -fsSL nebula-up.siwei.io/install.sh | bash
图建模的 Schema 对应的 NebulaGraph DDL 是:
# 创建一个叫做 entity_resolution 的图空间
CREATE SPACE entity_resolution (vid_type=FIXED_STRING(30));
USE entity_resolution;
# 创建点的类型 TAG
CREATE TAG `user` (`name` string NOT NULL, `email` string NOT NULL, `phone_num` string NOT NULL, `birthday` date NOT NULL, `address` string NOT NULL);
CREATE TAG `address` (`address` string NOT NULL);
CREATE TAG `device` (`uuid` string NOT NULL);
CREATE TAG `email` ();
CREATE TAG `ip` ();
CREATE TAG `phone` ();
# 创建边的类型 Edge Type
CREATE EDGE `used_device` (`time` timestamp NOT NULL);
CREATE EDGE `logged_in_from` (`time` timestamp NOT NULL);
CREATE EDGE `has_phone` ();
CREATE EDGE `has_address` ();
CREATE EDGE `has_email` ();
对于写入数据的 DML,这里只给出 user ,email 类型点、has_email 类型边的例子:
INSERT VERTEX `user` (`email`, `name`, `birthday`, `address`, `phone_num`) VALUES
"user_1":("heathermoore@johnson.com","Miranda Miller",date("1957-08-27"),"Brittany Forge Apt. 718 East Eric WV 97881"," 1-652-450-5443x00562"),
"user_2":("holly@welch.org","Holly Pollard",date("1990-10-19"),"1 Amanda Freeway Lisaland NJ 94933","600-192-2985x041"),
"user_3":("julia.h.24@gmail.com","Julia Hall",date("1927-08-24"),"Rodriguez Track East Connorfort NC 63144","1248361783"),
"user_4":("franklin.b@gibson.biz","Franklin Barnett",date("2020-03-01"),"Richard Curve Kingstad AZ 05660","(224)497-9312"),
"user_5":("4kelly@yahoo.com","April Kelly",date("1967-12-01"),"Schmidt Key Lake Charles AL 36174","410.138.1816x98702"),
"user_6":("steven.web@johnson.com","Steven Webb",date("1955-04-24"),"5 Joanna Key Suite 704 Frankshire OK 03035","3666519376"),
"user_7":("Jessica_Torres@morris.com","Jessica Torres",date("1958-09-03"),"1 Payne Circle Mitchellfort LA 73053","535-357-3112x4903"),
"user_8":("brettglenn@gmail.com","Brett Glenn",date("1992-09-03"),"Weber Unions Eddieland MT 64619","660.391.3730"),
"user_9":("veronica.j@yahoo.com","Veronica Jordan",date("1947-06-08"),"2 Klein Mission New Annetteton HI 05775","810-252-6218"),
"user_10":("steven@phelps-craig.info","Steven Brooks",date("1954-06-14"),"1 Vanessa Stravenue Suite 184 Baileyville NY 46381"," 1-665-328-8103x3448"),
"user_11":("ReginaldTheMan@hotmail.com","Reginald Mccullough",date("1915-04-12"),"John Garden Port John LA 54602","030.088.4523x94511"),
"user_12":("Jennifer.f@carroll-acosta.com","Jennifer Foster",date("1988-04-30"),"11 Webb Groves Tiffanyside MN 14566","(489)306-8558x98227"),
"user_13":("Philip66@yahoo.com","Philip Garcia",date("1955-12-01"),"70 Robinson Locks Suite 113 East Veronica ND 87845","490-088-7610x9437"),
"user_14":("Ann@hernandez.com","Ann Williams",date("1947-05-28"),"24 Mcknight Port Apt. 028 Sarahborough MD 38195","868.057.4056x4814"),
"user_15":("Jessica@turner.com","Jessica Stewart",date("1951-11-28"),"0337 Mason Corner Apt. 900 Toddmouth FL 61464","(335)408-3835x883"),
"user_16":("Sandra311@hotmail.com","Sandra Dougherty",date("1908-06-03"),"7 Davis Station Apt. 691 Pittmanfort HI 29746"," 1-189-827-0744x27614"),
"user_17":("Sharon91@gmail.com","Sharon Mccoy",date("1958-09-01"),"1 Southport Street Apt. 098 Westport KY 85907","(814)898-9079x898"),
"user_18":("Sharon91 001@gmail.com","Kathryn Miller",date("1958-09-01"),"1 Southport Street Apt. 098 Westport KY 85907","(814)898-9079x898"),
"user_19":("brettglenn@googlemail.com","Bretty Glenn",date("1991-09-03"),"Weber Unions Eddieland MT 64619","660-391-3730"),
"user_20":("julia.h.24@yahoo.com","Julia H.",date("1927-08-24"),"Rodriguez Track East Connorfort NC 63144","1248361783"),
"user_21":("holly@welch.org","Holly",date("0000-10-19"),"1 Amanda Freeway Lisaland NJ 94933","(600)-192-2985"),
"user_22":("veronica.j@yahoo.com","Veronica Jordan",date("0000-06-08"),"2 Klein HI 05775","(810)-252-6218"),
"user_23":("4kelly@hotmail.com","Kelly April",date("2010-01-01"),"Schmidt Key Lake Charles AL 13617","410-138-1816");
INSERT VERTEX `email` () VALUES
"heathermoore@johnson.com":(),
"holly@welch.org":(),
"julia.h.24@gmail.com":(),
"franklin.b@gibson.biz":(),
"4kelly@yahoo.com":(),
"steven.web@johnson.com":(),
"Jessica_Torres@morris.com":(),
"brettglenn@gmail.com":(),
"veronica.j@yahoo.com":(),
"steven@phelps-craig.info":(),
"ReginaldTheMan@hotmail.com":(),
"Jennifer.f@carroll-acosta.com":(),
"Philip66@yahoo.com":(),
"Ann@hernandez.com":(),
"Jessica@turner.com":(),
"Sandra311@hotmail.com":(),
"Sharon91@gmail.com":(),
"Sharon91 001@gmail.com":(),
"brettglenn@googlemail.com":(),
"julia.h.24@yahoo.com":(),
"holly@welch.org":(),
"veronica.j@yahoo.com":(),
"4kelly@hotmail.com":();
INSERT VERTEX `ip` () VALUES
"202.123.513.12":(),
"202.41.23.11":(),
"143.1.23.4":(),
"143.1.23.12":(),
"153.42.2.8":(),
"9.1.4.1":();
INSERT VERTEX `device`(`uuid`) VALUES
"device_0":("2a8e791d-0183-4df2-aa36-5ac82151be93"),
"device_1":("f9be6a11-f74b-45f5-a9ea-bb3af5a868a2"),
"device_2":("ae083379-91f5-4cd3-b2b3-273960979dab"),
"device_3":("c0981d43-1e59-4cd5-a1e1-e88cd9e792a5"),
"device_4":("e730dd8a-fcd3-47b4-be4a-0190610e6f02");
INSERT EDGE `has_email` () VALUES
"user_1"->"heathermoore@johnson.com":(),
"user_2"->"holly@welch.org":(),
"user_3"->"julia.h.24@gmail.com":(),
"user_4"->"franklin.b@gibson.biz":(),
"user_5"->"4kelly@yahoo.com":(),
"user_6"->"steven.web@johnson.com":(),
"user_7"->"Jessica_Torres@morris.com":(),
"user_8"->"brettglenn@gmail.com":(),
"user_9"->"veronica.j@yahoo.com":(),
"user_10"->"steven@phelps-craig.info":(),
"user_11"->"ReginaldTheMan@hotmail.com":(),
"user_12"->"Jennifer.f@carroll-acosta.com":(),
"user_13"->"Philip66@yahoo.com":(),
"user_14"->"Ann@hernandez.com":(),
"user_15"->"Jessica@turner.com":(),
"user_16"->"Sandra311@hotmail.com":(),
"user_17"->"Sharon91@gmail.com":(),
"user_18"->"Sharon91 001@gmail.com":(),
"user_19"->"brettglenn@googlemail.com":(),
"user_20"->"julia.h.24@yahoo.com":(),
"user_21"->"holly@welch.org":(),
"user_22"->"veronica.j@yahoo.com":(),
"user_23"->"4kelly@hotmail.com":();
INSERT EDGE `used_device` (`time`) VALUES
"user_2"->"device_0":(timestamp("2021-03-01T08:00:00")),
"user_21"->"device_0":(timestamp("2021-03-01T08:01:00")),
"user_18"->"device_1":(timestamp("2021-03-01T08:02:00")),
"user_17"->"device_1":(timestamp("2021-03-01T08:03:00")),
"user_22"->"device_2":(timestamp("2021-03-01T08:04:00")),
"user_9"->"device_3":(timestamp("2021-03-01T08:05:00")),
"user_9"->"device_2":(timestamp("2021-03-01T08:06:00")),
"user_23"->"device_4":(timestamp("2021-03-01T08:07:00"));
INSERT EDGE `logged_in_from` (`time`) VALUES
"user_2"->"202.123.513.12":(timestamp("2021-03-01T08:00:00")),
"user_21"->"202.41.23.11":(timestamp("2021-03-01T08:01:00")),
"user_18"->"143.1.23.4":(timestamp("2021-03-01T08:02:00")),
"user_17"->"143.1.23.12":(timestamp("2021-03-01T08:03:00")),
"user_22"->"153.42.2.8":(timestamp("2021-03-01T08:04:00")),
"user_9"->"153.42.2.8":(timestamp("2021-03-01T08:05:00")),
"user_9"->"153.42.2.8":(timestamp("2021-03-01T08:06:00")),
"user_23"->"9.1.4.1":(timestamp("2021-03-01T08:07:00"));
根据确定规则获取 ID 映射关系
这种方式是最简单、直接的方法,在特定的场景下也可能是有用的。试想象 email、IP 地址、上网设备这些有严格的结构的数据,在它们成为图谱中的点的时候,简单的相等关系就足以找出这样对应关系,比如:
- 拥有相同的 email
- 使用过相同的 IP 地址
- 使用过相同的设备
在前面的图谱、图数据库中,拥有相同的 email 可以直接表达为如下的图模式(Graph Pattern)。
(:user)-[:has_email]->(:email)<-[:has_email]-[:user]
下图为顶点 user 与边 has_email 的一个图的可视化结果,可以看到这其中有两个三个点相连的串正是符合拥有相同 email 的模式的点。
这个结果的数据源在 https://github.com/wey-gu/identity-correlation-datagen/tree/main/sample/hand_crafted。如果通过线上访问原文,你可以鼠标悬停(获取点上的属性)和框选放大每一个点和子图哦。
在构建 ID Mapping 系统的过程中,我们通过图数据库直接查询,可视化渲染结果来看到等效的洞察。这个查询可以写成:
MATCH p=(:user)-[:has_email]->(:email)<-[:has_email]-(:user)
RETURN p limit 10
可视化图探索工具 NebulaGraph Studio 中的查询结果: