Datasets

Overview

In this section, we present sample datasets recommended by Corvic to kick-start your embedding generation spaces. These datasets cover a variety of use cases, showcasing the capabilities and versatility of the Corvic platform. Available in .parquet format, these datasets can be accessed from the corvic-sample-datasets G-Drive folder. Below, we provide a brief overview of each dataset and the results of embedding them via the Corvic Platform.

LDBC-SF0.01

This sample dataset is derived from the LDBC Financial Benchmark, as detailed in the specification document. The primary goal of this benchmark is to establish a standard that captures the unique data and query patterns prevalent in the financial industry. This benchmark is designed to provide a way to test the performance of graph database systems in a manner that is representative of real-world use cases, particularly those found in financial scenarios. By doing so, it ensures that the evaluation of different graph databases is both reliable and comparable, thereby facilitating more informed decision-making when selecting a graph database system for use in financial applications.

Schema Definition

Entity (Dimension) FilesPerson.parquet: Real-world individuals.Company.parquet: Entities that people or other companies invest in.Account.parquet: Financial systems registered and owned by persons and companies.Loan.parquet: Loans applied by individuals and companies.Medium.parquet: Things used to sign in an account (IP, MAC, phone numbers).
Relation (Fact) FilesAccountTransferAccount.parquet: Fund transfers between accounts.AccountWithdrawAccount.parquet: Funds moved from one card account to another.AccountRepayLoan.parquet: Loan repayment from an account.LoanDepositAccount.parquet: Loan fund deposited to an account.MediumSignInAccount.parquet: Account signed in with a Media.CompanyInvestCompany.parquet: Company invests in a company.PersonInvestCompany.parquet: Person invests in a company.CompanyApplyLoan.parquet: Company applies for a Loan.PersonApplyLoan.parquet: Person applies for loan.CompanyGuaranteeCompany.parquet: Company guarantees another.PersonGuaranteePerson.parquet: Person guarantees another.CompanyOwnAccount.parquet: Company owns an account.PersonOwnAccount.parquet: Person owns an account.

Sample Embedding Space

The left figure below illustrates a feature view that incorporates all primary entities from the LDBC-SF dataset as the entities of interest for embedding. These entities can be embedded using our graph transformation and structural encoding algorithm. The right figure presents a UMAP plot of these graph structural embeddings consolidated into a single vector space.

LDBC-SF0.01 Sample Embedding Space
LDBC-SF0.01 Sample Embedding Space

AmazonReviews

This collection comprises more than 34,000 consumer reviews for various Amazon products, including popular items such as the Kindle and Fire TV Stick. These reviews are sourced from a Kaggle project, which uses a representative sample of a larger dataset that is made available by Datafiniti's Product Database. The provided dataset is not just limited to basic product information and customer ratings. It also includes the review text, which offers a more detailed insight into customer experiences, and additional information about each product. This makes it a valuable resource for those interested in analyzing consumer behavior or conducting market research.

Schema Definition

Entity (Dimension) Filesproducts.parquet: Information about the products: name, brand, category, etc.users.parquet: Information about the customers: username, age, state, etc.
Relation (Fact) Filesreviews.parquet: Customer review of a product: title, date, text, etc.transaction.parquet: user product purchase transactions.

Sample Embedding Space

The feature view (depicted on the left) represents a mixture of users and products, linked by transactions. The UMAP plot (shown on the right) illustrates the structural embeddings of these entities in a unified vector space.

Amazon Sample Embedding Space
Amazon Sample Embedding Space

ZacharyKarateClub

The Zachary karate club network is an undirected social network collected by Wayne Zachary in 1977, where each node represents a club member, and each edge a tie. It's often used to identify groups formed after a dispute between two teachers. The network, featured in Zachary's paper and later popularized by Girvan and Newman in 2002, includes 34 nodes, 156 edges, and 2 classes. Official website: http://konect.cc/networks/ucidata-zachary/

Schema Definition

Entity (Dimension) Filesadherent.parquet: Information about the club member: id, name, club.
Relation (Fact) Filesrelation.parquet: Connection between the club members

Sample Embedding Space

The feature view (depicted on the left) represents a mixture of users and products, linked by transactions. The UMAP plot (shown on the right) illustrates the structural embeddings of these entities in a unified vector space.

ZacharyKarateClub Sample Embedding Space
ZacharyKarateClub Sample Embedding Space