From GraphPowered Machine Learning by Alessandro Negro This article discusses managing data in graphpowered machine learning projects. 
Take 40% off GraphPowered Machine Learning by entering fccnegro into the discount code box at checkout at manning.com.
Managing data sources
Figure 1. Managing data sources in the mental model
As you are probably aware, graphs are extremely useful for encoding information, and data in graph format is increasingly plentiful. Many areas of machine learning, including natural language processing, computer vision, and recommendations, graphs are used to model local relationships between isolated data items (users, items, events, and others) and to construct global structures from local information.^{[1]} Representing data as graphs is often a necessary step (and at other times a desirable one) when dealing with problems arising from applications in machine learning or data mining. In particular, it becomes crucial when we want to apply graphbased learning methods to the datasets.
The transformation from structured or unstructured data to a graph representation can be performed in a lossless manner, but this isn’t always necessary (or desirable) for the purpose of the learning algorithm. Sometimes, a better model is an “aggregated view” of the data. For instance, if you’re modeling a phone call between two people you can decide to have a relationship between the two entities (the caller and the receiver) for each call, or you can have a single relationship that aggregates all the calls.
A graph can be constructed from the input dataset in two ways:
 By designing a graph model that represents the data
 By using some convenient graph formation criteria
In the first case, the graph model is an alternative representation of the same information available in the dataset itself, or in multiple datasets. The nodes and the relationships in the graph are a mere representation (aggregated or not) of the data available in the original sources. Furthermore, in this case the graph acts as a connected data source that merges data coming from multiple heterogeneous sources, operating, at the end of the process, as the single trusted source of truth. Multiple methodologies, techniques, and best practices should be applied this model shift and represent data in a graph format, and we l discuss some of them here, considering multiple scenarios.
In the second scenario (using some convenient graph formation criteria), the data items are stored in the graph (generally as nodes) and a graph is created using some edge construction mechanism. As an example, suppose that in your graph each node represents some text, such as a sentence or an entire document. They’re isolated entries. No relationship exists between them unless they’re connected explicitly (via a citation in a paper, for instance). In machine learning, text is generally represented as a vector where each entry contains the weight of a word or a “feature” in the text. Edges can be created (constructed) using the similarity or dissimilarity value between the vectors. A new graph is created starting from unrelated information. In such a case, the resulting graph embeds more information than the original datasets. This additional information is made up of several ingredients, the most important of which is the structural or topological information of the data relationships.
The result of both processes is a graph that represents the input data and that becomes the training dataset for the relevant machine learning algorithms. In some cases, these algorithms are themselves graph algorithms, and they require a graph representation of the data, and in other cases the graph is a better way of accessing the same data. The examples and scenarios described here represent both cases.
The required steps in the process of converting data into its graph representation (or creating it as a graph) are shown in figure 2.
Figure 2. Process of converting data into a graph representation
 Identify the data sources. Identify the data available for algorithm training purposes as well as the sources from which such data can be extracted. This corresponds to the second phase in a machine learning project, after defining the goals (the data preparation phase of the CRISPDM data model).
 Analyze the data available. Analyze each data source available and evaluate the content, in terms of quality and quantity. To achieve good results from the training phase, it’s imperative to have a large amount of goodquality data.
 Design the graph data model. This step is twofold. According to the specific analytics requirements, you must:

 Identify the meaningful information to be extracted from the data sources.
 Design a specific graph model, considering the data available, access patterns, and extensibility.
 Define the data flow. Design the architecture of the ingestion process (known as the ETL process) that extracts, transforms, and loads the data from the multiple sources in the graph database using the schema designed.
 Import the data into the graph. Start the ETL process defined in step 4. Generally, steps 3, 4, and 5 are iterative until you arrive at the right model and the right ETL process.
 Perform postimport tasks. Before starting the analysis, the data in the graph might require some preprocessing. These tasks include:

 Data cleaning—Remove or correct incomplete or incorrect data.
 Data enrichment—Extend the existing data sources with external sources of knowledge or with knowledge extracted from the data itself. The latter case falls under graph creation.
 Data merging—Because the data comes from multiple sources, related elements in the dataset can be merged in a single element or they can be connected through new relationships.
Steps 5 and 6 can be inverted or mixed; in some cases, the data may pass through a process of cleaning before the ingestion happens. In any case, at the end of those six steps the data is ready for the next phase, which involves the learning process.
The new representation of the data provides several advantages that we investigate here through the lens of multiple scenarios and multiple models. For each scenario presented, the context and purpose are described. These are key aspects to define the right model and to understand the value of the graph approach for storing and managing data and of graphs as input for the next steps in the analysis.
MONITOR A SUBJECT Suppose again that you’re a police officer. You want to track a suspect and predict their future movements using cellular tower data collected from the continuous monitoring signals every phone sends to (or receives from) all towers it can reach. Using graph models and graph clustering algorithms it’s possible to structure cellular tower data and represent a subject’s positions and movements in a simple and clear manner. A predictive model can then be created.</
The goal in this scenario is to monitor a subject and create a predictive model that identifies location clusters relevant to the subject’s life which is able to predict and anticipate subsequent movements according to the subject’s current position and last movements.^{[2]}
The data in this scenario is cellular tower data generated by the interaction between the subject’s phone and the cellular towers, as represented in figure 3.
Figure 3. Phones communicating with cellular towers
For the purpose of such as analysis, data can be collected from the towers or from the phones belonging to the monitored subjects. The data from the cellular towers is easy to obtain with the necessary permissions, but it requires a lot of cleaning (removing the irrelevant numbers) and merging (data from multiple cellular towers). Gathering data from the phones requires hacking, which isn’t always possible, but this data is clean and already merged. In their paper, Eagle, Quinn, and Clauset consider this second data source, and we do the same here, but the results and the considerations are the same regardless of the source of the data.
Let’s suppose that the phones will provide data in the format shown in table 1.
Table 1. Examples of the data provided by a single phone with the four towers (identified by their id) reached at each timestamp.
Phone identifier 
Timestamp 
Cellular tower 1 
Cellular tower 2 
Cellular tower 3 
Cellular tower 4 
562d6873b0fe 
1530713007 
eca5b35d 
f7106f86 
1d00f5fb 
665332d8 
562d6873b0fe 
1530716500 
f7106f86 
1d00f5fb 
2a434006 
eca5b35d 
562d6873b0fe 
1530799402 
f7106f86 
eca5b35d 
2a434006 
1d00f5fb 
562d6873b0fe 
1531317805 
1d00f5fb 
665332d8 
f7106f86 
eca5b35d 
562d6873b0fe 
1533391403 
2a434006 
665332d8 
eca5b35d 
1d00f5fb 
For the sake of simplicity, this table represents the data provided by a single phone (the phone identifier is always the same). Each phone records the four towers with the strongest signals at thirtysecond intervals.
The analysis requires us to identify locations in which the monitored subject spends time. The cellular tower data available on the phone can’t provide this information by itself because it contains the identifiers of the four towers with the highest signal strengths, but starting from such data it’s possible to identify key locations by passing through a graph representation of the data and a graph algorithm. This data can therefore be modeled as a graph that represents a cellular tower network (CTN). Each node in this graph is a unique cellular tower; edges exist between any two nodes that cooccur in the same record, and each edge is assigned a weight according to the total amount of time that pair of nodes cooccurred in a record, across all records. A CTN is generated for each subject that shows every tower logged by that individual’s phone during the monitoring period. The result looks like figure 4.
Figure 4. A graph representation of the CTN for a single subject
A graph clustering algorithm is then applied to this graph to identify the main locations where the subject spent a significant amount of time (for example, at the office, at home, at the supermarket, at church, and other locations). The result of the analysis looks like figure 5, in which multiple clusters are identified and isolated.
Figure 5. A clustered view of the CTN
This scenario shows how well adapted a graph model is to represent the data for the specific purposes of this analysis. By performing a graphbased analysis using the community detection algorithm, we can easily identify areas where the subject spends a lot of time—a task which is difficult, if not impossible, with other representations or analysis methods.
The graph modeling described here illustrates a graph construction technique. The resulting graph can be generalized as a cooccurrence graph. The nodes represent entities (in this case, cellular towers), and the relationships represent the fact that the two entities belong to a common set or group (in the CTN, the set is the row in the table that indicates that at a specific point in time the phone can reach both towers). This is a powerful technique used in many algorithms and machine learning applications as a data preprocessing step before performing the analysis. Quite often, this type of graph construction technique is used in applications related to text analysis.
DETECT A FRAUD Suppose again that you want to create a fraud detection platform for banks that reveals the point of origin of a credit card theft. A graph representation of the transactions can help you identify, even visually, the location of the theft.
In this scenario, the data available is the credit card transactions, with details about the date (timestamp), the amount, the merchant, and whether the transaction is “disputed” or “undisputed.” Once a person’s credit card details have been stolen, in the transaction history, real operations are mixed with illegal or fraudulent operations. The goal of the analysis is to identify the point where the fraud started—the shop where the theft occurred. The transactions at that shop are real, but any transactions that taken place afterward may be fraudulent.
The data available looks like table 2.
Table 2. A subset of user transactions
User identifier 
Timestamp 
Amount 
Merchant 
Validity 
User A 
01/02/2018 
$250 
Hilton Barcelona 
undisputed 
User A 
02/02/2018 
$220 
AT&T 
undisputed 
User A 
12/03/2018 
$15 
Burger King New York 
undisputed 
User A 
14/03/2018 
$100 
Whole Foods 
disputed 
User B 
12/04/2018 
$20 
AT&T 
undisputed 
User B 
13/04/2018 
$20 
Hard Rock Cafe 
undisputed 
User B 
14/04/2018 
$8 
Burger King 
undisputed 
User B 
20/04/2018 
$8 
Starbucks 
disputed 
User C 
03/05/2018 
$15 
Whole Foods 
disputed 
User C 
05/05/2018 
$15 
Burger King 
undisputed 
User C 
12/05/2018 
$15 
Starbucks 
disputed 
Our aim is to identify some common pattern that reveals at which point users start disputing their transactions, which helps us to locate the establishment where the card details were stolen. This analysis can be performed using a graph representation of the transactions. The data in table 2 can be modeled in a transaction graph as shown in figure 6.
Figure 6. A transaction graph for credit card fraud detection
By starting from this representation of the transactions and using graph queries, it’s possible to determine that the theft occurred at Burger King.
The graph of the transactions allows us to easily identify where the card thief operates. In this case the analysis is performed on the graph as it appears after the ETL phase; no other intermediate transformation is required. This data representation expresses the information in such a way that it’s possible to quickly recognize behavioral patterns in a long list of transactions and spot where the issue is.
Transaction graphs like the one shown here can represent any kind of event that involves two entities. Generally, they’re used for modeling monetary transactions in an unaggregated way, which means every single operation can be related to a specific portion of the graph. For the majority of cases, in the resulting graph each transaction is represented in one of the following two ways:
 As a directed edge between the two entities involved in the transaction. For example, if User A makes a purchase at Shop B, this event is translated into a directed edge that starts with User A and terminates at Shop B. In this case, all the relevant details about the purchase, such as the date and amount, are stored as properties of the edge (figure 6 (a)).
 As a node that contains all the relevant information about the event and it’s connected via edges to the related nodes. In the case of the purchase, the transaction itself is modeled as a node and it’s then connected to the “source” and “destination” of the purchase (figure 7 (b)).
Figure 7. Transaction modeling examples
The first approach is generally used when the amount of information related to the event is small or when a simpler model is preferable for the purpose of the analysis. The second approach is generally preferred when the event itself contains valuable information that could be connected to other information items, or when the event involves more than two items.
Transaction graphs are quite common in fraud detection analysis and all machine learning projects in which each event contains relevant information that, if the events were aggregated, would be lost.
IDENTIFY RISKS IN A SUPPLY CHAIN Suppose that you need to implement a risk management system that identifies or predicts possible risks in a supply chain. A supply chain network (SCN) is a common way to represent supply chain elements and their interactions in a graph. Such a representation, together with proper graph analysis algorithms, makes it easy and fast to spot issues throughout the chain.
This scenario has become more and more relevant in recent years, for multiple reasons. For example:
 With the development of the global economy, which means any supply chain can have a global dimension, the problems that supply chain management faces are becoming more complex.
 Customers located at the end of the chain are becoming more interested in the origins of the products they buy.
Managing the disruption risk and making the chain more transparent are currently mandatory tasks in any supply chain. Supply chains are inherently fragile and face a variety of threats, from natural disasters and attacks to the contamination of raw products, delivery delays, and labor shortages.^{[3]} Furthermore, because the different parts of the chain are complex and interrelated, the normal operation of one member—and the efficient operation of the chain as a whole—often relies on the normal operation of other components. The members within a supply chain include suppliers, manufacturers, distributors, and customers. They’re all dependent on one another and cooperate with each other through material flows, information flows, and financial flows, but they’re also independent entities operating on their own, and perhaps providing the same services to multiple companies. Therefore, the data available in such a scenario is distributed across multiple companies which have different structures. Any kind of analysis based on data in this shape is a complex task; gathering the required information from the multiple members and organizing it requires a lot of effort.
The purpose of the analysis here is to spot elements in the chain that, if compromised, can disrupt the entire network (or a large part of it), or significantly affect the normal behavior. A graph model can support such an analysis task through different network analysis algorithms.
A supply chain can be represented by a graph using the following approach:
 Each member of the supply chain is represented by a node. The members could be raw product suppliers (primary suppliers), secondary suppliers, intermediate distributors, transformation processes, organizational units in a big company, and final retailers. The granularity of the graph is related to the risk evaluation required.
 Each relation in the graph represents a dependency between two members of the chain. The relationships could include transport from a supplier to an intermediate distributor, a dependency between two processing steps, and the delivery to the final retailer.
 To each node it’s possible to relate some “temporal” data that could store historic as well as forecasting information.
The network structure might evolve over time as well. The graph model can be designed to keep track of the changes, but that makes it too complicated for the purpose of this example.
Our example graph model is shown in figure 8.
Figure 8. A supply chain network.
This model represents an important method for gathering data and organizing it in an organic and homogeneous way, and it provides a suitable representation for the type of analysis that risk management requires.
RECOMMEND ITEMS Suppose that you want to recommend items to users in an ecommerce shop, using the data from previous interactions (clicks, purchases, ratings). Graphs can help you store the user–item dataset in a way that speeds up access to it, and storing the predictive models in the graph not only facilitates the predictions but also allows you to merge multiple models smoothly.
One of the most common use cases for graphs in machine learning is recommendations. I wrote the first recommendation engine ever built on top of Neo4j in 2012. This is how my career with graphs started, and it’s why this specific topic is close to my heart. Let’s start with a simple example by considering the most basic implementation.
It’s possible to use multiple approaches to provide recommendations. In this specific example the approach selected is based on a technique called collaborative filtering. The main idea of collaborative approaches to recommendations is to exploit information about the past behavior or opinions of an existing user community to predict which items the current user of the system most probably like or are interested in.^{[4]} Pure collaborative approaches take a matrix of given user–item interactions of any type—views, past purchases, and ratings—as input and produce the following types of output:
 A (numerical) prediction indicating the likelihood that the current user likes or dislikes a certain item (the relevance score)
 An ordered list of n recommended items based on the value predicted
The relevance is measured with a utility function f which is estimated based on the user feedback.^{[5]} More formally, the relevance function can be defined as:
where User is the set of all users and Item is the set of all items. This function can then be used to compute the relevance scores for all the elements for which no information is available. The data the predictions are based on can be either directly provided by the users (through ratings, and likes/dislikes) or implicitly collected by observing the users’ actions (page clicks, and purchases). The type of information available determines the types of techniques that can be used to build the recommendations. A contentbased approach is possible if information about the users (profile attributes, preferences) and items (intrinsic properties) can be drawn upon. If only implicit feedback is available, a collaborative filtering approach is required.
After predicting relevance scores for all the unseen (or unbought) items, we can rank them and show the top n items to the user, performing the recommendation.
As usual, we start our discussion from the data available. The data source in this case looks like table 3 (one means a low rating and a five means the user has a great opinion of the item).
Table 3. An example of a User–Item dataset represented using a matrix
User 
Item 1 
Item 2 
Item 3 
Item 4 
Item 5 
Bob 
– 
3 
– 
4 
? 
User 2 
3 
5 
– 
– 
5 
User 3 
– 
– 
4 
4 
– 
User 4 
2 
– 
4 
– 
3 
User 5 
– 
3 
– 
5 
4 
User 6 
– 
– 
5 
4 
– 
User 7 
5 
4 
– 
– 
5 
User 8 
– 
– 
3 
4 
5 
This table is a classic user–item matrix that contains the interactions (in this case the ratings) between the users and the items. The cells with the symbol ““ mean that the user hasn’t bought or rated that item. In an ecommerce scenario like the one we’re considering, there could be a large number of users and items, and the resulting table could be quite sparse—each user buys only a small subset of the available items, and the resulting matrix has a lot of empty cells. In our example table, the unseen or unbought element that we want to predict interest in is item number five for the user Bob.
Starting from the data available (in the shape described) and from the basic idea of collaborative filtering, multiple ways of implementing this prediction exist. For the purpose of this scenario, in this part of the article we consider the itembased algorithms. The main idea of itembased algorithms is to compute predictions using the similarity between items. Therefore, we consider the table column by column, with each column describing a vector of elements (called the rating vector) where the ““ symbol is replaced with a zero value. Let’s examine our User–Item dataset and make a prediction for Bob for item number five. We first compare all the rating vectors of the other items and look for items that are similar to item number five. The idea of itembased recommendation is now to look at Bob’s ratings for these similar items. The itembased algorithm computes a weighted average of these other ratings and uses this to predict a rating for item number five for the user Bob.
To compute the similarity between items, a similarity measure must be defined. Cosine similarity is the standard metric in itembased recommendation approaches: it determines the similarity between two vectors by calculating the cosine of the angle between them. In machine learning applications, this measure is often used to compare two text documents, which are represented as vectors of terms; we use it frequently in this article.
The formula to compute the cosine of the angle between the two vectors, and therefore the similarity between two items a and b, is as follows:
The · symbol indicates the dot product of the two vectors. is the Euclidian length of the vector, which is defined as the square root of the dot product of the vector with itself.
Figure 9 shows a representation of cosine distance in twodimensional space.
Figure 9. Cosine distance representation in twodimensional space.
To further explain the formula, let’s consider the cosine similarity of item number five, described by the rating vector [0, 5, 0, 3, 4, 0, 5, 5], and item one, described by the vector [0, 3, 0, 2, 0, 0, 5, 0]. It’s calculated as follows:
The numerator is the dot product between the two vectors, computed from the sum of the products of the corresponding entries of the two sequences of numbers. The denominator is the product of the Euclidian lengths of the two vectors. The Euclidian distance is the distance between two points in the multidimensional space of the vectors. Figure 10 illustrates the concept in a twodimensional space.
Figure 10. Euclidian distance in twodimensional space
The formula is as follows:
The Euclidian length is the Euclidian distance of the vector from the origin of the space (the vector [0,0,0,0,0,0,0,0] in our case):
The similarity values range between zero and one, with one indicating the strongest similarity. Consider now that we need to compute this similarity between each pair of items in the database, and if we’ve 1M products we need to compute 1M * 1M similarity values. We can reduce this number by half because similarity is commutative—cos(a, b) = cos(b, a)—but it’s still a lot of computations. In this case a graph can be a valuable helper to speed up the machine learning algorithm for the recommendations.
The User–Item dataset described previously can be converted easily into a graph like the one in figure 11.
Figure 11. Bipartite graph representing the User–Item dataset.
In this graph representation, all the users are on the left and all the items are on the right. The relationships go only from nodes in one subset to nodes in the other subset; no relationships occur between nodes of the same set. This is an example of a bipartite graph, or bigraph.
More formally, a bigraph is a special type of graph whose vertices (nodes) can be divided into two disjoint and independent sets U and V such that every edge connects a vertex in U to one in V, or vice versa. Vertex sets U and V are usually called the parts of the graph.^{[6]}
How can a bipartite graph representation reduce the number of similarity computations we have to perform? To understand this, it’s necessary to understand cosine similarity a little better (although the principle can be extended to a wider set of similarity functions). The cosine similarity metric measures the angle between two ndimensional vectors, and the two vectors have a cosine similarity equal to zero when they’re orthogonal (perpendicular). In the context of our example, this happens when there are no overlapping users between two items (users that rate both the items). In such cases, the numerator of the fraction is zero. For example, we can compute the distance between item number two, described by the vector [3, 5, 0, 0, 3, 0, 4, 0], and item number three, described by the vector [0, 0, 4, 4, 0, 5, 0, 3], as follows:
In this case the similarity value is zero (see figure 12). In a sparse User–Item dataset the probability of orthogonality is high, and the number of useless computations is correspondingly high. Using the graph representation makes it easy, with a simple query, to find all the items that have at least one rating user in common. The similarity can then be computed only between the current item and the overlapping items, greatly reducing the number of computations required. In a native graph engine, the query for searching overlapping items is fast.
Figure 12. Two nonoverlapping items: these items have no rating users in common, and the cosine similarity between them is zero.
Another approach is to separate the bipartite graph into clusters and compute the distance only between the items belonging to the same cluster. In the second part other techniques are presented for representing such a graph in a way that simplifies the identification of areas in which it’s easier to find similar items or users.
In this scenario, the graph model helps to improve performance by reducing the amount of time required to compute the similarities between the items, and therefore the recommendations. It’s possible to store the results of similarity computations to perform fast recommendations. Furthermore, cosine similarity is used as a technique for graph construction.
That’s all for this article. If you want to learn more about the book, you can check it out on our browserbased liveBook platform here.
^{[1]} Machine Learning in Complex Networks, Zhao and Silva, 2016
^{[2]} International Conference on Pervasive Computing: Methodologies for Continuous Cellular Tower Data Analysis, Eagle, Quinn, and Clauset, 2009
^{[3]} Managing Disruption Risk in Supply Chains, Kleindorfer and Saad, 2005
^{[4]} Recommender Systems: and Introduction, Jannach et al., 2010
^{[5]} Tensor Methods and Recommender Systems, Frolov and Oseledets, 2016
^{[6]} Graph Theory, Diestel, 2017