Chapter 13. Building Knowledge Bases

This application is about organizing information, and making it easy to access by Humans and computers alike. This is known as a knowledge base. The popularity of knowledge bases in the field of NLP has waned in recent decades as the focus has moved away from “expert systems” to statistical machine learning approaches.

An expert system is a system that attempts to use knowledge to make decisions. This knowledge is entities, relationships between entities, and rules. Generally, they had inference engines which allowed the software to utilize the knowledge base to make a decision. These are sometimes as described as collections of if-then rules. However, these systems were much more complicated than this. The knowledge bases and rule sets could be quite large for the technology of the time, so the inference engines needed to be able to efficiently evaluate many logical statements.

Generally, an expert system has a number of actions it can take. There are rules for which action it should take. When the time to take an action comes, the system has a collection of statements, and must use these to identify the best action. For example, let’s say we have an expert system for controling the temperature in a house. We need to to be able to make decisions based on temperature and time. Whenever the system makes a decision to toggle the hearter, or air conditioner, or do nothing it must take the current temperature (or perhaps a collection of temperature measurements) and the current time combined with the rule set to determine what action to take. This system has a small number of entities, the temperatures and the time. Imagine if a system had thousands of entities, with multiple kinds of relationships, and a growing rule set. Resolving the statements available at decision time in a knowledge base this large can be expensive.

In this chapter we will be building a knowledge base. We want a tool for building a knowledge base from a wiki, and a tool for querying the knowledge base. This system should fit on a single machine now. We also want to be able to update the knowledge base with new kinds of entities, and relationships. Such a system could be used by a domain expert in exploring a topic, or could be integrated with an expert system. This means that it should have a Human usable interface, and a responsive API.

Our fictional scenario is a company that is building a machine learning platform. This company primarily sells to other businesses. The sales engineers sometimes fall out of sync with the current state of the system. The engineers are good and update the wiki where appropriate, but the sales engineers are having a hard time keeping up-to-date. The sales engineers create help tickets for engineers to help them update sales demos. The engineers do not like this. So this application will be used to create a knowledge base that will make it easier for the sales engineers to check out what may have changed.

Problem statement & Constraints

  1. What is the problem we are trying to solve?

We want to take a wiki, and produce a knowledge base. There should also be a ways for Humans and other software to query the knowledge base. We can assume that the knowledge base will fit on a single machine.

  1. What constraints are there?

  2. The knowledge base builder should be easily updatable. It should be easy to configure new types of relationships

  3. The storage solution should allow us to easily add new entities and relationships
  4. Answering queries will require less than 50GB disk space,
  5. ... and less than 16 GB memory.
  6. There should be a query for getting related entities. For example, at the end of a wiki article there are often links to related pages. The “get related” query should get these entities.
  7. The “get-related” query should take less than 500ms.

  8. How do we solve the problem with the constraints?

  9. The knowledge base builder can be a script that takes a wiki dump, and processes the XML and the text. This is where we can use Spark NLP in a larger Spark pipeline

  10. Our building script should monitor resources, to warn if we are nearing the prescribed limits
  11. We will need a data base to store our knowledge base. There are many options. We will use Neo4j, a graph database. Neo4j is also relatively well known. There are other solutions possible, but graph databases inherently structure data in the way we need.
  12. Another benefit to Neo4j is that it comes with a GUI for Humans to query, and a REST API for programmatic queries.

Plan the project

Let’s define the acceptance criteria.

We want a script that

takes a wiki dump, generally a compressed XML file
extracts entities, each article title is an entity, as well as other possible entities
extracts relationships, links between articles are relationships, as well as other possible relationships
stores the entities and relationships in Neo4J
will warn if we are producing too much data
We want a service that

allows a “get related” query for a given entitiy, results must be at least the articles linked in the entities article
performs “get related” query in under 500ms
has Human usable frontend
has a REST API
requires less than 16GB memory to run
This is somewhat similar to the application in the previous chapter. We have script that will build a “model”, but now we also want a way to serve the “model”. An additional, and important difference, is that the knowledge base does not come a simple score (e.g. F1 score). This means that we will have to put more thought into metrics.

Design the solution

So we will need to start up Neo4J. Go to neo4j.com for instructions. Once you have it installed, you should be able to go to localhost:7474 for the UI.

Since we are using an off the shelf solution we will not be going very much into graph databases. Here are the important facts.

Graph databases are built to store data as nodes and edges between nodes. The meaning of “node” in this case is usually some kind of entity, and the meaning of an edge is some kind of relationship. There can be different types of nodes and different types of relationships. Outside of a database, graph data can be easily stored in CSVs. There will be CSVs for the nodes. This CSV will have an ID column, some sort of name, and properties - depending on the type. Edges are similar, except that row for an edge will also have the IDs of the two nodes the edge connects. We will not be storing properties.

Since we don’t have access to a company’s internal wiki, we will be using an actual wikipedia dump. But rather than getting the full English language dump, which would be enormous, we will use the Simple English wiki dump.

Simple English is a subset of the English language. It uses about 1500 words not counting proper nouns and some technical terms. This is useful for us since this will help us simplify the code we need to write. If this were a real company wiki, there likely need to be a few iterations of data cleaning. You can find a dump of the Simple English wikipedia at https://dumps.wikimedia.org/simplewiki/.

Here is our plan.

  1. Get the data
  2. Explore the data
  3. Parse the wiki for entities and relationships
  4. Save the entities and relationships in CSVs
  5. Load the CSVs into Neo4J

Implement the solution

First, let’s load the data. Most wikidumps are available as bzip2 compressed XML files. Fortunately, Spark can deal with this kind of data fine. Let’s load it.

import json
import re
import pandas as pd
import sparknlp

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import lit, col

import sparknlp
from sparknlp import DocumentAssembler, Finisher
from sparknlp.annotator import *
packages = [
    'JohnSnowLabs:spark-nlp:2.2.2',
    'com.databricks:spark-xml_2.11:0.6.0'
]

spark = SparkSession.builder 
    .master("local[*]") 
    .appName("Knowledge Graph") 
    .config("spark.driver.memory", "12g") 
    .config("spark.jars.packages", ','.join(packages)) 
    .getOrCreate()

sparknlp.version()
2.2.2
spark

SparkSession - in-memory

SparkContext

Spark UI

Version
v2.4.3
Master
local[*]
AppName
Knowledge Graph

We need to give Spark a hint for parsing the XML. We need to configure what the rootTag is - the name of the element that contains all of our “rows”. We also need to configure the rowTag - the name of the elements that represent our rows.

df = spark.read
    .format('xml')
    .option("rootTag", "mediawiki")
    .option("rowTag", "page")
    .load("simplewiki-20191020-pages-articles-multistream.xml.bz2")
    .persist()

Now, let’s see what the schema looks like.

df.printSchema()
root
 |-- id: long (nullable = true)
 |-- ns: long (nullable = true)
 |-- redirect: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _title: string (nullable = true)
 |-- restrictions: string (nullable = true)
 |-- revision: struct (nullable = true)
 |    |-- comment: struct (nullable = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _deleted: string (nullable = true)
 |    |-- contributor: struct (nullable = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _deleted: string (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- ip: string (nullable = true)
 |    |    |-- username: string (nullable = true)
 |    |-- format: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- minor: string (nullable = true)
 |    |-- model: string (nullable = true)
 |    |-- parentid: long (nullable = true)
 |    |-- sha1: string (nullable = true)
 |    |-- text: struct (nullable = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _space: string (nullable = true)
 |    |-- timestamp: string (nullable = true)
 |-- title: string (nullable = true)

That is somewhat complicated, so we should try and simplify. Let’s see how many documents we have.

df.count()
284812

Let’s look at the page for “Paper”.

row = df.filter('title = "Paper"').first()

print('ID', row['id'])
print('Title', row['title'])
print()
print('redirect', row['redirect'])
print()
print('text')
print(row['revision']['text']['_VALUE'])
ID 3319
Title Paper

redirect None

text
[[File:...
[[File:...
[[File:...
[[File:...
[[File:...
[[File:...

Modern '''paper''' is a thin [[material]] of (mostly) 
[[wood fibre]]s pressed together. People write on paper, and 
[[book]]s are made of paper. Paper can absorb [[liquid]]s such as 
[[water]], so people can clean things with paper.

...

==Related pages==
* [[Paper size]]
* [[Cardboard]]

== References ==
{{Reflist}}

[[Category:Basic English 850 words]]
[[Category:Paper| ]]
[[Category:Writing tools]]

It looks like the text is stored in revision.text._VALUE. There seem to be a few special entries, namely categories and redirects. In most wikis, pages are organized into different categories. Pages are often in multiple categories. These categories have there own pages that link back to the articles. Redirects are pointers from an alternate name for an article to the actual entry.

Let’s look at some categories.

df.filter('title RLIKE "Category.*"').select('title')
    .show(10, False, True)
-RECORD 0--------------------------
 title | Category:Computer science 
-RECORD 1--------------------------
 title | Category:Sports           
-RECORD 2--------------------------
 title | Category:Athletics        
-RECORD 3--------------------------
 title | Category:Body parts       
-RECORD 4--------------------------
 title | Category:Tools            
-RECORD 5--------------------------
 title | Category:Movies           
-RECORD 6--------------------------
 title | Category:Grammar          
-RECORD 7--------------------------
 title | Category:Mathematics      
-RECORD 8--------------------------
 title | Category:Alphabet         
-RECORD 9--------------------------
 title | Category:Countries        
only showing top 10 rows

Now let’s look at a redirect. It looks like the redirect target, where the redirect points, is stored under redirect._title.

df.filter('redirect IS NOT NULL')
    .select('redirect._title', 'title')
    .show(1, False, True)
-RECORD 0-------------
 _title | Catharism   
 title  | Albigensian 
only showing top 1 row

This essentially gives us a synonymy relationship. So, our entities will be titles of articles. Our relationships will be redirects, and links in the related section of the page. First let’s get our entities.

entities = df.select('title').collect()
entities = [r['title'] for r in entities]
entities = set(entities)
print(len(entities))
284812

We may want to introduce a same-category relationship, we extract the categories too.

categories = [e for e in entities if e.startswith('Category:')]
entities = [e for e in entities if not e.startswith('Category:')]

Now, let’s get the redirects.

redirects = df.filter('redirect IS NOT NULL')
    .select('redirect._title', 'title').collect()
redirects = [(r['_title'], r['title']) for r in redirects]
print(len(redirects))
63941

Now we can get the articles.

data = df.filter('redirect IS NULL').selectExpr(
    'revision.text._VALUE AS text',
    'title'
).filter('text IS NOT NULL')

In order to get the related links, we need to know what section we are in. So let’s split the texts into sections. We can then use the RegexMatcher annotator to identify links. Looking at the data, it looks like sections look like ==Section Title==. Let’s define a regex for this, adding in the possibility for extra whitespace.

section_ptn = re.compile(r'^ *==[^=]+ *== *$')

Now, we will define a function that will take a partition of the data, and generate new rows for the sections. We will need to keep track of the article title, the section, and the text of the section.

def sectionize(rows):
    for row in rows:
        title = row['title']
        text = row['text']
        lines = text.split('
')
        buffer = []
        section = 'START'
        for line in lines:
            if section_ptn.match(line):
                yield (title, section, '
'.join(buffer))
                section = line.strip('=').strip().upper()
                buffer = []
                continue
            buffer.append(line)

Now we will call mapPartitions to create a new RDD, and convert that to a DataFrame.

sections = data.rdd.mapPartitions(sectionize)
sections = spark.createDataFrame(sections, 
    ['title', 'section', 'text'])

Let’s look at the most common sections.

sections.select('section').groupBy('section')
    .count().orderBy(col('count').desc()).take(10)
[Row(section='START', count=115586),
 Row(section='REFERENCES', count=32993),
 Row(section='RELATED PAGES', count=8603),
 Row(section='HISTORY', count=6227),
 Row(section='CLUB CAREER STATISTICS', count=3897),
 Row(section='INTERNATIONAL CAREER STATISTICS', count=2493),
 Row(section='GEOGRAPHY', count=2188),
 Row(section='EARLY LIFE', count=1935),
 Row(section='CAREER', count=1726),
 Row(section='NOTES', count=1724)]

Plainly, START is the most common as it captures the text between the start of the article and the first section, so almost all articles will have this. This is from wikipedia, so REFERENCES is the next most common. It looks like RELATED PAGES occurs on only 8603 articles. Now, we will use Spark-NLP to extract all the links from the texts.

%%writefile wiki_regexes.csv
[[[^]]+]]~link
{{[^}]+}}~anchor
Overwriting wiki_regexes.csv
assembler = DocumentAssembler()
    .setInputCol('text')
    .setOutputCol('document')
matcher = RegexMatcher()
    .setInputCols(['document'])
    .setOutputCol('matches')
    .setStrategy("MATCH_ALL")
    .setExternalRules('wiki_regexes.csv', '~')
finisher = Finisher()
    .setInputCols(['matches'])
    .setOutputCols(['links'])

pipeline = Pipeline()
    .setStages([assembler, matcher, finisher])
    .fit(sections)
extracted = pipeline.transform(sections)

Now, we could define a relationship based on just links occurring anywhere. For now, we will stick to the related links only.

links = extracted.select('title', 'section','links').collect()
links = [(r['title'], r['section'], link) for r in links for link in r['links']]
links = list(set(links))
print(len(links))
4012895
related = [(l[0], l[2]) for l in links if l[1] == 'RELATED PAGES']
related = [(e1, e2.strip('[').strip(']').split('|')[-1]) for e1, e2 in related]
related = list(set([(e1, e2) for e1, e2 in related]))
print(len(related))
20726

Now, we have extracted our entities, redirects and related links. Let’s create CSVs for them.

entities_df = pd.Series(entities, name='entity').to_frame()
entities_df.index.name = 'id'
entities_df.to_csv('wiki-entities.csv', index=True, header=True)
e2id = entities_df.reset_index().set_index('entity')['id'].to_dict()
redirect_df = []
for e1, e2 in redirects:
    if e1 in e2id and e2 in e2id:
        redirect_df.append((e2id[e1], e2id[e2]))
redirect_df = pd.DataFrame(redirect_df, columns=['id1', 'id2'])
redirect_df.to_csv('wiki-redirects.csv', index=False, header=True)
related_df = []
for e1, e2 in related:
    if e1 in e2id and e2 in e2id:
        related_df.append((e2id[e1], e2id[e2]))
related_df = pd.DataFrame(related_df, columns=['id1', 'id2'])
related_df.to_csv('wiki-related.csv', index=False, header=True)

Now that we have our CSVs we can copy them to /var/lib/neo4j/import/, and import them using the following

  • Load entities

    LOAD CSV WITH HEADERS FROM "file:/wiki-entities.csv" AS csvLine
    CREATE (e:Entity {id: toInteger(csvLine.id), entity: csvLine.entity})
  • Load “REDIRECTED” relationship

    USING PERIODIC COMMIT 500
    LOAD CSV WITH HEADERS FROM "file:///wiki-redirected.csv" AS csvLine
    MATCH (entity1:Entity {id: toInteger(csvLine.id1)}),(entity2:Entity {id: toInteger(csvLine.id2)})
    CREATE (entity1)-[:REDIRECTED {conxn: "redirected"}]->(entity2)
  • Load “RELATED” relationship

    USING PERIODIC COMMIT 500
    LOAD CSV WITH HEADERS FROM "file:///wiki-related.csv" AS csvLine
    MATCH (entity1:Entity {id: toInteger(csvLine.id1)}),(entity2:Entity {id: toInteger(csvLine.id2)})
    CREATE (entity1)-[:RELATED {conxn: "related"}]->(entity2)Let's go see what we can query. We will get all entities related to "Language", and related to entities that are related to Language (i.e. second-order relations).

Let’s go see what we can query. We will get all entities related to “Language”, and related to entities that are related to Language (i.e. second-order relations).

import requests
query = '''
MATCH (e:Entity {entity: 'Language'})
RETURN e
UNION ALL
MATCH (:Entity {entity: 'Language'})--(e:Entity)
RETURN e
UNION ALL
MATCH (:Entity {entity: 'Language'})--(e1:Entity)--(e:Entity)
RETURN e
'''
payload = {'query': query, 'params': {}}
endpoint = 'http://localhost:7474/db/data/cypher'

response = requests.post(endpoint, json=payload)
response.status_code
200
related = json.loads(response.content)
related = [entity[0]['data']['entity'] 
           for entity in related['data']]
related = sorted(related)
related
1989 in movies
Alphabet
Alphabet (computer science)
Alphabet (computer science)
American English
...
Template:Jctint/core
Testing English as a foreign language
Vowel
Wikipedia:How to write Simple English pages
Writing

Test & Measure the solution

We now have an initial implementation, so let’s go through metrics.

Business metrics

This will depend on the specific use-case of this application. If this knowledge base is used for organizing a companies internal information, then we can look at usage rates. This is not a great metric, since it does not tell us that the system is actually helping the business, only that it is getting used. Let’s consider a hypothetical scenario.

Using our example, the sales engineer can query for a feature they want to demo, and get related features. Hopefully, this will decrease the help tickets. This is a business-level metric we can monitor.

If we implement this system, and do not see sufficient change in the business metrics, we still need metrics to help us understand if problem is with the basic idea of the application or if it is with the quality of the knowledge base.

Model-centric metrics

Measuring the quality of a collection is not as straightforward as measuring a classifier. Let’s consider what intuitions we have about what should be in the knowledge base, and turn these intuitions into metrics.

  • Sparsity vs density: if too many entities have no relationship to any other entity, decreases the usefulness of of the knowledge base, similarly relationships that are ubiquitous cost resources and provide little benefit.
    • average number of relationships per entity
    • proportion of entities with no relationships
    • ratio of occurrences of a relationship to a fully connected graph
  • The entities and relationships that people query are ones we must focus on. Similarly, relationships that are almost never used may be superfluous.
    • Once the system is deployed, and queries are logged...
    • number of queries where an entity was not found
    • number of relationships that are not queried in a time period (day, week, month)

The benefit of having an intermediate step of outputting CSVs, is that we don’t need to do a large extraction from the database, we can calculate these graph metrics using the CSV data.

Now that we have some idea of how to measure the quality of the knowledge base, lets talk about measuring the infrastructure.

Infrastructure metrics

We will want to make sure that our single server approach is sufficient. For a small to medium sized platform company, this should be fine. If the company is large, or if the application were intended for much broader use, we would want to consider replication. That is, have multiple servers with database, and users are redirected through a load balancer.

With neo4j you can look at system info by querying :sysinfo. This will give you information about the amount of data being used.

For an application like this, you would want to monitor response time when queried, and update time when adding new entities or relationships.

Process metrics

On top of the generic process metrics, for this project you want to monitor how long it takes for someone to be able to update the graph. There are a few ways that this graph is likely to be updated.

  • Periodic updates to capture wiki updates.
  • Adding a new type of relationship
  • Adding properties to entities or relationships

The first of these is the most important to monitor. The whole point of this application is to keep sales engineers up-to-date, so this data has to keep up to date. Ideally, this process should be monitored. The later two are important to monitor since the hope of this project is to decrease the workload on developers and data scientists. We don’t want to replace the work needed to support sales efforts with effort maintaining this application.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.195.101