Spaces:

DerwenAI
/

textgraphs

Running

App Files Files Community

textgraphs / docs /ex0_0.md

Paco Nathan

A new start

91eaff6 5 months ago

preview code

raw

history blame contribute delete

No virus

18.5 kB



	!!! note
	To run this notebook in JupyterLab, load [`examples/ex0_0.ipynb`](https://github.com/DerwenAI/textgraphs/blob/main/examples/ex0_0.ipynb)



	# demo: TextGraphs + LLMs to construct a 'lemma graph'

	_TextGraphs_ library is intended for iterating through a sequence of paragraphs.

	## environment


	```python
	from IPython.display import display, HTML, Image, SVG
	import pathlib
	import typing

	from icecream import ic
	from pyinstrument import Profiler
	import matplotlib.pyplot as plt
	import pandas as pd
	import pyvis
	import spacy

	import textgraphs
	```


	```python
	%load_ext watermark
	```


	```python
	%watermark
	```

	Last updated: 2024-01-16T17:41:51.229985-08:00

	Python implementation: CPython
	Python version : 3.10.11
	IPython version : 8.20.0

	Compiler : Clang 13.0.0 (clang-1300.0.29.30)
	OS : Darwin
	Release : 21.6.0
	Machine : x86_64
	Processor : i386
	CPU cores : 8
	Architecture: 64bit




	```python
	%watermark --iversions
	```

	sys : 3.10.11 (v3.10.11:7d4cc5aa85, Apr 4 2023, 19:05:19) [Clang 13.0.0 (clang-1300.0.29.30)]
	spacy : 3.7.2
	pandas : 2.1.4
	matplotlib: 3.8.2
	textgraphs: 0.5.0
	pyvis : 0.3.2



	## parse a document

	provide the source text


	```python
	SRC_TEXT: str = """
	Werner Herzog is a remarkable filmmaker and an intellectual originally from Germany, the son of Dietrich Herzog.
	After the war, Werner fled to America to become famous.
	"""
	```

	set up the statistical stack profiling


	```python
	profiler: Profiler = Profiler()
	profiler.start()
	```

	set up the `TextGraphs` pipeline


	```python
	tg: textgraphs.TextGraphs = textgraphs.TextGraphs(
	factory = textgraphs.PipelineFactory(
	spacy_model = textgraphs.SPACY_MODEL,
	ner = None,
	kg = textgraphs.KGWikiMedia(
	spotlight_api = textgraphs.DBPEDIA_SPOTLIGHT_API,
	dbpedia_search_api = textgraphs.DBPEDIA_SEARCH_API,
	dbpedia_sparql_api = textgraphs.DBPEDIA_SPARQL_API,
	wikidata_api = textgraphs.WIKIDATA_API,
	min_alias = textgraphs.DBPEDIA_MIN_ALIAS,
	min_similarity = textgraphs.DBPEDIA_MIN_SIM,
	),
	infer_rels = [
	textgraphs.InferRel_OpenNRE(
	model = textgraphs.OPENNRE_MODEL,
	max_skip = textgraphs.MAX_SKIP,
	min_prob = textgraphs.OPENNRE_MIN_PROB,
	),
	textgraphs.InferRel_Rebel(
	lang = "en_XX",
	mrebel_model = textgraphs.MREBEL_MODEL,
	),
	],
	),
	)

	pipe: textgraphs.Pipeline = tg.create_pipeline(
	SRC_TEXT.strip(),
	)
	```

	## visualize the parse results


	```python
	spacy.displacy.render(
	pipe.ner_doc,
	style = "ent",
	jupyter = True,
	)
	```


	<span class="tex2jax_ignore"><div class="entities" style="line-height: 2.5; direction: ltr">
	<mark class="entity" style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
	Werner Herzog
	<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">PERSON</span>
	</mark>
	is a remarkable filmmaker and an intellectual originally from
	<mark class="entity" style="background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
	Germany
	<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">GPE</span>
	</mark>
	, the son of
	<mark class="entity" style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
	Dietrich Herzog
	<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">PERSON</span>
	</mark>
	.<br>After the war,
	<mark class="entity" style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
	Werner
	<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">PERSON</span>
	</mark>
	fled to
	<mark class="entity" style="background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
	America
	<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">GPE</span>
	</mark>
	to become famous.</div></span>



	```python
	parse_svg: str = spacy.displacy.render(
	pipe.ner_doc,
	style = "dep",
	jupyter = False,
	)

	display(SVG(parse_svg))
	```



	![svg](ex0_0_files/ex0_0_17_0.svg)



	## collect graph elements from the parse


	```python
	tg.collect_graph_elements(
	pipe,
	debug = False,
	)
	```


	```python
	ic(len(tg.nodes.values()));
	ic(len(tg.edges.values()));
	```

	ic\| len(tg.nodes.values()): 36
	ic\| len(tg.edges.values()): 42


	## perform entity linking


	```python
	tg.perform_entity_linking(
	pipe,
	debug = False,
	)
	```

	## infer relations


	```python
	inferred_edges: list = await tg.infer_relations_async(
	pipe,
	debug = False,
	)

	inferred_edges
	```




	[Edge(src_node=0, dst_node=10, kind=<RelEnum.INF: 2>, rel='https://schema.org/nationality', prob=1.0, count=1),
	Edge(src_node=15, dst_node=0, kind=<RelEnum.INF: 2>, rel='https://schema.org/children', prob=1.0, count=1),
	Edge(src_node=27, dst_node=22, kind=<RelEnum.INF: 2>, rel='https://schema.org/event', prob=1.0, count=1)]



	## construct a lemma graph


	```python
	tg.construct_lemma_graph(
	debug = False,
	)
	```

	## extract ranked entities


	```python
	tg.calc_phrase_ranks(
	pr_alpha = textgraphs.PAGERANK_ALPHA,
	debug = False,
	)
	```

	show the resulting entities extracted from the document


	```python
	df: pd.DataFrame = tg.get_phrases_as_df()
	df
	```




	<div>
	<style scoped>
	.dataframe tbody tr th:only-of-type {
	vertical-align: middle;
	}

	.dataframe tbody tr th {
	vertical-align: top;
	}

	.dataframe thead th {
	text-align: right;
	}
	</style>
	<table border="1" class="dataframe">
	<thead>
	<tr style="text-align: right;">
	<th></th>
	<th>node_id</th>
	<th>text</th>
	<th>pos</th>
	<th>label</th>
	<th>count</th>
	<th>weight</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<th>0</th>
	<td>0</td>
	<td>Werner Herzog</td>
	<td>PROPN</td>
	<td>dbr:Werner_Herzog</td>
	<td>1</td>
	<td>0.080547</td>
	</tr>
	<tr>
	<th>1</th>
	<td>10</td>
	<td>Germany</td>
	<td>PROPN</td>
	<td>dbr:Germany</td>
	<td>1</td>
	<td>0.080437</td>
	</tr>
	<tr>
	<th>2</th>
	<td>15</td>
	<td>Dietrich Herzog</td>
	<td>PROPN</td>
	<td>dbo:Person</td>
	<td>1</td>
	<td>0.079048</td>
	</tr>
	<tr>
	<th>3</th>
	<td>27</td>
	<td>America</td>
	<td>PROPN</td>
	<td>dbr:United_States</td>
	<td>1</td>
	<td>0.079048</td>
	</tr>
	<tr>
	<th>4</th>
	<td>24</td>
	<td>Werner</td>
	<td>PROPN</td>
	<td>dbo:Person</td>
	<td>1</td>
	<td>0.077633</td>
	</tr>
	<tr>
	<th>5</th>
	<td>4</td>
	<td>filmmaker</td>
	<td>NOUN</td>
	<td>owl:Thing</td>
	<td>1</td>
	<td>0.076309</td>
	</tr>
	<tr>
	<th>6</th>
	<td>22</td>
	<td>war</td>
	<td>NOUN</td>
	<td>owl:Thing</td>
	<td>1</td>
	<td>0.076309</td>
	</tr>
	<tr>
	<th>7</th>
	<td>32</td>
	<td>a remarkable filmmaker</td>
	<td>noun_chunk</td>
	<td>None</td>
	<td>1</td>
	<td>0.076077</td>
	</tr>
	<tr>
	<th>8</th>
	<td>7</td>
	<td>intellectual</td>
	<td>NOUN</td>
	<td>owl:Thing</td>
	<td>1</td>
	<td>0.074725</td>
	</tr>
	<tr>
	<th>9</th>
	<td>13</td>
	<td>son</td>
	<td>NOUN</td>
	<td>owl:Thing</td>
	<td>1</td>
	<td>0.074725</td>
	</tr>
	<tr>
	<th>10</th>
	<td>33</td>
	<td>an intellectual</td>
	<td>noun_chunk</td>
	<td>None</td>
	<td>1</td>
	<td>0.074606</td>
	</tr>
	<tr>
	<th>11</th>
	<td>34</td>
	<td>the son</td>
	<td>noun_chunk</td>
	<td>None</td>
	<td>1</td>
	<td>0.074606</td>
	</tr>
	<tr>
	<th>12</th>
	<td>35</td>
	<td>the war</td>
	<td>noun_chunk</td>
	<td>None</td>
	<td>1</td>
	<td>0.074606</td>
	</tr>
	</tbody>
	</table>
	</div>



	## visualize the lemma graph


	```python
	render: textgraphs.RenderPyVis = tg.create_render()

	pv_graph: pyvis.network.Network = render.render_lemma_graph(
	debug = False,
	)
	```

	initialize the layout parameters


	```python
	pv_graph.force_atlas_2based(
	gravity = -38,
	central_gravity = 0.01,
	spring_length = 231,
	spring_strength = 0.7,
	damping = 0.8,
	overlap = 0,
	)

	pv_graph.show_buttons(filter_ = [ "physics" ])
	pv_graph.toggle_physics(True)
	```


	```python
	pv_graph.prep_notebook()
	pv_graph.show("tmp.fig01.html")
	```

	tmp.fig01.html






	![png](ex0_0_files/tmp.fig01.png)




	## generate a word cloud


	```python
	wordcloud = render.generate_wordcloud()
	display(wordcloud.to_image())
	```



	![png](ex0_0_files/ex0_0_37_0.png)



	## cluster communities in the lemma graph

	In the tutorial
	<a href="https://towardsdatascience.com/how-to-convert-any-text-into-a-graph-of-concepts-110844f22a1a" target="_blank">"How to Convert Any Text Into a Graph of Concepts"</a>,
	Rahul Nayak uses the
	<a href="https://en.wikipedia.org/wiki/Girvan%E2%80%93Newman_algorithm"><em>girvan-newman</em></a>
	algorithm to split the graph into communities, then clusters on those communities.
	His approach works well for unsupervised clustering of key phrases which have been extracted from many documents.
	In contrast, Nayak was working with entities extracted from "chunks" of text, not with a text graph.


	```python
	render.draw_communities();
	```



	![png](ex0_0_files/ex0_0_40_0.png)



	## graph of relations transform

	Show a transformed graph, based on _graph of relations_ (see: `lee2023ingram`)


	```python
	graph: textgraphs.GraphOfRelations = textgraphs.GraphOfRelations(
	tg
	)

	graph.seeds()
	graph.construct_gor()
	```


	```python
	scores: typing.Dict[ tuple, float ] = graph.get_affinity_scores()
	pv_graph: pyvis.network.Network = graph.render_gor_pyvis(scores)

	pv_graph.force_atlas_2based(
	gravity = -38,
	central_gravity = 0.01,
	spring_length = 231,
	spring_strength = 0.7,
	damping = 0.8,
	overlap = 0,
	)

	pv_graph.show_buttons(filter_ = [ "physics" ])
	pv_graph.toggle_physics(True)

	pv_graph.prep_notebook()
	pv_graph.show("tmp.fig02.html")
	```

	tmp.fig02.html






	![png](ex0_0_files/tmp.fig02.png)




	What does this transform provide?

	By using a _graph of relations_ dual representation of our graph data, first and foremost we obtain a more compact representation of the relations in the graph, and means of making inferences (e.g., _link prediction_) where there is substantially more invariance in the training data.

	Also recognize that for a parse graph of a paragraph in the English language, the most interesting nodes will probably be either subjects (`nsubj`) or direct objects (`pobj`). Here in the _graph of relations_ we see illustrated how the important details from _entity linking_ tend to cluster near either `nsubj` or `pobj` entities, connected through punctuation. This is not as readily observed in the earlier visualization of the _lemma graph_.

	## extract as RDF triples

	Extract the nodes and edges which have IRIs, to create an "abstraction layer" as a semantic graph at a higher level of detail above the _lemma graph_:


	```python
	triples: str = tg.export_rdf()
	print(triples)
	```

	@base <https://github.com/DerwenAI/textgraphs/ns/> .
	@prefix dbo: <http://dbpedia.org/ontology/> .
	@prefix dbr: <http://dbpedia.org/resource/> .
	@prefix schema: <https://schema.org/> .
	@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
	@prefix wd_ent: <http://www.wikidata.org/entity/> .

	dbr:Germany skos:definition "Germany (German: Deutschland, German pronunciation: [ˈdɔʏtʃlant]), constitutionally the Federal"@en ;
	skos:prefLabel "Germany"@en .

	dbr:United_States skos:definition "The United States of America (USA), commonly known as the United States (U.S. or US) or America"@en ;
	skos:prefLabel "United States"@en .

	dbr:Werner_Herzog skos:definition "Werner Herzog (German: [ˈvɛɐ̯nɐ ˈhɛɐ̯tsoːk]; born 5 September 1942) is a German film director"@en ;
	skos:prefLabel "Werner Herzog"@en .

	wd_ent:Q183 skos:definition "country in Central Europe"@en ;
	skos:prefLabel "Germany"@en .

	wd_ent:Q44131 skos:definition "German film director, producer, screenwriter, actor and opera director"@en ;
	skos:prefLabel "Werner Herzog"@en .

	<entity/america_PROPN> a dbo:Country ;
	skos:prefLabel "America"@en ;
	schema:event <entity/war_NOUN> .

	<entity/dietrich_PROPN_herzog_PROPN> a dbo:Person ;
	skos:prefLabel "Dietrich Herzog"@en ;
	schema:children <entity/werner_PROPN_herzog_PROPN> .

	<entity/filmmaker_NOUN> skos:prefLabel "filmmaker"@en .

	<entity/intellectual_NOUN> skos:prefLabel "intellectual"@en .

	<entity/son_NOUN> skos:prefLabel "son"@en .

	<entity/werner_PROPN> a dbo:Person ;
	skos:prefLabel "Werner"@en .

	<entity/germany_PROPN> a dbo:Country ;
	skos:prefLabel "Germany"@en .

	<entity/war_NOUN> skos:prefLabel "war"@en .

	<entity/werner_PROPN_herzog_PROPN> a dbo:Person ;
	skos:prefLabel "Werner Herzog"@en ;
	schema:nationality <entity/germany_PROPN> .

	dbo:Country skos:definition "Countries, cities, states"@en ;
	skos:prefLabel "country"@en .

	dbo:Person skos:definition "People, including fictional"@en ;
	skos:prefLabel "person"@en .




	## statistical stack profile instrumentation


	```python
	profiler.stop()
	```




	<pyinstrument.session.Session at 0x141446080>




	```python
	profiler.print()
	```


	_ ._ __/__ _ _ _ _ _/_ Recorded: 17:41:51 Samples: 11163
	/_//_/// /_\ / //_// / //_'/ // Duration: 57.137 CPU time: 72.235
	/ _/ v4.6.1

	Program: /Users/paco/src/textgraphs/venv/lib/python3.10/site-packages/ipykernel_launcher.py -f /Users/paco/Library/Jupyter/runtime/kernel-8ffadb7d-3b45-4e0e-a94f-f098e5ad9fbe.json

	57.136 _UnixSelectorEventLoop._run_once asyncio/base_events.py:1832
	└─ 57.135 Handle._run asyncio/events.py:78
	[12 frames hidden] asyncio, ipykernel, IPython
	41.912 ZMQInteractiveShell.run_ast_nodes IPython/core/interactiveshell.py:3394
	├─ 20.701 <module> ../ipykernel_5151/1245857438.py:1
	│ └─ 20.701 TextGraphs.perform_entity_linking textgraphs/doc.py:534
	│ └─ 20.701 KGWikiMedia.perform_entity_linking textgraphs/kg.py:306
	│ ├─ 10.790 KGWikiMedia._link_kg_search_entities textgraphs/kg.py:932
	│ │ └─ 10.787 KGWikiMedia.dbpedia_search_entity textgraphs/kg.py:641
	│ │ └─ 10.711 get requests/api.py:62
	│ │ [37 frames hidden] requests, urllib3, http, socket, ssl,...
	│ ├─ 9.143 KGWikiMedia._link_spotlight_entities textgraphs/kg.py:851
	│ │ └─ 9.140 KGWikiMedia.dbpedia_search_entity textgraphs/kg.py:641
	│ │ └─ 9.095 get requests/api.py:62
	│ │ [37 frames hidden] requests, urllib3, http, socket, ssl,...
	│ └─ 0.768 KGWikiMedia._secondary_entity_linking textgraphs/kg.py:1060
	│ └─ 0.768 KGWikiMedia.wikidata_search textgraphs/kg.py:575
	│ └─ 0.765 KGWikiMedia._wikidata_endpoint textgraphs/kg.py:444
	│ └─ 0.765 get requests/api.py:62
	│ [7 frames hidden] requests, urllib3
	└─ 19.514 <module> ../ipykernel_5151/1708547378.py:1
	├─ 14.502 InferRel_Rebel.__init__ textgraphs/rel.py:121
	│ └─ 14.338 pipeline transformers/pipelines/__init__.py:531
	│ [39 frames hidden] transformers, torch, <built-in>, json
	├─ 3.437 PipelineFactory.__init__ textgraphs/pipe.py:434
	│ └─ 3.420 load spacy/__init__.py:27
	│ [20 frames hidden] spacy, en_core_web_sm, catalogue, imp...
	├─ 0.900 InferRel_OpenNRE.__init__ textgraphs/rel.py:33
	│ └─ 0.888 get_model opennre/pretrain.py:126
	└─ 0.672 TextGraphs.create_pipeline textgraphs/doc.py:103
	└─ 0.672 PipelineFactory.create_pipeline textgraphs/pipe.py:508
	└─ 0.672 Pipeline.__init__ textgraphs/pipe.py:216
	└─ 0.672 English.__call__ spacy/language.py:1016
	[11 frames hidden] spacy, spacy_dbpedia_spotlight, reque...
	14.363 InferRel_Rebel.gen_triples_async textgraphs/pipe.py:188
	├─ 13.670 InferRel_Rebel.gen_triples textgraphs/rel.py:259
	│ ├─ 12.439 InferRel_Rebel.tokenize_sent textgraphs/rel.py:145
	│ │ └─ 12.436 TranslationPipeline.__call__ transformers/pipelines/text2text_generation.py:341
	│ │ [42 frames hidden] transformers, torch, <built-in>
	│ └─ 1.231 KGWikiMedia.resolve_rel_iri textgraphs/kg.py:370
	│ └─ 0.753 get_entity_dict_from_api qwikidata/linked_data_interface.py:21
	│ [8 frames hidden] qwikidata, requests, urllib3
	└─ 0.693 InferRel_OpenNRE.gen_triples textgraphs/rel.py:58




	## outro

	_\[ more parts are in progress, getting added to this demo \]_