isbn-finder / README.md
apjanco's picture
adding post on process
a8ca5c6
|
raw
history blame
9.02 kB
---
title: ISBN Finder
emoji: πŸ“š
colorFrom: indigo
colorTo: purple
sdk: streamlit
sdk_version: 1.10.0
app_file: app.py
pinned: false
license: mit
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
This simple application searches Word files, PDFs, images, and other files for 13-digit ISBNs.
A colleague's tweet inspired this app at the end of the work day. I was bogged down in other projects, and the prospect of something shiny and new was very appealing. I was excited that someone I knew had a pattern-matching problem and I could help.
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">No one tells you how much of collection development is hour after hour of copy-pasting ISBNs from word doc vendor catalogs. My clicking hand longs for structured data pls</p>&mdash; Brie Gettleson (@brie_marina) <a href="https://twitter.com/brie_marina/status/1551647829621444608?ref_src=twsrc%5Etfw">July 25, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
## What is pattern matching?
To your computer, the text is just a sequence of characters. It's one letter after another. Perhaps there's a space here and there. When you press control-f to search within a text, the computer finds a specific sequence of characters. If you want to find an exact ISBN, for example, 978-0-520-30443-7, you can search for it. The computer looks over the text character by character, looking for 9, followed by 7, followed by 8, and so on. It matches the sequence.
What if I want to match a pattern of characters, such as an ISBN? In that case, I want the computer to find three numbers followed by a hyphen, then one number, then a hyphen, and so on. ISBNs are a standard format, so the only thing that varies in the pattern is the numbers.
The traditional way of approaching this task is regular expressions.
```python
import re
regex = re.compile("^(?:ISBN(?:-1[03])?:? )?(?=[0-9X]{10}$|(?=(?:[0-9]+[- ]){3})[- 0-9X]{13}$|97[89][0-9]{10}$|(?=(?:[0-9]+[- ]){4})[- 0-9]{17}$)(?:97[89][- ]?)?[0-9]{1,5}[- ]?[0-9]+[- ]?[0-9]+[- ]?[0-9X]$")
regex.search('978-0-520-30443-7')
```
See [this chapter in the O'Reilly Regular Expressions book](https://www.oreilly.com/library/view/regular-expressions-cookbook/9781449327453/ch04s13.html) for more
I'm also a big fan of [Regex101](https://regex101.com/) as a way to make sense of regex expressions.
All that said, I still find regex impossible and difficult to use. I spent about thirty minutes finding a regex solution to the problem, but nothing I found did the job out of the box. I'd need to adapt the expressions, and that's not how I wanted to spend the last few minutes of my workday. Other people love regex, so they'd do things differently.
Given that this was all supposed to be fun, I turned to spaCy's matcher. There's a [excellent demo app](https://explosion.ai/demos/matcher) that shows how it works. The main difference between matcher and regex is that spaCy approaches the text as human language. It's not just a sequence of characters but words with grammar and syntax. As a human, I find this more intuitive. spaCy splits the text into word tokens, and each token has part of speech, tense, and other linguistic attributes. spaCy's tokenizer will also separate using punctuation so that:
```
this: 978-0-520-30443-7
becomes: 978 - 0 - 520 - 30443 - 7
```
The pattern that I want to find is a number followed by a hyphen, then a number, hyphen, number, hyphen, number, hyphen, and then a number. I create a list to capture this idea in Python for the matcher. Each item in the list is a rule for a word token. The first rule, `IS_DIGIT: True`, asks, "is this a number?" Matcher then looks at the next token and asks, "is it a hyphen?" (`{'ORTH': '-'}`), if it is, then we continue to match all the other conditions. If everything in the sequence fits our pattern, then we have a match.
```python
pattern = [{'IS_DIGIT': True},
{'ORTH': '-'},
{'IS_DIGIT': True},
{'ORTH': '-'},
{'IS_DIGIT': True},
{'ORTH': '-'},
{'IS_DIGIT': True},
{'ORTH': '-'},
{'IS_DIGIT': True}]
```
There are lots of different ways that we could articulate the pattern. ORTH calls for an exact match with the token's text.
If I wanted to ignore the text's case, for example, I could use
`{'LOWER': '-'}`. Or I might ask if it's punctuation `{'IS_PUNCT': True}`.
This particular approach relies on the four hyphens to find an ISBN. If there's a typo, or a publisher doesn't use the hyphens, then we won't get a match. In the future, we may want to handle these exceptions.
For now, matcher gives very good results when the ISBN is properly formatted. Here's my code:
```python
import spacy
nlp = spacy.blank('xx')
matcher = Matcher(nlp.vocab)
pattern = [{'IS_DIGIT': True},
{'ORTH': '-'},
{'IS_DIGIT': True},
{'ORTH': '-'},
{'IS_DIGIT': True},
{'ORTH': '-'},
{'IS_DIGIT': True},
{'ORTH': '-'},
{'IS_DIGIT': True}]
matcher.add("ISBN",[pattern])
doc = nlp(text)
matches = matcher(doc)
for match_id,start,end in matches:
match = doc[start:end]
```
Now how to empower my colleague to use this approach? I needed to create a web app that takes multiple files in various formats and returns the matches.
Streamlit is a very convenient Python library for making demonstration applications. I added the file upload widget and the download button, using the [Streamlit documentation](https://docs.streamlit.io/).
```python
import streamlit as st
st.title('Find 13-digit ISBN Numbers')
uploaded_files = st.file_uploader("Select files to process", accept_multiple_files=True)
st.download_button('Download', isbn, 'text/plain')
```
Now, I need something to extract the text from the uploaded files. My friend currently has Word files, but I also wanted to support PDF and other formats. [textract](https://textract.readthedocs.io/) is my go-to library for this work. It'll extract text from 23 different file types of files, including images, audio, and standard word processing formats. I didn't want the user to have to worry about file formats, so this adds a vital feature to the app.
The code below processes the uploaded files. First, it saves the file to disk, then runs `textract.process(file_path)` to extract the text. I then use spaCy matcher to find ISBNs in the extracted text. The results are saved to a plain text string and exported as a text file.
```python
isbn = """"""
uploaded_files = st.file_uploader("Select files to process", accept_multiple_files=True)
for uploaded_file in uploaded_files:
file_type = uploaded_file.type
# TODO Just read bytes and extract text without saving to disk
Path(uploaded_file.name).write_bytes(uploaded_file.read())
text = textract.process(uploaded_file.name)
text = text.decode('utf-8')
doc = nlp(text)
matches = matcher(doc)
st.write(f'Found {len(matches)} ISBN numbers')
for match_id,start,end in matches:
isbn += f"{doc[start:end]}\n"
st.download_button('Download', isbn, 'text/plain')
```
## Deployment
With a working solution to the problem and a functional demo application, I now only needed to put the app on the Web. There are many ways to do this. I've used Heroku in the past, but it requires some configuration files and the app falls asleep when not used. I find the delay in loading the app frustrating. I'd never tried [HuggingFace Spaces](https://huggingface.co/spaces), so novelty and learning won the day.
To deploy my Streamlit app on Spaces, I just went to the Spaces page and clicked on ["Create new Space"](https://huggingface.co/new-space) and selected Streamlit. I already had an account. If you don't have one, you'll need to create one. I followed the instructions to clone the space on my machine using git and added my `app.py` file. I also created a `requirements.txt`. When HuggingFace builds the app, it will read that file and `pip install` anything listed there. In this case, I want it to install spacy and textract. Textract relies on some operating system-level dependencies, so I told HugginggFace to install those in a `packages.txt` file. HF will `apt install` each of the libraries listed in the file during the build.
The result is a deployed demo application that finds ISBNs in files. I'll talk with my colleagues to see if the app serves their needs and if there are future use cases that could be accounted for. But my first efforts were intentionally minimal in managing scope.
Oh, one last note. I'm a big fan of adding fun touches to a project. I learned this from the spaCy developers. So to add some personality, I found an open-license image using [WikiView](https://wikiview.net/) and added it to the app.
You can see and use the end result here:
https://huggingface.co/spaces/ajanco/isbn-finder