Spaces:

BenjaminB
/

gistillery

Runtime error

App Files Files Community

Benjamin Bossan commited on May 11, 2023

Commit

01ae0bb

1 Parent(s): 64d4f97

Use transformers agents where applicable

Browse files

Files changed (14) hide show

.gitignore +3 -0
README.md +7 -4
pyproject.toml +1 -1
requests.org +18 -10
requirements-dev.txt +1 -0
requirements.txt +3 -1
src/gistillery/config.py +24 -0
src/gistillery/db.py +3 -4
src/gistillery/preprocessing.py +59 -12
src/gistillery/registry.py +29 -4
src/gistillery/{ml.py → tools.py} +42 -44
src/gistillery/webservice.py +21 -1
src/gistillery/worker.py +6 -41
tests/test_app.py +42 -29

.gitignore CHANGED Viewed

@@ -10,3 +10,6 @@ build
 htmlcov
 *.db

 htmlcov
 *.db
+notebooks/
+*.ipynb
+.env

README.md CHANGED Viewed

@@ -19,18 +19,21 @@ python -m pip install -e .
 ## Starting
 In one terminal, start the background worker:
 ```sh
-cd src
-python worker.py
 ```
 In another terminal, start the web server:
 ```sh
-cd src
-uvicorn webservice:app --reload --port 8080
 ```
 For example requests, check `requests.org`.

 ## Starting
+### Preparing environemnt
+Set an environemnt variable called "HF_HUB_TOKEN" with your Hugging Face token
+or create a `.env` file with that env var.
 In one terminal, start the background worker:
 ```sh
+python src/gistillery/worker.py
 ```
 In another terminal, start the web server:
 ```sh
+uvicorn src.gistillery.webservice:app --reload --port 8080
 ```
 For example requests, check `requests.org`.

pyproject.toml CHANGED Viewed

@@ -18,5 +18,5 @@ no_implicit_optional = true
 strict = true
 [[tool.mypy.overrides]]
-module = "transformers,trafilatura"
 ignore_missing_imports = true

 strict = true
 [[tool.mypy.overrides]]
+module = "huggingface_hub,trafilatura,transformers.*"
 ignore_missing_imports = true

requests.org CHANGED Viewed

@@ -10,19 +10,18 @@ curl -X 'GET' \
 : OK
 #+begin_src bash
-# curl command to localhost and post the message "hi there"
 curl -X 'POST' \
   'http://localhost:8080/submit/' \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
   "author": "ben",
-  "content": "SAN FRANCISCO, May 2, 2023 PRNewswire -- GitLab Inc., the most comprehensive, scalable enterprise DevSecOps platform for software innovation, and Google Cloud today announced an extension of its strategic partnership to deliver secure AI offerings to the enterprise. GitLab is trusted by more than 50% of the Fortune 100 to secure and protect their most valuable assets, and leads with a privacy-first approach to AI. By leveraging Google Cloud'\''s customizable foundation models and open generative AI infrastructure, GitLab will provide customers with AI-assisted features directly within the enterprise DevSecOps platform."
 }'
 #+end_src
 #+RESULTS:
-: Submitted job 04deee1a2a9b4d6ea986ffe0fa4017d9
 #+begin_src bash
 curl -X 'POST' \
@@ -31,12 +30,12 @@ curl -X 'POST' \
   -H 'Content-Type: application/json' \
   -d '{
   "author": "ben",
-  "content": "In literature discussing why ChatGPT is able to capture so much of our imagination, I often come across two narratives: Scale: throwing more data and compute at it. UX: moving from a prompt interface to a more natural chat interface. A narrative that is often glossed over in the demo frenzy is the incredible technical creativity that went into making models like ChatGPT work. One such cool idea is RLHF (Reinforcement Learning from Human Feedback): incorporating reinforcement learning and human feedback into NLP. RL has been notoriously difficult to work with, and therefore, mostly confined to gaming and simulated environments like Atari or MuJoCo. Just five years ago, both RL and NLP were progressing pretty much orthogonally – different stacks, different techniques, and different experimentation setups. It’s impressive to see it work in a new domain at a massive scale. So, how exactly does RLHF work? Why does it work? This post will discuss the answers to those questions."
 }'
 #+end_src
 #+RESULTS:
-: Submitted job 730352e00e8145b39971fdc386c28a8f
 #+begin_src bash
 curl -X 'POST' \
@@ -45,21 +44,21 @@ curl -X 'POST' \
   -H 'Content-Type: application/json' \
   -d '{
   "author": "ben",
-  "content": "https://en.wikipedia.org/wiki/Goulburn_Street"
 }'
 #+end_src
 #+RESULTS:
-: Submitted job 1738d7daa96147198d80b93ea040863d
 #+begin_src bash
 curl -X 'GET' \
-  'http://localhost:8080/check_job_status/1738d7daa96147198d80b93ea040863d' \
   -H 'accept: application/json'
 #+end_src
 #+RESULTS:
-| {"id":"1738d7daa96147198d80b93ea040863d" | status:"pending" | last_updated:"2023-05-09T13:24:42"} |
 #+begin_src bash
 curl -X 'GET' \
@@ -68,4 +67,13 @@ curl -X 'GET' \
 #+end_src
 #+RESULTS:
-| [{"id":"1738d7daa96147198d80b93ea040863d" | author:"ben" | summary:"Goulburn Street is a street in the central business district of Sydney | New South Wales | Australia. It runs from Darling Harbour and Chinatown in the west to Crown Street in the east at Darlinghurst and Surry Hills. The only car park operated by Sydney City Council within the CBD is at the corner of Goulburn and Elizabeth Streets. It was the first air rights car park in Australia | opening in 1963 over six tracks of the City Circle line.[3][4]" | tags:["#centralbusinessdistrict" | #darlinghurst | #general | #goulburnstreet | #surryhills | #sydney | #sydneymasoniccentre] | date:"2023-05-09T13:24:42"} | {"id":"730352e00e8145b39971fdc386c28a8f" | author:"ben" | summary:"A new approach to NLP that incorporates reinforcement learning and human feedback. How does it work? Why does it work? In this post | I’ll explain how it works. RLHF is a new approach to NLP that incorporates reinforcement learning and human feedback. It’s a new approach to NLP that incorporates reinforcement learning and human feedback. It’s a new approach to NLP that incorporates reinforcement learning and human feedback. It’s a new approach to NLP that incorporates reinforcement learning and human feedback. It’s a new approach to NLP that incorporates reinforcement learning and human feedback." | tags:["#" | #general | #rlhf] | date:"2023-05-09T13:24:38"} | {"id":"04deee1a2a9b4d6ea986ffe0fa4017d9" | author:"ben" | summary:"GitLab | the most comprehensive | scalable enterprise DevSecOps platform for software innovation | and Google Cloud today announced an extension of their strategic partnership to deliver secure AI offerings to the enterprise. By leveraging Google Cloud's customizable foundation models and open generative AI infrastructure | GitLab will provide customers with AI-assisted features directly within the enterprise DevSecOps platform. The company's AI capabilities are designed to help enterprises improve productivity and reduce costs." | tags:["#ai-assistedfeatures" | #enterprisedevsecopsplatform | #general | #gitlab | #googlecloud] | date:"2023-05-09T13:24:36"}] |

 : OK
 #+begin_src bash
 curl -X 'POST' \
   'http://localhost:8080/submit/' \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
   "author": "ben",
+  "content": "In literature discussing why ChatGPT is able to capture so much of our imagination, I often come across two narratives: Scale: throwing more data and compute at it. UX: moving from a prompt interface to a more natural chat interface. A narrative that is often glossed over in the demo frenzy is the incredible technical creativity that went into making models like ChatGPT work. One such cool idea is RLHF (Reinforcement Learning from Human Feedback): incorporating reinforcement learning and human feedback into NLP. RL has been notoriously difficult to work with, and therefore, mostly confined to gaming and simulated environments like Atari or MuJoCo. Just five years ago, both RL and NLP were progressing pretty much orthogonally – different stacks, different techniques, and different experimentation setups. It’s impressive to see it work in a new domain at a massive scale. So, how exactly does RLHF work? Why does it work? This post will discuss the answers to those questions."
 }'
 #+end_src
 #+RESULTS:
+: Submitted job fef72c3aa4394bc7a299291c80a5c06b
 #+begin_src bash
 curl -X 'POST' \
   -H 'Content-Type: application/json' \
   -d '{
   "author": "ben",
+  "content": "https://en.wikipedia.org/wiki/Goulburn_Street"
 }'
 #+end_src
 #+RESULTS:
+: Submitted job f37729bb36104ab4a23cefd0480e4862
 #+begin_src bash
 curl -X 'POST' \
   -H 'Content-Type: application/json' \
   -d '{
   "author": "ben",
+  "content": "https://upload.wikimedia.org/wikipedia/commons/thumb/e/e1/Cattle_tyrant_%28Machetornis_rixosa%29_on_Capybara.jpg/1920px-Cattle_tyrant_%28Machetornis_rixosa%29_on_Capybara.jpg"
 }'
 #+end_src
 #+RESULTS:
+: Submitted job dc3da7b1d5aa47c38dc6713952104f5f
 #+begin_src bash
 curl -X 'GET' \
+  'http://localhost:8080/check_job_status/' \
   -H 'accept: application/json'
 #+end_src
 #+RESULTS:
+: Found 3 pending job(s): fef72c3aa4394bc7a299291c80a5c06b, f37729bb36104ab4a23cefd0480e4862, dc3da7b1d5aa47c38dc6713952104f5f
 #+begin_src bash
 curl -X 'GET' \
 #+end_src
 #+RESULTS:
+| [{"id":"dc3da7b1d5aa47c38dc6713952104f5f" | author:"ben" | summary:"A small bird is perched on the back of a capy capy. It's looking for a place to nestle.   It doesn't seem to be finding a suitable place for it | though | because it's not very big. The place is not very flat. " | tags:["#back" | #bird | #capy | #general | #perch | #perched] | date:"2023-05-11T13:16:48"} | {"id":"f37729bb36104ab4a23cefd0480e4862" | author:"ben" | summary:"Goulburn Street is a street in the central business district of Sydney in New South Wales | Australia. It runs from Darling Harbour and Chinatown in the west to Crown Street in the east at Darlinghurst and Surry Hills. It is the only car park operated by Sydney City Council within the CBD and was the first air rights car park in Australia." | tags:["#centralbusinessdistrict" | #darlinghurst | #general | #goulburnstreet | #surryhills | #sydney | #sydneymasoniccentre] | date:"2023-05-11T13:16:47"} | {"id":"fef72c3aa4394bc7a299291c80a5c06b" | author:"ben" | summary:"ChatGPT is able to capture our imagination because of its scale. RLHF (Reinforcement Learning from Human Feedback) is a new approach to NLP that incorporates reinforcement learning and human feedback into NLP. It's impressive to see it work in a new domain at a massive scale." | tags:["#" | #general | #rlhf] | date:"2023-05-11T13:16:45"}] |
+#+begin_src bash
+curl -X 'GET' \
+  'http://localhost:8080/recent/rlhf' \
+  -H 'accept: application/json'
+#+end_src
+#+RESULTS:
+| [{"id":"fef72c3aa4394bc7a299291c80a5c06b" | author:"ben" | summary:"ChatGPT is able to capture our imagination because of its scale. RLHF (Reinforcement Learning from Human Feedback) is a new approach to NLP that incorporates reinforcement learning and human feedback into NLP. It's impressive to see it work in a new domain at a massive scale." | tags:["#" | #general | #rlhf] | date:"2023-05-11T13:16:45"}] |

requirements-dev.txt CHANGED Viewed

@@ -4,3 +4,4 @@ mypy
 ruff
 pytest
 pytest-cov

 ruff
 pytest
 pytest-cov
+types-Pillow

requirements.txt CHANGED Viewed

@@ -2,6 +2,8 @@ fastapi
 httpx
 uvicorn[standard]
 torch
-transformers
 charset-normalizer
 trafilatura

 httpx
 uvicorn[standard]
 torch
+transformers>=4.29.0
+accelerate
 charset-normalizer
 trafilatura
+pillow

src/gistillery/config.py ADDED Viewed

	@@ -0,0 +1,24 @@

+import os
+from pathlib import Path
+from pydantic import BaseSettings
+class Config(BaseSettings):
+    hf_hub_token: str = "missing"
+    hf_agent: str = "https://api-inference.huggingface.co/models/bigcode/starcoder"
+    db_file_name: Path = Path("sqlite-data.db")
+    class Config:
+        # load .env file by default, with provisio to use other .env files if set
+        env_file = os.getenv('ENV_FILE', '.env')
+_config = None
+def get_config() -> Config:
+    global _config
+    if _config is None:
+        _config = Config()
+    return _config

src/gistillery/db.py CHANGED Viewed

@@ -1,15 +1,14 @@
 import logging
-import os
 import sqlite3
 from collections import namedtuple
 from contextlib import contextmanager
 from typing import Generator
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.DEBUG)
-db_file = os.getenv("DB_FILE_NAME", "sqlite-data.db")
 schema_entries = """
 CREATE TABLE entries
@@ -91,7 +90,7 @@ def _get_db_connection() -> sqlite3.Connection:
     global TABLES_CREATED
     # sqlite cannot deal with concurrent access, so we set a big timeout
-    conn = sqlite3.connect(db_file, timeout=30)
     conn.row_factory = namedtuple_factory
     if TABLES_CREATED:
         return conn

 import logging
 import sqlite3
 from collections import namedtuple
 from contextlib import contextmanager
 from typing import Generator
+from gistillery.config import get_config
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.DEBUG)
 schema_entries = """
 CREATE TABLE entries
     global TABLES_CREATED
     # sqlite cannot deal with concurrent access, so we set a big timeout
+    conn = sqlite3.connect(get_config().db_file_name, timeout=30)
     conn.row_factory = namedtuple_factory
     if TABLES_CREATED:
         return conn

src/gistillery/preprocessing.py CHANGED Viewed

@@ -1,16 +1,33 @@
 import abc
 import logging
 import re
 from httpx import Client
-from trafilatura import extract
 from gistillery.base import JobInput
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.DEBUG)
 class Processor(abc.ABC):
     def get_name(self) -> str:
         return self.__class__.__name__
@@ -40,25 +57,55 @@ class RawTextProcessor(Processor):
 class DefaultUrlProcessor(Processor):
-    # uses trafilatura to extract text from html
     def __init__(self) -> None:
         self.client = Client()
-        self.regex = re.compile(r"(https?://[^\s]+)")
-        self.url = None
         self.template = "{url}\n\n{content}"
     def match(self, input: JobInput) -> bool:
-        urls = list(self.regex.findall(input.content.strip()))
-        if len(urls) == 1:
-            self.url = urls[0]
-            return True
-        return False
     def process(self, input: JobInput) -> str:
         """Get content of website and return it as string"""
-        assert isinstance(self.url, str)
         text = self.client.get(self.url).text
         assert isinstance(text, str)
-        extracted = extract(text)
         text = self.template.format(url=self.url, content=extracted)
-        return text

 import abc
+import io
 import logging
 import re
+from typing import Optional
+import trafilatura
 from httpx import Client
+from PIL import Image
 from gistillery.base import JobInput
+from gistillery.tools import get_agent
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.DEBUG)
+RE_URL = re.compile(r"(https?://[^\s]+)")
+def get_url(text: str) -> str | None:
+    urls: list[str] = list(RE_URL.findall(text))
+    if len(urls) == 1:
+        url = urls[0]
+        return url
+    return None
 class Processor(abc.ABC):
     def get_name(self) -> str:
         return self.__class__.__name__
 class DefaultUrlProcessor(Processor):
     def __init__(self) -> None:
         self.client = Client()
+        self.url = Optional[str]
         self.template = "{url}\n\n{content}"
     def match(self, input: JobInput) -> bool:
+        url = get_url(input.content.strip())
+        if url is None:
+            return False
+        self.url = url
+        return True
     def process(self, input: JobInput) -> str:
         """Get content of website and return it as string"""
+        if not isinstance(self.url, str):
+            raise TypeError("self.url must be a string")
         text = self.client.get(self.url).text
         assert isinstance(text, str)
+        extracted = trafilatura.extract(text)
         text = self.template.format(url=self.url, content=extracted)
+        return str(text)
+class ImageUrlProcessor(Processor):
+    def __init__(self) -> None:
+        self.client = Client()
+        self.url = Optional[str]
+        self.template = "{url}\n\n{content}"
+        self.image_suffixes = {'jpg', 'jpeg', 'png', 'gif'}
+    def match(self, input: JobInput) -> bool:
+        url = get_url(input.content.strip())
+        if url is None:
+            return False
+        suffix = url.rsplit(".", 1)[-1].lower()
+        if suffix not in self.image_suffixes:
+            return False
+        self.url = url
+        return True
+    def process(self, input: JobInput) -> str:
+        if not isinstance(self.url, str):
+            raise TypeError("self.url must be a string")
+        response = self.client.get(self.url)
+        image = Image.open(io.BytesIO(response.content)).convert('RGB')
+        caption = get_agent().run("Caption the following image", image=image)
+        return str(caption)

src/gistillery/registry.py CHANGED Viewed

@@ -1,10 +1,14 @@
-from gistillery.ml import Summarizer, Tagger
-from gistillery.preprocessing import Processor, RawTextProcessor
 from gistillery.base import JobInput
-class MlRegistry:
     def __init__(self) -> None:
         self.processors: list[Processor] = []
         self.summerizer: Summarizer | None = None
@@ -39,3 +43,24 @@ class MlRegistry:
     def get_tagger(self) -> Tagger:
         assert self.tagger
         return self.tagger

 from gistillery.base import JobInput
+from gistillery.tools import Summarizer, Tagger, HfDefaultSummarizer, HfDefaultTagger
+from gistillery.preprocessing import (
+    Processor,
+    RawTextProcessor,
+    ImageUrlProcessor,
+    DefaultUrlProcessor,
+)
+class ToolRegistry:
     def __init__(self) -> None:
         self.processors: list[Processor] = []
         self.summerizer: Summarizer | None = None
     def get_tagger(self) -> Tagger:
         assert self.tagger
         return self.tagger
+_registry = None
+def get_tool_registry() -> ToolRegistry:
+    global _registry
+    if _registry is not None:
+        return _registry
+    summarizer = HfDefaultSummarizer()
+    tagger = HfDefaultTagger()
+    _registry = ToolRegistry()
+    _registry.register_processor(ImageUrlProcessor())
+    _registry.register_processor(DefaultUrlProcessor())
+    _registry.register_processor(RawTextProcessor())
+    _registry.register_summarizer(summarizer)
+    _registry.register_tagger(tagger)
+    return _registry

src/gistillery/{ml.py → tools.py} RENAMED Viewed

@@ -1,17 +1,26 @@
 import abc
-from typing import Any
-import logging
-logger = logging.getLogger(__name__)
-logger.setLevel(logging.DEBUG)
-class Summarizer(abc.ABC):
-    def __init__(
-        self, model_name: str, model: Any, tokenizer: Any, generation_config: Any
-    ) -> None:
-        raise NotImplementedError
     def get_name(self) -> str:
         raise NotImplementedError
@@ -20,12 +29,21 @@ class Summarizer(abc.ABC):
         raise NotImplementedError
-class Tagger(abc.ABC):
-    def __init__(
-        self, model_name: str, model: Any, tokenizer: Any, generation_config: Any
-    ) -> None:
-        raise NotImplementedError
     def get_name(self) -> str:
         raise NotImplementedError
@@ -34,39 +52,19 @@ class Tagger(abc.ABC):
         raise NotImplementedError
-class HfTransformersSummarizer(Summarizer):
-    def __init__(
-        self, model_name: str, model: Any, tokenizer: Any, generation_config: Any
-    ) -> None:
         self.model_name = model_name
-        self.model = model
-        self.tokenizer = tokenizer
-        self.generation_config = generation_config
-        self.template = "Summarize the text below in two sentences:\n\n{}"
-    def __call__(self, x: str) -> str:
-        text = self.template.format(x)
-        inputs = self.tokenizer(text, return_tensors="pt")
-        outputs = self.model.generate(
-            **inputs, generation_config=self.generation_config
-        )
-        output = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
-        assert isinstance(output, str)
-        return output
-    def get_name(self) -> str:
-        return f"{self.__class__.__name__}({self.model_name})"
-class HfTransformersTagger(Tagger):
-    def __init__(
-        self, model_name: str, model: Any, tokenizer: Any, generation_config: Any
-    ) -> None:
-        self.model_name = model_name
-        self.model = model
-        self.tokenizer = tokenizer
-        self.generation_config = generation_config
         self.template = (
             "Create a list of tags for the text below. The tags should be high level "

 import abc
+from huggingface_hub import login
+from transformers.tools import TextSummarizationTool
+from transformers import HfAgent
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig
+from gistillery.config import get_config
+agent = None
+def get_agent() -> HfAgent:
+    global agent
+    if agent is None:
+        login(get_config().hf_hub_token)
+        agent = HfAgent(get_config().hf_agent)
+    return agent
+class Summarizer(abc.ABC):
+    @abc.abstractmethod
     def get_name(self) -> str:
         raise NotImplementedError
         raise NotImplementedError
+class HfDefaultSummarizer(Summarizer):
+    def __init__(self) -> None:
+        self.summarizer = TextSummarizationTool()
+    def get_name(self) -> str:
+        return "hf_default"
+    def __call__(self, x: str) -> str:
+        summary = self.summarizer(x)
+        assert isinstance(summary, str)
+        return summary
+class Tagger(abc.ABC):
+    @abc.abstractmethod
     def get_name(self) -> str:
         raise NotImplementedError
         raise NotImplementedError
+class HfDefaultTagger(Tagger):
+    def __init__(self, model_name: str = "google/flan-t5-large") -> None:
         self.model_name = model_name
+        config = GenerationConfig.from_pretrained(self.model_name)
+        config.max_new_tokens = 50
+        config.min_new_tokens = 25
+        # increase the temperature to make the model more creative
+        config.temperature = 1.5
+        self.model = AutoModelForSeq2SeqLM.from_pretrained(self.model_name)
+        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
+        self.generation_config = config
         self.template = (
             "Create a list of tags for the text below. The tags should be high level "

src/gistillery/webservice.py CHANGED Viewed

@@ -37,8 +37,28 @@ def submit_job(input: RequestInput) -> str:
     return f"Submitted job {_id}"
 @app.get("/check_job_status/{_id}")
-def check_job_status(_id: str) -> JobStatusResult:
     with get_db_cursor() as cursor:
         cursor.execute(
             "SELECT status, last_updated FROM jobs WHERE entry_id = ?", (_id,)

     return f"Submitted job {_id}"
+@app.get("/check_job_status/")
+def check_job_status() -> str:
+    with get_db_cursor() as cursor:
+        cursor.execute(
+            "SELECT entry_id "
+            "FROM jobs WHERE status = 'pending' "
+            "ORDER BY last_updated ASC"
+        )
+        result = cursor.fetchall()
+    if not result:
+        return "No pending jobs found"
+    entry_ids = [r.entry_id for r in result]
+    num_entries = len(entry_ids)
+    if len(entry_ids) > 3:
+        entry_ids = entry_ids[:3] + ["..."]
+    return f"Found {num_entries} pending job(s): {', '.join(entry_ids)}"
 @app.get("/check_job_status/{_id}")
+def check_job_status_id(_id: str) -> JobStatusResult:
     with get_db_cursor() as cursor:
         cursor.execute(
             "SELECT status, last_updated FROM jobs WHERE entry_id = ?", (_id,)

src/gistillery/worker.py CHANGED Viewed

@@ -3,9 +3,7 @@ from dataclasses import dataclass
 from gistillery.base import JobInput
 from gistillery.db import get_db_cursor
-from gistillery.ml import HfTransformersSummarizer, HfTransformersTagger
-from gistillery.preprocessing import DefaultUrlProcessor, RawTextProcessor
-from gistillery.registry import MlRegistry
 SLEEP_INTERVAL = 5
@@ -13,7 +11,7 @@ SLEEP_INTERVAL = 5
 def check_pending_jobs() -> list[JobInput]:
     """Check DB for pending jobs"""
     with get_db_cursor() as cursor:
-        # fetch pending jobs, join authro and content from entries table
         query = """
         SELECT j.entry_id, e.author, e.source
         FROM jobs j
@@ -21,7 +19,7 @@ def check_pending_jobs() -> list[JobInput]:
         ON j.entry_id = e.id
         WHERE j.status = 'pending'
         """
-        res = list(cursor.execute(query))
     return [
         JobInput(id=_id, author=author, content=content) for _id, author, content in res
     ]
@@ -37,7 +35,7 @@ class JobOutput:
     tagger_name: str
-def _process_job(job: JobInput, registry: MlRegistry) -> JobOutput:
     processor = registry.get_processor(job)
     processor_name = processor.get_name()
     processed = processor(job)
@@ -79,7 +77,7 @@ def store(job: JobInput, output: JobOutput) -> None:
         )
-def process_job(job: JobInput, registry: MlRegistry) -> None:
     tic = time.perf_counter()
     print(f"Processing job for (id={job.id[:8]})")
@@ -105,41 +103,8 @@ def process_job(job: JobInput, registry: MlRegistry) -> None:
     print(f"Finished processing job (id={job.id[:8]}) in {toc - tic:0.3f} seconds")
-def load_mlregistry(model_name: str) -> MlRegistry:
-    from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig
-    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
-    tokenizer = AutoTokenizer.from_pretrained(model_name)
-    config_summarizer = GenerationConfig.from_pretrained(model_name)
-    config_summarizer.max_new_tokens = 200
-    config_summarizer.min_new_tokens = 100
-    config_summarizer.top_k = 5
-    config_summarizer.repetition_penalty = 1.5
-    config_tagger = GenerationConfig.from_pretrained(model_name)
-    config_tagger.max_new_tokens = 50
-    config_tagger.min_new_tokens = 25
-    # increase the temperature to make the model more creative
-    config_tagger.temperature = 1.5
-    summarizer = HfTransformersSummarizer(
-        model_name, model, tokenizer, config_summarizer
-    )
-    tagger = HfTransformersTagger(model_name, model, tokenizer, config_tagger)
-    registry = MlRegistry()
-    registry.register_processor(DefaultUrlProcessor())
-    registry.register_processor(RawTextProcessor())
-    registry.register_summarizer(summarizer)
-    registry.register_tagger(tagger)
-    return registry
 def main() -> None:
-    model_name = "google/flan-t5-large"
-    registry = load_mlregistry(model_name)
     while True:
         jobs = check_pending_jobs()

 from gistillery.base import JobInput
 from gistillery.db import get_db_cursor
+from gistillery.registry import ToolRegistry, get_tool_registry
 SLEEP_INTERVAL = 5
 def check_pending_jobs() -> list[JobInput]:
     """Check DB for pending jobs"""
     with get_db_cursor() as cursor:
+        # fetch pending jobs, join author and content from entries table
         query = """
         SELECT j.entry_id, e.author, e.source
         FROM jobs j
         ON j.entry_id = e.id
         WHERE j.status = 'pending'
         """
+        res = cursor.execute(query).fetchall()
     return [
         JobInput(id=_id, author=author, content=content) for _id, author, content in res
     ]
     tagger_name: str
+def _process_job(job: JobInput, registry: ToolRegistry) -> JobOutput:
     processor = registry.get_processor(job)
     processor_name = processor.get_name()
     processed = processor(job)
         )
+def process_job(job: JobInput, registry: ToolRegistry) -> None:
     tic = time.perf_counter()
     print(f"Processing job for (id={job.id[:8]})")
     print(f"Finished processing job (id={job.id[:8]}) in {toc - tic:0.3f} seconds")
 def main() -> None:
+    registry = get_tool_registry()
     while True:
         jobs = check_pending_jobs()

tests/test_app.py CHANGED Viewed

@@ -35,18 +35,14 @@ class TestWebservice:
         return client
     @pytest.fixture
-    def mlregistry(self):
         # use dummy models
-        from gistillery.ml import Summarizer, Tagger
         from gistillery.preprocessing import RawTextProcessor
-        from gistillery.registry import MlRegistry
         class DummySummarizer(Summarizer):
             """Returns the first 10 characters of the input"""
-            def __init__(self, *args, **kwargs):
-                pass
             def get_name(self):
                 return "dummy summarizer"
@@ -55,24 +51,20 @@ class TestWebservice:
         class DummyTagger(Tagger):
             """Returns the first 3 words of the input"""
-            def __init__(self, *args, **kwargs):
-                pass
             def get_name(self):
                 return "dummy tagger"
             def __call__(self, x):
                 return ["#" + word for word in x.split(maxsplit=4)[:3]]
-        registry = MlRegistry()
         registry.register_processor(RawTextProcessor())
         # arguments don't matter for dummy summarizer and tagger
-        summarizer = DummySummarizer(None, None, None, None)
         registry.register_summarizer(summarizer)
-        tagger = DummyTagger(None, None, None, None)
         registry.register_tagger(tagger)
         return registry
@@ -128,7 +120,7 @@ class TestWebservice:
         }
         assert last_updated is None
-    def test_submitted_job_failed(self, client, mlregistry, monkeypatch):
         # monkeypatch uuid4 to return a known value
         job_id = "abc1234"
         monkeypatch.setattr("uuid.uuid4", lambda: SimpleNamespace(hex=job_id))
@@ -143,7 +135,7 @@ class TestWebservice:
             "gistillery.worker._process_job",
             lambda job, registry: raise_(RuntimeError("something went wrong")),
         )
-        self.process_jobs(mlregistry)
         resp = client.get(f"/check_job_status/{job_id}")
         output = resp.json()
@@ -153,12 +145,12 @@ class TestWebservice:
             "status": "failed",
         }
-    def test_submitted_job_status_done(self, client, mlregistry, monkeypatch):
         # monkeypatch uuid4 to return a known value
         job_id = "abc1234"
         monkeypatch.setattr("uuid.uuid4", lambda: SimpleNamespace(hex=job_id))
         client.post("/submit", json={"author": "ben", "content": "this is a test"})
-        self.process_jobs(mlregistry)
         resp = client.get(f"/check_job_status/{job_id}")
         output = resp.json()
@@ -169,7 +161,28 @@ class TestWebservice:
         }
         assert is_roughly_now(last_updated)
-    def test_recent_with_entries(self, client, mlregistry):
         # submit 2 entries
         client.post(
             "/submit", json={"author": "maxi", "content": "this is a first test"}
@@ -178,7 +191,7 @@ class TestWebservice:
             "/submit",
             json={"author": "mini", "content": "this would be something else"},
         )
-        self.process_jobs(mlregistry)
         resp = client.get("/recent").json()
         # results are sorted by recency but since dummy models are so fast, the
@@ -196,7 +209,7 @@ class TestWebservice:
         assert resp1["summary"] == "this would"
         assert resp1["tags"] == sorted(["#this", "#would", "#be"])
-    def test_recent_tag_with_entries(self, client, mlregistry):
         # submit 2 entries
         client.post(
             "/submit", json={"author": "maxi", "content": "this is a first test"}
@@ -205,7 +218,7 @@ class TestWebservice:
             "/submit",
             json={"author": "mini", "content": "this would be something else"},
         )
-        self.process_jobs(mlregistry)
         # the "this" tag is in both entries
         resp = client.get("/recent/this").json()
@@ -220,22 +233,22 @@ class TestWebservice:
         assert resp0["summary"] == "this would"
         assert resp0["tags"] == sorted(["#this", "#would", "#be"])
-    def test_clear(self, client, cursor, mlregistry):
         client.post("/submit", json={"author": "ben", "content": "this is a test"})
-        self.process_jobs(mlregistry)
         assert cursor.execute("SELECT COUNT(*) c FROM entries").fetchone()[0] == 1
         client.get("/clear")
         assert cursor.execute("SELECT COUNT(*) c FROM entries").fetchone()[0] == 0
-    def test_inputs_stored(self, client, cursor, mlregistry):
         client.post("/submit", json={"author": "ben", "content": "  this is a test\n"})
-        self.process_jobs(mlregistry)
         rows = cursor.execute("SELECT * FROM inputs").fetchall()
         assert len(rows) == 1
         assert rows[0].input == "this is a test"
-    def test_submit_url(self, client, cursor, mlregistry, monkeypatch):
         class MockClient:
             """Mock httpx Client, return www.example.com content"""
@@ -269,7 +282,7 @@ class TestWebservice:
         from gistillery.preprocessing import DefaultUrlProcessor
         # register url processor, put it before the default processor
-        mlregistry.register_processor(DefaultUrlProcessor(), last=False)
         client.post(
             "/submit",
             json={
@@ -277,7 +290,7 @@ class TestWebservice:
                 "content": "https://en.wikipedia.org/wiki/non-existing-page",
             },
         )
-        self.process_jobs(mlregistry)
         rows = cursor.execute("SELECT * FROM inputs").fetchall()
         assert len(rows) == 1

         return client
     @pytest.fixture
+    def registry(self):
         # use dummy models
+        from gistillery.tools import Summarizer, Tagger
         from gistillery.preprocessing import RawTextProcessor
+        from gistillery.registry import ToolRegistry
         class DummySummarizer(Summarizer):
             """Returns the first 10 characters of the input"""
             def get_name(self):
                 return "dummy summarizer"
         class DummyTagger(Tagger):
             """Returns the first 3 words of the input"""
             def get_name(self):
                 return "dummy tagger"
             def __call__(self, x):
                 return ["#" + word for word in x.split(maxsplit=4)[:3]]
+        registry = ToolRegistry()
         registry.register_processor(RawTextProcessor())
         # arguments don't matter for dummy summarizer and tagger
+        summarizer = DummySummarizer()
         registry.register_summarizer(summarizer)
+        tagger = DummyTagger()
         registry.register_tagger(tagger)
         return registry
         }
         assert last_updated is None
+    def test_submitted_job_failed(self, client, registry, monkeypatch):
         # monkeypatch uuid4 to return a known value
         job_id = "abc1234"
         monkeypatch.setattr("uuid.uuid4", lambda: SimpleNamespace(hex=job_id))
             "gistillery.worker._process_job",
             lambda job, registry: raise_(RuntimeError("something went wrong")),
         )
+        self.process_jobs(registry)
         resp = client.get(f"/check_job_status/{job_id}")
         output = resp.json()
             "status": "failed",
         }
+    def test_submitted_job_status_done(self, client, registry, monkeypatch):
         # monkeypatch uuid4 to return a known value
         job_id = "abc1234"
         monkeypatch.setattr("uuid.uuid4", lambda: SimpleNamespace(hex=job_id))
         client.post("/submit", json={"author": "ben", "content": "this is a test"})
+        self.process_jobs(registry)
         resp = client.get(f"/check_job_status/{job_id}")
         output = resp.json()
         }
         assert is_roughly_now(last_updated)
+    def test_status_pending_jobs(self, client, registry, monkeypatch):
+        resp = client.get("/check_job_status/")
+        output = resp.json()
+        assert output == "No pending jobs found"
+        monkeypatch.setattr("uuid.uuid4", lambda: SimpleNamespace(hex="abc0"))
+        client.post("/submit", json={"author": "ben", "content": "this is a test"})
+        resp = client.get("/check_job_status/")
+        output = resp.json()
+        expected = "Found 1 pending job(s): abc0"
+        assert output == expected
+        for i in range(1, 10):
+            monkeypatch.setattr("uuid.uuid4", lambda: SimpleNamespace(hex=f"abc{i}"))
+            client.post("/submit", json={"author": "ben", "content": "this is a test"})
+        resp = client.get("/check_job_status/")
+        output = resp.json()
+        expected = "Found 10 pending job(s): abc0, abc1, abc2, ..."
+        assert output == expected
+    def test_recent_with_entries(self, client, registry):
         # submit 2 entries
         client.post(
             "/submit", json={"author": "maxi", "content": "this is a first test"}
             "/submit",
             json={"author": "mini", "content": "this would be something else"},
         )
+        self.process_jobs(registry)
         resp = client.get("/recent").json()
         # results are sorted by recency but since dummy models are so fast, the
         assert resp1["summary"] == "this would"
         assert resp1["tags"] == sorted(["#this", "#would", "#be"])
+    def test_recent_tag_with_entries(self, client, registry):
         # submit 2 entries
         client.post(
             "/submit", json={"author": "maxi", "content": "this is a first test"}
             "/submit",
             json={"author": "mini", "content": "this would be something else"},
         )
+        self.process_jobs(registry)
         # the "this" tag is in both entries
         resp = client.get("/recent/this").json()
         assert resp0["summary"] == "this would"
         assert resp0["tags"] == sorted(["#this", "#would", "#be"])
+    def test_clear(self, client, cursor, registry):
         client.post("/submit", json={"author": "ben", "content": "this is a test"})
+        self.process_jobs(registry)
         assert cursor.execute("SELECT COUNT(*) c FROM entries").fetchone()[0] == 1
         client.get("/clear")
         assert cursor.execute("SELECT COUNT(*) c FROM entries").fetchone()[0] == 0
+    def test_inputs_stored(self, client, cursor, registry):
         client.post("/submit", json={"author": "ben", "content": "  this is a test\n"})
+        self.process_jobs(registry)
         rows = cursor.execute("SELECT * FROM inputs").fetchall()
         assert len(rows) == 1
         assert rows[0].input == "this is a test"
+    def test_submit_url(self, client, cursor, registry, monkeypatch):
         class MockClient:
             """Mock httpx Client, return www.example.com content"""
         from gistillery.preprocessing import DefaultUrlProcessor
         # register url processor, put it before the default processor
+        registry.register_processor(DefaultUrlProcessor(), last=False)
         client.post(
             "/submit",
             json={
                 "content": "https://en.wikipedia.org/wiki/non-existing-page",
             },
         )
+        self.process_jobs(registry)
         rows = cursor.execute("SELECT * FROM inputs").fetchall()
         assert len(rows) == 1