File size: 3,455 Bytes
129cd69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
from typing import Any, List, Optional, Sequence

from langchain_core.documents import BaseDocumentTransformer, Document

from langchain.utils import get_from_env


class DoctranPropertyExtractor(BaseDocumentTransformer):
    """Extract properties from text documents using doctran.

    Arguments:
        properties: A list of the properties to extract.
        openai_api_key: OpenAI API key. Can also be specified via environment variable
            ``OPENAI_API_KEY``.

    Example:
        .. code-block:: python

            from langchain.document_transformers import DoctranPropertyExtractor

            properties = [
                {
                    "name": "category",
                    "description": "What type of email this is.",
                    "type": "string",
                    "enum": ["update", "action_item", "customer_feedback", "announcement", "other"],
                    "required": True,
                },
                {
                    "name": "mentions",
                    "description": "A list of all people mentioned in this email.",
                    "type": "array",
                    "items": {
                        "name": "full_name",
                        "description": "The full name of the person mentioned.",
                        "type": "string",
                    },
                    "required": True,
                },
                {
                    "name": "eli5",
                    "description": "Explain this email to me like I'm 5 years old.",
                    "type": "string",
                    "required": True,
                },
            ]

            # Pass in openai_api_key or set env var OPENAI_API_KEY
            property_extractor = DoctranPropertyExtractor(properties)
            transformed_document = await qa_transformer.atransform_documents(documents)
    """  # noqa: E501

    def __init__(
        self,
        properties: List[dict],
        openai_api_key: Optional[str] = None,
        openai_api_model: Optional[str] = None,
    ) -> None:
        self.properties = properties
        self.openai_api_key = openai_api_key or get_from_env(
            "openai_api_key", "OPENAI_API_KEY"
        )
        self.openai_api_model = openai_api_model or get_from_env(
            "openai_api_model", "OPENAI_API_MODEL"
        )

    def transform_documents(
        self, documents: Sequence[Document], **kwargs: Any
    ) -> Sequence[Document]:
        raise NotImplementedError

    async def atransform_documents(
        self, documents: Sequence[Document], **kwargs: Any
    ) -> Sequence[Document]:
        """Extracts properties from text documents using doctran."""
        try:
            from doctran import Doctran, ExtractProperty

            doctran = Doctran(
                openai_api_key=self.openai_api_key, openai_model=self.openai_api_model
            )
        except ImportError:
            raise ImportError(
                "Install doctran to use this parser. (pip install doctran)"
            )
        properties = [ExtractProperty(**property) for property in self.properties]
        for d in documents:
            doctran_doc = (
                await doctran.parse(content=d.page_content)
                .extract(properties=properties)
                .execute()
            )

            d.metadata["extracted_properties"] = doctran_doc.extracted_properties
        return documents