Spaces:

sidphbot
/

Researcher

Build error

App Files Files Community

sidphbot commited on May 28, 2022

Commit

a8d4e3d

1 Parent(s): e7f6d6b

spaces init

Browse files

Files changed (44) hide show

LICENSE +674 -0
README.md +267 -13
app.py +58 -0
arxiv_public_data/__init__.py +0 -0
arxiv_public_data/__pycache__/__init__.cpython-310.pyc +0 -0
arxiv_public_data/__pycache__/config.cpython-310.pyc +0 -0
arxiv_public_data/__pycache__/fixunicode.cpython-310.pyc +0 -0
arxiv_public_data/__pycache__/fulltext.cpython-310.pyc +0 -0
arxiv_public_data/__pycache__/internal_citations.cpython-310.pyc +0 -0
arxiv_public_data/__pycache__/pdfstamp.cpython-310.pyc +0 -0
arxiv_public_data/__pycache__/regex_arxiv.cpython-310.pyc +0 -0
arxiv_public_data/authors.py +469 -0
arxiv_public_data/config.py +55 -0
arxiv_public_data/embeddings/__init__.py +0 -0
arxiv_public_data/embeddings/tf_hub.py +185 -0
arxiv_public_data/embeddings/util.py +151 -0
arxiv_public_data/fixunicode.py +108 -0
arxiv_public_data/fulltext.py +349 -0
arxiv_public_data/internal_citations.py +128 -0
arxiv_public_data/oai_metadata.py +282 -0
arxiv_public_data/pdfstamp.py +83 -0
arxiv_public_data/regex_arxiv.py +195 -0
arxiv_public_data/s3_bulk_download.py +397 -0
arxiv_public_data/slice_pdfs.py +93 -0
arxiv_public_data/tex2utf.py +206 -0
logo.png +0 -0
requirements.txt +22 -0
setup.py +89 -0
src/Auto_Research.egg-info/PKG-INFO +313 -0
src/Auto_Research.egg-info/SOURCES.txt +10 -0
src/Auto_Research.egg-info/dependency_links.txt +2 -0
src/Auto_Research.egg-info/entry_points.txt +2 -0
src/Auto_Research.egg-info/requires.txt +24 -0
src/Auto_Research.egg-info/top_level.txt +1 -0
src/Surveyor.py +1518 -0
src/__pycache__/Surveyor.cpython-310.pyc +0 -0
src/__pycache__/defaults.cpython-310.pyc +0 -0
src/defaults.py +20 -0
src/packages.txt +0 -0
survey.py +72 -0
tests/__init__.py +0 -0
tests/__pycache__/__init__.cpython-310.pyc +0 -0
tests/__pycache__/test_survey_files.cpython-310-pytest-7.1.2.pyc +0 -0
tests/test_survey_files.py +10 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,674 @@

+                    GNU GENERAL PUBLIC LICENSE
+                       Version 3, 29 June 2007
+ Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+                            Preamble
+  The GNU General Public License is a free, copyleft license for
+software and other kinds of works.
+  The licenses for most software and other practical works are designed
+to take away your freedom to share and change the works.  By contrast,
+the GNU General Public License is intended to guarantee your freedom to
+share and change all versions of a program--to make sure it remains free
+software for all its users.  We, the Free Software Foundation, use the
+GNU General Public License for most of our software; it applies also to
+any other work released this way by its authors.  You can apply it to
+your programs, too.
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+them if you wish), that you receive source code or can get it if you
+want it, that you can change the software or use pieces of it in new
+free programs, and that you know you can do these things.
+  To protect your rights, we need to prevent others from denying you
+these rights or asking you to surrender the rights.  Therefore, you have
+certain responsibilities if you distribute copies of the software, or if
+you modify it: responsibilities to respect the freedom of others.
+  For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must pass on to the recipients the same
+freedoms that you received.  You must make sure that they, too, receive
+or can get the source code.  And you must show them these terms so they
+know their rights.
+  Developers that use the GNU GPL protect your rights with two steps:
+(1) assert copyright on the software, and (2) offer you this License
+giving you legal permission to copy, distribute and/or modify it.
+  For the developers' and authors' protection, the GPL clearly explains
+that there is no warranty for this free software.  For both users' and
+authors' sake, the GPL requires that modified versions be marked as
+changed, so that their problems will not be attributed erroneously to
+authors of previous versions.
+  Some devices are designed to deny users access to install or run
+modified versions of the software inside them, although the manufacturer
+can do so.  This is fundamentally incompatible with the aim of
+protecting users' freedom to change the software.  The systematic
+pattern of such abuse occurs in the area of products for individuals to
+use, which is precisely where it is most unacceptable.  Therefore, we
+have designed this version of the GPL to prohibit the practice for those
+products.  If such problems arise substantially in other domains, we
+stand ready to extend this provision to those domains in future versions
+of the GPL, as needed to protect the freedom of users.
+  Finally, every program is threatened constantly by software patents.
+States should not allow patents to restrict development and use of
+software on general-purpose computers, but in those that do, we wish to
+avoid the special danger that patents applied to a free program could
+make it effectively proprietary.  To prevent this, the GPL assures that
+patents cannot be used to render the program non-free.
+  The precise terms and conditions for copying, distribution and
+modification follow.
+                       TERMS AND CONDITIONS
+  0. Definitions.
+  "This License" refers to version 3 of the GNU General Public License.
+  "Copyright" also means copyright-like laws that apply to other kinds of
+works, such as semiconductor masks.
+  "The Program" refers to any copyrightable work licensed under this
+License.  Each licensee is addressed as "you".  "Licensees" and
+"recipients" may be individuals or organizations.
+  To "modify" a work means to copy from or adapt all or part of the work
+in a fashion requiring copyright permission, other than the making of an
+exact copy.  The resulting work is called a "modified version" of the
+earlier work or a work "based on" the earlier work.
+  A "covered work" means either the unmodified Program or a work based
+on the Program.
+  To "propagate" a work means to do anything with it that, without
+permission, would make you directly or secondarily liable for
+infringement under applicable copyright law, except executing it on a
+computer or modifying a private copy.  Propagation includes copying,
+distribution (with or without modification), making available to the
+public, and in some countries other activities as well.
+  To "convey" a work means any kind of propagation that enables other
+parties to make or receive copies.  Mere interaction with a user through
+a computer network, with no transfer of a copy, is not conveying.
+  An interactive user interface displays "Appropriate Legal Notices"
+to the extent that it includes a convenient and prominently visible
+feature that (1) displays an appropriate copyright notice, and (2)
+tells the user that there is no warranty for the work (except to the
+extent that warranties are provided), that licensees may convey the
+work under this License, and how to view a copy of this License.  If
+the interface presents a list of user commands or options, such as a
+menu, a prominent item in the list meets this criterion.
+  1. Source Code.
+  The "source code" for a work means the preferred form of the work
+for making modifications to it.  "Object code" means any non-source
+form of a work.
+  A "Standard Interface" means an interface that either is an official
+standard defined by a recognized standards body, or, in the case of
+interfaces specified for a particular programming language, one that
+is widely used among developers working in that language.
+  The "System Libraries" of an executable work include anything, other
+than the work as a whole, that (a) is included in the normal form of
+packaging a Major Component, but which is not part of that Major
+Component, and (b) serves only to enable use of the work with that
+Major Component, or to implement a Standard Interface for which an
+implementation is available to the public in source code form.  A
+"Major Component", in this context, means a major essential component
+(kernel, window system, and so on) of the specific operating system
+(if any) on which the executable work runs, or a compiler used to
+produce the work, or an object code interpreter used to run it.
+  The "Corresponding Source" for a work in object code form means all
+the source code needed to generate, install, and (for an executable
+work) run the object code and to modify the work, including scripts to
+control those activities.  However, it does not include the work's
+System Libraries, or general-purpose tools or generally available free
+programs which are used unmodified in performing those activities but
+which are not part of the work.  For example, Corresponding Source
+includes interface definition files associated with source files for
+the work, and the source code for shared libraries and dynamically
+linked subprograms that the work is specifically designed to require,
+such as by intimate data communication or control flow between those
+subprograms and other parts of the work.
+  The Corresponding Source need not include anything that users
+can regenerate automatically from other parts of the Corresponding
+Source.
+  The Corresponding Source for a work in source code form is that
+same work.
+  2. Basic Permissions.
+  All rights granted under this License are granted for the term of
+copyright on the Program, and are irrevocable provided the stated
+conditions are met.  This License explicitly affirms your unlimited
+permission to run the unmodified Program.  The output from running a
+covered work is covered by this License only if the output, given its
+content, constitutes a covered work.  This License acknowledges your
+rights of fair use or other equivalent, as provided by copyright law.
+  You may make, run and propagate covered works that you do not
+convey, without conditions so long as your license otherwise remains
+in force.  You may convey covered works to others for the sole purpose
+of having them make modifications exclusively for you, or provide you
+with facilities for running those works, provided that you comply with
+the terms of this License in conveying all material for which you do
+not control copyright.  Those thus making or running the covered works
+for you must do so exclusively on your behalf, under your direction
+and control, on terms that prohibit them from making any copies of
+your copyrighted material outside their relationship with you.
+  Conveying under any other circumstances is permitted solely under
+the conditions stated below.  Sublicensing is not allowed; section 10
+makes it unnecessary.
+  3. Protecting Users' Legal Rights From Anti-Circumvention Law.
+  No covered work shall be deemed part of an effective technological
+measure under any applicable law fulfilling obligations under article
+11 of the WIPO copyright treaty adopted on 20 December 1996, or
+similar laws prohibiting or restricting circumvention of such
+measures.
+  When you convey a covered work, you waive any legal power to forbid
+circumvention of technological measures to the extent such circumvention
+is effected by exercising rights under this License with respect to
+the covered work, and you disclaim any intention to limit operation or
+modification of the work as a means of enforcing, against the work's
+users, your or third parties' legal rights to forbid circumvention of
+technological measures.
+  4. Conveying Verbatim Copies.
+  You may convey verbatim copies of the Program's source code as you
+receive it, in any medium, provided that you conspicuously and
+appropriately publish on each copy an appropriate copyright notice;
+keep intact all notices stating that this License and any
+non-permissive terms added in accord with section 7 apply to the code;
+keep intact all notices of the absence of any warranty; and give all
+recipients a copy of this License along with the Program.
+  You may charge any price or no price for each copy that you convey,
+and you may offer support or warranty protection for a fee.
+  5. Conveying Modified Source Versions.
+  You may convey a work based on the Program, or the modifications to
+produce it from the Program, in the form of source code under the
+terms of section 4, provided that you also meet all of these conditions:
+    a) The work must carry prominent notices stating that you modified
+    it, and giving a relevant date.
+    b) The work must carry prominent notices stating that it is
+    released under this License and any conditions added under section
+    7.  This requirement modifies the requirement in section 4 to
+    "keep intact all notices".
+    c) You must license the entire work, as a whole, under this
+    License to anyone who comes into possession of a copy.  This
+    License will therefore apply, along with any applicable section 7
+    additional terms, to the whole of the work, and all its parts,
+    regardless of how they are packaged.  This License gives no
+    permission to license the work in any other way, but it does not
+    invalidate such permission if you have separately received it.
+    d) If the work has interactive user interfaces, each must display
+    Appropriate Legal Notices; however, if the Program has interactive
+    interfaces that do not display Appropriate Legal Notices, your
+    work need not make them do so.
+  A compilation of a covered work with other separate and independent
+works, which are not by their nature extensions of the covered work,
+and which are not combined with it such as to form a larger program,
+in or on a volume of a storage or distribution medium, is called an
+"aggregate" if the compilation and its resulting copyright are not
+used to limit the access or legal rights of the compilation's users
+beyond what the individual works permit.  Inclusion of a covered work
+in an aggregate does not cause this License to apply to the other
+parts of the aggregate.
+  6. Conveying Non-Source Forms.
+  You may convey a covered work in object code form under the terms
+of sections 4 and 5, provided that you also convey the
+machine-readable Corresponding Source under the terms of this License,
+in one of these ways:
+    a) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by the
+    Corresponding Source fixed on a durable physical medium
+    customarily used for software interchange.
+    b) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by a
+    written offer, valid for at least three years and valid for as
+    long as you offer spare parts or customer support for that product
+    model, to give anyone who possesses the object code either (1) a
+    copy of the Corresponding Source for all the software in the
+    product that is covered by this License, on a durable physical
+    medium customarily used for software interchange, for a price no
+    more than your reasonable cost of physically performing this
+    conveying of source, or (2) access to copy the
+    Corresponding Source from a network server at no charge.
+    c) Convey individual copies of the object code with a copy of the
+    written offer to provide the Corresponding Source.  This
+    alternative is allowed only occasionally and noncommercially, and
+    only if you received the object code with such an offer, in accord
+    with subsection 6b.
+    d) Convey the object code by offering access from a designated
+    place (gratis or for a charge), and offer equivalent access to the
+    Corresponding Source in the same way through the same place at no
+    further charge.  You need not require recipients to copy the
+    Corresponding Source along with the object code.  If the place to
+    copy the object code is a network server, the Corresponding Source
+    may be on a different server (operated by you or a third party)
+    that supports equivalent copying facilities, provided you maintain
+    clear directions next to the object code saying where to find the
+    Corresponding Source.  Regardless of what server hosts the
+    Corresponding Source, you remain obligated to ensure that it is
+    available for as long as needed to satisfy these requirements.
+    e) Convey the object code using peer-to-peer transmission, provided
+    you inform other peers where the object code and Corresponding
+    Source of the work are being offered to the general public at no
+    charge under subsection 6d.
+  A separable portion of the object code, whose source code is excluded
+from the Corresponding Source as a System Library, need not be
+included in conveying the object code work.
+  A "User Product" is either (1) a "consumer product", which means any
+tangible personal property which is normally used for personal, family,
+or household purposes, or (2) anything designed or sold for incorporation
+into a dwelling.  In determining whether a product is a consumer product,
+doubtful cases shall be resolved in favor of coverage.  For a particular
+product received by a particular user, "normally used" refers to a
+typical or common use of that class of product, regardless of the status
+of the particular user or of the way in which the particular user
+actually uses, or expects or is expected to use, the product.  A product
+is a consumer product regardless of whether the product has substantial
+commercial, industrial or non-consumer uses, unless such uses represent
+the only significant mode of use of the product.
+  "Installation Information" for a User Product means any methods,
+procedures, authorization keys, or other information required to install
+and execute modified versions of a covered work in that User Product from
+a modified version of its Corresponding Source.  The information must
+suffice to ensure that the continued functioning of the modified object
+code is in no case prevented or interfered with solely because
+modification has been made.
+  If you convey an object code work under this section in, or with, or
+specifically for use in, a User Product, and the conveying occurs as
+part of a transaction in which the right of possession and use of the
+User Product is transferred to the recipient in perpetuity or for a
+fixed term (regardless of how the transaction is characterized), the
+Corresponding Source conveyed under this section must be accompanied
+by the Installation Information.  But this requirement does not apply
+if neither you nor any third party retains the ability to install
+modified object code on the User Product (for example, the work has
+been installed in ROM).
+  The requirement to provide Installation Information does not include a
+requirement to continue to provide support service, warranty, or updates
+for a work that has been modified or installed by the recipient, or for
+the User Product in which it has been modified or installed.  Access to a
+network may be denied when the modification itself materially and
+adversely affects the operation of the network or violates the rules and
+protocols for communication across the network.
+  Corresponding Source conveyed, and Installation Information provided,
+in accord with this section must be in a format that is publicly
+documented (and with an implementation available to the public in
+source code form), and must require no special password or key for
+unpacking, reading or copying.
+  7. Additional Terms.
+  "Additional permissions" are terms that supplement the terms of this
+License by making exceptions from one or more of its conditions.
+Additional permissions that are applicable to the entire Program shall
+be treated as though they were included in this License, to the extent
+that they are valid under applicable law.  If additional permissions
+apply only to part of the Program, that part may be used separately
+under those permissions, but the entire Program remains governed by
+this License without regard to the additional permissions.
+  When you convey a copy of a covered work, you may at your option
+remove any additional permissions from that copy, or from any part of
+it.  (Additional permissions may be written to require their own
+removal in certain cases when you modify the work.)  You may place
+additional permissions on material, added by you to a covered work,
+for which you have or can give appropriate copyright permission.
+  Notwithstanding any other provision of this License, for material you
+add to a covered work, you may (if authorized by the copyright holders of
+that material) supplement the terms of this License with terms:
+    a) Disclaiming warranty or limiting liability differently from the
+    terms of sections 15 and 16 of this License; or
+    b) Requiring preservation of specified reasonable legal notices or
+    author attributions in that material or in the Appropriate Legal
+    Notices displayed by works containing it; or
+    c) Prohibiting misrepresentation of the origin of that material, or
+    requiring that modified versions of such material be marked in
+    reasonable ways as different from the original version; or
+    d) Limiting the use for publicity purposes of names of licensors or
+    authors of the material; or
+    e) Declining to grant rights under trademark law for use of some
+    trade names, trademarks, or service marks; or
+    f) Requiring indemnification of licensors and authors of that
+    material by anyone who conveys the material (or modified versions of
+    it) with contractual assumptions of liability to the recipient, for
+    any liability that these contractual assumptions directly impose on
+    those licensors and authors.
+  All other non-permissive additional terms are considered "further
+restrictions" within the meaning of section 10.  If the Program as you
+received it, or any part of it, contains a notice stating that it is
+governed by this License along with a term that is a further
+restriction, you may remove that term.  If a license document contains
+a further restriction but permits relicensing or conveying under this
+License, you may add to a covered work material governed by the terms
+of that license document, provided that the further restriction does
+not survive such relicensing or conveying.
+  If you add terms to a covered work in accord with this section, you
+must place, in the relevant source files, a statement of the
+additional terms that apply to those files, or a notice indicating
+where to find the applicable terms.
+  Additional terms, permissive or non-permissive, may be stated in the
+form of a separately written license, or stated as exceptions;
+the above requirements apply either way.
+  8. Termination.
+  You may not propagate or modify a covered work except as expressly
+provided under this License.  Any attempt otherwise to propagate or
+modify it is void, and will automatically terminate your rights under
+this License (including any patent licenses granted under the third
+paragraph of section 11).
+  However, if you cease all violation of this License, then your
+license from a particular copyright holder is reinstated (a)
+provisionally, unless and until the copyright holder explicitly and
+finally terminates your license, and (b) permanently, if the copyright
+holder fails to notify you of the violation by some reasonable means
+prior to 60 days after the cessation.
+  Moreover, your license from a particular copyright holder is
+reinstated permanently if the copyright holder notifies you of the
+violation by some reasonable means, this is the first time you have
+received notice of violation of this License (for any work) from that
+copyright holder, and you cure the violation prior to 30 days after
+your receipt of the notice.
+  Termination of your rights under this section does not terminate the
+licenses of parties who have received copies or rights from you under
+this License.  If your rights have been terminated and not permanently
+reinstated, you do not qualify to receive new licenses for the same
+material under section 10.
+  9. Acceptance Not Required for Having Copies.
+  You are not required to accept this License in order to receive or
+run a copy of the Program.  Ancillary propagation of a covered work
+occurring solely as a consequence of using peer-to-peer transmission
+to receive a copy likewise does not require acceptance.  However,
+nothing other than this License grants you permission to propagate or
+modify any covered work.  These actions infringe copyright if you do
+not accept this License.  Therefore, by modifying or propagating a
+covered work, you indicate your acceptance of this License to do so.
+  10. Automatic Licensing of Downstream Recipients.
+  Each time you convey a covered work, the recipient automatically
+receives a license from the original licensors, to run, modify and
+propagate that work, subject to this License.  You are not responsible
+for enforcing compliance by third parties with this License.
+  An "entity transaction" is a transaction transferring control of an
+organization, or substantially all assets of one, or subdividing an
+organization, or merging organizations.  If propagation of a covered
+work results from an entity transaction, each party to that
+transaction who receives a copy of the work also receives whatever
+licenses to the work the party's predecessor in interest had or could
+give under the previous paragraph, plus a right to possession of the
+Corresponding Source of the work from the predecessor in interest, if
+the predecessor has it or can get it with reasonable efforts.
+  You may not impose any further restrictions on the exercise of the
+rights granted or affirmed under this License.  For example, you may
+not impose a license fee, royalty, or other charge for exercise of
+rights granted under this License, and you may not initiate litigation
+(including a cross-claim or counterclaim in a lawsuit) alleging that
+any patent claim is infringed by making, using, selling, offering for
+sale, or importing the Program or any portion of it.
+  11. Patents.
+  A "contributor" is a copyright holder who authorizes use under this
+License of the Program or a work on which the Program is based.  The
+work thus licensed is called the contributor's "contributor version".
+  A contributor's "essential patent claims" are all patent claims
+owned or controlled by the contributor, whether already acquired or
+hereafter acquired, that would be infringed by some manner, permitted
+by this License, of making, using, or selling its contributor version,
+but do not include claims that would be infringed only as a
+consequence of further modification of the contributor version.  For
+purposes of this definition, "control" includes the right to grant
+patent sublicenses in a manner consistent with the requirements of
+this License.
+  Each contributor grants you a non-exclusive, worldwide, royalty-free
+patent license under the contributor's essential patent claims, to
+make, use, sell, offer for sale, import and otherwise run, modify and
+propagate the contents of its contributor version.
+  In the following three paragraphs, a "patent license" is any express
+agreement or commitment, however denominated, not to enforce a patent
+(such as an express permission to practice a patent or covenant not to
+sue for patent infringement).  To "grant" such a patent license to a
+party means to make such an agreement or commitment not to enforce a
+patent against the party.
+  If you convey a covered work, knowingly relying on a patent license,
+and the Corresponding Source of the work is not available for anyone
+to copy, free of charge and under the terms of this License, through a
+publicly available network server or other readily accessible means,
+then you must either (1) cause the Corresponding Source to be so
+available, or (2) arrange to deprive yourself of the benefit of the
+patent license for this particular work, or (3) arrange, in a manner
+consistent with the requirements of this License, to extend the patent
+license to downstream recipients.  "Knowingly relying" means you have
+actual knowledge that, but for the patent license, your conveying the
+covered work in a country, or your recipient's use of the covered work
+in a country, would infringe one or more identifiable patents in that
+country that you have reason to believe are valid.
+  If, pursuant to or in connection with a single transaction or
+arrangement, you convey, or propagate by procuring conveyance of, a
+covered work, and grant a patent license to some of the parties
+receiving the covered work authorizing them to use, propagate, modify
+or convey a specific copy of the covered work, then the patent license
+you grant is automatically extended to all recipients of the covered
+work and works based on it.
+  A patent license is "discriminatory" if it does not include within
+the scope of its coverage, prohibits the exercise of, or is
+conditioned on the non-exercise of one or more of the rights that are
+specifically granted under this License.  You may not convey a covered
+work if you are a party to an arrangement with a third party that is
+in the business of distributing software, under which you make payment
+to the third party based on the extent of your activity of conveying
+the work, and under which the third party grants, to any of the
+parties who would receive the covered work from you, a discriminatory
+patent license (a) in connection with copies of the covered work
+conveyed by you (or copies made from those copies), or (b) primarily
+for and in connection with specific products or compilations that
+contain the covered work, unless you entered into that arrangement,
+or that patent license was granted, prior to 28 March 2007.
+  Nothing in this License shall be construed as excluding or limiting
+any implied license or other defenses to infringement that may
+otherwise be available to you under applicable patent law.
+  12. No Surrender of Others' Freedom.
+  If conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot convey a
+covered work so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you may
+not convey it at all.  For example, if you agree to terms that obligate you
+to collect a royalty for further conveying from those to whom you convey
+the Program, the only way you could satisfy both those terms and this
+License would be to refrain entirely from conveying the Program.
+  13. Use with the GNU Affero General Public License.
+  Notwithstanding any other provision of this License, you have
+permission to link or combine any covered work with a work licensed
+under version 3 of the GNU Affero General Public License into a single
+combined work, and to convey the resulting work.  The terms of this
+License will continue to apply to the part which is the covered work,
+but the special requirements of the GNU Affero General Public License,
+section 13, concerning interaction through a network will apply to the
+combination as such.
+  14. Revised Versions of this License.
+  The Free Software Foundation may publish revised and/or new versions of
+the GNU General Public License from time to time.  Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+  Each version is given a distinguishing version number.  If the
+Program specifies that a certain numbered version of the GNU General
+Public License "or any later version" applies to it, you have the
+option of following the terms and conditions either of that numbered
+version or of any later version published by the Free Software
+Foundation.  If the Program does not specify a version number of the
+GNU General Public License, you may choose any version ever published
+by the Free Software Foundation.
+  If the Program specifies that a proxy can decide which future
+versions of the GNU General Public License can be used, that proxy's
+public statement of acceptance of a version permanently authorizes you
+to choose that version for the Program.
+  Later license versions may give you additional or different
+permissions.  However, no additional obligations are imposed on any
+author or copyright holder as a result of your choosing to follow a
+later version.
+  15. Disclaimer of Warranty.
+  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
+APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
+HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
+OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
+IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
+ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+  16. Limitation of Liability.
+  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
+THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
+GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
+USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
+DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
+PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
+EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
+SUCH DAMAGES.
+  17. Interpretation of Sections 15 and 16.
+  If the disclaimer of warranty and limitation of liability provided
+above cannot be given local legal effect according to their terms,
+reviewing courts shall apply local law that most closely approximates
+an absolute waiver of all civil liability in connection with the
+Program, unless a warranty or assumption of liability accompanies a
+copy of the Program in return for a fee.
+                     END OF TERMS AND CONDITIONS
+            How to Apply These Terms to Your New Programs
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+state the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation, either version 3 of the License, or
+    (at your option) any later version.
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+    You should have received a copy of the GNU General Public License
+    along with this program.  If not, see <https://www.gnu.org/licenses/>.
+Also add information on how to contact you by electronic and paper mail.
+  If the program does terminal interaction, make it output a short
+notice like this when it starts in an interactive mode:
+    <program>  Copyright (C) <year>  <name of author>
+    This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+    This is free software, and you are welcome to redistribute it
+    under certain conditions; type `show c' for details.
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License.  Of course, your program's commands
+might be different; for a GUI interface, you would use an "about box".
+  You should also get your employer (if you work as a programmer) or school,
+if any, to sign a "copyright disclaimer" for the program, if necessary.
+For more information on this, and how to apply and follow the GNU GPL, see
+<https://www.gnu.org/licenses/>.
+  The GNU General Public License does not permit incorporating your program
+into proprietary programs.  If your program is a subroutine library, you
+may consider it more useful to permit linking proprietary applications with
+the library.  If this is what you want to do, use the GNU Lesser General
+Public License instead of this License.  But first, please read
+<https://www.gnu.org/licenses/why-not-lgpl.html>.

README.md CHANGED Viewed

@@ -1,13 +1,267 @@
----
-title: Surveyor
-emoji: 📊
-colorFrom: gray
-colorTo: pink
-sdk: streamlit
-sdk_version: 1.9.0
-app_file: app.py
-pinned: false
-license: apache-2.0
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference

+# Auto-Research
+![Auto-Research][logo]
+[logo]: https://github.com/sidphbot/Auto-Research/blob/main/logo.png
+A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query.
+Data Provider: [arXiv](https://arxiv.org/) Open Archive Initiative OAI
+Requirements:
+ - python 3.7 or above
+ - poppler-utils - `sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev`
+ - list of requirements in requirements.txt - `cat requirements.txt | xargs pip install`
+ - 8GB disk space
+ - 13GB CUDA(GPU) memory - for a survey of 100 searched papers(max_search) and 25 selected papers(num_papers)
+#### Demo :
+Video Demo : https://drive.google.com/file/d/1-77J2L10lsW-bFDOGdTaPzSr_utY743g/view?usp=sharing
+Kaggle Re-usable Demo : https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query
+(`[TIP]` click 'edit and run' to run the demo for your custom queries on a free GPU)
+#### Steps to run (pip coming soon):
+```
+apt install -y poppler-utils libpoppler-cpp-dev
+git clone https://github.com/sidphbot/Auto-Research.git
+cd Auto-Research/
+pip install -r requirements.txt
+python survey.py [options] <your_research_query>
+```
+#### Artifacts generated (zipped):
+- Detailed survey draft paper as txt file
+- A curated list of top 25+ papers as pdfs and txts
+- Images extracted from above papers as jpegs, bmps etc
+- Heading/Section wise highlights extracted from above papers as a re-usable pure python joblib dump
+- Tables extracted from papers(optional)
+- Corpus of metadata highlights/text of top 100 papers as a re-usable pure python joblib dump
+## Example run #1 - python utility
+```
+python survey.py 'multi-task representation learning'
+```
+## Example run #2 - python class
+```
+from survey import Surveyor
+mysurveyor = Surveyor()
+mysurveyor.survey('quantum entanglement')
+```
+### Research tools:
+These are independent tools for your research or document text handling needs.
+```
+*[Tip]* :(models can be changed in defaults or passed on during init along with `refresh-models=True`)
+```
+- `abstractive_summary` - takes a long text document (`string`) and returns a 1-paragraph abstract or “abstractive” summary (`string`)
+	Input:
+		`longtext` : string
+	Returns:
+		`summary` : string
+- `extractive_summary` - takes a long text document (`string`) and returns a 1-paragraph of extracted highlights or “extractive” summary (`string`)
+	Input:
+		`longtext` : string
+	Returns:
+		`summary` : string
+- `generate_title` - takes a long text document (`string`) and returns a generated title (`string`)
+	Input:
+		`longtext` : string
+	Returns:
+		`title` : string
+- `extractive_highlights` - takes a long text document (`string`) and returns a list of extracted highlights (`[string]`), a list of keywords (`[string]`) and key phrases (`[string]`)
+	Input:
+		`longtext` : string
+	Returns:
+		`highlights` : [string]
+		`keywords` : [string]
+		`keyphrases` : [string]
+- `extract_images_from_file` - takes a pdf file name (`string`) and returns a list of image filenames (`[string]`).
+	Input:
+		`pdf_file` : string
+	Returns:
+		`images_files` : [string]
+- `extract_tables_from_file` - takes a pdf file name (`string`) and returns a list of csv filenames (`[string]`).
+	Input:
+		`pdf_file` : string
+	Returns:
+		`images_files` : [string]
+- `cluster_lines` - takes a list of lines (`string`) and returns the topic-clustered sections (`dict(generated_title: [cluster_abstract])`) and clustered lines (`dict(cluster_id: [cluster_lines])`)
+	Input:
+		`lines` : [string]
+	Returns:
+		`sections` : dict(generated_title: [cluster_abstract])
+		`clusters` : dict(cluster_id: [cluster_lines])
+- `extract_headings` - *[for scientific texts - Assumes an ‘abstract’ heading present]* takes a text file name (`string`) and returns a list of headings (`[string]`) and refined lines (`[string]`).
+    `[Tip 1]` : Use `extract_sections` as a wrapper (e.g. `extract_sections(extract_headings(“/path/to/textfile”)`) to get heading-wise sectioned text with refined lines instead (`dict( heading: text)`)
+    `[Tip 2]` : write the word ‘abstract’ at the start of the file text to get an extraction for non-scientific texts as well !!
+	Input:
+		`text_file` : string
+	Returns:
+		`refined` : [string],
+		`headings` : [string]
+		`sectioned_doc` : dict( heading: text) (Optional - Wrapper case)
+## Access/Modify defaults:
+- inside code
+```
+from survey.Surveyor import DEFAULTS
+from pprint import pprint
+pprint(DEFAULTS)
+```
+or,
+- Modify static config file - `defaults.py`
+or,
+- At runtime (utility)
+```
+python survey.py --help
+```
+```
+usage: survey.py [-h] [--max_search max_metadata_papers]
+                   [--num_papers max_num_papers] [--pdf_dir pdf_dir]
+                   [--txt_dir txt_dir] [--img_dir img_dir] [--tab_dir tab_dir]
+                   [--dump_dir dump_dir] [--models_dir save_models_dir]
+                   [--title_model_name title_model_name]
+                   [--ex_summ_model_name extractive_summ_model_name]
+                   [--ledmodel_name ledmodel_name]
+                   [--embedder_name sentence_embedder_name]
+                   [--nlp_name spacy_model_name]
+                   [--similarity_nlp_name similarity_nlp_name]
+                   [--kw_model_name kw_model_name]
+                   [--refresh_models refresh_models] [--high_gpu high_gpu]
+                   query_string
+Generate a survey just from a query !!
+positional arguments:
+  query_string          your research query/keywords
+optional arguments:
+  -h, --help            show this help message and exit
+  --max_search max_metadata_papers
+                        maximium number of papers to gaze at - defaults to 100
+  --num_papers max_num_papers
+                        maximium number of papers to download and analyse -
+                        defaults to 25
+  --pdf_dir pdf_dir     pdf paper storage directory - defaults to
+                        arxiv_data/tarpdfs/
+  --txt_dir txt_dir     text-converted paper storage directory - defaults to
+                        arxiv_data/fulltext/
+  --img_dir img_dir     image storage directory - defaults to
+                        arxiv_data/images/
+  --tab_dir tab_dir     tables storage directory - defaults to
+                        arxiv_data/tables/
+  --dump_dir dump_dir   all_output_dir - defaults to arxiv_dumps/
+  --models_dir save_models_dir
+                        directory to save models (> 5GB) - defaults to
+                        saved_models/
+  --title_model_name title_model_name
+                        title model name/tag in hugging-face, defaults to
+                        'Callidior/bert2bert-base-arxiv-titlegen'
+  --ex_summ_model_name extractive_summ_model_name
+                        extractive summary model name/tag in hugging-face,
+                        defaults to 'allenai/scibert_scivocab_uncased'
+  --ledmodel_name ledmodel_name
+                        led model(for abstractive summary) name/tag in
+                        hugging-face, defaults to 'allenai/led-
+                        large-16384-arxiv'
+  --embedder_name sentence_embedder_name
+                        sentence embedder name/tag in hugging-face, defaults
+                        to 'paraphrase-MiniLM-L6-v2'
+  --nlp_name spacy_model_name
+                        spacy model name/tag in hugging-face (if changed -
+                        needs to be spacy-installed prior), defaults to
+                        'en_core_sci_scibert'
+  --similarity_nlp_name similarity_nlp_name
+                        spacy downstream model(for similarity) name/tag in
+                        hugging-face (if changed - needs to be spacy-installed
+                        prior), defaults to 'en_core_sci_lg'
+  --kw_model_name kw_model_name
+                        keyword extraction model name/tag in hugging-face,
+                        defaults to 'distilbert-base-nli-mean-tokens'
+  --refresh_models refresh_models
+                        Refresh model downloads with given names (needs
+                        atleast one model name param above), defaults to False
+  --high_gpu high_gpu   High GPU usage permitted, defaults to False
+```
+- At runtime (code)
+    > during surveyor object initialization with `surveyor_obj = Surveyor()`
+    - `pdf_dir`: String, pdf paper storage directory - defaults to `arxiv_data/tarpdfs/`
+    - `txt_dir`: String, text-converted paper storage directory - defaults to `arxiv_data/fulltext/`
+    - `img_dir`: String, image image storage directory - defaults to `arxiv_data/images/`
+    - `tab_dir`: String, tables storage directory - defaults to `arxiv_data/tables/`
+    - `dump_dir`: String, all_output_dir - defaults to `arxiv_dumps/`
+    - `models_dir`: String, directory to save to huge models, defaults to `saved_models/`
+    - `title_model_name`: String, title model name/tag in hugging-face, defaults to `Callidior/bert2bert-base-arxiv-titlegen`
+    - `ex_summ_model_name`: String, extractive summary model name/tag in hugging-face, defaults to `allenai/scibert_scivocab_uncased`
+    - `ledmodel_name`: String, led model(for abstractive summary) name/tag in hugging-face, defaults to `allenai/led-large-16384-arxiv`
+    - `embedder_name`: String, sentence embedder name/tag in hugging-face, defaults to `paraphrase-MiniLM-L6-v2`
+    - `nlp_name`: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_scibert`
+    - `similarity_nlp_name`: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_lg`
+    - `kw_model_name`: String, keyword extraction model name/tag in hugging-face, defaults to `distilbert-base-nli-mean-tokens`
+    - `high_gpu`: Bool, High GPU usage permitted, defaults to `False`
+    - `refresh_models`: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False
+    > during survey generation with `surveyor_obj.survey(query="my_research_query")`
+    - `max_search`: int maximium number of papers to gaze at - defaults to `100`
+    - `num_papers`: int maximium number of papers to download and analyse - defaults to `25`

app.py ADDED Viewed

	@@ -0,0 +1,58 @@

+import streamlit as st
+import pandas as pd
+import numpy as np
+#from src.Surveyor import Surveyor
+def run_survey(surveyor, research_keywords, max_search, num_papers):
+    zip_file_name, survey_file_name = surveyor.survey(research_keywords,
+                                                  max_search=max_search,
+                                                  num_papers=num_papers
+                                                )
+    with open(str(zip_file_name), "rb") as file:
+        btn = st.download_button(
+             label="Download extracted topic-clustered-highlights, images and tables as zip",
+             data=file,
+             file_name=str(zip_file_name)
+           )
+    with open(str(survey_file_name), "rb") as file:
+        btn = st.download_button(
+             label="Download detailed generated survey file",
+             data=file,
+             file_name=str(zip_file_name)
+           )
+    with open(str(survey_file_name), "rb") as file:
+        btn = st.download_button(
+             label="Download detailed generated survey file",
+             data=file,
+             file_name=str(zip_file_name)
+           )
+        st.write(file.readlines())
+def survey_space():
+    st.title('Automated Survey generation from research keywords - Auto-Research V0.1')
+    form = st.sidebar.form(key='survey_form')
+    research_keywords = form.text_input("What would you like to research in today?")
+    max_search = form.number_input("num_papers_to_search", help="maximium number of papers to glance through - defaults to 20",
+                             min_value=1, max_value=60, value=20, step=1, key='max_search')
+    num_papers = form.number_input("num_papers_to_select", help="maximium number of papers to select and analyse - defaults to 8",
+                             min_value=1, max_value=25, value=8, step=1, key='num_papers')
+    submit = form.form_submit_button('Submit')
+    if submit:
+        st.write("hello")
+        #if surveyor_obj is None:
+        #    surveyor_obj = Surveyor()
+        #run_survey(surveyor_obj, research_keywords, max_search, num_papers)
+if __name__ == '__main__':
+    global surveyor_obj
+    surveyor_obj = None
+    survey_space()

arxiv_public_data/__init__.py ADDED Viewed

File without changes

arxiv_public_data/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (148 Bytes). View file

arxiv_public_data/__pycache__/config.cpython-310.pyc ADDED Viewed

Binary file (1.44 kB). View file

arxiv_public_data/__pycache__/fixunicode.cpython-310.pyc ADDED Viewed

Binary file (2.46 kB). View file

arxiv_public_data/__pycache__/fulltext.cpython-310.pyc ADDED Viewed

Binary file (8.32 kB). View file

arxiv_public_data/__pycache__/internal_citations.cpython-310.pyc ADDED Viewed

Binary file (4.27 kB). View file

arxiv_public_data/__pycache__/pdfstamp.cpython-310.pyc ADDED Viewed

Binary file (1.73 kB). View file

arxiv_public_data/__pycache__/regex_arxiv.cpython-310.pyc ADDED Viewed

Binary file (4.4 kB). View file

arxiv_public_data/authors.py ADDED Viewed

	@@ -0,0 +1,469 @@

+# https://github.com/arXiv/arxiv-base@32e6ad0
+"""
+Copyright 2017 Cornell University
+Permission is hereby granted, free of charge, to any person obtaining a copy of
+this software and associated documentation files (the "Software"), to deal in
+the Software without restriction, including without limitation the rights to
+use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
+of the Software, and to permit persons to whom the Software is furnished to do
+so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+"""
+"""Parse Authors lines to extract author and affiliation data."""
+import re
+import os
+import gzip
+import json
+from itertools import dropwhile
+from typing import Dict, Iterator, List, Tuple
+from multiprocessing import Pool, cpu_count
+from arxiv_public_data.tex2utf import tex2utf
+from arxiv_public_data.config import LOGGER, DIR_OUTPUT
+logger = LOGGER.getChild('authorsplit')
+PREFIX_MATCH = 'van|der|de|la|von|del|della|da|mac|ter|dem|di|vaziri'
+"""
+Takes data from an Author: line in the current arXiv abstract
+file and returns a structured set of data:
+ author_list_ptr = [
+  [ author1_keyname, author1_firstnames, author1_suffix, affil1, affil2 ] ,
+  [ author2_keyname, author2_firstnames, author1_suffix, affil1 ] ,
+  [ author3_keyname, author3_firstnames, author1_suffix ]
+         ]
+Abstracted from Dienst software for OAI1 and other uses. This
+routine should just go away when a better metadata structure is
+adopted that deals with names and affiliations properly.
+Must remember that there is at least one person one the archive
+who has only one name, this should clearly be considered the key name.
+Code originally written by Christina Scovel, Simeon Warner Dec99/Jan00
+ 2000-10-16 - separated.
+ 2000-12-07 - added support for suffix
+ 2003-02-14 - get surname prefixes from arXiv::Filters::Index [Simeon]
+ 2007-10-01 - created test script, some tidying [Simeon]
+ 2018-05-25 - Translated from Perl to Python [Brian C.]
+"""
+def parse_author_affil(authors: str) -> List[List[str]]:
+    """
+    Parse author line and returns an list of author and affiliation data.
+    The list for each author will have at least three elements for
+    keyname, firstname(s) and suffix. The keyname will always have content
+    but the other strings might be empty strings if there is no firstname
+    or suffix. Any additional elements after the first three are affiliations,
+    there may be zero or more.
+    Handling of prefix "XX collaboration" etc. is duplicated here and in
+    arXiv::HTML::AuthorLink -- it shouldn't be. Likely should just be here.
+    This routine is just a wrapper around the two parts that first split
+    the authors line into parts, and then back propagate the affiliations.
+    The first part is to be used along for display where we do not want
+    to back propagate affiliation information.
+    :param authors: string of authors from abs file or similar
+    :return:
+    Returns a structured set of data:
+    author_list_ptr = [
+       [ author1_keyname, author1_firstnames, author1_suffix, affil1, affil2 ],
+       [ author2_keyname, author2_firstnames, author1_suffix, affil1 ] ,
+       [ author3_keyname, author3_firstnames, author1_suffix ]
+    ]
+    """
+    return _parse_author_affil_back_propagate(
+        **_parse_author_affil_split(authors))
+def _parse_author_affil_split(author_line: str) -> Dict:
+    """
+    Split author line into author and affiliation data.
+    Take author line, tidy spacing and punctuation, and then split up into
+    individual author an affiliation data. Has special cases to avoid splitting
+    an initial collaboration name and records in $back_propagate_affiliation_to
+    the fact that affiliations should not be back propagated to collaboration
+    names.
+    Does not handle multiple collaboration names.
+    """
+    if not author_line:
+        return {'author_list': [], 'back_prop': 0}
+    names: List[str] = split_authors(author_line)
+    if not names:
+        return {'author_list': [], 'back_prop': 0}
+    names = _remove_double_commas(names)
+    # get rid of commas at back
+    namesIter: Iterator[str] = reversed(
+        list(dropwhile(lambda x: x == ',', reversed(names))))
+    # get rid of commas at front
+    names = list(dropwhile(lambda x: x == ',', namesIter))
+    # Extract all names (all parts not starting with comma or paren)
+    names = list(map(_tidy_name, filter(
+        lambda x: re.match('^[^](,]', x), names)))
+    names = list(filter(lambda n: not re.match(
+        r'^\s*et\.?\s+al\.?\s*', n, flags=re.IGNORECASE), names))
+    (names, author_list,
+     back_propagate_affiliations_to) = _collaboration_at_start(names)
+    (enumaffils) = _enum_collaboration_at_end(author_line)
+    # Split name into keyname and firstnames/initials.
+    # Deal with different patterns in turn: prefixes, suffixes, plain
+    # and single name.
+    patterns = [('double-prefix',
+                 r'^(.*)\s+(' + PREFIX_MATCH + r')\s(' +
+                 PREFIX_MATCH + r')\s(\S+)$'),
+                ('name-prefix-name',
+                 r'^(.*)\s+(' + PREFIX_MATCH + r')\s(\S+)$'),
+                ('name-name-prefix',
+                 r'^(.*)\s+(\S+)\s(I|II|III|IV|V|Sr|Jr|Sr\.|Jr\.)$'),
+                ('name-name',
+                 r'^(.*)\s+(\S+)$'), ]
+    # Now go through names in turn and try to get affiliations
+    # to go with them
+    for name in names:
+        pattern_matches = ((mtype, re.match(m, name, flags=re.IGNORECASE))
+                           for (mtype, m) in patterns)
+        (mtype, match) = next(((mtype, m)
+                               for (mtype, m) in pattern_matches
+                               if m is not None), ('default', None))
+        if match is None:
+            author_entry = [name, '', '']
+        elif mtype == 'double-prefix':
+            s = '{} {} {}'.format(match.group(
+                2), match.group(3), match.group(4))
+            author_entry = [s, match.group(1), '']
+        elif mtype == 'name-prefix-name':
+            s = '{} {}'.format(match.group(2), match.group(3))
+            author_entry = [s, match.group(1), '']
+        elif mtype == 'name-name-prefix':
+            author_entry = [match.group(2), match.group(1), match.group(3)]
+        elif mtype == 'name-name':
+            author_entry = [match.group(2), match.group(1), '']
+        else:
+            author_entry = [name, '', '']
+        # search back in author_line for affiliation
+        author_entry = _add_affiliation(
+            author_line, enumaffils, author_entry, name)
+        author_list.append(author_entry)
+    return {'author_list': author_list,
+            'back_prop': back_propagate_affiliations_to}
+def parse_author_affil_utf(authors: str) -> List:
+    """
+    Call parse_author_affil() and do TeX to UTF conversion.
+    Output structure is the same but should be in UTF and not TeX.
+    """
+    if not authors:
+        return []
+    return list(map(lambda author: list(map(tex2utf, author)),
+                    parse_author_affil(authors)))
+def _remove_double_commas(items: List[str]) -> List[str]:
+    parts: List[str] = []
+    last = ''
+    for pt in items:
+        if pt == ',' and last == ',':
+            continue
+        else:
+            parts.append(pt)
+            last = pt
+    return parts
+def _tidy_name(name: str) -> str:
+    name = re.sub(r'\s\s+', ' ', name)  # also gets rid of CR
+    # add space after dot (except in TeX)
+    name = re.sub(r'(?<!\\)\.(\S)', r'. \g<1>', name)
+    return name
+def _collaboration_at_start(names: List[str]) \
+        -> Tuple[List[str], List[List[str]], int]:
+    """Perform special handling of collaboration at start."""
+    author_list = []
+    back_propagate_affiliations_to = 0
+    while len(names) > 0:
+        m = re.search(r'([a-z0-9\s]+\s+(collaboration|group|team))',
+                      names[0], flags=re.IGNORECASE)
+        if not m:
+            break
+        # Add to author list
+        author_list.append([m.group(1), '', ''])
+        back_propagate_affiliations_to += 1
+        # Remove from names
+        names.pop(0)
+        # Also swallow and following comma or colon
+        if names and (names[0] == ',' or names[0] == ':'):
+            names.pop(0)
+    return names, author_list, back_propagate_affiliations_to
+def _enum_collaboration_at_end(author_line: str)->Dict:
+    """Get separate set of enumerated affiliations from end of author_line."""
+    # Now see if we have a separate set of enumerated affiliations
+    # This is indicated by finding '(\s*('
+    line_m = re.search(r'\(\s*\((.*)$', author_line)
+    if not line_m:
+        return {}
+    enumaffils = {}
+    affils = re.sub(r'\s*\)\s*$', '', line_m.group(1))
+    # Now expect to have '1) affil1 (2) affil2 (3) affil3'
+    for affil in affils.split('('):
+        # Now expect `1) affil1 ', discard if no match
+        m = re.match(r'^(\d+)\)\s*(\S.*\S)\s*$', affil)
+        if m:
+            enumaffils[m.group(1)] = re.sub(r'[\.,\s]*$', '', m.group(2))
+    return enumaffils
+def _add_affiliation(author_line: str,
+                     enumaffils: Dict,
+                     author_entry: List[str],
+                     name: str) -> List:
+    """
+    Add author affiliation to author_entry if one is found in author_line.
+    This should deal with these cases
+    Smith B(labX) Smith B(1) Smith B(1, 2) Smith B(1 & 2) Smith B(1 and 2)
+    """
+    en = re.escape(name)
+    namerex = r'{}\s*\(([^\(\)]+)'.format(en.replace(' ', 's*'))
+    m = re.search(namerex, author_line, flags=re.IGNORECASE)
+    if not m:
+        return author_entry
+    # Now see if we have enumerated references (just commas, digits, &, and)
+    affils = m.group(1).rstrip().lstrip()
+    affils = re.sub(r'(&|and)/,', ',', affils, flags=re.IGNORECASE)
+    if re.match(r'^[\d,\s]+$', affils):
+        for affil in affils.split(','):
+            if affil in enumaffils:
+                author_entry.append(enumaffils[affil])
+    else:
+        author_entry.append(affils)
+    return author_entry
+def _parse_author_affil_back_propagate(author_list: List[List[str]],
+                                       back_prop: int) -> List[List[str]]:
+    """Back propagate author affiliation.
+    Take the author list structure generated by parse_author_affil_split(..)
+    and propagate affiliation information backwards to preceeding author
+    entries where none was give. Stop before entry $back_prop to avoid
+    adding affiliation information to collaboration names.
+    given, eg:
+      a.b.first, c.d.second (affil)
+    implies
+      a.b.first (affil), c.d.second (affil)
+    and in more complex cases:
+      a.b.first, c.d.second (1), e.f.third, g.h.forth (2,3)
+    implies
+      a.b.first (1), c.d.second (1), e.f.third (2,3), g.h.forth (2,3)
+    """
+    last_affil: List[str] = []
+    for x in range(len(author_list) - 1, max(back_prop - 1, -1), -1):
+        author_entry = author_list[x]
+        if len(author_entry) > 3:  # author has affiliation,store
+            last_affil = author_entry
+        elif last_affil:
+            # author doesn't have affil but later one did => copy
+            author_entry.extend(last_affil[3:])
+    return author_list
+def split_authors(authors: str) -> List:
+    """
+    Split author string into authors entity lists.
+    Take an author line as a string and return a reference to a list of the
+    different name and affiliation blocks. While this does normalize spacing
+    and 'and', it is a key feature that the set of strings returned can be
+    concatenated to reproduce the original authors line. This code thus
+    provides a very graceful degredation for badly formatted authors lines, as
+    the text at least shows up.
+    """
+    # split authors field into blocks with boundaries of ( and )
+    if not authors:
+        return []
+    aus = re.split(r'(\(|\))', authors)
+    aus = list(filter(lambda x: x != '', aus))
+    blocks = []
+    if len(aus) == 1:
+        blocks.append(authors)
+    else:
+        c = ''
+        depth = 0
+        for bit in aus:
+            if bit == '':
+                continue
+            if bit == '(':  # track open parentheses
+                depth += 1
+                if depth == 1:
+                    blocks.append(c)
+                    c = '('
+                else:
+                    c = c + bit
+            elif bit == ')':  # track close parentheses
+                depth -= 1
+                c = c + bit
+                if depth == 0:
+                    blocks.append(c)
+                    c = ''
+                else:  # haven't closed, so keep accumulating
+                    continue
+            else:
+                c = c + bit
+        if c:
+            blocks.append(c)
+    listx = []
+    for block in blocks:
+        block = re.sub(r'\s+', ' ', block)
+        if re.match(r'^\(', block):  # it is a comment
+            listx.append(block)
+        else:  # it is a name
+            block = re.sub(r',?\s+(and|\&)\s', ',', block)
+            names = re.split(r'(,|:)\s*', block)
+            for name in names:
+                if not name:
+                    continue
+                name = name.rstrip().lstrip()
+                if name:
+                    listx.append(name)
+    # Recombine suffixes that were separated with a comma
+    parts: List[str] = []
+    for p in listx:
+        if re.match(r'^(Jr\.?|Sr\.?\[IV]{2,})$', p) \
+                and len(parts) >= 2 \
+                and parts[-1] == ',' \
+                and not re.match(r'\)$', parts[-2]):
+            separator = parts.pop()
+            last = parts.pop()
+            recomb = "{}{} {}".format(last, separator, p)
+            parts.append(recomb)
+        else:
+            parts.append(p)
+    return parts
+def parse_authorline(authors: str) -> str:
+    """
+    The external facing function from this module. Converts a complex authorline
+    into a simple one with only UTF-8.
+    Parameters
+    ----------
+    authors : string
+        The raw author line from the metadata
+    Returns
+    -------
+    clean_authors : string
+        String represeting cleaned author line
+    Examples
+    --------
+    >>> parse_authorline('A. Losev, S. Shadrin, I. Shneiberg')
+    'Losev, A.; Shadrin, S.; Shneiberg, I.'
+    >>> parse_authorline("C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan")
+    'Balázs, C.; Berger, E. L.; Nadolsky, P. M.; Yuan, C. -P.'
+    >>> parse_authorline('Stephen C. Power (Lancaster University), Baruch Solel (Technion)')
+    'Power, Stephen C.; Solel, Baruch'
+    >>> parse_authorline("L. Scheck (1), H.-Th. Janka (1), T. Foglizzo (2), and K. Kifonidis (1)\n  ((1) MPI for Astrophysics, Garching; (2) Service d'Astrophysique, CEA-Saclay)")
+    'Scheck, L.; Janka, H. -Th.; Foglizzo, T.; Kifonidis, K.'
+    """
+    names = parse_author_affil_utf(authors)
+    return '; '.join([', '.join([q for q in n[:2] if q]) for n in names])
+def _parse_article_authors(article_author):
+    try:
+        return [article_author[0], parse_author_affil_utf(article_author[1])]
+    except Exception as e:
+        msg = "Author split failed for article {}".format(article_author[0])
+        logger.error(msg)
+        logger.exception(e)
+        return [article_author[0], '']
+def parse_authorline_parallel(article_authors, n_processes=None):
+    """
+    Parallelize `parse_authorline`
+    Parameters
+    ----------
+        article_authors : list
+            list of tuples (arXiv id, author strings from metadata)
+        (optional)
+        n_processes : int
+            number of processes
+    Returns
+    -------
+        authorsplit : list
+            list of author strings in standardized format
+            [
+             [ author1_keyname, author1_firstnames, author1_suffix, affil1,
+                affil2 ] ,
+             [ author2_keyname, author2_firstnames, author1_suffix, affil1 ] ,
+             [ author3_keyname, author3_firstnames, author1_suffix ]
+            ]
+    """
+    logger.info(
+        'Parsing author lines for {} articles...'.format(len(article_authors))
+    )
+    pool = Pool(n_processes)
+    parsed = pool.map(_parse_article_authors, article_authors)
+    outdict = {aid: auth for aid, auth in parsed}
+    filename = os.path.join(DIR_OUTPUT, 'authors-parsed.json.gz')
+    logger.info('Saving to {}'.format(filename))
+    with gzip.open(filename, 'wb') as fout:
+        fout.write(json.dumps(outdict).encode('utf-8'))

arxiv_public_data/config.py ADDED Viewed

	@@ -0,0 +1,55 @@

+import os
+import json
+import logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s: %(message)s'
+)
+baselog = logging.getLogger('arxivdata')
+logger = baselog.getChild('config')
+DEFAULT_PATH = os.path.join(os.path.abspath('/'), 'arxiv-data')
+JSONFILE = './config.json'
+KEY = 'ARXIV_DATA'
+def get_outdir():
+    """
+    Grab the outdir from:
+    1) Environment
+    2) config.json
+    3) default ($PWD/arxiv-data)
+    """
+    if os.environ.get(KEY):
+        out = os.environ.get(KEY)
+    else:
+        if os.path.exists(JSONFILE):
+            js = json.load(open(JSONFILE))
+            if not KEY in js:
+                logger.warn('Configuration in "{}" invalid, using default'.format(JSONFILE))
+                logger.warn("default output directory is {}".format(DEFAULT_PATH))
+                out = DEFAULT_PATH
+            else:
+                out = js[KEY]
+        else:
+            logger.warn("default output directory is {}".format(DEFAULT_PATH))
+            out = DEFAULT_PATH
+    return out
+try:
+    DIR_BASE = get_outdir()
+except Exception as e:
+    logger.error(
+        "Error attempting to get path from ENV or json conf, "
+        "defaulting to current directory"
+    )
+    DIR_BASE = DEFAULT_PATH
+DIR_FULLTEXT = os.path.join(DIR_BASE, 'fulltext')
+DIR_PDFTARS = os.path.join(DIR_BASE, 'tarpdfs')
+DIR_OUTPUT = os.path.join(DIR_BASE, 'output')
+LOGGER = baselog
+for dirs in [DIR_BASE, DIR_PDFTARS, DIR_FULLTEXT, DIR_OUTPUT]:
+    if not os.path.exists(dirs):
+        os.mkdir(dirs)

arxiv_public_data/embeddings/__init__.py ADDED Viewed

File without changes

arxiv_public_data/embeddings/tf_hub.py ADDED Viewed

	@@ -0,0 +1,185 @@

+"""
+tf_hub.py
+Find text embeddings using pre-trained TensorFlow Hub models
+"""
+import os
+import pickle
+import numpy as np
+from arxiv_public_data.config import DIR_OUTPUT, LOGGER
+from arxiv_public_data.embeddings.util import batch_fulltext
+logger = LOGGER.getChild('embds')
+try:
+    import tensorflow as tf
+    import tensorflow_hub as hub
+    import sentencepiece as spm
+except ImportError as e:
+    logger.warn("This module requires 'tensorflow', 'tensorflow-hub', and"
+                "'sentencepiece'\n"
+                'Please install these modules to use tf_hub.py')
+UNIV_SENTENCE_ENCODER_URL = ('https://tfhub.dev/google/'
+                             'universal-sentence-encoder/2')
+ELMO_URL = "https://tfhub.dev/google/elmo/2"
+ELMO_KWARGS = dict(signature='default', as_dict=True)
+ELMO_MODULE_KWARGS = dict(trainable=True)
+ELMO_DICTKEY = 'default'
+DIR_EMBEDDING = os.path.join(DIR_OUTPUT, 'embeddings')
+if not os.path.exists(DIR_EMBEDDING):
+    os.mkdir(DIR_EMBEDDING)
+def elmo_strings(batches, filename, batchsize=32):
+    """
+    Compute and save vector embeddings of lists of strings in batches
+    Parameters
+    ----------
+        batches : iterable of strings to be embedded
+        filename : str
+            filename to store embeddings
+        (optional)
+        batchsize : int
+            size of batches
+    """
+    g = tf.Graph()
+    with g.as_default():
+        module = hub.Module(ELMO_URL, **ELMO_MODULE_KWARGS)
+        text_input = tf.placeholder(dtype=tf.string, shape=[None])
+        embeddings = module(text_input, **ELMO_KWARGS)
+        init_op = tf.group([tf.global_variables_initializer(),
+                            tf.tables_initializer()])
+    g.finalize()
+    with tf.Session(graph=g) as sess:
+        sess.run(init_op)
+        for i, batch in enumerate(batches):
+            # grab mean-pooling of contextualized word reps
+            logger.info("Computing/saving batch {}".format(i))
+            with open(filename, 'ab') as fout:
+                pickle.dump(sess.run(
+                    embeddings, feed_dict={text_input: batch}
+                )[ELMO_DICTKEY], fout)
+UNIV_SENTENCE_LITE = "https://tfhub.dev/google/universal-sentence-encoder-lite/2"
+def get_sentence_piece_model():
+    with tf.Session() as sess:
+        module = hub.Module(UNIV_SENTENCE_LITE)
+        return sess.run(module(signature="spm_path"))
+def process_to_IDs_in_sparse_format(sp, sentences):
+    """
+    An utility method that processes sentences with the sentence piece
+    processor
+    'sp' and returns the results in tf.SparseTensor-similar format:
+    (values, indices, dense_shape)
+    """
+    ids = [sp.EncodeAsIds(x) for x in sentences]
+    max_len = max(len(x) for x in ids)
+    dense_shape=(len(ids), max_len)
+    values=[item for sublist in ids for item in sublist]
+    indices=[[row,col] for row in range(len(ids)) for col in range(len(ids[row]))]
+    return (values, indices, dense_shape)
+def universal_sentence_encoder_lite(batches, filename, spm_path, batchsize=32):
+    """
+    Compute and save vector embeddings of lists of strings in batches
+    Parameters
+    ----------
+        batches : iterable of strings to be embedded
+        filename : str
+            filename to store embeddings
+        spm_path : str
+            path to sentencepiece model from `get_sentence_piece_model`
+        (optional)
+        batchsize : int
+            size of batches
+    """
+    sp = spm.SentencePieceProcessor()
+    sp.Load(spm_path)
+    g = tf.Graph()
+    with g.as_default():
+        module = hub.Module(UNIV_SENTENCE_LITE)
+        input_placeholder = tf.sparse_placeholder(
+            tf.int64, shape=(None, None)
+        )
+        embeddings = module(
+            inputs=dict(
+                values=input_placeholder.values, indices=input_placeholder.indices,
+                dense_shape=input_placeholder.dense_shape
+            )
+        )
+        init_op = tf.group([tf.global_variables_initializer(),
+                            tf.tables_initializer()])
+    g.finalize()
+    with tf.Session(graph=g) as sess:
+        sess.run(init_op)
+        for i, batch in enumerate(batches):
+            values, indices, dense_shape = process_to_IDs_in_sparse_format(sp, batch)
+            logger.info("Computing/saving batch {}".format(i))
+            emb = sess.run(
+                embeddings,
+                feed_dict={
+                    input_placeholder.values: values,
+                    input_placeholder.indices: indices,
+                    input_placeholder.dense_shape: dense_shape
+                }
+            )
+            with open(filename, 'ab') as fout:
+                    pickle.dump(emb, fout)
+def create_save_embeddings(batches, filename, encoder, headers=[], encoder_args=(),
+                           encoder_kwargs={}, savedir=DIR_EMBEDDING):
+    """
+    Create vector embeddings of strings and save them to filename
+    Parameters
+    ----------
+        batches : iterator of strings
+        filename: str
+            embeddings will be saved in DIR_EMBEDDING/embeddings/filename
+        encoder : function(batches, savename, *args, **kwargs)
+            encodes strings in batches into vectors and saves them
+        (optional)
+        headers : list of things to save in embeddings file first
+    Examples
+    --------
+    # For list of strings, create batched numpy array of objects
+    batches = np.array_split(
+        np.array(strings, dtype='object'), len(strings)//batchsize
+    )
+    headers = []
+    # For the fulltext which cannot fit in memory, use `util.batch_fulltext`
+    md_index, all_ids, batch_gen = batch_fulltext()
+    headers = [md_index, all_ids]
+    # Universal Sentence Encoder Lite:
+    spm_path = get_sentence_piece_model()
+    create_save_embeddings(batches, filename, universal_sentence_encoder_lite,
+                           headers=headers, encoder_args=(spm_path,))
+    # ELMO:
+    create_save_embeddings(strings, filename, elmo_strings, headers=headers)
+    """
+    if not os.path.exists(savedir):
+        os.makedirs(savedir)
+    savename = os.path.join(savedir, filename)
+    with open(savename, 'ab') as fout:
+        for h in headers:
+            pickle.dump(h, fout)
+    logger.info("Saving embeddings to {}".format(savename))
+    encoder(batches, savename, *encoder_args,
+            **encoder_kwargs)

arxiv_public_data/embeddings/util.py ADDED Viewed

	@@ -0,0 +1,151 @@

+"""
+util.py
+author: Colin Clement
+date: 2019-04-05
+This module contains helper functions for loading embeddings and batch
+loading the full text, since many computers cannot contain the whole
+fulltext in memory.
+"""
+import os
+import re
+import numpy as np
+import pickle
+from arxiv_public_data.config import DIR_FULLTEXT, DIR_OUTPUT
+from arxiv_public_data.oai_metadata import load_metadata
+def id_to_pathname(aid):
+    """
+    Make filename path for text document, matching the format of fulltext
+    creation in `s3_bulk_download`
+    Parameters
+    ----------
+        aid : str
+            string of arXiv article id as found in metadata
+    Returns
+    -------
+        pathname : str
+            pathname in which to store the article following
+    Examples
+    --------
+    >>> id_to_pathname('hep-ph/0001001')  #doctest: +ELLIPSIS
+    '.../hep-ph/0001/hep-ph0001001.txt'
+    >>> id_to_pathname('1501.13851')  #doctest: +ELLIPSIS
+    '.../arxiv/1501/1501.13851.txt'
+    """
+    if '.' in aid:  # new style ArXiv ID
+        yymm = aid.split('.')[0]
+        return os.path.join(DIR_FULLTEXT, 'arxiv', yymm, aid + '.txt')
+    # old style ArXiv ID
+    cat, arxiv_id = re.split(r'(\d+)', aid)[:2]
+    yymm = arxiv_id[:4]
+    return os.path.join(DIR_FULLTEXT, cat, yymm, aid.replace('/', '') + '.txt')
+def load_generator(paths, batchsize):
+    """
+    Creates a generator object for batch loading files from paths
+    Parameters
+    ----------
+        paths : list of filepaths
+        batchsize : int
+    Returns
+    -------
+        file_contents : list of strings of contents of files in path
+    """
+    assert type(paths) is list, 'Requires a list of paths'
+    assert type(batchsize) is int, 'batchsize must be an int'
+    assert batchsize > 0, 'batchsize must be positive'
+    out = []
+    for p in paths:
+        with open(p, 'r') as fin:
+            out.append(fin.read())
+        if len(out) == batchsize:
+            yield np.array(out, dtype='object')
+            out = []
+    yield out
+def batch_fulltext(batchsize=32, maxnum=None):
+    """
+    Read metadata and find corresponding files in the fulltext
+    Parameters
+    ----------
+        (optional)
+        batchsize : int
+            number of fulltext files to load into a batch
+        maxnum : int
+            the maximum number of paths to feed the generator, for
+            testing purposes
+    Returns
+    -------
+        md_index, all_ids, load_gen : tuple of (list, list, generator)
+           md_index is a mapping of existing fulltext files, in order
+           of their appearance, and containing the index of corresponding
+           metadata. all_ids is a list of all arXiv IDs in the metadata.
+           load_gen is a generator which allows batched loading of the
+           full-text, as defined by `load_generator`
+    """
+    all_ids = [m['id'] for m in load_metadata()]
+    all_paths = [id_to_pathname(aid) for aid in all_ids]
+    exists = [os.path.exists(p) for p in all_paths]
+    existing_paths = [p for p, e in zip(all_paths, exists) if e][:maxnum]
+    md_index = [i for i, e in enumerate(exists) if e]
+    return md_index, all_ids, load_generator(existing_paths, batchsize)
+def load_embeddings(filename, headers=0):
+    """
+    Loads vector embeddings
+    Parameters
+    ----------
+        filename : str
+            path to vector embeddings saved by `create_save_embeddings`
+        (optional)
+        headers : int
+            number of pickle calls containing metadata separate from the graphs
+    Returns
+    -------
+        embeddings : dict
+            keys 'embeddings' containing vector embeddings and
+            'headers' containining metadata
+    """
+    out = {'embeddings': [], 'headers': []}
+    N = 0
+    with open(filename, 'rb') as fin:
+        while True:
+            try:
+                if N < headers:
+                    out['headers'].append(pickle.load(fin))
+                else:
+                    out['embeddings'].extend(pickle.load(fin))
+            except EOFError as e:
+                break
+            N += 1
+    out['embeddings'] = np.array(out['embeddings'])
+    return out
+def fill_zeros(loaded_embedding):
+    """
+    Fill out zeros in the full-text embedding where full-text is missing
+    Parameters
+    ----------
+        loaded_embedding : dict
+            dict as saved from with `load_embeddings` with 2 headers
+            of the list of the metadata_index each embedding vector corresponds
+            to, the list of all article ids
+    Returns
+    -------
+        embeddings : array_like
+            vector embeddings of shape (number of articles, embedding dimension)
+    """
+    md_index = loaded_embedding['headers'][0]
+    all_ids = loaded_embedding['headers'][1]
+    vectors = loaded_embedding['embeddings']
+    output = np.zeros((len(all_ids), vectors.shape[1]))
+    for idx, v in zip(md_index, vectors):
+        output[idx,:] = v
+    return output

arxiv_public_data/fixunicode.py ADDED Viewed

	@@ -0,0 +1,108 @@

+# -*- coding: utf-8 -*-
+import re
+import unicodedata
+"""
+List of ligatures: https://en.wikipedia.org/wiki/Typographic_ligature
+MKB removed the following elements from the list:
+      - et	🙰	U+1F670	&#x1F670;
+      - ſs, ſz	ẞ, ß	U+00DF	&szlig;
+Additional notes:
+* Some classes of characters were listed in the original utf8 fixes but I'm not
+  sure they don't belong elsewhere (end user processing). In these cases, pass
+  through unidecode should normalize them to proper ascii. They are listed here
+  with reasoning:
+  - Ditch combining diacritics http://unicode.org/charts/PDF/U0300.pdf
+    r'[\u0300-\u036F]': ''
+  - Ditch chars that sometimes (incorrectly?) appear as combining diacritics
+    r'(?:\xa8|[\u02C0-\u02DF])': ''
+* Should we run ftfy?
+"""
+ligature_table = """
+AA, aa	Ꜳ, ꜳ	U+A732, U+A733	&#xA732; &#xA733;
+AE, ae	Æ, æ	U+00C6, U+00E6	&AElig; &aelig;
+AO, ao	Ꜵ, ꜵ	U+A734, U+A735	&#xA734; &#xA735;
+AU, au	Ꜷ, ꜷ	U+A736, U+A737	&#xA736; &#xA737;
+AV, av	Ꜹ, ꜹ	U+A738, U+A739	&#xA738; &#xA739;
+AV, av 	Ꜻ, ꜻ	U+A73A, U+A73B	&#xA73A; &#xA73B;
+AY, ay	Ꜽ, ꜽ	U+A73C, U+A73D	&#xA73C; &#xA73D;
+ff	ﬀ	U+FB00	&#xFB00;
+ffi	ﬃ	U+FB03	&#xFB03;
+ffl	ﬄ	U+FB04	&#xFB04;
+fi	ﬁ	U+FB01	&#xFB01;
+fl	ﬂ	U+FB02	&#xFB02;
+OE, oe	Œ, œ	U+0152, U+0153	&OElig; &oelig;
+OO, oo	Ꝏ, ꝏ	U+A74E, U+A74F	&#xA74E; &#xA74F;
+st	ﬆ	U+FB06	&#xFB06;
+ſt	ﬅ	U+FB05	&#xFB05;
+TZ, tz	Ꜩ, ꜩ	U+A728, U+A729	&#xA728; &#xA729;
+ue	ᵫ	U+1D6B	&#x1D6B;
+VY, vy	Ꝡ, ꝡ	U+A760, U+A761	&#xA760; &#xA761;
+db	ȸ	U+0238	&#x238;
+dz	ʣ	U+02A3	&#x2A3;
+dʑ 	ʥ	U+02A5	&#x2A5;
+dʒ 	ʤ	U+02A4	&#x2A4;
+fŋ 	ʩ	U+02A9	&#x2A9;
+IJ, ij	Ĳ, ĳ	U+0132, U+0133	&#x132; &#x133;
+ls	ʪ	U+02AA	&#x2AA;
+lz	ʫ	U+02AB	&#x2AB;
+lʒ 	ɮ	U+026E	&#x26E;
+qp	ȹ	U+0239	&#x239;
+tɕ 	ʨ	U+02A8	&#x2A8;
+ts	ʦ	U+02A6	&#x2A6;
+tʃ 	ʧ	U+02A7	&#x2A7;
+ui	ꭐ	U+AB50	&#xAB50;
+ui	ꭑ	U+AB51	&#xAB50;
+"""
+unicode_mapping = {}
+for row in ligature_table.split('\n'):
+    if row.count('\t') <= 1:
+        continue
+    unicode_mapping.update(
+        {
+            u.strip(): unicodedata.normalize('NFKC', a.strip())
+            for a, u in zip(*[c.split(',') for c in row.split('\t')[:2]])
+        }
+    )
+unicode_mapping.update({
+    # 'ẞ, ß': careful, some use this for \beta
+    r'(\B)\u00DF': r'\1ss',
+    # Additions (manual normalization that we feel is important)
+    # unicode space  u'\xa0'  (not \x{0c} = ^L keep!)
+    '\xa0': ' ',
+    # single + double quotes, dash, and asterisk
+    r'[\u2018\u2019]': r"'",
+    r'[\u201C\u201D]': r'"',
+    r'[\xad\u2014]': r'-',
+    r'\xb7': r'*'
+})
+def fix_unicode(txt: str) -> str:
+    """
+    Given UTF-8 encoded text, remove typographical ligatures (normalize to true
+    non-display character set) and do a general normalization of the unicode
+    so that possible redundant characters and simplified to a single set.
+    Parameters
+    ----------
+    txt : unicode string
+    Returns
+    -------
+    output : unicode string
+    """
+    for search, replace in unicode_mapping.items():
+        txt = re.subn(search, replace, txt)[0]
+    return unicodedata.normalize('NFKC', txt)

arxiv_public_data/fulltext.py ADDED Viewed

	@@ -0,0 +1,349 @@

+import os
+import re
+import sys
+import glob
+import shlex
+from functools import partial
+from multiprocessing import Pool
+from subprocess import check_call, CalledProcessError, TimeoutExpired, PIPE
+from arxiv_public_data.config import LOGGER
+from arxiv_public_data import fixunicode, pdfstamp
+log = LOGGER.getChild('fulltext')
+TIMELIMIT = 2*60
+STAMP_SEARCH_LIMIT = 1000
+PDF2TXT = 'pdf2txt.py'
+PDFTOTEXT = 'pdftotext'
+RE_REPEATS = r'(\(cid:\d+\)|lllll|\.\.\.\.\.|\*\*\*\*\*)'
+def reextension(filename: str, extension: str) -> str:
+    """ Give a filename a new extension """
+    name, _ = os.path.splitext(filename)
+    return '{}.{}'.format(name, extension)
+def average_word_length(txt):
+    """
+    Gather statistics about the text, primarily the average word length
+    Parameters
+    ----------
+    txt : str
+    Returns
+    -------
+    word_length : float
+        Average word length in the text
+    """
+    #txt = re.subn(RE_REPEATS, '', txt)[0]
+    nw = len(txt.split())
+    nc = len(txt)
+    avgw = nc / (nw + 1)
+    return avgw
+def process_timeout(cmd, timeout):
+    return check_call(cmd, timeout=timeout, stdout=PIPE, stderr=PIPE)
+# ============================================================================
+#  functions for calling the text extraction services
+# ============================================================================
+def run_pdf2txt(pdffile: str, timelimit: int=TIMELIMIT, options: str=''):
+    """
+    Run pdf2txt to extract full text
+    Parameters
+    ----------
+    pdffile : str
+        Path to PDF file
+    timelimit : int
+        Amount of time to wait for the process to complete
+    Returns
+    -------
+    output : str
+        Full plain text output
+    """
+    log.debug('Running {} on {}'.format(PDF2TXT, pdffile))
+    tmpfile = reextension(pdffile, 'pdf2txt')
+    cmd = '{cmd} {options} -o "{output}" "{pdf}"'.format(
+        cmd=PDF2TXT, options=options, output=tmpfile, pdf=pdffile
+    )
+    cmd = shlex.split(cmd)
+    output = process_timeout(cmd, timeout=timelimit)
+    with open(tmpfile) as f:
+        return f.read()
+def run_pdftotext(pdffile: str, timelimit: int = TIMELIMIT) -> str:
+    """
+    Run pdftotext on PDF file for extracted plain text
+    Parameters
+    ----------
+    pdffile : str
+        Path to PDF file
+    timelimit : int
+        Amount of time to wait for the process to complete
+    Returns
+    -------
+    output : str
+        Full plain text output
+    """
+    log.debug('Running {} on {}'.format(PDFTOTEXT, pdffile))
+    tmpfile = reextension(pdffile, 'pdftotxt')
+    cmd = '{cmd} "{pdf}" "{output}"'.format(
+        cmd=PDFTOTEXT, pdf=pdffile, output=tmpfile
+    )
+    cmd = shlex.split(cmd)
+    output = process_timeout(cmd, timeout=timelimit)
+    with open(tmpfile) as f:
+        return f.read()
+def run_pdf2txt_A(pdffile: str, **kwargs) -> str:
+    """
+    Run pdf2txt with the -A option which runs 'positional analysis on images'
+    and can return better results when pdf2txt combines many words together.
+    Parameters
+    ----------
+    pdffile : str
+        Path to PDF file
+    kwargs : dict
+        Keyword arguments to :func:`run_pdf2txt`
+    Returns
+    -------
+    output : str
+        Full plain text output
+    """
+    return run_pdf2txt(pdffile, options='-A', **kwargs)
+# ============================================================================
+#  main function which extracts text
+# ============================================================================
+def fulltext(pdffile: str, timelimit: int = TIMELIMIT):
+    """
+    Given a pdf file, extract the unicode text and run through very basic
+    unicode normalization routines. Determine the best extracted text and
+    return as a string.
+    Parameters
+    ----------
+    pdffile : str
+        Path to PDF file from which to extract text
+    timelimit : int
+        Time in seconds to allow the extraction routines to run
+    Returns
+    -------
+    fulltext : str
+        The full plain text of the PDF
+    """
+    if not os.path.isfile(pdffile):
+        raise FileNotFoundError(pdffile)
+    if os.stat(pdffile).st_size == 0:  # file is empty
+        raise RuntimeError('"{}" is an empty file'.format(pdffile))
+    try:
+        output = run_pdftotext(pdffile, timelimit=timelimit)
+        #output = run_pdf2txt(pdffile, timelimit=timelimit)
+    except (TimeoutExpired, CalledProcessError, RuntimeError) as e:
+        output = run_pdf2txt(pdffile, timelimit=timelimit)
+        #output = run_pdftotext(pdffile, timelimit=timelimit)
+    output = fixunicode.fix_unicode(output)
+    #output = stamp.remove_stamp(output, split=STAMP_SEARCH_LIMIT)
+    wordlength = average_word_length(output)
+    if wordlength <= 45:
+        try:
+            os.remove(reextension(pdffile, 'pdftotxt'))  # remove the tempfile
+        except OSError:
+            pass
+        return output
+    output = run_pdf2txt_A(pdffile, timelimit=timelimit)
+    output = fixunicode.fix_unicode(output)
+    #output = stamp.remove_stamp(output, split=STAMP_SEARCH_LIMIT)
+    wordlength = average_word_length(output)
+    if wordlength > 45:
+        raise RuntimeError(
+            'No accurate text could be extracted from "{}"'.format(pdffile)
+        )
+    try:
+        os.remove(reextension(pdffile, 'pdftotxt'))  # remove the tempfile
+    except OSError:
+        pass
+    return output
+def sorted_files(globber: str):
+    """
+    Give a globbing expression of files to find. They will be sorted upon
+    return.  This function is most useful when sorting does not provide
+    numerical order,
+    e.g.:
+        9 -> 12 returned as 10 11 12 9 by string sort
+    In this case use num_sort=True, and it will be sorted by numbers in the
+    string, then by the string itself.
+    Parameters
+    ----------
+    globber : str
+        Expression on which to search for files (bash glob expression)
+    """
+    files = glob.glob(globber, recursive = True) # return a list of path, including sub directories
+    files.sort()
+    allfiles = []
+    for fn in files:
+        nums = re.findall(r'\d+', fn) # regular expression, find number in path names
+        data = [str(int(n)) for n in nums] + [fn]
+        # a list of [first number, second number,..., filename] in string format otherwise sorted fill fail
+        allfiles.append(data) # list of list
+    allfiles = sorted(allfiles)
+    return [f[-1] for f in allfiles] # sorted filenames
+def convert_directory(path: str, timelimit: int = TIMELIMIT):
+    """
+    Convert all pdfs in a given `path` to full plain text. For each pdf, a file
+    of the same name but extension .txt will be created. If that file exists,
+    it will be skipped.
+    Parameters
+    ----------
+    path : str
+        Directory in which to search for pdfs and convert to text
+    Returns
+    -------
+    output : list of str
+        List of converted files
+    """
+    outlist = []
+    globber = os.path.join(path, '*.pdf')
+    pdffiles = sorted_files(globber)
+    log.info('Searching "{}"...'.format(globber))
+    log.info('Found: {} pdfs'.format(len(pdffiles)))
+    for pdffile in pdffiles:
+        txtfile = reextension(pdffile, 'txt')
+        if os.path.exists(txtfile):
+            continue
+        # we don't want this function to stop half way because of one failed
+        # file so just charge onto the next one
+        try:
+            text = fulltext(pdffile, timelimit)
+            with open(txtfile, 'w') as f:
+                f.write(text)
+        except Exception as e:
+            log.error("Conversion failed for '{}'".format(pdffile))
+            log.exception(e)
+            continue
+        outlist.append(pdffile)
+    return outlist
+def convert_directory_parallel(path: str, processes: int, timelimit: int = TIMELIMIT):
+    """
+    Convert all pdfs in a given `path` to full plain text. For each pdf, a file
+    of the same name but extension .txt will be created. If that file exists,
+    it will be skipped.
+    Parameters
+    ----------
+    path : str
+        Directory in which to search for pdfs and convert to text
+    Returns
+    -------
+    output : list of str
+        List of converted files
+    """
+    globber = os.path.join(path, '**/*.pdf') # search expression for glob.glob
+    pdffiles = sorted_files(globber)  # a list of path
+    log.info('Searching "{}"...'.format(globber))
+    log.info('Found: {} pdfs'.format(len(pdffiles)))
+    pool = Pool(processes=processes)
+    result = pool.map(partial(convert_safe, timelimit=timelimit), pdffiles)
+    pool.close()
+    pool.join()
+def convert_safe(pdffile: str, timelimit: int = TIMELIMIT):
+    """ Conversion function that never fails """
+    try:
+        convert(pdffile, timelimit=timelimit)
+    except Exception as e:
+        log.error('File conversion failed for {}: {}'.format(pdffile, e))
+def convert(path: str, skipconverted=True, timelimit: int = TIMELIMIT) -> str:
+    """
+    Convert a single PDF to text.
+    Parameters
+    ----------
+    path : str
+        Location of a PDF file.
+    skipconverted : boolean
+        Skip conversion when there is a text file already
+    Returns
+    -------
+    str
+        Location of text file.
+    """
+    if not os.path.exists(path):
+        raise RuntimeError('No such path: %s' % path)
+    outpath = reextension(path, 'txt')
+    if os.path.exists(outpath):
+        return outpath
+    try:
+        content = fulltext(path, timelimit)
+        with open(outpath, 'w') as f:
+            f.write(content)
+    except Exception as e:
+        msg = "Conversion failed for '%s': %s"
+        log.error(msg, path, e)
+        raise RuntimeError(msg % (path, e)) from e
+    return outpath

arxiv_public_data/internal_citations.py ADDED Viewed

	@@ -0,0 +1,128 @@

+#! /usr/bin/env python
+import time
+import re
+import sys
+import glob
+import os
+import gzip
+import json
+import math
+from multiprocessing import Pool,cpu_count
+from arxiv_public_data.regex_arxiv import REGEX_ARXIV_FLEXIBLE, clean
+from arxiv_public_data.config import DIR_FULLTEXT, DIR_OUTPUT, LOGGER
+log = LOGGER.getChild('fulltext')
+RE_FLEX = re.compile(REGEX_ARXIV_FLEXIBLE)
+RE_OLDNAME_SPLIT = re.compile(r"([a-z\-]+)(\d+)")
+def path_to_id(path):
+    """ Convert filepath name of ArXiv file to ArXiv ID """
+    name = os.path.splitext(os.path.basename(path))[0]
+    if '.' in name:  # new  ID
+        return name
+    split = [a for a in RE_OLDNAME_SPLIT.split(name) if a]
+    return "/".join(split)
+def all_articles(directory=DIR_FULLTEXT):
+    """ Find all *.txt files in directory """
+    out = []
+    # make sure the path is absolute for os.walk
+    directory = os.path.abspath(os.path.expanduser(directory))
+    for root, dirs, files in os.walk(directory):
+        for f in files:
+            if 'txt' in f:
+                out.append(os.path.join(root, f))
+    return out
+def extract_references(filename, pattern=RE_FLEX):
+    """
+    Parameters
+    ----------
+        filename : str
+            name of file to search for pattern
+        pattern : re pattern object
+            compiled regex pattern
+    Returns
+    -------
+        citations : list
+            list of found arXiv IDs
+    """
+    out = []
+    with open(filename, 'r') as fn:
+        txt = fn.read()
+        for matches in pattern.findall(txt):
+            out.extend([clean(a) for a in matches if a])
+    return list(set(out))
+def citation_list_inner(articles):
+    """ Find references in all the input articles
+    Parameters
+    ----------
+        articles : list of str
+            list of paths to article text
+    Returns
+    -------
+        citations : dict[arXiv ID] = list of arXiv IDs
+            dictionary of articles and their references
+    """
+    cites = {}
+    for i, article in enumerate(articles):
+        if i > 0 and i % 1000 == 0:
+            log.info('Completed {} articles'.format(i))
+        try:
+            refs = extract_references(article)
+            cites[path_to_id(article)] = refs
+        except:
+            log.error("Error in {}".format(article))
+            continue
+    return cites
+def citation_list_parallel(N=cpu_count(), directory=DIR_FULLTEXT):
+    """
+    Split the task of checking for citations across some number of processes
+    Parameters
+    ----------
+        N : int
+            number of processes
+        directory: str
+            directory where full text files are stored
+    Returns
+    -------
+        citations : dict[arXiv ID] = list of arXiv IDs
+            all arXiv citations in all articles
+    """
+    articles = all_articles(directory)
+    log.info('Calculating citation network for {} articles'.format(len(articles)))
+    pool = Pool(N)
+    A = len(articles)
+    divs = list(range(0, A, math.ceil(A/N))) + [A]
+    chunks = [articles[s:e] for s, e in zip(divs[:-1], divs[1:])]
+    cites = pool.map(citation_list_inner, chunks)
+    allcites = {}
+    for c in cites:
+        allcites.update(c)
+    return allcites
+def default_filename():
+    return os.path.join(DIR_OUTPUT, 'internal-citations.json.gz')
+def save_to_default_location(citations):
+    filename = default_filename()
+    log.info('Saving to "{}"'.format(filename))
+    with gzip.open(filename, 'wb') as fn:
+        fn.write(json.dumps(citations).encode('utf-8'))

arxiv_public_data/oai_metadata.py ADDED Viewed

	@@ -0,0 +1,282 @@

+"""
+oia_metadata.py
+authors: Matt Bierbaum and Colin Clement
+date: 2019-02-25
+This module interacts with the Open Archive Initiative API, downloading
+the metadata for all Arxiv articles.
+Usage
+=====
+python oia_metadata.py data/<savefile>.json
+Notes
+=====
+The save file is not technically JSON, but individual streamed lines of JSON,
+each of which is compressed by gzip. Use the helper function load_metadata
+to be sure to open it without error.
+Resources
+=========
+* http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm
+* https://arxiv.org/help/oa/index
+"""
+import os
+import gzip
+import glob
+import json
+import time
+import hashlib
+import datetime
+import requests
+import xml.etree.ElementTree as ET
+from arxiv_public_data.config import LOGGER, DIR_BASE
+log = LOGGER.getChild('metadata')
+URL_ARXIV_OAI = 'https://export.arxiv.org/oai2'
+URL_CITESEER_OAI = 'http://citeseerx.ist.psu.edu/oai2'
+OAI_XML_NAMESPACES = {
+    'OAI': 'http://www.openarchives.org/OAI/2.0/',
+    'arXiv': 'http://arxiv.org/OAI/arXivRaw/'
+}
+def get_list_record_chunk(resumptionToken=None, harvest_url=URL_ARXIV_OAI,
+                          metadataPrefix='arXivRaw'):
+    """
+    Query OIA API for the metadata of 1000 Arxiv article
+    Parameters
+    ----------
+        resumptionToken : str
+            Token for the API which triggers the next 1000 articles
+    Returns
+    -------
+        record_chunks : str
+            metadata of 1000 arXiv articles as an XML string
+    """
+    parameters = {'verb': 'ListRecords'}
+    if resumptionToken:
+        parameters['resumptionToken'] = resumptionToken
+    else:
+        parameters['metadataPrefix'] = metadataPrefix
+    response = requests.get(harvest_url, params=parameters)
+    if response.status_code == 200:
+        return response.text
+    if response.status_code == 503:
+        secs = int(response.headers.get('Retry-After', 20)) * 1.5
+        log.info('Requested to wait, waiting {} seconds until retry...'.format(secs))
+        time.sleep(secs)
+        return get_list_record_chunk(resumptionToken=resumptionToken)
+    else:
+        raise Exception(
+            'Unknown error in HTTP request {}, status code: {}'.format(
+                response.url, response.status_code
+            )
+        )
+def _record_element_text(elm, name):
+    """ XML helper function for extracting text from leaf (single-node) elements """
+    item = elm.find('arXiv:{}'.format(name), OAI_XML_NAMESPACES)
+    return item.text if item is not None else None
+def _record_element_all(elm, name):
+    """ XML helper function for extracting text from queries with multiple nodes """
+    return elm.findall('arXiv:{}'.format(name), OAI_XML_NAMESPACES)
+def parse_record(elm):
+    """
+    Parse the XML element of a single ArXiv article into a dictionary of
+    attributes
+    Parameters
+    ----------
+        elm : xml.etree.ElementTree.Element
+            Element of the record of a single ArXiv article
+    Returns
+    -------
+        output : dict
+            Attributes of the ArXiv article stored as a dict with the keys
+            id, submitter, authors, title, comments, journal-ref, doi, abstract,
+            report-no, categories, and version
+    """
+    text_keys = [
+        'id', 'submitter', 'authors', 'title', 'comments',
+        'journal-ref', 'doi', 'abstract', 'report-no'
+    ]
+    output = {key: _record_element_text(elm, key) for key in text_keys}
+    output['categories'] = [
+        i.text for i in (_record_element_all(elm, 'categories') or [])
+    ]
+    output['versions'] = [
+        i.attrib['version'] for i in _record_element_all(elm, 'version')
+    ]
+    return output
+def parse_xml_listrecords(root):
+    """
+    Parse XML of one chunk of the metadata of 1000 ArXiv articles
+    into a list of dictionaries
+    Parameters
+    ----------
+        root : xml.etree.ElementTree.Element
+            Element containing the records of an entire chunk of ArXiv queries
+    Returns
+    -------
+        records, resumptionToken : list, str
+            records is a list of 1000 dictionaries, each containing the
+            attributes of a single arxiv article
+            resumptionToken is a string which is fed into the subsequent query
+    """
+    resumptionToken = root.find(
+        'OAI:ListRecords/OAI:resumptionToken',
+        OAI_XML_NAMESPACES
+    )
+    resumptionToken = resumptionToken.text if resumptionToken is not None else ''
+    records = root.findall(
+        'OAI:ListRecords/OAI:record/OAI:metadata/arXiv:arXivRaw',
+        OAI_XML_NAMESPACES
+    )
+    records = [parse_record(p) for p in records]
+    return records, resumptionToken
+def check_xml_errors(root):
+    """ Check for, log, and raise any OAI service errors in the XML """
+    error = root.find('OAI:error', OAI_XML_NAMESPACES)
+    if error is not None:
+        raise RuntimeError(
+            'OAI service returned error: {}'.format(error.text)
+        )
+def find_default_locations():
+    outfile = os.path.join(DIR_BASE, 'arxiv-metadata-oai-*.json.gz')
+    resume = os.path.join(
+        DIR_BASE, 'arxiv-metadata-oai-*.json.gz-resumptionToken.txt'
+    )
+    fn_outfile = sorted(glob.glob(outfile))
+    fn_resume = sorted(glob.glob(resume))
+    if len(fn_outfile) > 0:
+        return fn_outfile[-1]
+    return None
+def all_of_arxiv(outfile=None, resumptionToken=None, autoresume=True):
+    """
+    Download the metadata for every article in the ArXiv via the OAI API
+    Parameters
+    ----------
+        outfile : str (default './arxiv-metadata-oai-<date>.json')
+            name of file where data is stored, appending each chunk of 1000
+            articles.
+        resumptionToken : str (default None)
+            token which instructs the OAI server to continue feeding the next
+            chunk
+        autoresume : bool
+            If true, it looks for a saved resumptionToken in the file
+            <outfile>-resumptionToken.txt
+    """
+    date = str(datetime.datetime.now()).split(' ')[0]
+    outfile = (
+        outfile or # user-supplied
+        find_default_locations() or # already in progress
+        os.path.join(
+            DIR_BASE, 'arxiv-metadata-oai-{}.json.gz'.format(date)
+        ) # new file
+    )
+    directory = os.path.split(outfile)[0]
+    if directory and not os.path.exists(directory):
+        os.makedirs(directory)
+    tokenfile = '{}-resumptionToken.txt'.format(outfile)
+    chunk_index = 0
+    total_records = 0
+    log.info('Saving metadata to "{}"'.format(outfile))
+    resumptionToken = None
+    if autoresume:
+        try:
+            resumptionToken = open(tokenfile, 'r').read()
+        except Exception as e:
+            log.warn("No tokenfile found '{}'".format(tokenfile))
+            log.info("Starting download from scratch...")
+    while True:
+        log.info('Index {:4d} | Records {:7d} | resumptionToken "{}"'.format(
+            chunk_index, total_records, resumptionToken)
+        )
+        xml_root = ET.fromstring(get_list_record_chunk(resumptionToken))
+        check_xml_errors(xml_root)
+        records, resumptionToken = parse_xml_listrecords(xml_root)
+        chunk_index = chunk_index + 1
+        total_records = total_records + len(records)
+        with gzip.open(outfile, 'at', encoding='utf-8') as fout:
+            for rec in records:
+                fout.write(json.dumps(rec) + '\n')
+        if resumptionToken:
+            with open(tokenfile, 'w') as fout:
+                fout.write(resumptionToken)
+        else:
+            log.info('No resumption token, query finished')
+            return
+        time.sleep(12)  # OAI server usually requires a 10s wait
+def load_metadata(infile=None):
+    """
+    Load metadata saved by all_of_arxiv, as a list of lines of gzip compressed
+    json.
+    Parameters
+    ----------
+        infile : str or None
+            name of file saved by gzip. If None, one is attempted to be found
+            in the expected location with the expected name.
+    Returns
+    -------
+        article_attributes : list
+            list of dicts, each of which contains the metadata attributes of
+            the ArXiv articles
+    """
+    fname = infile or find_default_locations()
+    with gzip.open(fname, 'rt', encoding='utf-8') as fin:
+        return [json.loads(line) for line in fin.readlines()]
+def hash_abstracts(metadata):
+    """ Replace abstracts with their MD5 hash for legal distribution """
+    metadata_no_abstract = []
+    for i in range(len(metadata)):
+        m = metadata[i].copy()
+        m['abstract_md5'] = hashlib.md5(m['abstract'].encode()).hexdigest()
+        del m['abstract']
+        metadata_no_abstract.append(m)
+    return metadata_no_abstract
+def validate_abstract_hashes(metadata, metadata_no_abstract):
+    """ Validate that abstracts match the hashes """
+    for m, n in zip(metadata, metadata_no_abstract):
+        md5 = hashlib.md5(m['abstract'].encode()).hexdigest()
+        if not md5 == n['abstract_md5']:
+            return False
+    return True

arxiv_public_data/pdfstamp.py ADDED Viewed

	@@ -0,0 +1,83 @@

+import re
+SPACE_DIGIT = r'\s*\d\s*'
+SPACE_NUMBER = r'(?:{})+'.format(SPACE_DIGIT)
+SPACE_CHAR = r'\s*[a-zA-Z\.-]\s*'
+SPACE_WORD = r'(?:{})+'.format(SPACE_CHAR)
+# old style ID, 7 digits in a row
+RE_NUM_OLD = SPACE_DIGIT*7
+# new style ID, 4 digits, ., 4,5 digits
+RE_NUM_NEW = (
+    SPACE_DIGIT*4 +
+    r'\.' +
+    SPACE_DIGIT*4 + r'(?:{})?'.format(SPACE_DIGIT)
+)
+# the version part v1 V2 v 1, etc
+RE_VERSION = r'(?:\s*[vV]\s*\d+\s*)?'
+# the word arxiv, as printed by the autotex, arXiv
+RE_ARXIV = r'\s*a\s*r\s*X\s*i\s*v\s*:\s*'
+# any words within square brackets [cs.A I]
+RE_CATEGORIES = r'\[{}\]'.format(SPACE_WORD)
+# two digit date, month, year "29 Jan 2012"
+RE_DATE = SPACE_NUMBER + SPACE_WORD + r'(?:{}){}'.format(SPACE_DIGIT, '{2,4}')
+# the full identifier for the banner
+RE_ARXIV_ID = (
+    RE_ARXIV +
+    r'(?:' +
+    r'(?:{})|(?:{})'.format(RE_NUM_NEW, RE_NUM_OLD) +
+    r')' +
+    RE_VERSION +
+    RE_CATEGORIES +
+    RE_DATE
+)
+REGEX_ARXIV_ID = re.compile(RE_ARXIV_ID)
+def _extract_arxiv_stamp(txt):
+    """
+    Find location of stamp within the text and remove that section
+    """
+    match = REGEX_ARXIV_ID.search(txt)
+    if not match:
+        return txt, ''
+    s, e = match.span()
+    return '{} {}'.format(txt[:s].strip(), txt[e:].strip()), txt[s:e].strip()
+def remove_stamp(txt, split=1000):
+    """
+    Given full text, remove the stamp placed in the pdf by arxiv itself. This
+    deserves a bit of consideration since the stamp often becomes mangled by
+    the text extraction tool (i.e. hard to find and replace) and can be
+    reversed.
+    Parameters
+    ----------
+    txt : string
+        The full text of a document
+    Returns
+    -------
+    out : string
+        Full text without stamp
+    """
+    t0, t1 = txt[:split], txt[split:]
+    txt0, stamp0 = _extract_arxiv_stamp(t0)
+    txt1, stamp1 = _extract_arxiv_stamp(t0[::-1])
+    if stamp0:
+        return txt0 + t1
+    elif stamp1:
+        return txt1[::-1] + t1
+    else:
+        return txt

arxiv_public_data/regex_arxiv.py ADDED Viewed

	@@ -0,0 +1,195 @@

+"""
+regex_arxiv.py
+author: Matt Bierbaum
+date: 2019-03-14
+RegEx patterns for finding arXiv id citations in fulltext articles.
+"""
+import re
+# These are all the primary categories present in the OAI ArXiv metadata
+CATEGORIES = [
+    "acc-phys", "adap-org", "alg-geom", "ao-sci", "astro-ph", "atom-ph",
+    "bayes-an", "chao-dyn", "chem-ph", "cmp-lg", "comp-gas", "cond-mat", "cs",
+    "dg-ga", "funct-an", "gr-qc", "hep-ex", "hep-lat", "hep-ph", "hep-th",
+    "math", "math-ph", "mtrl-th", "nlin", "nucl-ex", "nucl-th", "patt-sol",
+    "physics", "plasm-ph", "q-alg", "q-bio", "quant-ph", "solv-int",
+    "supr-con", "eess", "econ", "q-fin", "stat"
+]
+#  All subcategories with more than 2 capital letters (not SG, SI, SP, etc)
+SUB_CATEGORIES = [
+     'acc-ph', 'ao-ph', 'app-ph', 'atm-clus', 'atom-ph', 'bio-ph', 'chem-ph',
+     'class-ph', 'comp-ph', 'data-an', 'dis-nn', 'ed-ph', 'flu-dyn', 'gen-ph',
+     'geo-ph', 'hist-ph', 'ins-det', 'med-ph', 'mes-hall', 'mtrl-sci', 'optics',
+     'other', 'plasm-ph', 'pop-ph', 'quant-gas', 'soc-ph', 'soft', 'space-ph',
+     'stat-mech', 'str-el', 'supr-con'
+]
+__all__ = (
+    'REGEX_ARXIV_SIMPLE',
+    'REGEX_ARXIV_STRICT',
+    'REGEX_ARXIV_FLEXIBLE'
+)
+dashdict = {c.replace('-', ''): c for c in CATEGORIES if '-' in c}
+dashdict.update({c.replace('-', ''): c for c in SUB_CATEGORIES if '-' in c})
+REGEX_VERSION_SPLITTER = re.compile(r'([vV][1-9]\d*)')
+def strip_version(name):
+    """ 1501.21981v1 -> 1501.21981 """
+    return REGEX_VERSION_SPLITTER.split(name)[0]
+def format_cat(name):
+    """ Strip subcategory, add hyphen to category name if missing """
+    if '/' in name:  # OLD ID, names contains subcategory
+        catsubcat, aid = name.split('/')
+        cat = catsubcat.split('.')[0]
+        return dashdict.get(cat, cat) + "/" + aid
+    else:
+        return name
+def zeropad_1501(name):
+    """ Arxiv IDs after yymm=1501 are padded to 5 zeros """
+    if not '/' in name:  # new ID
+        yymm, num = name.split('.')
+        if int(yymm) > 1500 and len(num) < 5:
+            return yymm + ".0" + num
+    return name
+def clean(name):
+    """ Correct common errors in ArXiv IDs to improve matching """
+    funcs = [strip_version, format_cat, zeropad_1501]
+    for func in funcs:
+        name = func(name)
+    return name
+# A common typo is to exclude the hyphen in the category.
+categories = list(set(CATEGORIES + [cat.replace('-', '') for cat in
+                                    CATEGORIES]))
+subcategories = list(set(SUB_CATEGORIES + [cat.replace('-', '') for cat in
+                                           SUB_CATEGORIES]))
+#  capture possible minor categories
+RE_CATEGORIES = r'(?:{})(?:(?:[.][A-Z]{{2}})|(?:{}))?'.format(
+    r'|'.join(categories), r'|'.join(subcategories)
+)
+# valid YYMM date, NOT preceded by any digits
+# NOTE: at the date of writing, it is 2019, so we do not allow
+# proper dates for YY 20 or larger
+RE_DATE = r'(?:(?:[0-1][0-9])|(?:9[1-9]))(?:0[1-9]|1[0-2])'
+RE_VERSION = r'(?:[vV][1-9]\d*)?'
+# =============================================================================
+RE_NUM_NEW = RE_DATE + r'(?:[.]\d{4,5})' + RE_VERSION
+RE_NUM_OLD = RE_DATE + r'(?:\d{3})' + RE_VERSION
+# matches: 1612.00001 1203.0023v2
+RE_ID_NEW = r'(?:{})'.format(RE_NUM_NEW)
+# matches: hep-th/11030234 cs/0112345v2 cs.AI/0112345v2
+RE_ID_OLD = r'(?:{}/{})'.format(RE_CATEGORIES, RE_NUM_OLD)
+# =============================================================================
+# matches: https://arxiv.org/abs/ abs/ arxiv.org/abs/
+#   3. e-print: eprints
+RE_PREFIX_URL = (
+    r'(?:'
+      r'(?i:http[s]?\://)?'  # we could have a url prefix
+      r'(?i:arxiv\.org/)?'   # maybe with the arxiv.org bit
+      r'(?i:abs/|pdf/)'      # at least it has the abs/ part
+    r')'
+)
+# matches: arXiv: arxiv/ arxiv
+RE_PREFIX_ARXIV = r'(?i:arxiv\s*[:/\s,.]*\s*)'
+# matches:  cs.AI/ cs.AI nucl-th
+RE_PREFIX_CATEGORIES = r'(?i:{})'.format(RE_CATEGORIES)
+# matches: e-prints: e-print eprints:
+RE_PREFIX_EPRINT = r'(?i:e[-]?print[s]?.{1,3})'
+# =============================================================================
+# matches simple old or new identifiers, no fancy business
+REGEX_ARXIV_SIMPLE = r'(?:{}|{})'.format(RE_ID_OLD, RE_ID_NEW)
+# this one follows the guide set forth by:
+#   https://arxiv.org/help/arxiv_identifier
+REGEX_ARXIV_STRICT = (
+    r'(?:{})'.format(RE_PREFIX_ARXIV) +
+    r'(?:'
+      r'({})'.format(RE_ID_OLD) +
+    r'|'
+      r'({})'.format(RE_ID_NEW) +
+    r')'
+)
+# this regex essentially accepts anything that looks like an arxiv id and has
+# the slightest smell of being one as well. that is, if it is an id and
+# mentions anything about the arxiv before hand, then it is an id.
+REGEX_ARXIV_FLEXIBLE = (
+    r'(?:'
+      r'({})'.format(REGEX_ARXIV_SIMPLE) +  # capture
+    r')|(?:'
+      r'(?:'
+        r'(?:{})?'.format(RE_PREFIX_URL) +
+        r'(?:{})?'.format(RE_PREFIX_EPRINT) +
+        r'(?:'
+          r'(?:{})?'.format(RE_PREFIX_ARXIV) +
+          r'({})'.format(RE_ID_OLD) +  # capture
+        r'|'
+          r'(?:{})'.format(RE_PREFIX_ARXIV) +
+          r'(?:{}/)?'.format(RE_CATEGORIES) +
+          r'({})'.format(RE_ID_NEW) +  # capture
+        r')'
+      r')'
+    r'|'
+      r'(?:'
+        r'(?:{})|'.format(RE_PREFIX_URL) +
+        r'(?:{})|'.format(RE_PREFIX_EPRINT) +
+        r'(?:{})|'.format(RE_PREFIX_CATEGORIES) +
+        r'(?:{})'.format(RE_PREFIX_ARXIV) +
+      r')'
+      r'.*?'
+      r'({})'.format(REGEX_ARXIV_SIMPLE) +  # capture
+    r')|(?:'
+      r'(?:[\[\(]\s*)'
+        r'({})'.format(REGEX_ARXIV_SIMPLE) +  # capture
+      r'(?:\s*[\]\)])'
+    r')'
+)
+TEST_POSITIVE = [
+    'arXiv:quant-ph 1503.01017v3',
+    'math. RT/0903.2992',
+    'arXiv, 1511.03262',
+    'tions. arXiv preprint arXiv:1607.00021, 2016',
+    'Math. Phys. 255, 577 (2005), hep-th/0306165',
+    'Kuzovlev, arXiv:cond-mat/9903350 ',
+    'arXiv:math.RT/1206.5933,',
+    'arXiv e-prints 1306.1595',
+    'ays, JHEP 07 (2009) 055, [ 0903.0883]',
+    ' Rev. D71 (2005) 063534, [ astro-ph/0501562]',
+    'e-print arXiv:1506.02215v1',
+    'available at: http://arxiv.org/abs/1511.08977',
+    'arXiv e-print: 1306.2144',
+    'Preprint arXiv:math/0612139',
+    'Vertices in a Digraph. arXiv preprint 1602.02129 ',
+    'cond-mat/0309488.'
+    'decays, 1701.01871 LHCB-PAPE',
+    'Distribution. In: 1404.2485v3 (2015)',
+    '113005 (2013), 1307.4331,',
+    'scalar quantum 1610.07877v1',
+    'cond-mat/0309488.'
+    'cond-mat/0309488.8383'
+]
+TEST_NEGATIVE = [
+    'doi: 10.1145/ 321105.321114 ',
+    'doi: 10.1145/ 1105.321114 ',
+]

arxiv_public_data/s3_bulk_download.py ADDED Viewed

	@@ -0,0 +1,397 @@

+"""
+s3_bulk_download.py
+authors: Matt Bierbaum and Colin Clement
+date: 2019-02-27
+This module uses AWS to request a signed key url, which requests files
+from the ArXiv S3 bucket. It then unpacks and converts the pdfs into text.
+Note that at the time of writing the ArXiv manifest, it contains 1.15 TB
+of PDFs, which would cost $103 to receive from AWS S3.
+see: https://arxiv.org/help/bulk_data_s3
+Usage
+-----
+Set DIR_FULLTEXT as the directory where the text parsed from pdfs should be placed.
+Set DIR_PDFTARS as the directory where the raw pdf tars should be placed.
+```
+import arxiv_public_data.s3_bulk_download as s3
+# Download manifest file (or load if already downloaded)
+>>> manifest = s3.get_manifest()
+# Download tar files and convert pdf to text
+# Costs money! Will only download if it does not find files
+>>> s3.process_manifest_files(manifest)
+# If you just want to download the PDFs and not convert to text use
+>>> s3.download_check_tarfiles(manifest)
+```
+"""
+import os
+import re
+import gzip
+import json
+import glob
+import shlex
+import shutil
+import tarfile
+import boto3
+import hashlib
+import requests
+import subprocess
+from functools import partial
+from multiprocessing import Pool
+from collections import defaultdict
+import xml.etree.ElementTree as ET
+from arxiv_public_data import fulltext
+from arxiv_public_data.config import DIR_FULLTEXT, DIR_PDFTARS, LOGGER
+logger = LOGGER.getChild('s3')
+CHUNK_SIZE = 2**20  # 1MB
+BUCKET_NAME = 'arxiv'
+S3_PDF_MANIFEST = 'pdf/arXiv_pdf_manifest.xml'
+S3_TEX_MANIFEST = 'src/arXiv_src_manifest.xml'
+HEADERS = {'x-amz-request-payer': 'requester'}
+s3 = boto3.client('s3', region_name='us-east-1')
+def download_file(filename, outfile, chunk_size=CHUNK_SIZE, redownload=False,
+                  dryrun=False):
+    """
+    Downloads filename from the ArXiv AWS S3 bucket, and returns streaming md5
+    sum of the content
+    Parameters
+    ----------
+        filename : str
+            KEY corresponding to AWS bucket file
+        outfile : stf
+            name and path of local file in which downloaded file will be stored
+        (optional)
+        chunk_size : int
+            requests byte streaming size (so 500MB are not stored in memory
+            prior to processing)
+        redownload : bool
+            Look to see if file is already downloaded, and simply return md5sum
+            if it it exists, unless redownload is True
+        dryrun : bool
+            If True, only log activity
+    Returns
+    -------
+        md5sum : str
+            md5 checksum of the contents of filename
+    """
+    if os.path.exists(outfile) and not redownload:
+        md5 = hashlib.md5()
+        md5.update(gzip.open(outfile, 'rb').read())
+        return md5.hexdigest()
+    md5 = hashlib.md5()
+    url = s3.generate_presigned_url(
+        "get_object",
+        Params={
+            "Bucket": BUCKET_NAME, "Key": filename, "RequestPayer": 'requester'
+        }
+    )
+    if not dryrun:
+        logger.info('Requesting "{}" (costs money!)'.format(filename))
+        request = requests.get(url, stream=True)
+        response_iter = request.iter_content(chunk_size=chunk_size)
+        logger.info("\t Writing {}".format(outfile))
+        with gzip.open(outfile, 'wb') as fout:
+            for i, chunk in enumerate(response_iter):
+                fout.write(chunk)
+                md5.update(chunk)
+    else:
+        logger.info('Requesting "{}" (free!)'.format(filename))
+        logger.info("\t Writing {}".format(outfile))
+    return md5.hexdigest()
+def default_manifest_filename():
+    return os.path.join(DIR_PDFTARS, 'arxiv-manifest.xml.gz')
+def get_manifest(filename=None, redownload=False):
+    """
+    Get the file manifest for the ArXiv
+    Parameters
+    ----------
+        redownload : bool
+            If true, forces redownload of manifest even if it exists
+    Returns
+    -------
+        file_information : list of dicts
+            each dict contains the file metadata
+    """
+    manifest_file = filename or default_manifest_filename()
+    md5 = download_file(
+        S3_PDF_MANIFEST, manifest_file, redownload=redownload, dryrun=False
+    )
+    manifest = gzip.open(manifest_file, 'rb').read()
+    return parse_manifest(manifest)
+def parse_manifest(manifest):
+    """
+    Parse the XML of the ArXiv manifest file.
+    Parameters
+    ----------
+        manifest : str
+            xml string from the ArXiv manifest file
+    Returns
+    -------
+        file_information : list of dicts
+            One dict for each file, containing the filename, size, md5sum,
+            and other metadata
+    """
+    root = ET.fromstring(manifest)
+    return [
+        {c.tag: f.find(c.tag).text for c in f.getchildren()}
+        for f in root.findall('file')
+    ]
+def _tar_to_filename(filename):
+    return os.path.join(DIR_PDFTARS, os.path.basename(filename)) + '.gz'
+def download_check_tarfile(filename, md5_expected, dryrun=False, redownload=False):
+    """ Download filename, check its md5sum, and form the output path """
+    outname = _tar_to_filename(filename)
+    md5_downloaded = download_file(
+        filename, outname, dryrun=dryrun, redownload=redownload
+    )
+    if not dryrun:
+        if md5_expected != md5_downloaded:
+            msg = "MD5 '{}' does not match expected '{}' for file '{}'".format(
+                md5_downloaded, md5_expected, filename
+            )
+            raise AssertionError(msg)
+    return outname
+def download_check_tarfiles(list_of_fileinfo, dryrun=False):
+    """
+    Download tar files from the ArXiv manifest and check that their MD5sums
+    match
+    Parameters
+    ----------
+        list_of_fileinfo : list
+            Some elements of results of get_manifest
+        (optional)
+        dryrun : bool
+            If True, only log activity
+    """
+    for fileinfo in list_of_fileinfo:
+        download_check_tarfile(fileinfo['filename'], fileinfo['md5sum'], dryrun=dryrun)
+def call(cmd, dryrun=False, debug=False):
+    """ Spawn a subprocess and execute the string in cmd """
+    if dryrun:
+        logger.info(cmd)
+        return 0
+    else:
+        return subprocess.check_call(
+            shlex.split(cmd), stderr=None if debug else open(os.devnull, 'w')
+        )
+def _make_pathname(filename):
+    """
+    Make filename path for text document, sorted like on arXiv servers.
+    Parameters
+    ----------
+        filename : str
+            string filename of arXiv article
+        (optional)
+    Returns
+    -------
+        pathname : str
+            pathname in which to store the article following
+            * Old ArXiv IDs: e.g. hep-ph0001001.txt returns
+                DIR_PDFTARS/hep-ph/0001/hep-ph0001001.txt
+            * New ArXiv IDs: e.g. 1501.13851.txt returns
+                DIR_PDFTARS/arxiv/1501/1501.13851.txt
+    """
+    basename = os.path.basename(filename)
+    fname = os.path.splitext(basename)[0]
+    if '.' in fname:  # new style ArXiv ID
+        yearmonth = fname.split('.')[0]
+        return os.path.join(DIR_FULLTEXT, 'arxiv', yearmonth, basename)
+    # old style ArXiv ID
+    cat, aid = re.split(r'(\d+)', fname)[:2]
+    yearmonth = aid[:4]
+    return os.path.join(DIR_FULLTEXT, cat, yearmonth, basename)
+def process_tarfile_inner(filename, pdfnames=None, processes=1, dryrun=False,
+                          timelimit=fulltext.TIMELIMIT):
+    outname = _tar_to_filename(filename)
+    if not os.path.exists(outname):
+        msg = 'Tarfile from manifest not found {}, skipping...'.format(outname)
+        logger.error(msg)
+        return
+    # unpack tar file
+    if pdfnames:
+        namelist = ' '.join(pdfnames)
+        cmd = 'tar --one-top-level -C {} -xf {} {}'
+        cmd = cmd.format(DIR_PDFTARS, outname, namelist)
+    else:
+        cmd = 'tar --one-top-level -C {} -xf {}'.format(DIR_PDFTARS, outname)
+    _call(cmd, dryrun)
+    basename = os.path.splitext(os.path.basename(filename))[0]
+    pdfdir = os.path.join(DIR_PDFTARS, basename, basename.split('_')[2])
+    # Run fulltext to convert pdfs in tardir into *.txt
+    converts = fulltext.convert_directory_parallel(
+        pdfdir, processes=processes, timelimit=timelimit
+    )
+    # move txt into final file structure
+    txtfiles = glob.glob('{}/*.txt'.format(pdfdir))
+    for tf in txtfiles:
+        mvfn = _make_pathname(tf)
+        dirname = os.path.dirname(mvfn)
+        if not os.path.exists(dirname):
+            _call('mkdir -p {}'.format(dirname), dryrun)
+        if not dryrun:
+            shutil.move(tf, mvfn)
+    # clean up pdfs
+    _call('rm -rf {}'.format(os.path.join(DIR_PDFTARS, basename)), dryrun)
+def process_tarfile(fileinfo, pdfnames=None, dryrun=False, debug=False, processes=1):
+    """
+    Download and process one of the tar files from the ArXiv manifest.
+    Download, unpack, and spawn the Docker image for converting pdf2text.
+    It will only try to download the file if it does not already exist.
+    The tar file will be stored in DIR_FULLTEXT/<fileinfo[filename](tar)> and the
+    resulting arXiv articles will be stored in the subdirectory
+    DIR_FULLTEXT/arxiv/<yearmonth>/<aid>.txt for old style arXiv IDs and
+    DIR_FULLTEXT/<category>/<yearmonth>/<aid>.txt for new style arXiv IDs.
+    Parameters
+    ----------
+        fileinfo : dict
+            dictionary of file information from parse_manifest
+        (optional)
+        dryrun : bool
+            If True, only log activity
+        debug : bool
+            Silence stderr of Docker _call if debug is False
+    """
+    filename = fileinfo['filename']
+    md5sum = fileinfo['md5sum']
+    if check_if_any_processed(fileinfo):
+        logger.info('Tar file appears processed, skipping {}...'.format(filename))
+        return
+    logger.info('Processing tar "{}" ...'.format(filename))
+    process_tarfile_inner(filename, pdfnames=None, processes=processes, dryrun=dryrun)
+def process_manifest_files(list_of_fileinfo, processes=1, dryrun=False):
+    """
+    Download PDFs from the ArXiv AWS S3 bucket and convert each pdf to text
+    Parameters. If files are already downloaded, it will only process them.
+    ----------
+        list_of_fileinfo : list
+            Some elements of results of get_manifest
+        (optional)
+        processes : int
+            number of paralell workers to spawn (roughly as many CPUs as you have)
+        dryrun : bool
+            If True, only log activity
+    """
+    for fileinfo in list_of_fileinfo:
+        process_tarfile(fileinfo, dryrun=dryrun, processes=processes)
+def check_if_any_processed(fileinfo):
+    """
+    Spot check a tarfile to see if the pdfs have been converted to text,
+    given an element of the s3 manifest
+    """
+    first = _make_pathname(fileinfo['first_item']+'.txt')
+    last = _make_pathname(fileinfo['last_item']+'.txt')
+    return os.path.exists(first) and os.path.exists(last)
+def generate_tarfile_indices(manifest):
+    """
+    Go through the manifest and for every tarfile, get a list of the PDFs
+    that should be contained within it. This is a separate function because
+    even checking the tars is rather slow.
+    Returns
+    -------
+    index : dictionary
+        keys: tarfile, values: list of pdfs
+    """
+    index = {}
+    for fileinfo in manifest:
+        name = fileinfo['filename']
+        logger.info("Indexing {}...".format(name))
+        tarname = os.path.join(DIR_PDFTARS, os.path.basename(name))+'.gz'
+        files = [i for i in tarfile.open(tarname).getnames() if i.endswith('.pdf')]
+        index[name] = files
+    return index
+def check_missing_txt_files(index):
+    """
+    Use the index file from `generate_tarfile_indices` to check which pdf->txt
+    conversions are outstanding.
+    """
+    missing = defaultdict(list)
+    for tar, pdflist in index.items():
+        logger.info("Checking {}...".format(tar))
+        for pdf in pdflist:
+            txt = _make_pathname(pdf).replace('.pdf', '.txt')
+            if not os.path.exists(txt):
+                missing[tar].append(pdf)
+    return missing
+def rerun_missing(missing, processes=1):
+    """
+    Use the output of `check_missing_txt_files` to attempt to rerun the text
+    files which are missing from the conversion. There are various reasons
+    that they can fail.
+    """
+    sort = list(reversed(
+        sorted([(k, v) for k, v in missing.items()], key=lambda x: len(x[1]))
+    ))
+    for tar, names in sort:
+        logger.info("Running {} ({} to do)...".format(tar, len(names)))
+        process_tarfile_inner(
+            tar, pdfnames=names, processes=processes,
+            timelimit=5 * fulltext.TIMELIMIT
+        )
+def process_missing(manifest, processes=1):
+    """
+    Do the full process of figuring what is missing and running them
+    """
+    indexfile = os.path.join(DIR_PDFTARS, 'manifest-index.json')
+    if not os.path.exists(indexfile):
+        index = generate_tarfile_indices(manifest)
+        json.dump(index, open(indexfile, 'w'))
+    index = json.load(open(indexfile))
+    missing = check_missing_txt_files(index)
+    rerun_missing(missing, processes=processes)

arxiv_public_data/slice_pdfs.py ADDED Viewed

	@@ -0,0 +1,93 @@

+import os
+import subprocess
+import shlex
+from collections import defaultdict
+from arxiv_public_data.config import DIR_FULLTEXT, DIR_PDFTARS, LOGGER
+def id_to_tarpdf(n):
+    if '.' in n:
+        ym = n.split('.')[0]
+        return '{}/{}.pdf'.format(ym, n)
+    else:
+        ym = n.split('/')[1][:4]
+        return '{}/{}.pdf'.format(ym, n.replace('/', ''))
+def _call(cmd, dryrun=False, debug=False):
+    """ Spawn a subprocess and execute the string in cmd """
+    return subprocess.check_call(
+        shlex.split(cmd), stderr=None if debug else open(os.devnull, 'w')
+    )
+def _tar_to_filename(filename):
+    return os.path.join(DIR_PDFTARS, os.path.basename(filename)) + '.gz'
+def extract_files(tarfile, pdfs, outdir):
+    """
+    Extract the list of `pdfs` filenames from `tarfile` into the `outdir`
+    """
+    filename = tarfile
+    namelist = ' '.join([id_to_tarpdf(i) for i in pdfs])
+    outname = _tar_to_filename(filename)
+    basename = os.path.splitext(os.path.basename(filename))[0]
+    tdir = os.path.join(DIR_PDFTARS, basename)
+    outpdfs = ' '.join([os.path.join(tdir, id_to_tarpdf(i)) for i in pdfs])
+    cmd0 = 'tar --one-top-level -C {} -xf {} {}'.format(DIR_PDFTARS, outname, namelist)
+    cmd1 = 'cp -a {} {}'.format(outpdfs, outdir)
+    cmd2 = 'rm -rf {}'.format(tdir)
+    _call(cmd0)
+    _call(cmd1)
+    _call(cmd2)
+def call_list(ai, manifest):
+    """
+    Convert a list of articles and the tar manifest into a dictionary
+    of the tarfiles and the pdfs needed from them.
+    """
+    inv = {}
+    for tar, pdfs in manifest.items():
+        for pdf in pdfs:
+            inv[pdf] = tar
+    tars = defaultdict(list)
+    num = 0
+    for i in ai:
+        aid = i.get('id')
+        tar = id_to_tarpdf(aid)
+        if not tar in inv:
+            continue
+        tars[inv[id_to_tarpdf(aid)]].append(aid)
+    return tars
+def extract_by_filter(oai, tarmanifest, func, outdir):
+    """
+    User-facing function that deals extracts a section of articles from
+    the entire arxiv.
+    Parameters
+    ----------
+    oai : list of dicts
+        The OAI metadata from `oai_metadata.load_metadata`
+    tarmanifest : list of dicts
+        Dictionary describing the S3 downloads, `s3_bulk_download.get_manifest`
+    func : function
+        Filter to apply to OAI metadata to get list of articles
+    outdir : string
+        Directory in which to place the PDFs and metadata for the slice
+    """
+    articles = func(oai)
+    tarmap = call_list(articles, tarmanifest)
+    for tar, pdfs in tarmap.items():
+        extract_files(tar, pdfs, outdir=outdir)
+    with open(os.path.join(outdir, 'metadata.json'), 'w') as f:
+        json.dump(articles, f)

arxiv_public_data/tex2utf.py ADDED Viewed

	@@ -0,0 +1,206 @@

+# https://github.com/arXiv/arxiv-base@32e6ad0
+"""
+Copyright 2017 Cornell University
+Permission is hereby granted, free of charge, to any person obtaining a copy of
+this software and associated documentation files (the "Software"), to deal in
+the Software without restriction, including without limitation the rights to
+use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
+of the Software, and to permit persons to whom the Software is furnished to do
+so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+"""
+"""Convert between TeX escapes and UTF8."""
+import re
+from typing import Pattern, Dict, Match
+accents = {
+    # first accents with non-letter prefix, e.g. \'A
+    "'A": 0x00c1, "'C": 0x0106, "'E": 0x00c9, "'I": 0x00cd,
+    "'L": 0x0139, "'N": 0x0143, "'O": 0x00d3, "'R": 0x0154,
+    "'S": 0x015a, "'U": 0x00da, "'Y": 0x00dd, "'Z": 0x0179,
+    "'a": 0x00e1, "'c": 0x0107, "'e": 0x00e9, "'i": 0x00ed,
+    "'l": 0x013a, "'n": 0x0144, "'o": 0x00f3, "'r": 0x0155,
+    "'s": 0x015b, "'u": 0x00fa, "'y": 0x00fd, "'z": 0x017a,
+    '"A': 0x00c4, '"E': 0x00cb, '"I': 0x00cf, '"O': 0x00d6,
+    '"U': 0x00dc, '"Y': 0x0178, '"a': 0x00e4, '"e': 0x00eb,
+    '"i': 0x00ef, '"o': 0x00f6, '"u': 0x00fc, '"y': 0x00ff,
+    '.A': 0x0226, '.C': 0x010a, '.E': 0x0116, '.G': 0x0120,
+    '.I': 0x0130, '.O': 0x022e, '.Z': 0x017b, '.a': 0x0227,
+    '.c': 0x010b, '.e': 0x0117, '.g': 0x0121, '.o': 0x022f,
+    '.z': 0x017c, '=A': 0x0100, '=E': 0x0112, '=I': 0x012a,
+    '=O': 0x014c, '=U': 0x016a, '=Y': 0x0232, '=a': 0x0101,
+    '=e': 0x0113, '=i': 0x012b, '=o': 0x014d, '=u': 0x016b,
+    '=y': 0x0233, '^A': 0x00c2, '^C': 0x0108, '^E': 0x00ca,
+    '^G': 0x011c, '^H': 0x0124, '^I': 0x00ce, '^J': 0x0134,
+    '^O': 0x00d4, '^S': 0x015c, '^U': 0x00db, '^W': 0x0174,
+    '^Y': 0x0176, '^a': 0x00e2, '^c': 0x0109, '^e': 0x00ea,
+    '^g': 0x011d, '^h': 0x0125, '^i': 0x00ee, '^j': 0x0135,
+    '^o': 0x00f4, '^s': 0x015d, '^u': 0x00fb, '^w': 0x0175,
+    '^y': 0x0177, '`A': 0x00c0, '`E': 0x00c8, '`I': 0x00cc,
+    '`O': 0x00d2, '`U': 0x00d9, '`a': 0x00e0, '`e': 0x00e8,
+    '`i': 0x00ec, '`o': 0x00f2, '`u': 0x00f9, '~A': 0x00c3,
+    '~I': 0x0128, '~N': 0x00d1, '~O': 0x00d5, '~U': 0x0168,
+    '~a': 0x00e3, '~i': 0x0129, '~n': 0x00f1, '~o': 0x00f5,
+    '~u': 0x0169,
+    # and now ones with letter prefix \c{c} etc..
+    'HO': 0x0150, 'HU': 0x0170, 'Ho': 0x0151, 'Hu': 0x0171,
+    'cC': 0x00c7, 'cE': 0x0228,
+    'cG': 0x0122, 'cK': 0x0136, 'cL': 0x013b, 'cN': 0x0145,
+    'cR': 0x0156, 'cS': 0x015e, 'cT': 0x0162, 'cc': 0x00e7,
+    'ce': 0x0229, 'cg': 0x0123, 'ck': 0x0137, 'cl': 0x013c,
+    # Commented out due ARXIVDEV-2322 (bug reported by PG)
+    # 'ci' : 'i\x{0327}' = chr(0x69).ch(0x327) # i with combining cedilla
+    'cn': 0x0146, 'cr': 0x0157, 'cs': 0x015f, 'ct': 0x0163,
+    'kA': 0x0104, 'kE': 0x0118, 'kI': 0x012e, 'kO': 0x01ea,
+    'kU': 0x0172, 'ka': 0x0105, 'ke': 0x0119, 'ki': 0x012f,
+    'ko': 0x01eb, 'ku': 0x0173, 'rA': 0x00c5, 'rU': 0x016e,
+    'ra': 0x00e5, 'ru': 0x016f, 'uA': 0x0102, 'uE': 0x0114,
+    'uG': 0x011e, 'uI': 0x012c, 'uO': 0x014e, 'uU': 0x016c,
+    'ua': 0x0103, 'ue': 0x0115, 'ug': 0x011f,
+    'ui': 0x012d, 'uo': 0x014f, 'uu': 0x016d,
+    'vA': 0x01cd, 'vC': 0x010c, 'vD': 0x010e,
+    'vE': 0x011a, 'vG': 0x01e6, 'vH': 0x021e, 'vI': 0x01cf,
+    'vK': 0x01e8, 'vL': 0x013d, 'vN': 0x0147, 'vO': 0x01d1,
+    'vR': 0x0158, 'vS': 0x0160, 'vT': 0x0164, 'vU': 0x01d3,
+    'vZ': 0x017d, 'va': 0x01ce, 'vc': 0x010d, 'vd': 0x010f,
+    've': 0x011b, 'vg': 0x01e7, 'vh': 0x021f, 'vi': 0x01d0,
+    'vk': 0x01e9, 'vl': 0x013e, 'vn': 0x0148, 'vo': 0x01d2,
+    'vr': 0x0159, 'vs': 0x0161, 'vt': 0x0165, 'vu': 0x01d4,
+    'vz': 0x017e
+}
+r"""
+Hash to lookup tex markup and convert to Unicode.
+macron: a line above character (overbar \={} in TeX)
+caron: v-shape above character (\v{ } in TeX)
+See: http://www.unicode.org/charts/
+"""
+textlet = {
+    'AA': 0x00c5, 'AE': 0x00c6, 'DH': 0x00d0, 'DJ': 0x0110,
+    'ETH': 0x00d0, 'L': 0x0141, 'NG': 0x014a, 'O': 0x00d8,
+    'oe': 0x0153, 'OE': 0x0152, 'TH': 0x00de, 'aa': 0x00e5,
+    'ae': 0x00e6,
+    'dh': 0x00f0, 'dj': 0x0111, 'eth': 0x00f0, 'i': 0x0131,
+    'l': 0x0142, 'ng': 0x014b, 'o': 0x00f8, 'ss': 0x00df,
+    'th': 0x00fe,
+    # Greek (upper)
+    'Gamma': 0x0393, 'Delta': 0x0394, 'Theta': 0x0398,
+    'Lambda': 0x039b, 'Xi': 0x039E, 'Pi': 0x03a0,
+    'Sigma': 0x03a3, 'Upsilon': 0x03a5, 'Phi': 0x03a6,
+    'Psi': 0x03a8, 'Omega': 0x03a9,
+    # Greek (lower)
+    'alpha': 0x03b1, 'beta': 0x03b2, 'gamma': 0x03b3,
+    'delta': 0x03b4, 'epsilon': 0x03b5, 'zeta': 0x03b6,
+    'eta': 0x03b7, 'theta': 0x03b8, 'iota': 0x03b9,
+    'kappa': 0x03ba, 'lambda': 0x03bb, 'mu': 0x03bc,
+    'nu': 0x03bd, 'xi': 0x03be, 'omicron': 0x03bf,
+    'pi': 0x03c0, 'rho': 0x03c1, 'varsigma': 0x03c2,
+    'sigma': 0x03c3, 'tau': 0x03c4, 'upsion': 0x03c5,
+    'varphi': 0x03C6,  # φ
+    'phi':  0x03D5,  # ϕ
+    'chi': 0x03c7, 'psi': 0x03c8, 'omega': 0x03c9,
+}
+def _p_to_match(tex_to_chr: Dict[str, int]) -> Pattern:
+    # textsym and textlet both use the same sort of regex pattern.
+    keys = r'\\(' + '|'.join(tex_to_chr.keys()) + ')'
+    pstr = r'({)?' + keys + r'(\b|(?=_))(?(1)}|(\\(?= )| |{}|)?)'
+    return re.compile(pstr)
+textlet_pattern = _p_to_match(textlet)
+textsym = {
+    'P': 0x00b6, 'S': 0x00a7, 'copyright': 0x00a9,
+    'guillemotleft': 0x00ab, 'guillemotright': 0x00bb,
+    'pounds': 0x00a3, 'dag': 0x2020, 'ddag': 0x2021,
+    'div': 0x00f7, 'deg': 0x00b0}
+textsym_pattern = _p_to_match(textsym)
+def _textlet_sub(match: Match) -> str:
+    return chr(textlet[match.group(2)])
+def _textsym_sub(match: Match) -> str:
+    return chr(textsym[match.group(2)])
+def texch2UTF(acc: str) -> str:
+    """Convert single character TeX accents to UTF-8.
+    Strip non-whitepsace characters from any sequence not recognized (hence
+    could return an empty string if there are no word characters in the input
+    string).
+    chr(num) will automatically create a UTF8 string for big num
+    """
+    if acc in accents:
+        return chr(accents[acc])
+    else:
+        return re.sub(r'[^\w]+', '', acc, flags=re.IGNORECASE)
+def tex2utf(tex: str, letters: bool = True) -> str:
+    r"""Convert some TeX accents and greek symbols to UTF-8 characters.
+    :param tex: Text to filter.
+    :param letters: If False, do not convert greek letters or
+    ligatures.  Greek symbols can cause problems. Ex. \phi is not
+    suppose to look like φ. φ looks like \varphi.  See ARXIVNG-1612
+    :returns: string, possibly with some TeX replaced with UTF8
+    """
+    # Do dotless i,j -> plain i,j where they are part of an accented i or j
+    utf = re.sub(r"/(\\['`\^\"\~\=\.uvH])\{\\([ij])\}", r"\g<1>\{\g<2>\}", tex)
+    # Now work on the Tex sequences, first those with letters only match
+    if letters:
+        utf = textlet_pattern.sub(_textlet_sub, utf)
+    utf = textsym_pattern.sub(_textsym_sub, utf)
+    utf = re.sub(r'\{\\j\}|\\j\s', 'j', utf)  # not in Unicode?
+    # reduce {{x}}, {{{x}}}, ... down to {x}
+    while re.search(r'\{\{([^\}]*)\}\}', utf):
+        utf = re.sub(r'\{\{([^\}]*)\}\}', r'{\g<1>}', utf)
+    # Accents which have a non-letter prefix in TeX, first \'e
+    utf = re.sub(r'\\([\'`^"~=.][a-zA-Z])',
+                 lambda m: texch2UTF(m.group(1)), utf)
+    # then \'{e} form:
+    utf = re.sub(r'\\([\'`^"~=.])\{([a-zA-Z])\}',
+                 lambda m: texch2UTF(m.group(1) + m.group(2)), utf)
+    # Accents which have a letter prefix in TeX
+    #  \u{x} u above (breve), \v{x}   v above (caron), \H{x}   double accute...
+    utf = re.sub(r'\\([Hckoruv])\{([a-zA-Z])\}',
+                 lambda m: texch2UTF(m.group(1) + m.group(2)), utf)
+    # Don't do \t{oo} yet,
+    utf = re.sub(r'\\t{([^\}])\}', r'\g<1>', utf)
+    # bdc34: commented out in original Perl
+    # $utf =~ s/\{(.)\}/$1/g; #  remove { } from around {x}
+    return utf

logo.png ADDED Viewed

requirements.txt ADDED Viewed

	@@ -0,0 +1,22 @@

+boto3==1.9.118
+requests==2.20.0
+unicodedata2
+https://github.com/jaepil/pdfminer3k/archive/1.0.4.zip
+sentence-transformers
+pdftotext
+arxiv
+arxiv2bib
+scholarly
+PyMuPDF
+Pillow
+tabula-py
+sentencepiece
+keybert
+spacy[all]
+scispacy
+amrlib
+transformers # >2.2.0
+https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_scibert-0.5.0.tar.gz
+https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_lg-0.5.0.tar.gz
+bert-extractive-summarizer
+streamlit

setup.py ADDED Viewed

	@@ -0,0 +1,89 @@

+import setuptools
+with open("README.md", "r", encoding="utf-8") as fh:
+    long_description = fh.read()
+setuptools.setup(
+    name="Auto-Research",
+    version="1.0",
+    author="Sidharth Pal",
+    author_email="sidharth.pal1992@gmail.com",
+    description="Geberate scientific survey with just a query",
+    long_description=long_description,
+    long_description_content_type="text/markdown",
+    url="https://github.com/sidphbot/Auto-Research",
+    project_urls={
+        "Docs" : "https://github.com/example/example/README.md",
+        "Bug Tracker": "https://github.com/sidphbot/Auto-Research/issues",
+        "Demo": "https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query",
+    },
+    classifiers=[
+        "Development Status :: 5 - Production/Stable",
+        "Environment :: Console",
+        "Environment :: Other Environment",
+        "Intended Audience :: Developers",
+        "Intended Audience :: Education",
+        "Intended Audience :: Science/Research",
+        "Intended Audience :: Other Audience",
+        "Topic :: Education",
+        "Topic :: Education :: Computer Aided Instruction (CAI)",
+        "Topic :: Scientific/Engineering",
+        "Topic :: Scientific/Engineering :: Artificial Intelligence",
+        "Topic :: Scientific/Engineering :: Information Analysis",
+        "Topic :: Scientific/Engineering :: Medical Science Apps.",
+        "Topic :: Scientific/Engineering :: Physics",
+        "Natural Language :: English",
+        "License :: OSI Approved :: GNU General Public License (GPL)",
+        "License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)",
+        "License :: OSI Approved :: GNU Lesser General Public License v3 or later (LGPLv3+)",
+        "License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)",
+        "Operating System :: POSIX :: Linux",
+        "Operating System :: MacOS :: MacOS X",
+        "Environment :: GPU",
+        "Environment :: GPU :: NVIDIA CUDA",
+        "Programming Language :: Python",
+        "Programming Language :: Python :: 3",
+        "Programming Language :: Python :: 3 :: Only",
+        "Programming Language :: Python :: 3.6",
+    ],
+    package_dir={"": "src"},
+    packages=setuptools.find_packages(where="src"),
+    python_requires=">=3.7",
+    install_requires=[
+        "pip",
+        "boto3==1.9.118",
+        "requests==2.20.0",
+        "unicodedata2",
+        "pdfminer3k",
+        "sentence-transformers",
+        "pdftotext",
+        "arxiv",
+        "arxiv2bib",
+        "scholarly",
+        "PyMuPDF",
+        "Pillow",
+        "tabula-py",
+        "sentencepiece",
+        "keybert",
+        "scispacy",
+        "amrlib",
+        "transformers",
+        "en_core_sci_scibert",
+        "bert-extractive-summarizer",
+        "en_core_sci_lg",
+    ],
+    extras_require={
+        "spacy": ["all"],
+    },
+    dependency_links=[
+        "https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_scibert-0.5.0.tar.gz#egg=en_core_sci_scibert-0.5.0",
+        "https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_lg-0.5.0.tar.gz#egg=en_core_sci_lg-0.5.0"
+    ],
+    tests_require=["pytest"],
+    entry_points={
+        'console_scripts': [
+            'cursive = src.Surveyor:main',
+        ],
+    },
+)

src/Auto_Research.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,313 @@

+Metadata-Version: 2.1
+Name: Auto-Research
+Version: 1.0
+Summary: Geberate scientific survey with just a query
+Home-page: https://github.com/sidphbot/Auto-Research
+Author: Sidharth Pal
+Author-email: sidharth.pal1992@gmail.com
+License: UNKNOWN
+Project-URL: Docs, https://github.com/example/example/README.md
+Project-URL: Bug Tracker, https://github.com/sidphbot/Auto-Research/issues
+Project-URL: Demo, https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query
+Platform: UNKNOWN
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Environment :: Console
+Classifier: Environment :: Other Environment
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Education
+Classifier: Intended Audience :: Science/Research
+Classifier: Intended Audience :: Other Audience
+Classifier: Topic :: Education
+Classifier: Topic :: Education :: Computer Aided Instruction (CAI)
+Classifier: Topic :: Scientific/Engineering
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Scientific/Engineering :: Information Analysis
+Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
+Classifier: Topic :: Scientific/Engineering :: Physics
+Classifier: Natural Language :: English
+Classifier: License :: OSI Approved :: GNU General Public License (GPL)
+Classifier: License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)
+Classifier: License :: OSI Approved :: GNU Lesser General Public License v3 or later (LGPLv3+)
+Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
+Classifier: Operating System :: POSIX :: Linux
+Classifier: Operating System :: MacOS :: MacOS X
+Classifier: Environment :: GPU
+Classifier: Environment :: GPU :: NVIDIA CUDA
+Classifier: Programming Language :: Python
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3 :: Only
+Classifier: Programming Language :: Python :: 3.6
+Requires-Python: >=3.7
+Description-Content-Type: text/markdown
+Provides-Extra: spacy
+License-File: LICENSE
+# Auto-Research
+![Auto-Research][logo]
+[logo]: https://github.com/sidphbot/Auto-Research/blob/main/logo.png
+A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query.
+Data Provider: [arXiv](https://arxiv.org/) Open Archive Initiative OAI
+Requirements:
+ - python 3.7 or above
+ - poppler-utils - `sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev`
+ - list of requirements in requirements.txt - `cat requirements.txt | xargs pip install`
+ - 8GB disk space
+ - 13GB CUDA(GPU) memory - for a survey of 100 searched papers(max_search) and 25 selected papers(num_papers)
+#### Demo :
+Video Demo : https://drive.google.com/file/d/1-77J2L10lsW-bFDOGdTaPzSr_utY743g/view?usp=sharing
+Kaggle Re-usable Demo : https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query
+(`[TIP]` click 'edit and run' to run the demo for your custom queries on a free GPU)
+#### Steps to run (pip coming soon):
+```
+apt install -y poppler-utils libpoppler-cpp-dev
+git clone https://github.com/sidphbot/Auto-Research.git
+cd Auto-Research/
+pip install -r requirements.txt
+python survey.py [options] <your_research_query>
+```
+#### Artifacts generated (zipped):
+- Detailed survey draft paper as txt file
+- A curated list of top 25+ papers as pdfs and txts
+- Images extracted from above papers as jpegs, bmps etc
+- Heading/Section wise highlights extracted from above papers as a re-usable pure python joblib dump
+- Tables extracted from papers(optional)
+- Corpus of metadata highlights/text of top 100 papers as a re-usable pure python joblib dump
+## Example run #1 - python utility
+```
+python survey.py 'multi-task representation learning'
+```
+## Example run #2 - python class
+```
+from survey import Surveyor
+mysurveyor = Surveyor()
+mysurveyor.survey('quantum entanglement')
+```
+### Research tools:
+These are independent tools for your research or document text handling needs.
+```
+*[Tip]* :(models can be changed in defaults or passed on during init along with `refresh-models=True`)
+```
+- `abstractive_summary` - takes a long text document (`string`) and returns a 1-paragraph abstract or “abstractive” summary (`string`)
+	Input:
+		`longtext` : string
+	Returns:
+		`summary` : string
+- `extractive_summary` - takes a long text document (`string`) and returns a 1-paragraph of extracted highlights or “extractive” summary (`string`)
+	Input:
+		`longtext` : string
+	Returns:
+		`summary` : string
+- `generate_title` - takes a long text document (`string`) and returns a generated title (`string`)
+	Input:
+		`longtext` : string
+	Returns:
+		`title` : string
+- `extractive_highlights` - takes a long text document (`string`) and returns a list of extracted highlights (`[string]`), a list of keywords (`[string]`) and key phrases (`[string]`)
+	Input:
+		`longtext` : string
+	Returns:
+		`highlights` : [string]
+		`keywords` : [string]
+		`keyphrases` : [string]
+- `extract_images_from_file` - takes a pdf file name (`string`) and returns a list of image filenames (`[string]`).
+	Input:
+		`pdf_file` : string
+	Returns:
+		`images_files` : [string]
+- `extract_tables_from_file` - takes a pdf file name (`string`) and returns a list of csv filenames (`[string]`).
+	Input:
+		`pdf_file` : string
+	Returns:
+		`images_files` : [string]
+- `cluster_lines` - takes a list of lines (`string`) and returns the topic-clustered sections (`dict(generated_title: [cluster_abstract])`) and clustered lines (`dict(cluster_id: [cluster_lines])`)
+	Input:
+		`lines` : [string]
+	Returns:
+		`sections` : dict(generated_title: [cluster_abstract])
+		`clusters` : dict(cluster_id: [cluster_lines])
+- `extract_headings` - *[for scientific texts - Assumes an ‘abstract’ heading present]* takes a text file name (`string`) and returns a list of headings (`[string]`) and refined lines (`[string]`).
+    `[Tip 1]` : Use `extract_sections` as a wrapper (e.g. `extract_sections(extract_headings(“/path/to/textfile”)`) to get heading-wise sectioned text with refined lines instead (`dict( heading: text)`)
+    `[Tip 2]` : write the word ‘abstract’ at the start of the file text to get an extraction for non-scientific texts as well !!
+	Input:
+		`text_file` : string
+	Returns:
+		`refined` : [string],
+		`headings` : [string]
+		`sectioned_doc` : dict( heading: text) (Optional - Wrapper case)
+## Access/Modify defaults:
+- inside code
+```
+from survey.Surveyor import DEFAULTS
+from pprint import pprint
+pprint(DEFAULTS)
+```
+or,
+- Modify static config file - `defaults.py`
+or,
+- At runtime (utility)
+```
+python survey.py --help
+```
+```
+usage: survey.py [-h] [--max_search max_metadata_papers]
+                   [--num_papers max_num_papers] [--pdf_dir pdf_dir]
+                   [--txt_dir txt_dir] [--img_dir img_dir] [--tab_dir tab_dir]
+                   [--dump_dir dump_dir] [--models_dir save_models_dir]
+                   [--title_model_name title_model_name]
+                   [--ex_summ_model_name extractive_summ_model_name]
+                   [--ledmodel_name ledmodel_name]
+                   [--embedder_name sentence_embedder_name]
+                   [--nlp_name spacy_model_name]
+                   [--similarity_nlp_name similarity_nlp_name]
+                   [--kw_model_name kw_model_name]
+                   [--refresh_models refresh_models] [--high_gpu high_gpu]
+                   query_string
+Generate a survey just from a query !!
+positional arguments:
+  query_string          your research query/keywords
+optional arguments:
+  -h, --help            show this help message and exit
+  --max_search max_metadata_papers
+                        maximium number of papers to gaze at - defaults to 100
+  --num_papers max_num_papers
+                        maximium number of papers to download and analyse -
+                        defaults to 25
+  --pdf_dir pdf_dir     pdf paper storage directory - defaults to
+                        arxiv_data/tarpdfs/
+  --txt_dir txt_dir     text-converted paper storage directory - defaults to
+                        arxiv_data/fulltext/
+  --img_dir img_dir     image storage directory - defaults to
+                        arxiv_data/images/
+  --tab_dir tab_dir     tables storage directory - defaults to
+                        arxiv_data/tables/
+  --dump_dir dump_dir   all_output_dir - defaults to arxiv_dumps/
+  --models_dir save_models_dir
+                        directory to save models (> 5GB) - defaults to
+                        saved_models/
+  --title_model_name title_model_name
+                        title model name/tag in hugging-face, defaults to
+                        'Callidior/bert2bert-base-arxiv-titlegen'
+  --ex_summ_model_name extractive_summ_model_name
+                        extractive summary model name/tag in hugging-face,
+                        defaults to 'allenai/scibert_scivocab_uncased'
+  --ledmodel_name ledmodel_name
+                        led model(for abstractive summary) name/tag in
+                        hugging-face, defaults to 'allenai/led-
+                        large-16384-arxiv'
+  --embedder_name sentence_embedder_name
+                        sentence embedder name/tag in hugging-face, defaults
+                        to 'paraphrase-MiniLM-L6-v2'
+  --nlp_name spacy_model_name
+                        spacy model name/tag in hugging-face (if changed -
+                        needs to be spacy-installed prior), defaults to
+                        'en_core_sci_scibert'
+  --similarity_nlp_name similarity_nlp_name
+                        spacy downstream model(for similarity) name/tag in
+                        hugging-face (if changed - needs to be spacy-installed
+                        prior), defaults to 'en_core_sci_lg'
+  --kw_model_name kw_model_name
+                        keyword extraction model name/tag in hugging-face,
+                        defaults to 'distilbert-base-nli-mean-tokens'
+  --refresh_models refresh_models
+                        Refresh model downloads with given names (needs
+                        atleast one model name param above), defaults to False
+  --high_gpu high_gpu   High GPU usage permitted, defaults to False
+```
+- At runtime (code)
+    > during surveyor object initialization with `surveyor_obj = Surveyor()`
+    - `pdf_dir`: String, pdf paper storage directory - defaults to `arxiv_data/tarpdfs/`
+    - `txt_dir`: String, text-converted paper storage directory - defaults to `arxiv_data/fulltext/`
+    - `img_dir`: String, image image storage directory - defaults to `arxiv_data/images/`
+    - `tab_dir`: String, tables storage directory - defaults to `arxiv_data/tables/`
+    - `dump_dir`: String, all_output_dir - defaults to `arxiv_dumps/`
+    - `models_dir`: String, directory to save to huge models, defaults to `saved_models/`
+    - `title_model_name`: String, title model name/tag in hugging-face, defaults to `Callidior/bert2bert-base-arxiv-titlegen`
+    - `ex_summ_model_name`: String, extractive summary model name/tag in hugging-face, defaults to `allenai/scibert_scivocab_uncased`
+    - `ledmodel_name`: String, led model(for abstractive summary) name/tag in hugging-face, defaults to `allenai/led-large-16384-arxiv`
+    - `embedder_name`: String, sentence embedder name/tag in hugging-face, defaults to `paraphrase-MiniLM-L6-v2`
+    - `nlp_name`: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_scibert`
+    - `similarity_nlp_name`: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_lg`
+    - `kw_model_name`: String, keyword extraction model name/tag in hugging-face, defaults to `distilbert-base-nli-mean-tokens`
+    - `high_gpu`: Bool, High GPU usage permitted, defaults to `False`
+    - `refresh_models`: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False
+    > during survey generation with `surveyor_obj.survey(query="my_research_query")`
+    - `max_search`: int maximium number of papers to gaze at - defaults to `100`
+    - `num_papers`: int maximium number of papers to download and analyse - defaults to `25`

src/Auto_Research.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+LICENSE
+README.md
+pyproject.toml
+setup.py
+src/Auto_Research.egg-info/PKG-INFO
+src/Auto_Research.egg-info/SOURCES.txt
+src/Auto_Research.egg-info/dependency_links.txt
+src/Auto_Research.egg-info/entry_points.txt
+src/Auto_Research.egg-info/requires.txt
+src/Auto_Research.egg-info/top_level.txt

src/Auto_Research.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_scibert-0.5.0.tar.gz#egg=en_core_sci_scibert-0.5.0
2	+ https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_lg-0.5.0.tar.gz#egg=en_core_sci_lg-0.5.0

src/Auto_Research.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ cursive = src.Surveyor:main

src/Auto_Research.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,24 @@

+pip
+boto3==1.9.118
+requests==2.20.0
+unicodedata2
+pdfminer3k
+sentence-transformers
+pdftotext
+arxiv
+arxiv2bib
+scholarly
+PyMuPDF
+Pillow
+tabula-py
+sentencepiece
+keybert
+scispacy
+amrlib
+transformers
+en_core_sci_scibert
+bert-extractive-summarizer
+en_core_sci_lg
+[spacy]
+all

src/Auto_Research.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

src/Surveyor.py ADDED Viewed

	@@ -0,0 +1,1518 @@

+from arxiv_public_data.fulltext import convert_directory_parallel
+from arxiv_public_data import internal_citations
+import torch
+import os
+from summarizer import Summarizer
+from sentence_transformers import SentenceTransformer
+import spacy
+import numpy as np
+from keybert import KeyBERT
+import shutil, joblib
+from distutils.dir_util import copy_tree
+try:
+    from transformers import *
+except:
+    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoConfig, AutoModel, LEDTokenizer, \
+        LEDForConditionalGeneration
+from src.defaults import DEFAULTS
+class Surveyor:
+    '''
+    A class to abstract all nlp and data mining helper functions as well as workflows
+    required to generate the survey from a single query, with absolute configurability
+    '''
+    def __init__(
+            self,
+            pdf_dir=None,
+            txt_dir=None,
+            img_dir=None,
+            tab_dir=None,
+            dump_dir=None,
+            models_dir=None,
+            title_model_name=None,
+            ex_summ_model_name=None,
+            ledmodel_name=None,
+            embedder_name=None,
+            nlp_name=None,
+            similarity_nlp_name=None,
+            kw_model_name=None,
+            high_gpu=False,
+            refresh_models=False,
+            no_save_models=False
+    ):
+        '''
+        Initializes models and directory structure for the surveyor
+        Optional Params:
+            - pdf_dir: String, pdf paper storage directory - defaults to arxiv_data/tarpdfs/
+            - txt_dir: String, text-converted paper storage directory - defaults to arxiv_data/fulltext/
+            - img_dir: String, image image storage directory - defaults to arxiv_data/images/
+            - tab_dir: String, tables storage directory - defaults to arxiv_data/tables/
+            - dump_dir: String, all_output_dir - defaults to arxiv_dumps/
+            - models_dir: String, directory to save to huge models
+            - title_model_name: String, title model name/tag in hugging-face, defaults to `Callidior/bert2bert-base-arxiv-titlegen`
+            - ex_summ_model_name: String, extractive summary model name/tag in hugging-face, defaults to `allenai/scibert_scivocab_uncased`
+            - ledmodel_name: String, led model(for abstractive summary) name/tag in hugging-face, defaults to `allenai/led-large-16384-arxiv`
+            - embedder_name: String, sentence embedder name/tag in hugging-face, defaults to `paraphrase-MiniLM-L6-v2`
+            - nlp_name: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_scibert`
+            - similarity_nlp_name: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_lg`
+            - kw_model_name: String, keyword extraction model name/tag in hugging-face, defaults to `distilbert-base-nli-mean-tokens`
+            - high_gpu: Bool, High GPU usage permitted, defaults to False
+            - refresh_models: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False
+            - no_save_models: forces refresh models
+            - max_search: int maximium number of papers to gaze at - defaults to 100
+            - num_papers: int maximium number of papers to download and analyse - defaults to 25
+        '''
+        self.torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
+        print("\nTorch_device: " + self.torch_device)
+        if 'cuda' in self.torch_device:
+            print("\nloading spacy for gpu")
+            spacy.require_gpu()
+        if not kw_model_name:
+            kw_model_name = DEFAULTS["kw_model_name"]
+        if not high_gpu:
+            self.high_gpu = DEFAULTS["high_gpu"]
+        else:
+            self.high_gpu = high_gpu
+        self.num_papers = DEFAULTS['num_papers']
+        self.max_search = DEFAULTS['max_search']
+        if not models_dir:
+            models_dir = DEFAULTS['models_dir']
+        models_found = False
+        if os.path.exists(models_dir) and not no_save_models:
+            if len(os.listdir(models_dir)) > 6:
+                models_found = True
+        if not title_model_name:
+            title_model_name = DEFAULTS["title_model_name"]
+        if not ex_summ_model_name:
+            ex_summ_model_name = DEFAULTS["ex_summ_model_name"]
+        if not ledmodel_name:
+            ledmodel_name = DEFAULTS["ledmodel_name"]
+        if not embedder_name:
+            embedder_name = DEFAULTS["embedder_name"]
+        if not nlp_name:
+            nlp_name = DEFAULTS["nlp_name"]
+        if not similarity_nlp_name:
+            similarity_nlp_name = DEFAULTS["similarity_nlp_name"]
+        if refresh_models or not models_found:
+            print(f'\nInitializing models {"and saving (about 5GB)" if not no_save_models else ""}')
+            if not no_save_models:
+                self.clean_dirs([models_dir])
+            self.title_tokenizer = AutoTokenizer.from_pretrained(title_model_name)
+            self.title_model = AutoModelForSeq2SeqLM.from_pretrained(title_model_name).to(self.torch_device)
+            self.title_model.eval()
+            if not no_save_models:
+                self.title_model.save_pretrained(models_dir + "/title_model")
+            #self.title_tokenizer.save_pretrained(models_dir + "/title_tokenizer")
+            # summary model
+            self.custom_config = AutoConfig.from_pretrained(ex_summ_model_name)
+            self.custom_config.output_hidden_states = True
+            self.summ_tokenizer = AutoTokenizer.from_pretrained(ex_summ_model_name)
+            self.summ_model = AutoModel.from_pretrained(ex_summ_model_name, config=self.custom_config).to(
+                self.torch_device)
+            self.summ_model.eval()
+            if not no_save_models:
+                self.summ_model.save_pretrained(models_dir + "/summ_model")
+            #self.summ_tokenizer.save_pretrained(models_dir + "/summ_tokenizer")
+            self.model = Summarizer(custom_model=self.summ_model, custom_tokenizer=self.summ_tokenizer)
+            self.ledtokenizer = LEDTokenizer.from_pretrained(ledmodel_name)
+            self.ledmodel = LEDForConditionalGeneration.from_pretrained(ledmodel_name).to(self.torch_device)
+            self.ledmodel.eval()
+            if not no_save_models:
+                self.ledmodel.save_pretrained(models_dir + "/ledmodel")
+            #self.ledtokenizer.save_pretrained(models_dir + "/ledtokenizer")
+            self.embedder = SentenceTransformer(embedder_name)
+            self.embedder.eval()
+            if not no_save_models:
+                self.embedder.save(models_dir + "/embedder")
+        else:
+            print("\nInitializing from previously saved models at" + models_dir)
+            self.title_tokenizer = AutoTokenizer.from_pretrained(title_model_name)
+            self.title_model = AutoModelForSeq2SeqLM.from_pretrained(models_dir + "/title_model").to(self.torch_device)
+            self.title_model.eval()
+            # summary model
+            #self.summ_config = AutoConfig.from_pretrained(ex_summ_model_name)
+            #self.summ_config.output_hidden_states = True
+            self.summ_tokenizer = AutoTokenizer.from_pretrained(ex_summ_model_name)
+            self.summ_model = AutoModel.from_pretrained(models_dir + "/summ_model").to(
+                self.torch_device)
+            self.summ_model.eval()
+            self.model = Summarizer(custom_model=self.summ_model, custom_tokenizer=self.summ_tokenizer)
+            self.ledtokenizer = LEDTokenizer.from_pretrained(ledmodel_name)
+            self.ledmodel = LEDForConditionalGeneration.from_pretrained(models_dir + "/ledmodel").to(self.torch_device)
+            self.ledmodel.eval()
+            self.embedder = SentenceTransformer(models_dir + "/embedder")
+            self.embedder.eval()
+        self.nlp = spacy.load(nlp_name)
+        self.similarity_nlp = spacy.load(similarity_nlp_name)
+        self.kw_model = KeyBERT(kw_model_name)
+        self.define_structure(pdf_dir=pdf_dir, txt_dir=txt_dir, img_dir=img_dir, tab_dir=tab_dir, dump_dir=dump_dir)
+    def define_structure(self, pdf_dir=None, txt_dir=None, img_dir=None, tab_dir=None, dump_dir=None):
+        if pdf_dir:
+            self.pdf_dir = pdf_dir
+        else:
+            self.pdf_dir = DEFAULTS["pdf_dir"]
+        if txt_dir:
+            self.txt_dir = txt_dir
+        else:
+            self.txt_dir = DEFAULTS["txt_dir"]
+        if img_dir:
+            self.img_dir = img_dir
+        else:
+            self.img_dir = DEFAULTS["img_dir"]
+        if tab_dir:
+            self.tab_dir = tab_dir
+        else:
+            self.tab_dir = DEFAULTS["tab_dir"]
+        if dump_dir:
+            self.dump_dir = dump_dir
+        else:
+            self.dump_dir = DEFAULTS["dump_dir"]
+        dirs = [self.pdf_dir, self.txt_dir, self.img_dir, self.tab_dir, self.dump_dir]
+        if sum([True for dir in dirs if 'arxiv_data/' in dir]):
+            base = os.path.dirname("arxiv_data/")
+            if not os.path.exists(base):
+                os.mkdir(base)
+        self.clean_dirs(dirs)
+    def clean_dirs(self, dirs):
+        import shutil
+        for d in dirs:
+            if os.path.exists(d):
+                shutil.rmtree(d)
+            os.mkdir(d)
+    def pdf_route(self, pdf_dir, txt_dir, img_dir, tab_dir, dump_dir, papers_meta):
+        ## Data prep
+        import joblib
+        # test full again - check images - check dfs !!
+        self.clean_dirs([pdf_dir, txt_dir, img_dir, tab_dir, dump_dir])
+        papers = papers_meta[:self.num_papers]
+        selected_papers = papers
+        print("\nFirst stage paper collection...")
+        ids_none, papers, cites = self.fetch_papers(dump_dir, img_dir, papers, pdf_dir, tab_dir, txt_dir)
+        print("\nFirst stage paper collection complete, papers collected: \n" + ', '.join([p['id'] for p in papers]))
+        new_papers = papers_meta[self.num_papers : self.num_papers + len(ids_none)]
+        _ = self.get_freq_cited(cites)
+        '''
+        filtered_idlist = []
+        for c in self.get_freq_cited(cites):
+            if c in
+        _, new_searched_papers = self.search(filtered_idlist)
+        new_papers.extend(new_searched_papers)
+        '''
+        selected_papers.extend(new_papers)
+        print("\nSecond stage paper collection...")
+        _, new_papers, _ = self.fetch_papers(dump_dir, img_dir, new_papers, pdf_dir, tab_dir, txt_dir, repeat=True)
+        print("\nSecond stage paper collection complete, new papers collected: \n" + ', '.join([p['id'] for p in new_papers]))
+        papers.extend(new_papers)
+        joblib.dump(papers, dump_dir + 'papers_extracted_pdf_route.dmp')
+        copy_tree(img_dir, dump_dir + os.path.basename(img_dir))
+        copy_tree(tab_dir, dump_dir + os.path.basename(tab_dir))
+        print("\nExtracting section-wise highlights.. ")
+        papers = self.extract_highlights(papers)
+        return papers, selected_papers
+    def get_freq_cited(self, cites_dict, k=5):
+        cites_list = []
+        for k, v in cites_dict.items():
+            cites_list.append(k)
+            [cites_list.append(val) for val in v]
+        cite_freqs = {cite: cites_list.count(cite) for cite in set(cites_list)}
+        sorted_cites = dict(sorted(cite_freqs.items(), key=lambda item: item[1], reverse=True)[:5])
+        print("\nThe most cited paper ids are:\n" + str(sorted_cites))
+        return sorted_cites.keys()
+    def fetch_papers(self, dump_dir, img_dir, papers, pdf_dir, tab_dir, txt_dir, repeat=False):
+        import tempfile
+        if repeat:
+            with tempfile.TemporaryDirectory() as dirpath:
+                print("\n- downloading extra pdfs.. ")
+                # full text preparation of selected papers
+                self.download_pdfs(papers, dirpath)
+                dirpath_pdfs = os.listdir(dirpath)
+                for file_name in dirpath_pdfs:
+                    full_file_name = os.path.join(dirpath, file_name)
+                    if os.path.isfile(full_file_name):
+                        shutil.copy(full_file_name, pdf_dir)
+                print("\n- converting extra pdfs.. ")
+                self.convert_pdfs(dirpath, txt_dir)
+        else:
+            print("\n- downloading pdfs.. ")
+            # full text preparation of selected papers
+            self.download_pdfs(papers, pdf_dir)
+            print("\n- converting pdfs.. ")
+            self.convert_pdfs(pdf_dir, txt_dir)
+        # plugging citations to our papers object
+        print("\n- plugging in citation network.. ")
+        papers, cites = self.cocitation_network(papers, txt_dir)
+        joblib.dump(papers, dump_dir + 'papers_selected_pdf_route.dmp')
+        from distutils.dir_util import copy_tree
+        copy_tree(txt_dir, dump_dir + os.path.basename(txt_dir))
+        copy_tree(pdf_dir, dump_dir + os.path.basename(pdf_dir))
+        print("\n- extracting structure.. ")
+        papers, ids_none = self.extract_structure(papers, pdf_dir, txt_dir, img_dir, dump_dir, tab_dir)
+        return ids_none, papers, cites
+    def tar_route(self, pdf_dir, txt_dir, img_dir, tab_dir, papers):
+        ## Data prep
+        import joblib
+        # test full again - check images - check dfs !!
+        self.clean_dirs([pdf_dir, txt_dir, img_dir, tab_dir])
+        # full text preparation of selected papers
+        self.download_sources(papers, pdf_dir)
+        self.convert_pdfs(pdf_dir, txt_dir)
+        # plugging citations to our papers object
+        papers, cites = self.cocitation_network(papers, txt_dir)
+        joblib.dump(papers, 'papers_selected_tar_route.dmp')
+        papers = self.extract_structure(papers, pdf_dir, txt_dir, img_dir, tab_dir)
+        joblib.dump(papers, 'papers_extracted_tar_route.dmp')
+        return papers
+    def build_doc(self, research_sections, papers, query=None, filename='survey.txt'):
+        import arxiv2bib
+        print("\nbuilding bibliography entries.. ")
+        bibentries = arxiv2bib.arxiv2bib([p['id'] for p in papers])
+        bibentries = [r.bibtex() for r in bibentries]
+        print("\nbuilding final survey file .. at "+ filename)
+        file = open(filename, 'w+')
+        if query is None:
+            query = 'Internal(existing) research'
+        file.write("----------------------------------------------------------------------")
+        file.write("Title: A survey on " + query)
+        print("")
+        print("----------------------------------------------------------------------")
+        print("Title: A survey on " + query)
+        file.write("Author: Auto-Research (github.com/sidphbot/Auto-Research)")
+        print("Author: Auto-Research (github.com/sidphbot/Auto-Research)")
+        file.write("Dev: Auto-Research (github.com/sidphbot/Auto-Research)")
+        print("Dev: Auto-Research (github.com/sidphbot/Auto-Research)")
+        file.write("Disclaimer: This survey is intended to be a research starter. This Survey is Machine-Summarized, "+
+                   "\nhence some sentences might be wrangled or grammatically incorrect. However all sentences are "+
+                   "\nmined with proper citations. As All of the text is practically quoted texted, hence to "+
+                   "\nimprove visibility, all the papers are duly cited in the Bibiliography section. as bibtex "+
+                   "\nentries(only to avoid LaTex overhead). ")
+        print("Disclaimer: This survey is intended to be a research starter. This Survey is Machine-Summarized, "+
+                "\nhence some sentences might be wrangled or grammatically incorrect. However all sentences are "+
+                "\nmined with proper citations. As All of the text is practically quoted texted, hence to "+
+                "\nimprove visibility, all the papers are duly cited in the Bibiliography section. as bibtex "+
+                "\nentries(only to avoid LaTex overhead). ")
+        file.write("----------------------------------------------------------------------")
+        print("----------------------------------------------------------------------")
+        file.write("")
+        print("")
+        file.write('ABSTRACT')
+        print('ABSTRACT')
+        print("=================================================")
+        file.write("=================================================")
+        file.write("")
+        print("")
+        file.write(research_sections['abstract'])
+        print(research_sections['abstract'])
+        file.write("")
+        print("")
+        file.write('INTRODUCTION')
+        print('INTRODUCTION')
+        print("=================================================")
+        file.write("=================================================")
+        file.write("")
+        print("")
+        file.write(research_sections['introduction'])
+        print(research_sections['introduction'])
+        file.write("")
+        print("")
+        for k, v in research_sections.items():
+            if k not in ['abstract', 'introduction', 'conclusion']:
+                file.write(k.upper())
+                print(k.upper())
+                print("=================================================")
+                file.write("=================================================")
+                file.write("")
+                print("")
+                file.write(v)
+                print(v)
+                file.write("")
+                print("")
+        file.write('CONCLUSION')
+        print('CONCLUSION')
+        print("=================================================")
+        file.write("=================================================")
+        file.write("")
+        print("")
+        file.write(research_sections['conclusion'])
+        print(research_sections['conclusion'])
+        file.write("")
+        print("")
+        file.write('REFERENCES')
+        print('REFERENCES')
+        print("=================================================")
+        file.write("=================================================")
+        file.write("")
+        print("")
+        for entry in bibentries:
+            file.write(entry)
+            print(entry)
+            file.write("")
+            print("")
+        print("========================XXX=========================")
+        file.write("========================XXX=========================")
+        file.close()
+    def build_basic_blocks(self, corpus_known_sections, corpus):
+        research_blocks = {}
+        for head, textarr in corpus_known_sections.items():
+            torch.cuda.empty_cache()
+            # print(head.upper())
+            with torch.no_grad():
+                summtext = self.model(" ".join([l.lower() for l in textarr]), ratio=0.5)
+            res = self.nlp(summtext)
+            res = set([str(sent) for sent in list(res.sents)])
+            summtext = ''.join([line for line in res])
+            # pprint(summtext)
+            research_blocks[head] = summtext
+        return research_blocks
+    def abstractive_summary(self, longtext):
+        '''
+        faulty method
+        input_ids = ledtokenizer(longtext, return_tensors="pt").input_ids
+        global_attention_mask = torch.zeros_like(input_ids)
+        # set global_attention_mask on first token
+        global_attention_mask[:, 0] = 1
+        sequences = ledmodel.generate(input_ids, global_attention_mask=global_attention_mask).sequences
+        summary = ledtokenizer.batch_decode(sequences)
+        '''
+        torch.cuda.empty_cache()
+        inputs = self.ledtokenizer.prepare_seq2seq_batch(longtext, truncation=True, padding='longest',
+                                                         return_tensors='pt').to(self.torch_device)
+        with torch.no_grad():
+            summary_ids = self.ledmodel.generate(**inputs)
+        summary = self.ledtokenizer.batch_decode(summary_ids, skip_special_tokens=True,
+                                                 clean_up_tokenization_spaces=True)
+        res = self.nlp(summary[0])
+        res = set([str(sent) for sent in list(res.sents)])
+        summtext = ''.join([line for line in res])
+        #print("abstractive summary type:" + str(type(summary)))
+        return summtext
+    def get_abstract(self, abs_lines, corpus_known_sections, research_blocks):
+        # abs_lines = " ".join(abs_lines)
+        abs_lines = ""
+        abs_lines += " ".join([l.lower() for l in corpus_known_sections['abstract']])
+        abs_lines += research_blocks['abstract']
+        # print(abs_lines)
+        try:
+            return self.abstractive_summary(abs_lines)
+        except:
+            highlights = self.extractive_summary(abs_lines)
+            return self.abstractive_summary(highlights)
+    def get_corpus_lines(self, corpus):
+        abs_lines = []
+        types = set()
+        for k, v in corpus.items():
+            # print(v)
+            types.add(type(v))
+            abstext = k + '. ' + v.replace('\n', ' ')
+            abstext = self.nlp(abstext)
+            abs_lines.extend([str(sent).lower() for sent in list(abstext.sents)])
+        #print("unique corpus value types:" + str(types))
+        # abs_lines = '\n'.join([str(sent) for sent in abs_lines.sents])
+        return abs_lines
+    def get_sectioned_docs(self, papers, papers_meta):
+        import random
+        docs = []
+        for p in papers:
+            for section in p['sections']:
+                if len(section['highlights']) > 0:
+                    if self.high_gpu:
+                        content = self.generate_title(section['highlights'])
+                    else:
+                        content = self.extractive_summary(''.join(section['highlights']))
+                    docs.append(content)
+        selected_pids = [p['id'] for p in papers]
+        meta_abs = []
+        for p in papers_meta:
+            if p['id'] not in selected_pids:
+                meta_abs.append(self.generate_title(p['abstract']))
+        docs.extend(meta_abs)
+        #print("meta_abs num"+str(len(meta_abs)))
+        #print("selected_pids num"+str(len(selected_pids)))
+        #print("papers_meta num"+str(len(papers_meta)))
+        #assert (len(meta_abs) + len(selected_pids) == len(papers_meta))
+        assert ('str' in str(type(random.sample(docs, 1)[0])))
+        return [doc for doc in docs if doc != '']
+    def cluster_lines(self, abs_lines):
+        from sklearn.cluster import KMeans
+        # from bertopic import BERTopic
+        # topic_model = BERTopic(embedding_model=embedder)
+        torch.cuda.empty_cache()
+        corpus_embeddings = self.embedder.encode(abs_lines)
+        # Normalize the embeddings to unit length
+        corpus_embeddings = corpus_embeddings / np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)
+        with torch.no_grad():
+            optimal_k = self.model.calculate_optimal_k(' '.join(abs_lines), k_max=10)
+        # Perform kmean clustering
+        clustering_model = KMeans(n_clusters=optimal_k, n_init=20, n_jobs=-1)
+        # clustering_model = AgglomerativeClustering(n_clusters=optimal_k, affinity='cosine', linkage='average') #, affinity='cosine', linkage='average', distance_threshold=0.4)
+        clustering_model.fit(corpus_embeddings)
+        cluster_assignment = clustering_model.labels_
+        clustered_sentences = {}
+        dummy_count = 0
+        for sentence_id, cluster_id in enumerate(cluster_assignment):
+            if cluster_id not in clustered_sentences:
+                clustered_sentences[cluster_id] = []
+            '''
+            if dummy_count < 5:
+                print("abs_line: "+abs_lines[sentence_id])
+                print("cluster_ID: "+str(cluster_id))
+                print("embedding: "+str(corpus_embeddings[sentence_id]))
+                dummy_count += 1
+            '''
+            clustered_sentences[cluster_id].append(abs_lines[sentence_id])
+        # for i, cluster in clustered_sentences.items():
+        # print("Cluster ", i+1)
+        # print(cluster)
+        # print("")
+        return self.get_clustered_sections(clustered_sentences), clustered_sentences
+    def get_clusters(self, papers, papers_meta):
+        from sklearn.cluster import KMeans
+        # from bertopic import BERTopic
+        # topic_model = BERTopic(embedding_model=embedder)
+        torch.cuda.empty_cache()
+        abs_lines = self.get_sectioned_docs(papers, papers_meta)
+        corpus_embeddings = self.embedder.encode(abs_lines)
+        # Normalize the embeddings to unit length
+        corpus_embeddings = corpus_embeddings / np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)
+        with torch.no_grad():
+            optimal_k = self.model.calculate_optimal_k(' '.join(abs_lines), k_max=10)
+        # Perform kmean clustering
+        clustering_model = KMeans(n_clusters=optimal_k, n_init=20, n_jobs=-1)
+        # clustering_model = AgglomerativeClustering(n_clusters=optimal_k, affinity='cosine', linkage='average') #, affinity='cosine', linkage='average', distance_threshold=0.4)
+        clustering_model.fit(corpus_embeddings)
+        cluster_assignment = clustering_model.labels_
+        clustered_sentences = {}
+        dummy_count = 0
+        for sentence_id, cluster_id in enumerate(cluster_assignment):
+            if cluster_id not in clustered_sentences:
+                clustered_sentences[cluster_id] = []
+            '''
+            if dummy_count < 5:
+                print("abs_line: "+abs_lines[sentence_id])
+                print("cluster_ID: "+str(cluster_id))
+                print("embedding: "+str(corpus_embeddings[sentence_id]))
+                dummy_count += 1
+            '''
+            clustered_sentences[cluster_id].append(abs_lines[sentence_id])
+        # for i, cluster in clustered_sentences.items():
+        # print("Cluster ", i+1)
+        # print(cluster)
+        # print("")
+        return self.get_clustered_sections(clustered_sentences), clustered_sentences
+    def generate_title(self, longtext):
+        torch.cuda.empty_cache()
+        inputs = self.title_tokenizer.prepare_seq2seq_batch(longtext, truncation=True, padding='longest',
+                                                            return_tensors='pt').to(self.torch_device)
+        with torch.no_grad():
+            summary_ids = self.title_model.generate(**inputs)
+        summary = self.title_tokenizer.batch_decode(summary_ids, skip_special_tokens=True,
+                                                    clean_up_tokenization_spaces=True)
+        return str(summary[0])
+    def get_clustered_sections(self, clustered_lines):
+        clusters_dict = {}
+        for i, cluster in clustered_lines.items():
+            # print(cluster)
+            try:
+                clusters_dict[self.generate_title(str(" ".join(cluster)))] = self.abstractive_summary(
+                    str(" ".join(cluster)).lower())
+            except:
+                clusters_dict[self.generate_title(str(" ".join(cluster)))] = self.abstractive_summary(
+                    self.extractive_summary(str(" ".join(cluster)).lower()))
+        return clusters_dict
+    def get_intro(self, corpus_known_sections, research_blocks):
+        intro_lines = ""
+        intro_lines += str(" ".join([l.lower() for l in corpus_known_sections['introduction']])) + str(
+            " ".join([l.lower() for l in corpus_known_sections['conclusion']]))
+        intro_lines += research_blocks['introduction'] + research_blocks['conclusion']
+        try:
+            return self.abstractive_summary(intro_lines)
+        except:
+            return self.abstractive_summary(self.extractive_summary(intro_lines))
+    def get_conclusion(self, research_sections):
+        paper_body = ""
+        for k, v in research_sections.items():
+            paper_body += v
+        return self.abstractive_summary(paper_body)
+    def build_corpus_sectionwise(self, papers):
+        known = ['abstract', 'introduction', 'conclusion']
+        corpus_known_sections = {}
+        for kh in known:
+            khtext = []
+            for p in papers:
+                for section in p['sections']:
+                    if kh in section['heading']:
+                        khtext.extend(section['highlights'])
+                        # print(khtext)
+            corpus_known_sections[kh] = khtext
+        return corpus_known_sections
+    def standardize_headings(self, papers):
+        known = ['abstract', 'introduction', 'discussion', 'relatedwork', 'contribution', 'analysis', 'experiments',
+                 'conclusion']
+        for p in papers:
+            # print("================================")
+            headings = [section['heading'] for section in p['sections'] if len(section['heading'].split()) < 3]
+            # print("id: "+ str(p['id'])+"\nHeadings: \n"+str('\n'.join(headings)))
+            for kh in known:
+                for section in p['sections']:
+                    if len(section['heading'].split()) < 3:
+                        # print(section['heading'])
+                        if kh in ''.join(filter(str.isalpha, section['heading'].replace(' ', '').lower())):
+                            # print("orig head: "+ section['heading'] +", plain head:" + kh)
+                            section['heading'] = kh
+        return papers
+    def build_corpus(self, papers, papers_meta):
+        corpus = self.build_meta_corpus(papers_meta)
+        for p in papers:
+            ph = []
+            for sid, section in enumerate(p['sections']):
+                ph.extend(section['highlights'])
+            for pid, ls in corpus.items():
+                if pid == p['id']:
+                    corpus[pid] = p['abstract'] + str(' '.join(ph))
+        '''
+        print("==================    final corpus       ====================")
+        print('\n'.join([str("paper: "+ get_by_pid(pid, papers_meta)['title']+" \nhighlight count: " + str(len(phs))) for pid, phs in corpus.items()]))
+        print("========    sample point       ========")
+        p = random.choice(list(papers))
+        print("paper: "+ p['title']+" \nhighlights: " + str(corpus[p['id']]))
+        print("========    sample meta point       ========")
+        p = random.choice(list(papers_meta))
+        print("meta paper: "+ p['title']+" \nhighlights: " + str(corpus[p['id']]))
+        '''
+        return corpus
+    def get_by_pid(self, pid, papers):
+        for p in papers:
+            if p['id'] == pid:
+                return p
+    def build_meta_corpus(self, papers):
+        meta_corpus = {}
+        for p in papers:
+            # pprint(p)
+            pid = p['id']
+            ptext = p['title'] + ". " + p['abstract']
+            doc = self.nlp(ptext)
+            phs, _, _ = self.extractive_highlights([str(sent) for sent in list(doc.sents)])
+            meta_corpus[pid] = str(' '.join(phs))
+        '''
+        print("==================    meta corpus       ====================")
+        print('\n'.join([str("paper: "+ get_by_pid(pid, papers)['title']+" \nhighlight count: " + str(len(phs))) for pid, phs in meta_corpus.items()]))
+        print("========    sample point       ========")
+        p = random.choice(list(papers))
+        print("paper: "+ p['title']+" \nhighlights: " + str(meta_corpus[p['id']]))
+        '''
+        return meta_corpus
+    def select_papers(self, papers, query, num_papers=20):
+        import numpy as np
+        # print("paper sample: ")
+        # print(papers)
+        meta_corpus = self.build_meta_corpus(papers)
+        scores = []
+        pids = []
+        for id, highlights in meta_corpus.items():
+            score = self.text_para_similarity(query, highlights)
+            scores.append(score)
+            pids.append(id)
+            print("corpus item: " + str(self.get_by_pid(id, papers)['title']))
+        idx = np.argsort(scores)[:num_papers]
+        #for i in range(len(scores)):
+        #    print("paper: " + str(self.get_by_pid(pids[i], papers)['title']))
+        #    print("score: " + str(scores[i]))
+        # print("argsort ids("+str(num_papers)+" papers): "+ str(idx))
+        idx = [pids[i] for i in idx]
+        # print("argsort pids("+str(num_papers)+" papers): "+ str(idx))
+        papers_selected = [p for p in papers if p['id'] in idx]
+        # assert(len(papers_selected)==num_papers)
+        print("num papers selected: " + str(len(papers_selected)))
+        for p in papers_selected:
+            print("Selected Paper: " + p['title'])
+        print("constrast with natural selection: forward")
+        for p in papers[:4]:
+            print("Selected Paper: " + p['title'])
+        print("constrast with natural selection: backward")
+        for p in papers[-4:]:
+            print("Selected Paper: " + p['title'])
+        # arxiv search producing better relevnce
+        return papers_selected
+    def extractive_summary(self, text):
+        torch.cuda.empty_cache()
+        with torch.no_grad():
+            res = self.model(text, ratio=0.5)
+        res_doc = self.nlp(res)
+        return " ".join(set([str(sent) for sent in list(res_doc.sents)]))
+    def extractive_highlights(self, lines):
+        # text = " ".join(lines)
+        # text_doc = nlp(" ".join([l.lower() for l in lines]))
+        # text = ' '.join([ str(sent) for sent in list(text_doc.sents)])
+        torch.cuda.empty_cache()
+        with torch.no_grad():
+            res = self.model(" ".join([l.lower() for l in lines]), ratio=0.5, )
+        res_doc = self.nlp(res)
+        res_lines = set([str(sent) for sent in list(res_doc.sents)])
+        # print("\n".join(res_sents))
+        with torch.no_grad():
+            keywords = self.kw_model.extract_keywords(str(" ".join([l.lower() for l in lines])), stop_words='english')
+            keyphrases = self.kw_model.extract_keywords(str(" ".join([l.lower() for l in lines])),
+                                                    keyphrase_ngram_range=(4, 4),
+                                                    stop_words='english', use_mmr=True, diversity=0.7)
+        return res_lines, keywords, keyphrases
+    def extract_highlights(self, papers):
+        for p in papers:
+            sid = 0
+            p['sections'] = []
+            for heading, lines in p['body_text'].items():
+                hs, kws, kps = self.extractive_highlights(lines)
+                p['sections'].append({
+                    'sid': sid,
+                    'heading': heading,
+                    'text': lines,
+                    'highlights': hs,
+                    'keywords': kws,
+                    'keyphrases': kps,
+                })
+                sid += 1
+        return papers
+    def extract_structure(self, papers, pdf_dir, txt_dir, img_dir, dump_dir, tab_dir, tables=False):
+        print("\nextracting sections.. ")
+        papers, ids_none = self.extract_parts(papers, txt_dir, dump_dir)
+        print("\nextracting images.. for future correlation use-cases ")
+        papers = self.extract_images(papers, pdf_dir, img_dir)
+        if tables:
+            print("\nextracting tables.. for future correlation use-cases ")
+            papers = self.extract_tables(papers, pdf_dir, tab_dir)
+        return papers, ids_none
+    def extract_parts(self, papers, txt_dir, dump_dir):
+        headings_all = {}
+        # refined = []
+        # model = build_summarizer()
+        #for file in glob.glob(txt_dir + '/*.txt'):
+        for p in papers:
+            file = txt_dir + '/'+ p['id'] +'.txt'
+            refined, headings_extracted = self.extract_headings(file)
+            sections = self.extract_sections(headings_extracted, refined)
+            # highlights = {k: extract_highlights(model,v) for k, v in sections.items()}
+            #p = self.get_by_file(file, papers)
+            #if len(headings_extracted) > 3:
+            p['body_text'] = sections
+            # p['body_highlights'] = highlights
+            headings_all[p['id']] = headings_extracted
+        ids_none = {i: h for i, h in headings_all.items() if len(h) < 3}
+        '''
+        for f, h in headings_all.items():
+            if len(h) < 4:
+                print("=================headings almost undetected================")
+                print(f)
+                print(h)
+        '''
+        # from pprint import pprint
+        # pprint({f: len(h) for f,h in headings_all.items()})
+        papers_none = [p for p in papers if p['id'] in ids_none]
+        for p in papers_none:
+            os.remove(txt_dir + '/'+ p['id'] + '.txt')
+            papers.remove(p)
+        return papers, ids_none
+    def check_para(self, df):
+        size = 0
+        for col in df.columns:
+            size += df[col].apply(lambda x: len(str(x))).median()
+        return size / len(df.columns) > 25
+    def scan_blocks(self, lines):
+        lines_mod = [line.strip().replace('\n', '') for line in lines if len(line.strip().replace('\n', '')) > 3]
+        for i in range(len(lines_mod)):
+            yield lines_mod[i:i + 3]
+    def extract_sections(self, headings, lines, min_part_length=2):
+        sections = {}
+        self.check_list_elems_in_list(headings, lines)
+        head_len = len(headings)
+        for i in range(len(headings) - 1):
+            start = headings[i]
+            end = headings[i + 1]
+            section = self.get_section(start, end, lines)
+            # print(start + " : "+ str(len(section)) +" lines")
+            '''
+            if i > 0:
+              old = headings[i-1]
+              if len(section) < min_part_length + 1:
+                sections[old].extend(start)
+                sections[old].extend(section)
+              else:
+                sections[start] = section
+            else:
+              sections[start] = section
+            '''
+            sections[start] = section
+        return {k: v for k, v in sections.items()}
+    def is_rubbish(self, s, rubbish_tolerance=0.2, min_char_len=4):
+        # numbers = sum(c.isdigit() for c in s)
+        letters = sum(c.isalpha() for c in s)
+        spaces = sum(c.isspace() for c in s)
+        # others  = len(s) - numbers - letters - spaces
+        if len(s) == 0:
+            return False
+        if ((len(s) - (letters + spaces)) / len(s) >= rubbish_tolerance) or self.alpha_length(s) < min_char_len:
+            return True
+        else:
+            return False
+    def get_section(self, first, last, lines):
+        try:
+            assert (first in lines)
+            assert (last in lines)
+            # start = lines.index( first ) + len( first )
+            # end = lines.index( last, start )
+            start = [i for i in range(len(lines)) if first is lines[i]][0]
+            end = [i for i in range(len(lines)) if last is lines[i]][0]
+            section_lines = lines[start + 1:end]
+            # print("heading: " + str(first))
+            # print("section_lines: "+ str(section_lines))
+            # print(section_lines)
+            return section_lines
+        except ValueError:
+            print("value error :")
+            print("first heading :" + str(first) + ", second heading :" + str(last))
+            print("first index :" + str(start) + ", second index :" + str(end))
+            return ""
+    def check_list_elems_in_list(self, headings, lines):
+        import numpy as np
+        # [print(head) for head in headings if head not in lines ]
+        return np.all([True if head in lines else False for head in headings])
+    def check_first_char_upper(self, text):
+        for c in text:
+            if c.isspace():
+                continue
+            elif c.isalpha():
+                return c.isupper()
+    def extract_headings(self, txt_file):
+        import re
+        fulltext = self.read_paper(txt_file)
+        lines = self.clean_lines(fulltext)
+        refined, headings = self.scan_text(lines)
+        assert (self.check_list_elems_in_list(headings, refined))
+        headings = self.check_duplicates(headings)
+        # print('===========================================')
+        # print(txt_file +": first scan: \n"+str(len(headings))+" headings")
+        # print('\n'.join(headings))
+        # scan_failed - rescan with first match for abstract hook
+        if len(headings) == 0:
+            # print('===================')
+            # print("run 1 failed")
+            abs_cans = [line for line in lines if 'abstract' in re.sub("\s+", "", line.strip().lower())]
+            if len(abs_cans) != 0:
+                abs_head = abs_cans[0]
+                refined, headings = self.scan_text(lines, abs_head=abs_head)
+                self.check_list_elems_in_list(headings, refined)
+                headings = self.check_duplicates(headings)
+                # print('===================')
+                # print(txt_file +": second scan: \n"+str(len(headings))+" headings")
+        # if len(headings) == 0:
+        # print("heading scan failed completely")
+        return refined, headings
+    def check_duplicates(self, my_list):
+        my_finallist = []
+        dups = [s for s in my_list if my_list.count(s) > 1]
+        if len(dups) > 0:
+            [my_finallist.append(n) for n in my_list if n not in my_finallist]
+        # print("original: "+str(len(my_list))+" new: "+str(len(my_finallist)))
+        return my_finallist
+    def clean_lines(self, text):
+        import numpy as np
+        import re
+        # doc = nlp(text)
+        # lines = [str(sent) for sent in doc.sents]
+        lines = text.replace('\r', '').split('\n')
+        lines = [line for line in lines if not self.is_rubbish(line)]
+        lines = [line for line in lines if
+                 re.match("^[a-zA-Z1-9\.\[\]\(\):\-,\"\"\s]*$", line) and not 'Figure' in line and not 'Table' in line]
+        lengths_cleaned = [self.alpha_length(line) for line in lines]
+        mean_length_cleaned = np.median(lengths_cleaned)
+        lines_standardized = []
+        for line in lines:
+            if len(line) >= (1.8 * mean_length_cleaned):
+                first_half = line[0:len(line) // 2]
+                second_half = line[len(line) // 2 if len(line) % 2 == 0 else ((len(line) // 2) + 1):]
+                lines_standardized.append(first_half)
+                lines_standardized.append(second_half)
+            else:
+                lines_standardized.append(line)
+        return lines
+    def scan_text(self, lines, abs_head=None):
+        import re
+        # print('\n'.join(lines))
+        record = False
+        headings = []
+        refined = []
+        for i in range(1, len(lines) - 4):
+            line = lines[i]
+            line = line.replace('\n', '').strip()
+            if 'abstract' in re.sub("\s+", "", line.strip().lower()) and len(line) - len('abstract') < 5 or (
+                    abs_head is not None and abs_head in line):
+                record = True
+                headings.append(line)
+                refined.append(line)
+            if 'references' in re.sub("\s+", "", line.strip().lower()) and len(line) - len('references') < 5:
+                headings.append(line)
+                refined.append(line)
+                break
+            elif 'bibliography' in re.sub("\s+", "", line.strip().lower()) and len(line) - len('bibliography') < 5:
+                headings.append(line)
+                refined.append(line)
+                break
+            refined, headings = self.scanline(record, headings, refined, i, lines)
+            # print('=========in scan_text loop i : '+str(i)+' heading count : '+str(len(headings))+'  =========')
+        return refined, headings
+    def scanline(self, record, headings, refined, id, lines):
+        import numpy as np
+        import re
+        line = lines[id]
+        if not len(line) == 0:
+            # print("in scanline")
+            # print(line)
+            if record:
+                refined.append(line)
+                if len(lines[id - 1]) == 0 or len(lines[id + 1]) == 0 or re.match(
+                        "^[1-9XVIABCD]{0,4}(\.{0,1}[1-9XVIABCD]{0,4}){0, 3}\s{0,2}[A-Z][a-zA-Z\:\-\s]*$",
+                        line) and self.char_length(line) > 7:
+                    # print("candidate")
+                    # print(line)
+                    if np.mean([len(s) for s in lines[id + 2:id + 6]]) > 40 and self.check_first_char_upper(
+                            line) and re.match("^[a-zA-Z1-9\.\:\-\s]*$", line) and len(line.split()) < 10:
+                        # if len(line) < 20 and np.mean([len(s) for s in lines[i+1:i+5]]) > 30 :
+                        headings.append(line)
+                        assert (line in refined)
+                        # print("selected")
+                        # print(line)
+                else:
+                    known_headings = ['introduction', 'conclusion', 'abstract', 'references', 'bibliography']
+                    missing = [h for h in known_headings if not np.any([True for head in headings if h in head])]
+                    # for h in missing:
+                    head = [line for h in missing if h in re.sub("\s+", "", line.strip().lower())]
+                    # head = [line for known]
+                    if len(head) > 0:
+                        headings.append(head[0])
+                        assert (head[0] in refined)
+        return refined, headings
+    def char_length(self, s):
+        # numbers = sum(c.isdigit() for c in s)
+        letters = sum(c.isalpha() for c in s)
+        # spaces  = sum(c.isspace() for c in s)
+        # others  = len(s) - numbers - letters - spaces
+        return letters
+    def get_by_file(self, file, papers):
+        import os
+        pid = os.path.basename(file)
+        pid = pid.replace('.txt', '').replace('.pdf', '')
+        for p in papers:
+            if p['id'] == pid:
+                return p
+        print("\npaper not found by file, \nfile: "+file+"\nall papers: "+', '.join([p['id'] for p in papers]))
+    def alpha_length(self, s):
+        # numbers = sum(c.isdigit() for c in s)
+        letters = sum(c.isalpha() for c in s)
+        spaces = sum(c.isspace() for c in s)
+        # others  = len(s) - numbers - letters - spaces
+        return letters + spaces
+    def check_append(self, baselist, addstr):
+        check = False
+        for e in baselist:
+            if addstr in e:
+                check = True
+        if not check:
+            baselist.append(addstr)
+        return baselist
+    def extract_images(self, papers, pdf_dir, img_dir):
+        import fitz
+        # print("in images")
+        for p in papers:
+            file = pdf_dir + p['id'] + ".pdf"
+            pdf_file = fitz.open(file)
+            images = []
+            for page_index in range(len(pdf_file)):
+                page = pdf_file[page_index]
+                images.extend(page.getImageList())
+            images_files = [self.save_image(pdf_file.extractImage(img[0]), i, p['id'], img_dir) for i, img in
+                            enumerate(set(images)) if img[0]]
+            # print(len(images_per_paper))
+            p['images'] = images_files
+            # print(len(p.keys()))
+        # print(papers[0].keys())
+        return papers
+    def extract_images_from_file(self, pdf_file_name, img_dir):
+        import fitz
+        pdf_file = fitz.open(pdf_file_name)
+        images = []
+        for page_index in range(len(pdf_file)):
+            page = pdf_file[page_index]
+            images.extend(page.getImageList())
+        images_files = [self.save_image(pdf_file.extractImage(img[0]), i, pdf_file_name.replace('.pdf', ''), img_dir) for i, img in
+                        enumerate(set(images)) if img[0]]
+        return images_files
+    def save_image(self, base_image, img_index, pid, img_dir):
+        from PIL import Image
+        import io
+        image_bytes = base_image["image"]
+        # get the image extension
+        image_ext = base_image["ext"]
+        # load it to PIL
+        image = Image.open(io.BytesIO(image_bytes))
+        # save it to local disk
+        fname = img_dir + "/" + str(pid) + "_" + str(img_index + 1) + "." + image_ext
+        image.save(open(f"{fname}", "wb"))
+        # print(fname)
+        return fname
+    def save_tables(self, dfs, pid, tab_dir):
+        # todo
+        dfs = [df for df in dfs if not self.check_para(df)]
+        files = []
+        for df in dfs:
+            filename = tab_dir + "/" + str(pid) + ".csv"
+            files.append(filename)
+            df.to_csv(filename, index=False)
+        return files
+    def extract_tables(self, papers, pdf_dir, tab_dir):
+        import tabula
+        check = True
+        # for file in glob.glob(pdf_dir+'/*.pdf'):
+        for p in papers:
+            dfs = tabula.read_pdf(pdf_dir + p['id'] + ".pdf", pages='all', multiple_tables=True, silent=True)
+            p['tables'] = self.save_tables(dfs, p['id'], tab_dir)
+        # print(papers[0].keys())
+        return papers
+    def extract_tables_from_file(self, pdf_file_name, tab_dir):
+        import tabula
+        check = True
+        # for file in glob.glob(pdf_dir+'/*.pdf'):
+        dfs = tabula.read_pdf(pdf_file_name, pages='all', multiple_tables=True, silent=True)
+        return self.save_tables(dfs, pdf_file_name.replace('.pdf', ''), tab_dir)
+    def search(self, query_text=None, id_list=None, max_search=100):
+        import arxiv
+        from urllib.parse import urlparse
+        if query_text:
+            search = arxiv.Search(
+                query=query_text,
+                max_results=max_search,
+                sort_by=arxiv.SortCriterion.Relevance
+            )
+        else:
+            id_list = [id for id in id_list if '.' in id]
+            search = arxiv.Search(
+                id_list=id_list
+            )
+        results = [result for result in search.get()]
+        searched_papers = []
+        discarded_ids = []
+        for result in results:
+            id = urlparse(result.entry_id).path.split('/')[-1].split('v')[0]
+            if '.' in id:
+                paper = {
+                    'id': id,
+                    'title': result.title,
+                    'comments': result.comment if result.journal_ref else "None",
+                    'journal-ref': result.journal_ref if result.journal_ref else "None",
+                    'doi': str(result.doi),
+                    'primary_category': result.primary_category,
+                    'categories': result.categories,
+                    'license': None,
+                    'abstract': result.summary,
+                    'published': result.published,
+                    'pdf_url': result.pdf_url,
+                    'links': [str(l) for l in result.links],
+                    'update_date': result.updated,
+                    'authors': [str(a.name) for a in result.authors],
+                }
+                searched_papers.append(paper)
+            else:
+                discarded_ids.append(urlparse(result.entry_id).path.split('/')[-1].split('v')[0])
+        print("\nPapers discarded due to id error [arxiv api bug: #74] :\n" + str(discarded_ids))
+        return results, searched_papers
+    def download_pdfs(self, papers, pdf_dir):
+        import arxiv
+        from urllib.parse import urlparse
+        ids = [p['id'] for p in papers]
+        print("\ndownloading below selected papers: ")
+        print(ids)
+        # asert(False)
+        papers_filtered = arxiv.Search(id_list=ids).get()
+        for p in papers_filtered:
+            p_id = str(urlparse(p.entry_id).path.split('/')[-1]).split('v')[0]
+            download_file = pdf_dir + "/" + p_id + ".pdf"
+            p.download_pdf(filename=download_file)
+    def download_sources(self, papers, src_dir):
+        import arxiv
+        from urllib.parse import urlparse
+        ids = [p['id'] for p in papers]
+        print(ids)
+        # asert(False)
+        papers_filtered = arxiv.Search(id_list=ids).get()
+        for p in papers_filtered:
+            p_id = str(urlparse(p.entry_id).path.split('/')[-1]).split('v')[0]
+            download_file = src_dir + "/" + p_id + ".tar.gz"
+            p.download_source(filename=download_file)
+    def convert_pdfs(self, pdf_dir, txt_dir):
+        import glob, shutil
+        import multiprocessing
+        # import arxiv_public_data
+        convert_directory_parallel(pdf_dir, multiprocessing.cpu_count())
+        for file in glob.glob(pdf_dir + '/*.txt'):
+            shutil.move(file, txt_dir)
+    def read_paper(self, path):
+        f = open(path, 'r', encoding="utf-8")
+        text = str(f.read())
+        f.close()
+        return text
+    def cocitation_network(self, papers, txt_dir):
+        import multiprocessing
+        cites = internal_citations.citation_list_parallel(N=multiprocessing.cpu_count(), directory=txt_dir)
+        print("\ncitation-network: ")
+        print(cites)
+        for p in papers:
+            p['cites'] = cites[p['id']]
+        return papers, cites
+    def lookup_author(self, author_query):
+        from scholarly import scholarly
+        import operator
+        # Retrieve the author's data, fill-in, and print
+        print("Searching Author: " + author_query)
+        search_result = next(scholarly.search_author(author_query), None)
+        if search_result is not None:
+            author = scholarly.fill(search_result)
+            author_stats = {
+                'name': author_query,
+                'affiliation': author['affiliation'] if author['affiliation'] else None,
+                'citedby': author['citedby'] if 'citedby' in author.keys() else 0,
+                'most_cited_year': max(author['cites_per_year'].items(), key=operator.itemgetter(1))[0] if len(
+                    author['cites_per_year']) > 0 else None,
+                'coauthors': [c['name'] for c in author['coauthors']],
+                'hindex': author['hindex'],
+                'impact': author['i10index'],
+                'interests': author['interests'],
+                'publications': [{'title': p['bib']['title'], 'citations': p['num_citations']} for p in
+                                 author['publications']],
+                'url_picture': author['url_picture'],
+            }
+        else:
+            print("author not found")
+            author_stats = {
+                'name': author_query,
+                'affiliation': "",
+                'citedby': 0,
+                'most_cited_year': None,
+                'coauthors': [],
+                'hindex': 0,
+                'impact': 0,
+                'interests': [],
+                'publications': [],
+                'url_picture': "",
+            }
+        # pprint(author_stats)
+        return author_stats
+    def author_stats(self, papers):
+        all_authors = []
+        for p in papers:
+            paper_authors = [a for a in p['authors']]
+            all_authors.extend(paper_authors)
+        searched_authors = [self.lookup_author(a) for a in set(all_authors)]
+        return searched_authors
+    def text_similarity(self, text1, text2):
+        doc1 = self.similarity_nlp(text1)
+        doc2 = self.similarity_nlp(text2)
+        return doc1.similarity(doc2)
+    def text_para_similarity(self, text, lines):
+        doc1 = self.similarity_nlp(text)
+        doc2 = self.similarity_nlp(" ".join(lines))
+        return doc1.similarity(doc2)
+    def para_para_similarity(self, lines1, lines2):
+        doc1 = self.similarity_nlp(" ".join(lines1))
+        doc2 = self.similarity_nlp(" ".join(lines2))
+        return doc1.similarity(doc2)
+    def text_image_similarity(self, text, image):
+        pass
+    def ask(self, corpus, question):
+        text = " ".join(corpus)
+        import torch
+        inputs = self.qatokenizer(question, text, return_tensors='pt')
+        start_positions = torch.tensor([1])
+        end_positions = torch.tensor([3])
+        outputs = self.qamodel(**inputs, start_positions=start_positions, end_positions=end_positions)
+        print("context: " + text)
+        print("question: " + question)
+        print("outputs: " + outputs)
+        return outputs
+    def zip_outputs(self, dump_dir, query):
+        import zipfile
+        def zipdir(path, ziph):
+            # ziph is zipfile handle
+            for root, dirs, files in os.walk(path):
+                for file in files:
+                    ziph.write(os.path.join(root, file),
+                               os.path.relpath(os.path.join(root, file),
+                                               os.path.join(path, '../..')))
+        zip_name = 'arxiv_dumps_'+query.replace(' ', '_')+'.zip'
+        zipf = zipfile.ZipFile(zip_name, 'w', zipfile.ZIP_DEFLATED)
+        zipdir(dump_dir, zipf)
+        return zip_name
+    def survey(self, query, max_search=None, num_papers=None, debug=False, weigh_authors=False):
+        import joblib
+        import os, shutil
+        if not max_search:
+            max_search = DEFAULTS['max_search']
+        if not num_papers:
+            num_papers = DEFAULTS['num_papers']
+        # arxiv api relevance search and data preparation
+        print("\nsearching arXiv for top 100 papers.. ")
+        results, searched_papers = self.search(query, max_search=max_search)
+        joblib.dump(searched_papers, self.dump_dir + 'papers_metadata.dmp')
+        print("\nfound " + str(len(searched_papers)) + " papers")
+        # paper selection by scibert vector embedding relevance scores
+        # papers_selected = select_papers(searched_papers, query, num_papers=num_papers)
+        papers_highlighted, papers_selected = self.pdf_route(self.pdf_dir, self.txt_dir, self.img_dir, self.tab_dir, self.dump_dir,
+                                            searched_papers)
+        if weigh_authors:
+            authors = self.author_stats(papers_highlighted)
+        joblib.dump(papers_highlighted, self.dump_dir + 'papers_highlighted.dmp')
+        print("\nStandardizing known section headings per paper.. ")
+        papers_standardized = self.standardize_headings(papers_highlighted)
+        joblib.dump(papers_standardized, self.dump_dir + 'papers_standardized.dmp')
+        print("\nBuilding paper-wise corpus.. ")
+        corpus = self.build_corpus(papers_highlighted, searched_papers)
+        joblib.dump(corpus, self.dump_dir + 'corpus.dmp')
+        print("\nBuilding section-wise corpus.. ")
+        corpus_sectionwise = self.build_corpus_sectionwise(papers_standardized)
+        joblib.dump(corpus_sectionwise, self.dump_dir + 'corpus_sectionwise.dmp')
+        print("\nBuilding basic research highlights.. ")
+        research_blocks = self.build_basic_blocks(corpus_sectionwise, corpus)
+        joblib.dump(research_blocks, self.dump_dir + 'research_blocks.dmp')
+        print("\nReducing corpus to lines.. ")
+        corpus_lines = self.get_corpus_lines(corpus)
+        joblib.dump(corpus_lines, self.dump_dir + 'corpus_lines.dmp')
+        # temp
+        # searched_papers = joblib.load(dump_dir + 'papers_metadata.dmp')
+        '''
+        papers_highlighted = joblib.load(dump_dir + 'papers_highlighted.dmp')
+        corpus = joblib.load(dump_dir + 'corpus.dmp')
+        papers_standardized = joblib.load(dump_dir + 'papers_standardized.dmp')
+        corpus_sectionwise = joblib.load(dump_dir + 'corpus_sectionwise.dmp')
+        research_blocks = joblib.load(dump_dir + 'research_blocks.dmp')
+        corpus_lines = joblib.load(dump_dir + 'corpus_lines.dmp')
+        '''
+        '''
+        print("papers_highlighted types:"+ str(np.unique([str(type(p['sections'][0]['highlights'])) for p in papers_highlighted])))
+        print("papers_highlighted example:")
+        print(random.sample(list(papers_highlighted), 1)[0]['sections'][0]['highlights'])
+        print("corpus types:"+ str(np.unique([str(type(txt)) for k,txt in corpus.items()])))
+        print("corpus example:")
+        print(random.sample(list(corpus.items()), 1)[0])
+        print("corpus_lines types:"+ str(np.unique([str(type(txt)) for txt in corpus_lines])))
+        print("corpus_lines example:")
+        print(random.sample(list(corpus_lines), 1)[0])
+        print("corpus_sectionwise types:"+ str(np.unique([str(type(txt)) for k,txt in corpus_sectionwise.items()])))
+        print("corpus_sectionwise example:")
+        print(random.sample(list(corpus_sectionwise.items()), 1)[0])
+        print("research_blocks types:"+ str(np.unique([str(type(txt)) for k,txt in research_blocks.items()])))
+        print("research_blocks example:")
+        print(random.sample(list(research_blocks.items()), 1)[0])
+        '''
+        # print("corpus types:"+ str(np.unique([type(txt) for k,txt in corpus.items()])))
+        print("\nBuilding abstract.. ")
+        abstract_block = self.get_abstract(corpus_lines, corpus_sectionwise, research_blocks)
+        joblib.dump(abstract_block, self.dump_dir + 'abstract_block.dmp')
+        '''
+        print("abstract_block type:"+ str(type(abstract_block)))
+        print("abstract_block:")
+        print(abstract_block)
+        '''
+        print("\nBuilding introduction.. ")
+        intro_block = self.get_intro(corpus_sectionwise, research_blocks)
+        joblib.dump(intro_block, self.dump_dir + 'intro_block.dmp')
+        '''
+        print("intro_block type:"+ str(type(intro_block)))
+        print("intro_block:")
+        print(intro_block)
+        '''
+        print("\nBuilding custom sections.. ")
+        clustered_sections, clustered_sentences = self.get_clusters(papers_standardized, searched_papers)
+        joblib.dump(clustered_sections, self.dump_dir + 'clustered_sections.dmp')
+        joblib.dump(clustered_sentences, self.dump_dir + 'clustered_sentences.dmp')
+        '''
+        print("clusters extracted")
+        print("clustered_sentences types:"+ str(np.unique([str(type(txt)) for k,txt in clustered_sentences.items()])))
+        print("clustered_sentences example:")
+        print(random.sample(list(clustered_sections.items()), 1)[0])
+        print("clustered_sections types:"+ str(np.unique([str(type(txt)) for k,txt in clustered_sections.items()])))
+        print("clustered_sections example:")
+        print(random.sample(list(clustered_sections.items()), 1)[0])
+        '''
+        clustered_sections['abstract'] = abstract_block
+        clustered_sections['introduction'] = intro_block
+        joblib.dump(clustered_sections, self.dump_dir + 'research_sections.dmp')
+        print("\nBuilding conclusion.. ")
+        conclusion_block = self.get_conclusion(clustered_sections)
+        joblib.dump(conclusion_block, self.dump_dir + 'conclusion_block.dmp')
+        clustered_sections['conclusion'] = conclusion_block
+        '''
+        print("conclusion_block type:"+ str(type(conclusion_block)))
+        print("conclusion_block:")
+        print(conclusion_block)
+        '''
+        survey_file = 'A_Survey_on_' + query.replace(' ', '_') + '.txt'
+        self.build_doc(clustered_sections, papers_standardized, query=query, filename=self.dump_dir + survey_file)
+        shutil.copytree('arxiv_data/', self.dump_dir + '/arxiv_data/')
+        shutil.copy(self.dump_dir + survey_file, survey_file)
+        assert (os.path.exists(survey_file))
+        output_zip = self.zip_outputs(self.dump_dir, query)
+        print("\nSurvey complete.. \nSurvey file path :" + os.path.abspath(
+            survey_file) + "\nAll outputs zip path :" + os.path.abspath(self.dump_dir + output_zip))
+        return os.path.abspath(self.dump_dir + output_zip), os.path.abspath(survey_file)
+if __name__ == '__main__':
+    import argparse
+    parser = argparse.ArgumentParser(description='Generate a survey just from a query !!')
+    parser.add_argument('query', metavar='query_string', type=str,
+                        help='your research query/keywords')
+    parser.add_argument('--max_search', metavar='max_metadata_papers', type=int, default=None,
+                        help='maximium number of papers to gaze at - defaults to 100')
+    parser.add_argument('--num_papers', metavar='max_num_papers', type=int, default=None,
+                        help='maximium number of papers to download and analyse - defaults to 25')
+    parser.add_argument('--pdf_dir', metavar='pdf_dir', type=str, default=None,
+                        help='pdf paper storage directory - defaults to arxiv_data/tarpdfs/')
+    parser.add_argument('--txt_dir', metavar='txt_dir', type=str, default=None,
+                        help='text-converted paper storage directory - defaults to arxiv_data/fulltext/')
+    parser.add_argument('--img_dir', metavar='img_dir', type=str, default=None,
+                        help='image storage directory - defaults to arxiv_data/images/')
+    parser.add_argument('--tab_dir', metavar='tab_dir', type=str, default=None,
+                        help='tables storage directory - defaults to arxiv_data/tables/')
+    parser.add_argument('--dump_dir', metavar='dump_dir', type=str, default=None,
+                        help='all_output_dir - defaults to arxiv_dumps/')
+    parser.add_argument('--models_dir', metavar='save_models_dir', type=str, default=None,
+                        help='directory to save models (> 5GB) - defaults to saved_models/')
+    parser.add_argument('--title_model_name', metavar='title_model_name', type=str, default=None,
+                        help='title model name/tag in hugging-face, defaults to \'Callidior/bert2bert-base-arxiv-titlegen\'')
+    parser.add_argument('--ex_summ_model_name', metavar='extractive_summ_model_name', type=str, default=None,
+                        help='extractive summary model name/tag in hugging-face, defaults to \'allenai/scibert_scivocab_uncased\'')
+    parser.add_argument('--ledmodel_name', metavar='ledmodel_name', type=str, default=None,
+                        help='led model(for abstractive summary) name/tag in hugging-face, defaults to \'allenai/led-large-16384-arxiv\'')
+    parser.add_argument('--embedder_name', metavar='sentence_embedder_name', type=str, default=None,
+                        help='sentence embedder name/tag in hugging-face, defaults to \'paraphrase-MiniLM-L6-v2\'')
+    parser.add_argument('--nlp_name', metavar='spacy_model_name', type=str, default=None,
+                        help='spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to \'en_core_sci_scibert\'')
+    parser.add_argument('--similarity_nlp_name', metavar='similarity_nlp_name', type=str, default=None,
+                        help='spacy downstream model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to \'en_core_sci_lg\'')
+    parser.add_argument('--kw_model_name', metavar='kw_model_name', type=str, default=None,
+                        help='keyword extraction model name/tag in hugging-face, defaults to \'distilbert-base-nli-mean-tokens\'')
+    parser.add_argument('--refresh_models', metavar='refresh_models', type=str, default=None,
+                        help='Refresh model downloads with given names (needs atleast one model name param above), defaults to False')
+    parser.add_argument('--high_gpu', metavar='high_gpu', type=str, default=None,
+                        help='High GPU usage permitted, defaults to False')
+    args = parser.parse_args()
+    surveyor = Surveyor(
+        pdf_dir=args.pdf_dir,
+        txt_dir=args.txt_dir,
+        img_dir=args.img_dir,
+        tab_dir=args.tab_dir,
+        dump_dir=args.dump_dir,
+        models_dir=args.models_dir,
+        title_model_name=args.title_model_name,
+        ex_summ_model_name=args.ex_summ_model_name,
+        ledmodel_name=args.ledmodel_name,
+        embedder_name=args.embedder_name,
+        nlp_name=args.nlp_name,
+        similarity_nlp_name=args.similarity_nlp_name,
+        kw_model_name=args.kw_model_name,
+        refresh_models=args.refresh_models,
+        high_gpu=args.high_gpu
+    )
+    surveyor.survey(args.query, max_search=args.max_search, num_papers=args.num_papers,
+                                              debug=False, weigh_authors=False)

src/__pycache__/Surveyor.cpython-310.pyc ADDED Viewed

Binary file (47.8 kB). View file

src/__pycache__/defaults.cpython-310.pyc ADDED Viewed

Binary file (835 Bytes). View file

src/defaults.py ADDED Viewed

	@@ -0,0 +1,20 @@

+# defaults for arxiv
+DEFAULTS = {
+    "max_search": 100,
+    "num_papers": 20,
+    "high_gpu": False,
+    "pdf_dir": "arxiv_data/tarpdfs/",
+    "txt_dir": "arxiv_data/fulltext/",
+    "img_dir": "arxiv_data/images/",
+    "tab_dir": "arxiv_data/tables/",
+    "dump_dir": "arxiv_dumps/",
+    "models_dir": "saved_models/",
+    "title_model_name": "Callidior/bert2bert-base-arxiv-titlegen",
+    "ex_summ_model_name": "allenai/scibert_scivocab_uncased",
+    "ledmodel_name": "allenai/led-large-16384-arxiv",
+    "embedder_name": "paraphrase-MiniLM-L6-v2",
+    "nlp_name": "en_core_sci_scibert",
+    "similarity_nlp_name": "en_core_sci_lg",
+    "kw_model_name": "distilbert-base-nli-mean-tokens",
+}

src/packages.txt ADDED Viewed

File without changes

survey.py ADDED Viewed

	@@ -0,0 +1,72 @@

+from src.Surveyor import Surveyor
+import logging
+logging.basicConfig()
+logging.getLogger().setLevel(logging.ERROR)
+if __name__ == '__main__':
+    import argparse
+    parser = argparse.ArgumentParser(description='Generate a survey just from a query !!')
+    parser.add_argument('query', metavar='query_string', type=str,
+                        help='your research query/keywords')
+    parser.add_argument('--max_search', metavar='max_metadata_papers', type=int, default=None,
+                        help='maximium number of papers to gaze at - defaults to 100')
+    parser.add_argument('--num_papers', metavar='max_num_papers', type=int, default=None,
+                        help='maximium number of papers to download and analyse - defaults to 25')
+    parser.add_argument('--pdf_dir', metavar='pdf_dir', type=str, default=None,
+                        help='pdf paper storage directory - defaults to arxiv_data/tarpdfs/')
+    parser.add_argument('--txt_dir', metavar='txt_dir', type=str, default=None,
+                        help='text-converted paper storage directory - defaults to arxiv_data/fulltext/')
+    parser.add_argument('--img_dir', metavar='img_dir', type=str, default=None,
+                        help='image storage directory - defaults to arxiv_data/images/')
+    parser.add_argument('--tab_dir', metavar='tab_dir', type=str, default=None,
+                        help='tables storage directory - defaults to arxiv_data/tables/')
+    parser.add_argument('--dump_dir', metavar='dump_dir', type=str, default=None,
+                        help='all_output_dir - defaults to arxiv_dumps/')
+    parser.add_argument('--models_dir', metavar='save_models_dir', type=str, default=None,
+                        help='directory to save models (> 5GB) - defaults to saved_models/')
+    parser.add_argument('--title_model_name', metavar='title_model_name', type=str, default=None,
+                        help='title model name/tag in hugging-face, defaults to \'Callidior/bert2bert-base-arxiv-titlegen\'')
+    parser.add_argument('--ex_summ_model_name', metavar='extractive_summ_model_name', type=str, default=None,
+                        help='extractive summary model name/tag in hugging-face, defaults to \'allenai/scibert_scivocab_uncased\'')
+    parser.add_argument('--ledmodel_name', metavar='ledmodel_name', type=str, default=None,
+                        help='led model(for abstractive summary) name/tag in hugging-face, defaults to \'allenai/led-large-16384-arxiv\'')
+    parser.add_argument('--embedder_name', metavar='sentence_embedder_name', type=str, default=None,
+                        help='sentence embedder name/tag in hugging-face, defaults to \'paraphrase-MiniLM-L6-v2\'')
+    parser.add_argument('--nlp_name', metavar='spacy_model_name', type=str, default=None,
+                        help='spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to \'en_core_sci_scibert\'')
+    parser.add_argument('--similarity_nlp_name', metavar='similarity_nlp_name', type=str, default=None,
+                        help='spacy downstream model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to \'en_core_sci_lg\'')
+    parser.add_argument('--kw_model_name', metavar='kw_model_name', type=str, default=None,
+                        help='keyword extraction model name/tag in hugging-face, defaults to \'distilbert-base-nli-mean-tokens\'')
+    parser.add_argument('--refresh_models', metavar='refresh_models', type=str, default=None,
+                        help='Refresh model downloads with given names (needs atleast one model name param above), defaults to False')
+    parser.add_argument('--high_gpu', metavar='high_gpu', type=str, default=None,
+                        help='High GPU usage permitted, defaults to False')
+    args = parser.parse_args()
+    surveyor = Surveyor(
+        pdf_dir=args.pdf_dir,
+        txt_dir=args.txt_dir,
+        img_dir=args.img_dir,
+        tab_dir=args.tab_dir,
+        dump_dir=args.dump_dir,
+        models_dir=args.models_dir,
+        title_model_name=args.title_model_name,
+        ex_summ_model_name=args.ex_summ_model_name,
+        ledmodel_name=args.ledmodel_name,
+        embedder_name=args.embedder_name,
+        nlp_name=args.nlp_name,
+        similarity_nlp_name=args.similarity_nlp_name,
+        kw_model_name=args.kw_model_name,
+        refresh_models=args.refresh_models,
+        high_gpu=args.high_gpu
+    )
+    surveyor.survey(args.query, max_search=args.max_search, num_papers=args.num_papers,
+                                              debug=False, weigh_authors=False)

tests/__init__.py ADDED Viewed

File without changes

tests/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (136 Bytes). View file

tests/__pycache__/test_survey_files.cpython-310-pytest-7.1.2.pyc ADDED Viewed

Binary file (1.21 kB). View file

tests/test_survey_files.py ADDED Viewed

	@@ -0,0 +1,10 @@

+import os
+from src.Surveyor import Surveyor
+def test_files():
+    surveyor = Surveyor()
+    sample_query = 'quantum entanglement'
+    zip_file, survey_file = surveyor.survey(sample_query, max_search=10, num_papers=6,
+                    debug=False, weigh_authors=False)
+    assert os.path.exists(zip_file)
+    assert os.path.exists(survey_file)