File size: 3,880 Bytes
7a8878c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
from bs4 import BeautifulSoup
import requests
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
import os

dir_path = os.path.dirname(os.path.realpath(__file__))

SAVE_PATH = dir_path + '/prescraped/usaco/'
scraped_problems = os.listdir(SAVE_PATH + "Problems")
scraped_editorials = os.listdir(SAVE_PATH + "Editorials")


def anti_scrape(soup):
    if soup.text == "Just a moment...Enable JavaScript and cookies to continue":
        print("Bypassing anti-scrap protection...")
        scr = soup.findAll("script")[-1].string
        scr = scr[scr.index("var a=toNumbers"):].split(';')
        line = scr[0]
        abc = []
        while "toNumbers" in line:
            i = line.index("toNumbers")
            line = line[i+11:]
            abc.append(line[:line.index('"')])
        from Crypto.Cipher import AES
        def to_numbers(x):
            return bytes(int(x[i:i+2], 16) for i in range(0, len(x), 2))
        key, iv, cipher = map(to_numbers, abc)
        aes = AES.new(key, AES.MODE_CBC, iv)
        rcpc = aes.decrypt(cipher).hex()
        print(f"RCPC = {rcpc}")
        url = scr[-2]
        url = url[url.index('"')+1:-1]
        r = requests.get(url, cookies={"RCPC": rcpc})
        s = r.text
        soup = BeautifulSoup(s, "html.parser")

def read(file_path):
    res = ""
    with open(file_path, 'r') as f:
        res = f.read()

    return res

def from_url(url):
    return url.split('/')[-1]

def problem(url):
    pid = from_url(url)
    if (pid in scraped_problems):
        statement = read(SAVE_PATH + "Problems/" + pid)
        if (len(statement)):
            return {"statement": statement}

    response = requests.get(url)

    soup = BeautifulSoup(response.text, 'html.parser')

    soup = soup.find_all(class_='problem-text')[0]

    while soup.pre != None: # removes all code
        soup.pre.decompose()



    prob = soup.text

    prob = prob.split("SAMPLE INPUT")
    
    prob[-1] = prob[-1].split("SCORING:")


    prob = prob[0] + "SCORING:" +  prob[-1][-1]

    with open(SAVE_PATH + 'Problems/' + pid, 'w') as f:
        f.write(prob)
    scraped_problems.append(pid)

    return {"statement": prob}



def editorial(prob_url, edi_url, bot=None, query_func=None): # TODO: Fix random line breaks in the scrapes
    pid = from_url(edi_url)
    print(pid, scraped_editorials)
    if (pid in scraped_editorials):
        edi = read(SAVE_PATH + "Editorials/" + pid)
        if (len(edi)):
            return edi



    response = requests.get(edi_url)

    soup = BeautifulSoup(response.text, 'html.parser')

    while soup.pre != None: # removes all code
        soup.pre.decompose()

    edi = []

    for tag in soup.find_all(['p']):
        if (tag.parent.name != 'body'):
            continue

        latex_content = tag.text

        # print(tag.parent.name)

        # for elem in tag.descendants: # In case LaTeX doesn't render automatically with bs4

        #     if (elem.find_parent().name != 'p' and elem.find_parent().name != 'a' and elem.find_parent().name != 'center'):
        #         continue

        #     if isinstance(elem, str):
        #         latex_content += elem
        #     elif elem.name == "script" and elem.get("type") == "math/tex":
        #         latex_content += "$$$" + elem.string + "$$$"

        
        # if ("code:" in latex_content.lower()):
        #     continue
        edi.append(latex_content)

    edi = '\n'.join(edi)

    # print('bot', bot)

    # if (bot):
    #     edi = bot.chat(query_func(problem(prob_url), edi))


    with open(SAVE_PATH + 'Editorials/' + pid, 'w') as f:
        f.write(edi)
    scraped_editorials.append(pid)

    return edi

    

# print(editorial('https://usaco.org/current/data/sol_prob2_platinum_open24.html'))
# print(problem('https://usaco.org/index.php?page=viewproblem2&cpid=1428')['statement'])