Could you explain the details about the finetune dataset?

#2
by hxypqr - opened

In your paper, you explained the section about Fine-Tuning Data as follows:
In order to fine-tune our models, we paired each question with up to $N$ correct answers and the same number of incorrect answers. Up to $N$ correct answers were randomly chosen from the answers of the question. Each question in the corpus comes along with tags, i.e. categories indicating the topic of a question such as sequences-and-series or limits. As an incorrect answer for each question, we picked a random answer from one question sharing at least one tag with the original question by chance. This way, we chose up to $N$ incorrect answers independently from another.

This procedure yields 1.9 million examples for N=1 and 2.8 million examples for N=10, of which 90% were used as training data for the fine-tuning task. We presented to the model the entire text of the questions and answers using the structure introduced in the previous section. In addition, we pre-trained an ALBERT Model on MathSE (1) and fine-tuned it on N=1. We then let this model predict 1,000 answers to the 2021 test set. We evaluated the answers against the publicly available test set from last year and paired each correct answer with a randomly selected incorrect answer from the model's results. These question-answer pairs were used as an additional fine-tuning set which we denote by ANNOTATED.

I would like to ask about the selection of answers here, especially how the left and right ends of mathematical formulas are determined. Is it done by judging whether it is a symbol like $ using a similar regular expression method?

Hi!
Sorry for the late reply. "One answer" is an entire answer post from the math stack exchange, not only one formula. So, there is no need to find the left and right ends of a formula. However, we converted the data from the HTML format, in which the ARQMath data was provided, to plain text-only with formulas written in LaTeX. Each formulas was enclosed in a math-container HTML element. We parsed the entire answer post using beautiful soup and removed all HTML syntax. We enclosed the formulas in $ ... $ after removing the math-container element.

I hope that helps clarifying! Let me know if you have further questions!

Sign up or log in to comment