CongresoRAG / About_CongresoRAG /Dataset-README.txt
Ulaşcan Akbulut
Add dataset description
d2d362f
raw
history blame contribute delete
6.29 kB
Spanish Congress Parliamentary Records (1977-2024)
The Context of the Data CollecCon: The goal of this data collecCon has been to create a unified and easy access database with all the published documents of El Congreso De Los Diputados. A dataset of these characterisCcs is created for researchers to analyze various aspects of Spanish poliCcs and legislaCon. This includes studying the law-making process, comparing it to other countries, and evaluaCng the effecCveness of policies. The dataset also promotes transparency and allows for research using Natural Language Processing techniques.
Data CollecCon Methods: The collecCon of data has been directly made from the El Congreso web www.congreso.es by crawling through all the documents to collect its metadata and its texts. This source has documents up to 1977/07/26, coinciding with the democracy restoraCon in Spain, so this dataset contains all the published documents of the history of the Spanish Congress. Throughout all these years the format of storing these documents has changed a lot, and for that reason, many fields are just used during some terms instead of on the full dataset.
The Structure of Files: The presented files are sixteen json.gz files, each of them containing the data related to all the documents of its corresponding term and named aYer the short code of it (e.g., XV represents the fiYeenth term). It is also worth menConing that the first json is named C aYer the first term, so called Legislatura ConsCtuyente. By using this structure, the researchers want to facilitate access to concrete terms or concrete documents. The idea is that there is no need to download the sixteen json files when someone is only interested in one concrete term.
Your Sources Used: As previously menConed, all the data contained in the presented files has directly been obtained from the official webpage of El Congreso De Los Diputados, www.congreso.es. The version being published was lastly updated on the 15th of May 2024, on that date the source was last accessed.
Any Data ManipulaCons or ModificaCons: With the purpose of obtaining a cleaner dataset, the fields boleCn and diario have been merged into a single field called pdf_url. This has been done aYer discovering that they both contain the same informaCon, the URL to the PDF of the document. Another key point leading to this measure was the realizaCon that they are complementary; when one is missing, the other isn’t, and vice versa, and they are never both null or both filled. With the creaCon of this new field, not only is the database cleaner, but also a unique idenCfier for it is created.
Data ConfidenCality and Permissions: Permission to publish emanates from the official webpage of El Congreso, on this legal warning it is established that the content data should not be modified nor denaturalized. It also specifically asks that the source of the data (Congreso de los Diputados) is quoted. More informaCon at hfps://www.congreso.es/es/cem/aviso-legal. For this reason, we are using the no derivaCves license.
Data Quality: This dataset has some data missing, for the most part, are some non-relevant fields, but for 6 documents, the raw text (on the texto field) wasn’t possible to recover and therefore, that field is empty. This comes as a consequence of the data extracCon techniques.
For the documents aYer 1996, there is always a “texto integro” that directly gives the raw text next to the link to the PDF (this can be found at hfps://www.congreso.es/es/busqueda-de- publicaciones). For all these documents, the extracCon of the raw text gives no problems at all, since it is only needed to extract the HTML and then crawl it. Unfortunately, for the documents before 1996, we can only extract the raw text from the PDF, and the quality of some of them is quesConable. This, of course, is due to the fact that these documents are more than 30 years old and the format has changed a lot through the years. For all these documents, the PDFs are downloaded and then the text is extracted with pdfium, but for 6 of them, this wasn’t possible.
The Names of Labels and Variables
Field Name: - Field DescripCon - Example or Range - AnnotaCons -
cve: - CVE, or Electronic VerificaCon Code, a set of characters that uniquely idenCfy any of the official published texts. - BOCG_D_09_1_2 - Only some of the documents aYer 05/01/2011 contain this field and none before that date. -
pdf_url: - String that when concatenated with www.congreso.es forms the document's PDF URL. Unique idenCfier for all the dataset - /public_oficiales/L0/S EN/DS/S_1977_004.PDF - - encabezado: - String describing whether the document is a Diario de Sesiones (DS) or a Boleon Oficial de las Cortes Generales (BOCG). - BOCG or DS -
fecha: - Date in string format when the document was published. - 19770706, as in July the 6th 1977. - -
fecha_mensaje: - Date in date format when the document was published. - de 22/01/2020 - Only present in some of the documents. - -
mensaje: - String introducing the document to situate its context. - Congreso de los Diputados, Comisiones, núm. 156 - -
ndia: - Number in string format staCng the term day when the document content took place. - range from 1 to 2977 - For the term I, the numbers are not correctly represenCng. -
numdoc: - A number that uniquely idenCfies a document on its term. - 1,2,..., up to the number of documents on that term. - -
orga: - Name of the organ the plenary session the document describes was held on. - Pleno, Comisión ConsulCva de Nombramientos,... there are 304 different organs. - Only specified for does special organs such as invesCgaCon commissions. -
seri: - String specifying the series the documents take part in. - Pleno y Diputación Permanente - -
texto: - Raw text of the document. - - For 6 documents on the whole DB, it wasn’t possible to crawl any text. -
secc: - String staCng the secCon the document corresponds to, either congreso, senado, or cortes generales. – Congreso - -
legislatura: - Short code indicaCng the term the document was published on. - C, I, II,..., XV desu: - Comments secCon. - ACTA ADICIONAL - -
desu1: - Comments secCon 1. - ACUERDO DE LA MESA DE LA CAMARA - -
desu2: - Comments secCon 2. - ACUERDO SUBSIGUIENTE A LA TOMA EN CONSIDERACIÓN - -