Spaces:

clarin-pl
/

datasets-explorer

Runtime error

App Files Files Community

Mariusz Kossakowski commited on Sep 1, 2022

Commit

802f11a

•

1 Parent(s): 08bbbaf

Change description displaying

Browse files

Files changed (3) hide show

clarin_datasets/aspectemo_dataset.py +33 -29
clarin_datasets/kpwr_ner_datasets.py +28 -23
clarin_datasets/punctuation_restoration_dataset.py +32 -27

clarin_datasets/aspectemo_dataset.py CHANGED Viewed

@@ -9,34 +9,36 @@ class AspectEmoDataset(DatasetToShow):
     def __init__(self):
         DatasetToShow.__init__(self)
         self.dataset_name = "clarin-pl/aspectemo"
-        self.description = """
-        AspectEmo Corpus is an extended version of a publicly available PolEmo 2.0
-        corpus of Polish customer reviews used in many projects on the use of different methods in sentiment
-        analysis. The AspectEmo corpus consists of four subcorpora, each containing online customer reviews from the
-        following domains: school, medicine, hotels, and products. All documents are annotated at the aspect level
-        with six sentiment categories: strong negative (minus_m), weak negative (minus_s), neutral (zero),
-        weak positive (plus_s), strong positive (plus_m).
-        Tasks (input, output and metrics)
-        Aspect-based sentiment analysis (ABSA) is a text analysis method that
-        categorizes data by aspects and identifies the sentiment assigned to each aspect. It is the sequence tagging
-        task.
-        Input ('tokens' column): sequence of tokens
-        Output ('labels' column): sequence of predicted tokens’ classes ("O" + 6 possible classes: strong negative (
-        a_minus_m), weak negative (a_minus_s), neutral (a_zero), weak positive (a_plus_s), strong positive (
-        a_plus_m), ambiguous (a_amb) )
-        Domain: school, medicine, hotels and products
-        Measurements:
-        Example: ['Dużo', 'wymaga', ',', 'ale', 'bardzo', 'uczciwy', 'i', 'przyjazny', 'studentom', '.', 'Warto', 'chodzić',
-        'na', 'konsultacje', '.', 'Docenia', 'postępy', 'i', 'zaangażowanie', '.', 'Polecam', '.'] → ['O', 'a_plus_s', 'O',
-        'O', 'O', 'a_plus_m', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'a_zero', 'O', 'a_plus_m', 'O', 'O', 'O', 'O', 'O', 'O']
-        """
     def load_data(self):
         raw_dataset = load_dataset(self.dataset_name)
@@ -56,7 +58,9 @@ class AspectEmoDataset(DatasetToShow):
         with description:
             st.header("Dataset description")
-            st.write(self.description)
         full_dataframe = pd.concat(self.data_dict.values(), axis="rows")
         tokens_all = full_dataframe["tokens"].tolist()

     def __init__(self):
         DatasetToShow.__init__(self)
         self.dataset_name = "clarin-pl/aspectemo"
+        self.description = [
+            """
+            AspectEmo Corpus is an extended version of a publicly available PolEmo 2.0
+            corpus of Polish customer reviews used in many projects on the use of different methods in sentiment
+            analysis. The AspectEmo corpus consists of four subcorpora, each containing online customer reviews from the
+            following domains: school, medicine, hotels, and products. All documents are annotated at the aspect level
+            with six sentiment categories: strong negative (minus_m), weak negative (minus_s), neutral (zero),
+            weak positive (plus_s), strong positive (plus_m).
+            """,
+            "Tasks (input, output and metrics)",
+            """
+            Aspect-based sentiment analysis (ABSA) is a text analysis method that
+            categorizes data by aspects and identifies the sentiment assigned to each aspect. It is the sequence tagging
+            task.
+            "Input ('tokens' column): sequence of tokens"
+            Output ('labels' column): sequence of predicted tokens’ classes ("O" + 6 possible classes: strong negative (
+            a_minus_m), weak negative (a_minus_s), neutral (a_zero), weak positive (a_plus_s), strong positive (
+            a_plus_m), ambiguous (a_amb) )
+            Domain: school, medicine, hotels and products
+            Measurements:
+            Example: ['Dużo', 'wymaga', ',', 'ale', 'bardzo', 'uczciwy', 'i', 'przyjazny', 'studentom', '.', 'Warto', 'chodzić',
+            'na', 'konsultacje', '.', 'Docenia', 'postępy', 'i', 'zaangażowanie', '.', 'Polecam', '.'] → ['O', 'a_plus_s', 'O',
+            'O', 'O', 'a_plus_m', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'a_zero', 'O', 'a_plus_m', 'O', 'O', 'O', 'O', 'O', 'O']
+            """
+        ]
     def load_data(self):
         raw_dataset = load_dataset(self.dataset_name)
         with description:
             st.header("Dataset description")
+            st.write(self.description[0])
+            st.subheader(self.description[1])
+            st.write(self.description[2])
         full_dataframe = pd.concat(self.data_dict.values(), axis="rows")
         tokens_all = full_dataframe["tokens"].tolist()

clarin_datasets/kpwr_ner_datasets.py CHANGED Viewed

@@ -10,28 +10,31 @@ class KpwrNerDataset(DatasetToShow):
         DatasetToShow.__init__(self)
         self.data_dict_named = None
         self.dataset_name = "clarin-pl/kpwr-ner"
-        self.description = """
-        KPWR-NER is a part the Polish Corpus of Wrocław University of Technology (Korpus Języka
-        Polskiego Politechniki Wrocławskiej). Its objective is named entity recognition for fine-grained categories
-        of entities. It is the ‘n82’ version of the KPWr, which means that number of classes is restricted to 82 (
-        originally 120). During corpus creation, texts were annotated by humans from various sources, covering many
-        domains and genres.
-        Tasks (input, output and metrics)
-        Named entity recognition (NER) - tagging entities in text with their corresponding type.
-        Input ('tokens' column): sequence of tokens
-        Output ('ner' column): sequence of predicted tokens’ classes in BIO notation (82 possible classes, described
-        in detail in the annotation guidelines)
-        example:
-        [‘Roboty’, ‘mają’, ‘kilkanaście’, ‘lat’, ‘i’, ‘pochodzą’, ‘z’, ‘USA’, ‘,’, ‘Wysokie’, ‘napięcie’, ‘jest’,
-        ‘dużo’, ‘młodsze’, ‘,’, ‘powstało’, ‘w’, ‘Niemczech’, ‘.’] → [‘B-nam_pro_title’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’,
-        ‘O’, ‘B-nam_loc_gpe_country’, ‘O’, ‘B-nam_pro_title’, ‘I-nam_pro_title’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’,
-        ‘B-nam_loc_gpe_country’, ‘O’]
-        """
     def load_data(self):
         raw_dataset = load_dataset(self.dataset_name)
@@ -67,7 +70,9 @@ class KpwrNerDataset(DatasetToShow):
         with description:
             st.header("Dataset description")
-            st.write(self.description)
         full_dataframe = pd.concat(self.data_dict.values(), axis="rows")
         tokens_all = full_dataframe["tokens"].tolist()

         DatasetToShow.__init__(self)
         self.data_dict_named = None
         self.dataset_name = "clarin-pl/kpwr-ner"
+        self.description = [
+            """
+            KPWR-NER is a part the Polish Corpus of Wrocław University of Technology (Korpus Języka
+            Polskiego Politechniki Wrocławskiej). Its objective is named entity recognition for fine-grained categories
+            of entities. It is the ‘n82’ version of the KPWr, which means that number of classes is restricted to 82 (
+            originally 120). During corpus creation, texts were annotated by humans from various sources, covering many
+            domains and genres.
+            """,
+            "Tasks (input, output and metrics)",
+            """
+            Named entity recognition (NER) - tagging entities in text with their corresponding type.
+            Input ('tokens' column): sequence of tokens
+            Output ('ner' column): sequence of predicted tokens’ classes in BIO notation (82 possible classes, described
+            in detail in the annotation guidelines)
+            example:
+            [‘Roboty’, ‘mają’, ‘kilkanaście’, ‘lat’, ‘i’, ‘pochodzą’, ‘z’, ‘USA’, ‘,’, ‘Wysokie’, ‘napięcie’, ‘jest’,
+            ‘dużo’, ‘młodsze’, ‘,’, ‘powstało’, ‘w’, ‘Niemczech’, ‘.’] → [‘B-nam_pro_title’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’,
+            ‘O’, ‘B-nam_loc_gpe_country’, ‘O’, ‘B-nam_pro_title’, ‘I-nam_pro_title’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’,
+            ‘B-nam_loc_gpe_country’, ‘O’]
+            """
+        ]
     def load_data(self):
         raw_dataset = load_dataset(self.dataset_name)
         with description:
             st.header("Dataset description")
+            st.write(self.description[0])
+            st.subheader(self.description[1])
+            st.write(self.description[2])
         full_dataframe = pd.concat(self.data_dict.values(), axis="rows")
         tokens_all = full_dataframe["tokens"].tolist()

clarin_datasets/punctuation_restoration_dataset.py CHANGED Viewed

@@ -10,32 +10,34 @@ class PunctuationRestorationDataset(DatasetToShow):
         DatasetToShow.__init__(self)
         self.data_dict_named = None
         self.dataset_name = "clarin-pl/2021-punctuation-restoration"
-        self.description = """
-        Speech transcripts generated by Automatic Speech Recognition (ASR) systems typically do
-        not contain any punctuation or capitalization. In longer stretches of automatically recognized speech,
-        the lack of punctuation affects the general clarity of the output text [1]. The primary purpose of
-        punctuation (PR) and capitalization restoration (CR) as a distinct natural language processing (NLP) task is
-        to improve the legibility of ASR-generated text, and possibly other types of texts without punctuation. Aside
-        from their intrinsic value, PR and CR may improve the performance of other NLP aspects such as Named Entity
-        Recognition (NER), part-of-speech (POS) and semantic parsing or spoken dialog segmentation [2, 3]. As useful
-        as it seems, It is hard to systematically evaluate PR on transcripts of conversational language; mainly
-        because punctuation rules can be ambiguous even for originally written texts, and the very nature of
-        naturally-occurring spoken language makes it difficult to identify clear phrase and sentence boundaries [4,
-        5]. Given these requirements and limitations, a PR task based on a redistributable corpus of read speech was
-        suggested. 1200 texts included in this collection (totaling over 240,000 words) were selected from two
-        distinct sources: WikiNews and WikiTalks. Punctuation found in these sources should be approached with some
-        reservation when used for evaluation: these are original texts and may contain some user-induced errors and
-        bias. The texts were read out by over a hundred different speakers. Original texts with punctuation were
-        forced-aligned with recordings and used as the ideal ASR output. The goal of the task is to provide a
-        solution for restoring punctuation in the test set collated for this task. The test set consists of
-        time-aligned ASR transcriptions of read texts from the two sources. Participants are encouraged to use both
-        text-based and speech-derived features to identify punctuation symbols (e.g. multimodal framework [6]). In
-        addition, the train set is accompanied by reference text corpora of WikiNews and WikiTalks data that can be
-        used in training and fine-tuning punctuation models.
-        Task description
-        The purpose of this task is to restore punctuation in the ASR recognition of texts read out loud.
-        """
     def load_data(self):
         raw_dataset = load_dataset(self.dataset_name)
@@ -70,7 +72,10 @@ class PunctuationRestorationDataset(DatasetToShow):
         with description:
             st.header("Dataset description")
-            st.write(self.description)
         full_dataframe = pd.concat(self.data_dict.values(), axis="rows")

         DatasetToShow.__init__(self)
         self.data_dict_named = None
         self.dataset_name = "clarin-pl/2021-punctuation-restoration"
+        self.description = [
+            """
+            Speech transcripts generated by Automatic Speech Recognition (ASR) systems typically do
+            not contain any punctuation or capitalization. In longer stretches of automatically recognized speech,
+            the lack of punctuation affects the general clarity of the output text [1]. The primary purpose of
+            punctuation (PR) and capitalization restoration (CR) as a distinct natural language processing (NLP) task is
+            to improve the legibility of ASR-generated text, and possibly other types of texts without punctuation. Aside
+            from their intrinsic value, PR and CR may improve the performance of other NLP aspects such as Named Entity
+            Recognition (NER), part-of-speech (POS) and semantic parsing or spoken dialog segmentation [2, 3]. As useful
+            as it seems, It is hard to systematically evaluate PR on transcripts of conversational language; mainly
+            because punctuation rules can be ambiguous even for originally written texts, and the very nature of
+            naturally-occurring spoken language makes it difficult to identify clear phrase and sentence boundaries [4,
+            5]. Given these requirements and limitations, a PR task based on a redistributable corpus of read speech was
+            suggested. 1200 texts included in this collection (totaling over 240,000 words) were selected from two
+            distinct sources: WikiNews and WikiTalks. Punctuation found in these sources should be approached with some
+            reservation when used for evaluation: these are original texts and may contain some user-induced errors and
+            bias. The texts were read out by over a hundred different speakers. Original texts with punctuation were
+            forced-aligned with recordings and used as the ideal ASR output. The goal of the task is to provide a
+            solution for restoring punctuation in the test set collated for this task. The test set consists of
+            time-aligned ASR transcriptions of read texts from the two sources. Participants are encouraged to use both
+            text-based and speech-derived features to identify punctuation symbols (e.g. multimodal framework [6]). In
+            addition, the train set is accompanied by reference text corpora of WikiNews and WikiTalks data that can be
+            used in training and fine-tuning punctuation models.
+            """,
+            "Task description",
+            "The purpose of this task is to restore punctuation in the ASR recognition of texts read out loud.",
+            "clarin_datasets/punctuation_restoration_task.png"
+        ]
     def load_data(self):
         raw_dataset = load_dataset(self.dataset_name)
         with description:
             st.header("Dataset description")
+            st.write(self.description[0])
+            st.subheader(self.description[1])
+            st.write(self.description[2])
+            st.image(self.description[3])
         full_dataframe = pd.concat(self.data_dict.values(), axis="rows")