[ { "question": "A large mobile network operating company is buildin g a machine learning model to predict customers who are likely to unsubscribe from the service. The company plans to offer an incentive for these customers as the cost of churn is far greater than the cost of the incent ive. The model produces the following confusion matrix a fter evaluating on a test dataset of 100 customers: Based on the model evaluation results, why is this a viable model for production?", "options": [ "A. The model is 86% accurate and the cost incurred b y the company as a result of false negatives is les s than", "B. The precision of the model is 86%, which is less than the accuracy of the model.", "C. The model is 86% accurate and the cost incurred b y the company as a result of false positives is les s than", "D. The precision of the model is 86%, which is great er than the accuracy of the model." ], "correct": "A. The model is 86% accurate and the cost incurred b y the company as a result of false negatives is les s than", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist is designing a system for improving sales for a company. The objective i s to use the large amount of information the company has on users' behavior and product preferences to predict which products users would like based on the users' simil arity to other users. What should the Specialist do to meet this objectiv e?", "options": [ "A. Build a content-based filtering recommendation en gine with Apache Spark ML on Amazon EMR", "B. Build a collaborative filtering recommendation en gine with Apache Spark ML on Amazon EMR.", "C. Build a model-based filtering recommendation engi ne with Apache Spark ML on Amazon EMR", "D. Build a combinative filtering recommendation engi ne with Apache Spark ML on Amazon EMR" ], "correct": "B. Build a collaborative filtering recommendation en gine with Apache Spark ML on Amazon EMR.", "explanation": "Many developers want to implement the famous Amazon model that was used to power the \"People who bought this also bought these items\" feature on Ama zon.com. This model is based on a method called Collaborative Filtering. It takes items such as mov ies, books, and products that were rated highly by a set of users and recommending them to other users who also gave them high ratings. This method works well in domains where explicit ratings or implicit user act ions can be gathered and analyzed. 96CE4376707A97CE80D4B1916F054522", "references": "https://aws.amazon.com/blogs/big-data/bu ilding-a-recommendation-engine-with-spark-ml-on- amazon-emr-using-zeppelin/" }, { "question": "A Mobile Network Operator is building an analytics platform to analyze and optimize a company's operat ions using Amazon Athena and Amazon S3. The source systems send data in .CSV format in real time. The Data Engineering team wants to transform the data to the Apache Parquet format before storing it on Amazon S3. Which solution takes the LEAST effort to implement?", "options": [ "A. Ingest .CSV data using Apache Kafka Streams on Am azon EC2 instances and use Kafka Connect S3 to", "B. Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Glue to convert data into Parquet.", "C. Ingest .CSV data using Apache Spark Structured St reaming in an Amazon EMR cluster and use Apache", "D. Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Kinesis Data Firehose to convert" ], "correct": "C. Ingest .CSV data using Apache Spark Structured St reaming in an Amazon EMR cluster and use Apache", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A city wants to monitor its air quality to address the consequences of air pollution. A Machine Learni ng Specialist needs to forecast the air quality in par ts per million of contaminates for the next 2 days in the city. As this is a prototype, only daily data from the last year is available. Which model is MOST likely to provide the best resu lts in Amazon SageMaker?", "options": [ "A. Use the Amazon SageMaker k-Nearest-Neighbors (kNN ) algorithm on the single time series consisting of", "B. Use Amazon SageMaker Random Cut Forest (RCF) on t he single time series consisting of the full year o f", "C. Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full y ear of", "D. Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full y ear of" ], "correct": "C. Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full y ear of", "explanation": "Explanation/Reference:", "references": "https://aws.amazon.com/blogs/machine-lea rning/build-a-model-to-predict-the-impact-of-weathe r- on-urban-air-quality-using-amazon-sagemaker/?ref=We lcome.AI" }, { "question": "A Data Engineer needs to build a model using a data set containing customer credit card information How can the Data Engineer ensure the data remains e ncrypted and the credit card information is secure? 96CE4376707A97CE80D4B1916F054522", "options": [ "A. Use a custom encryption algorithm to encrypt the data and store the data on an Amazon SageMaker", "B. Use an IAM policy to encrypt the data on the Amaz on S3 bucket and Amazon Kinesis to automatically", "C. Use an Amazon SageMaker launch configuration to e ncrypt the data once it is copied to the SageMaker", "D. Use AWS KMS to encrypt the data on Amazon S3 and Amazon SageMaker, and redact the credit card" ], "correct": "C. Use an Amazon SageMaker launch configuration to e ncrypt the data once it is copied to the SageMaker", "explanation": "Explanation/Reference:", "references": "https://docs.aws.amazon.com/sagemaker/la test/dg/pca.html" }, { "question": "A Machine Learning Specialist is using an Amazon Sa geMaker notebook instance in a private subnet of a corporate VPC. The ML Specialist has important data stored on the Amazon SageMaker notebook instance's Amazon EBS volume, and needs to take a snapshot of that EBS volume. However, the ML Specialist cannot find the Amazon SageMaker notebook instance's EBS v olume or Amazon EC2 instance within the VPC. Why is the ML Specialist not seeing the instance vi sible in the VPC?", "options": [ "A. Amazon SageMaker notebook instances are based on the EC2 instances within the customer account, but", "B. Amazon SageMaker notebook instances are based on the Amazon ECS service within customer accounts.", "C. Amazon SageMaker notebook instances are based on EC2 instances running within AWS service", "D. Amazon SageMaker notebook instances are based on AWS ECS instances running within AWS service" ], "correct": "C. Amazon SageMaker notebook instances are based on EC2 instances running within AWS service", "explanation": "Explanation/Reference:", "references": "https://docs.aws.amazon.com/sagemaker/la test/dg/gs-setup-working-env.html" }, { "question": "A Machine Learning Specialist is building a model t hat will perform time series forecasting using Amaz on SageMaker. The Specialist has finished training the model and is now planning to perform load testing on the endpoint so they can configure Auto Scaling for the model variant. Which approach will allow the Specialist to review the latency, memory utilization, and CPU utilizatio n during the load test?", "options": [ "A. Review SageMaker logs that have been written to A mazon S3 by leveraging Amazon Athena and Amazon", "B. Generate an Amazon CloudWatch dashboard to create a single view for the latency, memory utilization, and", "C. Build custom Amazon CloudWatch Logs and then leve rage Amazon ES and Kibana to query and visualize", "D. Send Amazon CloudWatch Logs that were generated b y Amazon SageMaker to Amazon ES and use" ], "correct": "B. Generate an Amazon CloudWatch dashboard to create a single view for the latency, memory utilization, and", "explanation": "Explanation/Reference:", "references": "https://docs.aws.amazon.com/sagemaker/la test/dg/monitoring-cloudwatch.html" }, { "question": "A manufacturing company has structured and unstruct ured data stored in an Amazon S3 bucket. A Machine Learning Specialist wants to use SQL to run queries on this data. Which solution requires the LEAST effort to be able to query this data?", "options": [ "A. Use AWS Data Pipeline to transform the data and A mazon RDS to run queries.", "B. Use AWS Glue to catalogue the data and Amazon Ath ena to run queries.", "C. Use AWS Batch to run ETL on the data and Amazon A urora to run the queries.", "D. Use AWS Lambda to transform the data and Amazon K inesis Data Analytics to run queries.", "A. Load a smaller subset of the data into the SageMa ker notebook and train locally. Confirm that the tr aining", "B. Launch an Amazon EC2 instance with an AWS Deep Le arning AMI and attach the S3 bucket to the", "C. Use AWS Glue to train a model using a small subse t of the data to confirm that the data will be comp atible", "D. Load a smaller subset of the data into the SageMa ker notebook and train locally. Confirm that the tr aining" ], "correct": "A. Load a smaller subset of the data into the SageMa ker notebook and train locally. Confirm that the tr aining", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist has completed a proof of concept for a company using a small data sample , and now the Specialist is ready to implement an end-to- end solution in AWS using Amazon SageMaker. The 96CE4376707A97CE80D4B1916F054522 historical training data is stored in Amazon RDS. Which approach should the Specialist use for traini ng a model using that data?", "options": [ "A. Write a direct connection to the SQL database wit hin the notebook and pull data in", "B. Push the data from Microsoft SQL Server to Amazon S3 using an AWS Data Pipeline and provide the S3", "C. Move the data to Amazon DynamoDB and set up a con nection to DynamoDB within the notebook to pull", "D. Move the data to Amazon ElastiCache using AWS DMS and set up a connection within the notebook to pul l", "A. Latent Dirichlet Allocation (LDA) for the given c ollection of discrete data to identify patterns in the customer", "B. A neural network with a minimum of three layers a nd random initial weights to identify patterns in t he", "C. Collaborative filtering based on user interaction s and correlations to identify patterns in the cust omer", "D. Random Cut Forest (RCF) over random subsamples to identify patterns in the customer database." ], "correct": "C. Collaborative filtering based on user interaction s and correlations to identify patterns in the cust omer", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist is working with a lar ge company to leverage machine learning within its products. The company wants to group its customers into categories based on which customers will and w ill not churn within the next 6 months. The company has lab eled the data available to the Specialist. Which machine learning model type should the Specia list use to accomplish this task?", "options": [ "A. Linear regression", "B. Classification", "C. Clustering", "D. Reinforcement learning" ], "correct": "B. Classification", "explanation": "The goal of classification is to determine to which class or category a data point (customer in our ca se) belongs to. For classification problems, data scientists wo uld use historical data with predefined target vari ables AKA labels (churner/non-churner) answers that need to be predicted to train an algorithm. With classific ation, businesses can answer the following questions: Will this customer churn or not? Will a customer renew their subscription? Will a user downgrade a pricing plan? Are there any signs of unusual customer behavior?", "references": "https://www.kdnuggets.com/2019/05/churn- prediction-machine-learning.html" }, { "question": "The displayed graph is from a forecasting model for testing a time series. Considering the graph only, which conclusion should a Machine Learning Specialist make about the behav ior of the model?", "options": [ "A. The model predicts both the trend and the seasona lity well", "B. The model predicts the trend well, but not the se asonality.", "C. The model predicts the seasonality well, but not the trend.", "D. The model does not predict the trend or the seaso nality well.", "A. Long short-term memory (LSTM) model with scaled exp onential linear unit (SELU) B. Logistic regression", "C. Support vector machine (SVM) with non-linear kern el", "D. Single perceptron with tanh activation function" ], "correct": "C. Support vector machine (SVM) with non-linear kern el", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist at a company sensitiv e to security is preparing a dataset for model trai ning. The dataset is stored in Amazon S3 and contains Persona lly Identifiable Information (PII). The dataset: Must be accessible from a VPC only. Must not traverse the public internet. 96CE4376707A97CE80D4B1916F054522 How can these requirements be satisfied?", "options": [ "A. Create a VPC endpoint and apply a bucket access p olicy that restricts access to the given VPC endpoi nt", "B. Create a VPC endpoint and apply a bucket access p olicy that allows access from the given VPC endpoin t", "C. Create a VPC endpoint and use Network Access Cont rol Lists (NACLs) to allow traffic between only the", "D. Create a VPC endpoint and use security groups to restrict access to the given VPC endpoint and an" ], "correct": "B. Create a VPC endpoint and apply a bucket access p olicy that allows access from the given VPC endpoin t", "explanation": "Explanation/Reference:", "references": "https://docs.aws.amazon.com/AmazonS3/lat est/dev/example-bucket-policies-vpc-endpoint.html" }, { "question": "During mini-batch training of a neural network for a classification problem, a Data Scientist notices that training accuracy oscillates. What is the MOST likely cause of this issue?", "options": [ "A. The class distribution in the dataset is imbalanced . B. Dataset shuffling is disabled.", "C. The batch size is too big.", "D. The learning rate is very high." ], "correct": "D. The learning rate is very high.", "explanation": "Explanation/Reference:", "references": "https://towardsdatascience.com/deep-lear ning-personal-notes-part-1-lesson-2-8946fe970b95" }, { "question": "An employee found a video clip with audio on a comp any's social media feed. The language used in the v ideo is Spanish. English is the employee's first languag e, and they do not understand Spanish. The employee wants to do a sentiment analysis. What combination of services is the MOST efficient to accomplish the task?", "options": [ "A. Amazon Transcribe, Amazon Translate, and Amazon C omprehend", "B. Amazon Transcribe, Amazon Comprehend, and Amazon SageMaker seq2seq", "C. Amazon Transcribe, Amazon Translate, and Amazon S ageMaker Neural Topic Model (NTM)", "D. Amazon Transcribe, Amazon Translate and Amazon Sa geMaker BlazingText" ], "correct": "C. Amazon Transcribe, Amazon Translate, and Amazon S ageMaker Neural Topic Model (NTM)", "explanation": "Explanation/Reference:", "references": "" }, { "question": "96CE4376707A97CE80D4B1916F054522 A Machine Learning Specialist is packaging a custom ResNet model into a Docker container so the compan y can leverage Amazon SageMaker for training. The Spe cialist is using Amazon EC2 P3 instances to train t he model and needs to properly configure the Docker co ntainer to leverage the NVIDIA GPUs. What does the Specialist need to do?", "options": [ "A. Bundle the NVIDIA drivers with the Docker image.", "B. Build the Docker container to be NVIDIA-Docker co mpatible.", "C. Organize the Docker container's file structure to execute on GPU instances.", "D. Set the GPU flag in the Amazon SageMaker CreateTr ainingJob request body." ], "correct": "A. Bundle the NVIDIA drivers with the Docker image.", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist is building a logisti c regression model that will predict whether or not a person will order a pizza. The Specialist is trying to bui ld the optimal model with an ideal classification t hreshold. What model evaluation technique should the Speciali st use to understand how different classification thresholds will impact the model's performance?", "options": [ "A. Receiver operating characteristic (ROC) curve", "B. Misclassification rate", "C. Root Mean Square Error (RMSE)", "D. L1 norm" ], "correct": "A. Receiver operating characteristic (ROC) curve", "explanation": "Explanation/Reference:", "references": "https://docs.aws.amazon.com/machine-lear ning/latest/dg/binary-model-insights.html" }, { "question": "An interactive online dictionary wants to add a wid get that displays words used in similar contexts. A Machine Learning Specialist is asked to provide word featur es for the downstream nearest neighbor model poweri ng the widget. What should the Specialist do to meet these require ments?", "options": [ "A. Create one-hot word encoding vectors.", "B. Produce a set of synonyms for every word using Am azon Mechanical Turk.", "C. Create word embedding vectors that store edit dis tance with every other word.", "D. Download word embeddings pre-trained on a large c orpus." ], "correct": "A. Create one-hot word encoding vectors.", "explanation": "Explanation/Reference:", "references": "https://aws.amazon.com/blogs/machine-lea rning/amazon-sagemaker-object2vec-adds-new- features-that-support-automatic-negative-sampling-a nd-speed-up-training/ 96CE4376707A97CE80D4B1916F054522" }, { "question": "A Machine Learning Specialist is configuring Amazon SageMaker so multiple Data Scientists can access notebooks, train models, and deploy endpoints. To e nsure the best operational performance, the Special ist needs to be able to track how often the Scientists are deploying models, GPU and CPU utilization on th e deployed SageMaker endpoints, and all errors that a re generated when an endpoint is invoked. Which services are integrated with Amazon SageMaker to track this information? (Choose two.)", "options": [ "A. AWS CloudTrail", "B. AWS Health", "C. AWS Trusted Advisor", "D. Amazon CloudWatch" ], "correct": "", "explanation": "Explanation/Reference:", "references": "https://aws.amazon.com/sagemaker/faqs/" }, { "question": "A retail chain has been ingesting purchasing record s from its network of 20,000 stores to Amazon S3 us ing Amazon Kinesis Data Firehose. To support training a n improved machine learning model, training records will require new but simple transformations, and some at tributes will be combined. The model needs to be re trained daily. Given the large number of stores and the legacy dat a ingestion, which change will require the LEAST am ount of development effort?", "options": [ "A. Require that the stores to switch to capturing th eir data locally on AWS Storage Gateway for loading into", "B. Deploy an Amazon EMR cluster running Apache Spark with the transformation logic, and have the cluste r", "C. Spin up a fleet of Amazon EC2 instances with the transformation logic, have them transform the data", "D. Insert an Amazon Kinesis Data Analytics stream do wnstream of the Kinesis Data Firehose stream that" ], "correct": "D. Insert an Amazon Kinesis Data Analytics stream do wnstream of the Kinesis Data Firehose stream that", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist is building a convolu tional neural network (CNN) that will classify 10 t ypes of animals. The Specialist has built a series of layer s in a neural network that will take an input image of an animal, pass it through a series of convolutional a nd pooling layers, and then finally pass it through a dense and fully connected layer with 10 nodes. The Specialist would like to get an output from the neural networ k that is a probability distribution of how likely it is that t he input image belongs to each of the 10 classes. Which function will produce the desired output? 96CE4376707A97CE80D4B1916F054522", "options": [ "A. Dropout", "B. Smooth L1 loss", "C. Softmax", "D. Rectified linear units (ReLU)" ], "correct": "D. Rectified linear units (ReLU)", "explanation": "Explanation/Reference:", "references": "https://towardsdatascience.com/building- a-convolutional-neural-network-cnn-in-keras- 329fbbadc5f5" }, { "question": "A Machine Learning Specialist trained a regression model, but the first iteration needs optimizing. Th e Specialist needs to understand whether the model is more frequently overestimating or underestimating the target. What option can the Specialist use to determine whe ther it is overestimating or underestimating the ta rget value?", "options": [ "A. Root Mean Square Error (RMSE)", "B. Residual plots", "C. Area under the curve", "D. Confusion matrix" ], "correct": "C. Area under the curve", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a Machine Learning Specialist would like to build a b inary classifier based on two features: age of acco unt and transaction month. The class distribution for these features is illustrated in the figure provided. 96CE4376707A97CE80D4B1916F054522 Based on this information, which model would have t he HIGHEST recall with respect to the fraudulent cl ass?", "options": [ "A. Decision tree", "B. Linear support vector machine (SVM)", "C. Naive Bayesian classifier", "D. Single Perceptron with sigmoidal activation funct ion" ], "correct": "C. Naive Bayesian classifier", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist kicks off a hyperpara meter tuning job for a tree-based ensemble model us ing Amazon SageMaker with Area Under the ROC Curve (AUC ) as the objective metric. This workflow will eventually be deployed in a pipeline that retrains and tunes hyperparameters each night to model click -through on data that goes stale every 24 hours. With the goal of decreasing the amount of time it t akes to train these models, and ultimately to decre ase costs, the Specialist wants to reconfigure the input hyper parameter range(s). Which visualization will accomplish this?", "options": [ "A. A histogram showing whether the most important in put feature is Gaussian.", "B. A scatter plot with points colored by target vari able that uses t-Distributed Stochastic Neighbor Em bedding", "C. A scatter plot showing the performance of the obj ective metric over each training iteration.", "D. A scatter plot showing the correlation between ma ximum tree depth and the objective metric." ], "correct": "B. A scatter plot with points colored by target vari able that uses t-Distributed Stochastic Neighbor Em bedding", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist is creating a new nat ural language processing application that processes a dataset comprised of 1 million sentences. The aim i s to then run Word2Vec to generate embeddings of th e sentences and enable different types of predictions . Here is an example from the dataset: \"The quck BROWN FOX jumps over the lazy dog.\" Which of the following are the operations the Speci alist needs to perform to correctly sanitize and pr epare the data in a repeatable manner? (Choose three.)", "options": [ "A. Perform part-of-speech tagging and keep the actio n verb and the nouns only.", "B. Normalize all words by making the sentence lowerc ase.", "C. Remove stop words using an English stopword dicti onary.", "D. Correct the typography on \"quck\" to \"quick.\"" ], "correct": "", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A company is using Amazon Polly to translate plaint ext documents to speech for automated company announcements. However, company acronyms are being mispronounced in the current documents. How should a Machine Learning Specialist address th is issue for future documents?", "options": [ "A. Convert current documents to SSML with pronunciat ion tags.", "B. Create an appropriate pronunciation lexicon.", "C. Output speech marks to guide in pronunciation.", "D. Use Amazon Lex to preprocess the text files for p ronunciation" ], "correct": "A. Convert current documents to SSML with pronunciat ion tags.", "explanation": "Explanation/Reference:", "references": "https://docs.aws.amazon.com/polly/latest /dg/ssml.html 96CE4376707A97CE80D4B1916F054522" }, { "question": "An insurance company is developing a new device for vehicles that uses a camera to observe drivers' be havior and alert them when they appear distracted. The com pany created approximately 10,000 training images i n a controlled environment that a Machine Learning Spec ialist will use to train and evaluate machine learn ing models. During the model evaluation, the Specialist notices that the training error rate diminishes faster as the number of epochs increases and the model is not accurately inferring on the unseen test images. Which of the following should be used to resolve th is issue? (Choose two.)", "options": [ "A. Add vanishing gradient to the model.", "B. Perform data augmentation on the training data.", "C. Make the neural network architecture complex.", "D. Use gradient checking in the model.", "A. The training channel identifying the location of training data on an Amazon S3 bucket.", "B. The validation channel identifying the location o f validation data on an Amazon S3 bucket.", "C. The IAM role that Amazon SageMaker can assume to perform tasks on behalf of the users.", "D. Hyperparameters in a JSON array as documented for the algorithm used." ], "correct": "", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A monitoring service generates 1 TB of scale metric s record data every minute. A Research team perform s queries on this data using Amazon Athena. The queri es run slowly due to the large volume of data, and the team requires better performance. How should the records be stored in Amazon S3 to im prove query performance?", "options": [ "A. CSV files", "B. Parquet files", "C. Compressed JSON", "D. RecordIO" ], "correct": "B. Parquet files", "explanation": "Explanation/Reference:", "references": "" }, { "question": "Machine Learning Specialist is working with a media company to perform classification on popular artic les from the company's website. The company is using random forests to classify how popular an article will be before it is published. A sample of the data being used is be low. Given the dataset, the Specialist wants to convert the Day_Of_Week column to binary values. What technique should be used to convert this colum n to binary values?", "options": [ "A. Binarization", "B. One-hot encoding", "C. Tokenization", "D. Normalization transformation" ], "correct": "B. One-hot encoding", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A gaming company has launched an online game where people can start playing for free, but they need to pay if they choose to use certain features. The company needs to build an automated system to predict whet her or not a new user will become a paid user within 1 yea r. The company has gathered a labeled dataset from 1 million users. The training dataset consists of 1,000 positive sam ples (from users who ended up paying within 1 year) and 999,000 negative samples (from users who did not us e any paid features). Each data sample consists of 200 features including user age, device, location, and play patterns. Using this dataset for training, the Data Science t eam trained a random forest model that converged wi th over 99% accuracy on the training set. However, the pred iction results on a test dataset were not satisfact ory 96CE4376707A97CE80D4B1916F054522 Which of the following approaches should the Data S cience team take to mitigate this issue? (Choose tw o.) A. Add more deep trees to the random forest to enabl e the model to learn more features.", "options": [ "B. Include a copy of the samples in the test dataset in the training dataset.", "C. Generate more positive samples by duplicating the positive samples and adding a small amount of nois e to", "D. Change the cost function so that false negatives have a higher impact on the cost value than false p ositives." ], "correct": "", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Data Scientist is developing a machine learning m odel to predict future patient outcomes based on information collected about each patient and their treatment plans. The model should output a continuo us value as its prediction. The data available includes labe led outcomes for a set of 4,000 patients. The study was conducted on a group of individuals over the age of 65 who have a particular disease that is known to worsen with age. Initial models have performed poorly. While reviewi ng the underlying data, the Data Scientist notices that, out of 4,000 patient observations, there are 450 where the patient age has been input as 0. The other feature s for these observations appear normal compared to the re st of the sample population How should the Data Scientist correct this issue?", "options": [ "A. Drop all records from the dataset where age has b een set to 0.", "B. Replace the age field value for records with a va lue of 0 with the mean or median value from the dat aset", "C. Drop the age feature from the dataset and train t he model using the rest of the features.", "D. Use k-means clustering to handle missing features" ], "correct": "A. Drop all records from the dataset where age has b een set to 0.", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Data Science team is designing a dataset reposito ry where it will store a large amount of training d ata commonly used in its machine learning models. As Da ta Scientists may create an arbitrary number of new datasets every day, the solution has to scale autom atically and be cost-effective. Also, it must be po ssible to explore the data using SQL. Which storage scheme is MOST adapted to this scenar io?", "options": [ "A. Store datasets as files in Amazon S3.", "B. Store datasets as files in an Amazon EBS volume att ached to an Amazon EC2 instance. C. Store datasets as tables in a multi-node Amazon Red shift cluster.", "D. Store datasets as global tables in Amazon DynamoD B." ], "correct": "A. Store datasets as files in Amazon S3.", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist deployed a model that provides product recommendations on a company's website. Initially, the model was performing very w ell and resulted in customers buying more products on average. However, within the past few months, the S pecialist has noticed that the effect of product recommendations has diminished and customers are st arting to return to their original habits of spendi ng less. The Specialist is unsure of what happened, as the m odel has not changed from its initial deployment ov er a year ago. Which method should the Specialist try to improve m odel performance?", "options": [ "A. The model needs to be completely re-engineered be cause it is unable to handle product inventory chan ges.", "B. The model's hyperparameters should be periodicall y updated to prevent drift.", "C. The model should be periodically retrained from s cratch using the original data while adding a regul arization", "D. The model should be periodically retrained using the original training data plus new data as product" ], "correct": "D. The model should be periodically retrained using the original training data plus new data as product", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist working for an online fashion company wants to build a data ingestion so lution for the company's Amazon S3-based data lake. The Specialist wants to create a set of ingestion m echanisms that will enable future capabilities comp rised of: Real-time analytics Interactive analytics of historical data Clickstream analytics Product recommendations Which services should the Specialist use?", "options": [ "A. AWS Glue as the data catalog; Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for real -", "B. Amazon Athena as the data catalog: Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for", "C. AWS Glue as the data catalog; Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for", "D. Amazon Athena as the data catalog; Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for" ], "correct": "A. AWS Glue as the data catalog; Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for real -", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A company is observing low accuracy while training on the default built-in image classification algori thm in Amazon SageMaker. The Data Science team wants to us e an Inception neural network architecture instead of a ResNet architecture. Which of the following will accomplish this? (Choos e two.)", "options": [ "A. Customize the built-in image classification algor ithm to use Inception and use this for model traini ng.", "B. Create a support case with the SageMaker team to change the default image classification algorithm t o", "C. Bundle a Docker container with TensorFlow Estimat or loaded with an Inception network and use this fo r", "D. Use custom code in Amazon SageMaker with TensorFl ow Estimator to load the model with an Inception" ], "correct": "", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist built an image classi fication deep learning model. However, the Speciali st ran into an overfitting problem in which the training a nd testing accuracies were 99% and 75%, respectivel y. How should the Specialist address this issue and wh at is the reason behind it?", "options": [ "A. The learning rate should be increased because the optimization process was trapped at a local minimu m.", "B. The dropout rate at the flatten layer should be i ncreased because the model is not generalized enoug h.", "C. The dimensionality of dense layer next to the fla tten layer should be increased because the model is not", "D. The epoch number should be increased because the optimization process was terminated before it reach ed" ], "correct": "D. The epoch number should be increased because the optimization process was terminated before it reach ed", "explanation": "Explanation/Reference:", "references": "https://www.tensorflow.org/tutorials/ker as/overfit_and_underfit" }, { "question": "A Machine Learning team uses Amazon SageMaker to tr ain an Apache MXNet handwritten digit classifier model using a research dataset. The team wants to r eceive a notification when the model is overfitting . Auditors want to view the Amazon SageMaker log activity repo rt to ensure there are no unauthorized API calls. What should the Machine Learning team do to address the requirements with the least amount of code and 96CE4376707A97CE80D4B1916F054522 fewest steps?", "options": [ "A. Implement an AWS Lambda function to log Amazon Sa geMaker API calls to Amazon S3. Add code to push", "B. Use AWS CloudTrail to log Amazon SageMaker API ca lls to Amazon S3. Add code to push a custom metric", "C. Implement an AWS Lambda function to log Amazon Sa geMaker API calls to AWS CloudTrail. Add code to", "D. Use AWS CloudTrail to log Amazon SageMaker API ca lls to Amazon S3. Set up Amazon SNS to receive a" ], "correct": "C. Implement an AWS Lambda function to log Amazon Sa geMaker API calls to AWS CloudTrail. Add code to", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist is building a predict ion model for a large number of features using line ar models, such as linear regression and logistic regression. During exploratory data analysis, the Specialist ob serves that many features are highly correlated with each other . This may make the model unstable. What should be done to reduce the impact of having such a large number of features?", "options": [ "A. Perform one-hot encoding on highly correlated fea tures.", "B. Use matrix multiplication on highly correlated fe atures.", "C. Create a new feature space using principal compon ent analysis (PCA)", "D. Apply the Pearson correlation coefficient." ], "correct": "C. Create a new feature space using principal compon ent analysis (PCA)", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist is implementing a ful l Bayesian network on a dataset that describes publ ic transit in New York City. One of the random variables is di screte, and represents the number of minutes New Yo rkers wait for a bus given that the buses cycle every 10 minutes, with a mean of 3 minutes. Which prior probability distribution should the ML Specialist use for this variable?", "options": [ "A. Poisson distribution", "B. Uniform distribution", "C. Normal distribution", "D. Binomial distribution" ], "correct": "D. Binomial distribution", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Data Science team within a large company uses Ama zon SageMaker notebooks to access data stored in Amazon S3 buckets. The IT Security team is concerne d that internet-enabled notebook instances create a security vulnerability where malicious code running on the instances could compromise data privacy. Th e company mandates that all instances stay within a s ecured VPC with no internet access, and data communication traffic must stay within the AWS netw ork. How should the Data Science team configure the note book instance placement to meet these requirements?", "options": [ "A. Associate the Amazon SageMaker notebook with a pr ivate subnet in a VPC. Place the Amazon SageMaker", "B. Associate the Amazon SageMaker notebook with a pr ivate subnet in a VPC. Use IAM policies to grant", "C. Associate the Amazon SageMaker notebook with a pr ivate subnet in a VPC. Ensure the VPC has S3 VPC" ], "correct": "", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist has created a deep le arning neural network model that performs well on t he training data but performs poorly on the test data. Which of the following methods should the Specialis t consider using to correct this? (Choose three.)", "options": [ "A. Decrease regularization.", "B. Increase regularization.", "C. Increase dropout.", "D. Decrease dropout." ], "correct": "", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Data Scientist needs to create a serverless inges tion and analytics solution for high-velocity, real -time streaming data. The ingestion process must buffer and convert incom ing records from JSON to a query-optimized, columna r format without data loss. The output datastore must be highly available, and Analysts must be able to run SQL queries against the data and connect to existing bu siness intelligence dashboards. 96CE4376707A97CE80D4B1916F054522 Which solution should the Data Scientist build to s atisfy the requirements?", "options": [ "A. Create a schema in the AWS Glue Data Catalog of t he incoming data format. Use an Amazon Kinesis Data", "B. Write each JSON record to a staging location in A mazon S3. Use the S3 Put event to trigger an AWS", "D. Use Amazon Kinesis Data Analytics to ingest the s treaming data and perform real-time SQL queries to" ], "correct": "A. Create a schema in the AWS Glue Data Catalog of t he incoming data format. Use an Amazon Kinesis Data", "explanation": "Explanation/Reference:", "references": "" }, { "question": "An online reseller has a large, multi-column datase t with one column missing 30% of its data. A Machin e Learning Specialist believes that certain columns i n the dataset could be used to reconstruct the miss ing data. Which reconstruction approach should the Specialist use to preserve the integrity of the dataset?", "options": [ "A. Listwise deletion", "B. Last observation carried forward", "C. Multiple imputation", "D. Mean substitution" ], "correct": "C. Multiple imputation", "explanation": "Explanation/Reference:", "references": "https://worldwidescience.org/topicpages/ i/imputing+missing+values.html" }, { "question": "A company is setting up an Amazon SageMaker environ ment. The corporate data security policy does not a llow communication over the internet. How can the company enable the Amazon SageMaker ser vice without enabling direct internet access to Amazon SageMaker notebook instances?", "options": [ "A. Create a NAT gateway within the corporate VPC.", "B. Route Amazon SageMaker traffic through an on-prem ises network.", "C. Create Amazon SageMaker VPC interface endpoints w ithin the corporate VPC.", "D. Create VPC peering with Amazon VPC hosting Amazon SageMaker." ], "correct": "A. Create a NAT gateway within the corporate VPC.", "explanation": "Explanation/Reference:", "references": "https://docs.aws.amazon.com/sagemaker/la test/dg/sagemaker-dg.pdf (46)" }, { "question": "A Machine Learning Specialist is training a model t o identify the make and model of vehicles in images . The Specialist wants to use transfer learning and an ex isting model trained on images of general objects. The Specialist collated a large custom dataset of pictu res containing different vehicle makes and models. What should the Specialist do to initialize the mod el to re-train it with the custom data?", "options": [ "A. Initialize the model with random weights in all l ayers including the last fully connected layer.", "B. Initialize the model with pre-trained weights in all layers and replace the last fully connected lay er.", "C. Initialize the model with random weights in all l ayers and replace the last fully connected layer.", "D. Initialize the model with pre-trained weights in all layers including the last fully connected layer ." ], "correct": "B. Initialize the model with pre-trained weights in all layers and replace the last fully connected lay er.", "explanation": "Explanation/Reference:", "references": "" }, { "question": "An office security agency conducted a successful pi lot using 100 cameras installed at key locations wi thin the main office. Images from the cameras were uploaded to Amazon S3 and tagged using Amazon Rekognition, and the results were stored in Amazon ES. The agenc y is now looking to expand the pilot into a full pr oduction system using thousands of video cameras in its offi ce locations globally. The goal is to identify acti vities performed by non-employees in real time Which solution should the agency consider?", "options": [ "A. Use a proxy server at each local office and for e ach camera, and stream the RTSP feed to a unique", "B. Use a proxy server at each local office and for e ach camera, and stream the RTSP feed to a unique", "C. Install AWS DeepLens cameras and use the DeepLens _Kinesis_Video module to stream video to Amazon", "D. Install AWS DeepLens cameras and use the DeepLens _Kinesis_Video module to stream video to Amazon" ], "correct": "D. Install AWS DeepLens cameras and use the DeepLens _Kinesis_Video module to stream video to Amazon", "explanation": "Explanation/Reference: 96CE4376707A97CE80D4B1916F054522", "references": "https://aws.amazon.com/blogs/machine-lea rning/video-analytics-in-the-cloud-and-at-the-edge- with- aws-deeplens-and-kinesis-video-streams/" }, { "question": "A Marketing Manager at a pet insurance company plan s to launch a targeted marketing campaign on social media to acquire new customers. Currently, the comp any has the following data in Amazon Aurora: Profiles for all past and existing customers Profiles for all past and existing insured pets Policy-level information Premiums received Claims paid What steps should be taken to implement a machine l earning model to identify potential new customers o n social media?", "options": [ "A. Use regression on customer profile data to unders tand key characteristics of consumer segments. Find", "B. Use clustering on customer profile data to unders tand key characteristics of consumer segments. Find", "C. Use a recommendation engine on customer profile d ata to understand key characteristics of consumer", "D. Use a decision tree classifier engine on customer profile data to understand key characteristics of consumer" ], "correct": "C. Use a recommendation engine on customer profile d ata to understand key characteristics of consumer", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A manufacturing company has a large set of labeled historical sales data. The manufacturer would like to predict how many units of a particular part should be produced each quarter. Which machine learning approach should be used to s olve this problem? A. Logistic regression", "options": [ "B. Random Cut Forest (RCF)", "C. Principal component analysis (PCA)", "D. Linear regression" ], "correct": "B. Random Cut Forest (RCF)", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A financial services company is building a robust s erverless data lake on Amazon S3. The data lake sho uld be flexible and meet the following requirements: Support querying old and new data on Amazon S3 thro ugh Amazon Athena and Amazon Redshift Spectrum. Support event-driven ETL pipelines 96CE4376707A97CE80D4B1916F054522 Provide a quick and easy way to understand metadata Which approach meets these requirements?", "options": [ "A. Use an AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Glue ETL job, and", "B. Use an AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Batch job, and an", "C. Use an AWS Glue crawler to crawl S3 data, an Amaz on CloudWatch alarm to trigger an AWS Batch job,", "D. Use an AWS Glue crawler to crawl S3 data, an Amaz on CloudWatch alarm to trigger an AWS Glue ETL" ], "correct": "A. Use an AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Glue ETL job, and", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A company's Machine Learning Specialist needs to im prove the training speed of a time-series forecasti ng model using TensorFlow. The training is currently i mplemented on a single-GPU machine and takes approximately 23 hours to complete. The training ne eds to be run daily. The model accuracy is acceptable, but the company a nticipates a continuous increase in the size of the training data and a need to update the model on an hourly, r ather than a daily, basis. The company also wants t o minimize coding effort and infrastructure changes. What should the Machine Learning Specialist do to t he training solution to allow it to scale for futur e demand?", "options": [ "A. Do not change the TensorFlow code. Change the mac hine to one with a more powerful GPU to speed up", "B. Change the TensorFlow code to implement a Horovod distributed framework supported by Amazon", "C. Switch to using a built-in AWS SageMaker DeepAR m odel. Parallelize the training to as many machines as", "D. Move the training to Amazon EMR and distribute th e workload to as many machines as needed to achieve" ], "correct": "B. Change the TensorFlow code to implement a Horovod distributed framework supported by Amazon", "explanation": "Explanation/Reference:", "references": "" }, { "question": "Which of the following metrics should a Machine Lea rning Specialist generally use to compare/evaluate machine learning classification models against each other?", "options": [ "A. Recall", "B. Misclassification rate", "C. Mean absolute percentage error (MAPE)", "D. Area Under the ROC Curve (AUC)" ], "correct": "D. Area Under the ROC Curve (AUC)", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A company is running a machine learning prediction service that generates 100 TB of predictions every day. A Machine Learning Specialist must generate a visuali zation of the daily precision-recall curve from the predictions, and forward a read-only version to the Business team. Which solution requires the LEAST coding effort?", "options": [ "A. Run a daily Amazon EMR workflow to generate preci sion-recall data, and save the results in Amazon S3 .", "B. Generate daily precision-recall data in Amazon Qu ickSight, and publish the results in a dashboard sh ared", "D. Generate daily precision-recall data in Amazon ES , and publish the results in a dashboard shared wit h the" ], "correct": "", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist is preparing data for training on Amazon SageMaker. The Specialist is us ing one of the SageMaker built-in algorithms for the traini ng. The dataset is stored in .CSV format and is tra nsformed into a numpy.array, which appears to be negatively affecting the speed of the training. What should the Specialist do to optimize the data for training on SageMaker?", "options": [ "A. Use the SageMaker batch transform feature to tran sform the training data into a DataFrame.", "B. Use AWS Glue to compress the data into the Apache Parquet format.", "C. Transform the dataset into the RecordIO protobuf format.", "D. Use the SageMaker hyperparameter optimization fea ture to automatically optimize the data." ], "correct": "C. Transform the dataset into the RecordIO protobuf format.", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist is required to build a supervised image-recognition model to identify a cat. The ML Specialist performs some tests and records the f ollowing results for a neural network-based image classifier: Total number of images available = 1,000 Test set images = 100 (constant test set) 96CE4376707A97CE80D4B1916F054522 The ML Specialist notices that, in over 75% of the misclassified images, the cats were held upside dow n by their owners. Which techniques can be used by the ML Specialist t o improve this specific test error?", "options": [ "A. Increase the training data by adding variation in rotation for training images.", "B. Increase the number of epochs for model training", "C. Increase the number of layers for the neural netw ork.", "D. Increase the dropout rate for the second-to-last layer. Correct Answer: B" ], "correct": "", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist needs to be able to i ngest streaming data and store it in Apache Parquet files for exploration and analysis. Which of the following services would both ingest a nd store this data in the correct format?", "options": [ "A. AWS DMS", "B. Amazon Kinesis Data Streams", "C. Amazon Kinesis Data Firehose", "D. Amazon Kinesis Data Analytics" ], "correct": "C. Amazon Kinesis Data Firehose", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Data Scientist is developing a machine learning m odel to classify whether a financial transaction is fraudulent. The labeled data available for training consists of 100,000 non-fraudulent observations and 1,000 frau dulent observations. The Data Scientist applies the XGBoost algorithm to the data, resulting in the following confusion mat rix when the trained model is applied to a previously unseen validation dataset. The accuracy of the model is 9 9.1%, but the Data Scientist has been asked to reduce the num ber of false negatives. Which combination of steps should the Data Scientis t take to reduce the number of false positive predi ctions by the model? (Choose two.)", "options": [ "A. Change the XGBoost eval_metric parameter to optim ize based on rmse instead of error.", "B. Increase the XGBoost scale_pos_weight parameter t o adjust the balance of positive and negative weigh ts.", "C. Increase the XGBoost max_depth parameter because the model is currently underfitting the data.", "D. Change the XGBoost eval_metric parameter to optimiz e based on AUC instead of error. E. Decrease the XGBoost max_depth parameter because th e model is currently overfitting the data." ], "correct": "", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist is assigned a TensorF low project using Amazon SageMaker for training, an d needs to continue working for an extended period wi th no Wi-Fi access. Which approach should the Specialist use to continu e working?", "options": [ "A. Install Python 3 and boto3 on their laptop and co ntinue the code development using that environment.", "B. Download the TensorFlow Docker container used in Amazon SageMaker from GitHub to their local", "C. Download TensorFlow from tensorflow.org to emulat e the TensorFlow kernel in the SageMaker", "D. Download the SageMaker notebook to their local en vironment, then install Jupyter Notebooks on their" ], "correct": "B. Download the TensorFlow Docker container used in Amazon SageMaker from GitHub to their local", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist is working with a lar ge cybersecurity company that manages security even ts in real time for companies around the world. The cyber security company wants to design a solution that wi ll allow it to use machine learning to score malicious event s as anomalies on the data as it is being ingested. The company also wants be able to save the results in i ts data lake for later processing and analysis. What is the MOST efficient way to accomplish these tasks?", "options": [ "A. Ingest the data using Amazon Kinesis Data Firehos e, and use Amazon Kinesis Data Analytics Random Cut", "B. Ingest the data into Apache Spark Streaming using Amazon EMR, and use Spark MLlib with k-means to", "C. Ingest the data and store it in Amazon S3. Use AW S Batch along with the AWS Deep Learning AMIs to", "D. Ingest the data and store it in Amazon S3. Have a n AWS Glue job that is triggered on demand transfor m" ], "correct": "B. Ingest the data into Apache Spark Streaming using Amazon EMR, and use Spark MLlib with k-means to", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Data Scientist wants to gain real-time insights i nto a data stream of GZIP files. Which solution would allow the use of SQL to query the stream with the LEAST latency?", "options": [ "A. Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data.", "B. AWS Glue with a custom ETL script to transform th e data.", "C. An Amazon Kinesis Client Library to transform the data and save it to an Amazon ES cluster.", "D. Amazon Kinesis Data Firehose to transform the dat a and put it into an Amazon S3 bucket." ], "correct": "A. Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data.", "explanation": "Explanation/Reference:", "references": "https://aws.amazon.com/big-data/real-tim e-analytics-featured-partners/" }, { "question": "A retail company intends to use machine learning to categorize new products. A labeled dataset of curr ent products was provided to the Data Science team. The dataset includes 1,200 products. The labeled datas et has 15 features for each product such as title dime nsions, weight, and price. Each product is labeled as belonging to one of six categories such as books, g ames, electronics, and movies. Which model should be used for categorizing new pro ducts using the provided dataset for training?", "options": [ "A. AnXGBoost model where the objective parameter is set to multi:softmax", "B. A deep convolutional neural network (CNN) with a softmax activation function for the last layer", "C. A regression forest where the number of trees is set equal to the number of product categories", "D. A DeepAR forecasting model based on a recurrent n eural network (RNN)" ], "correct": "B. A deep convolutional neural network (CNN) with a softmax activation function for the last layer", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Data Scientist is working on an application that performs sentiment analysis. The validation accurac y is poor, and the Data Scientist thinks that the cause may be a rich vocabulary and a low average frequency of w ords in the dataset. Which tool should be used to improve the validation accuracy?", "options": [ "A. Amazon Comprehend syntax analysis and entity dete ction", "B. Amazon SageMaker BlazingText cbow mode", "C. Natural Language Toolkit (NLTK) stemming and stop word removal", "D. Scikit-leam term frequency-inverse document frequ ency (TF-IDF) vectorizer" ], "correct": "D. Scikit-leam term frequency-inverse document frequ ency (TF-IDF) vectorizer", "explanation": "Explanation/Reference:", "references": "https://monkeylearn.com/sentiment-analys is/" }, { "question": "Machine Learning Specialist is building a model to predict future employment rates based on a wide ran ge of economic factors. While exploring the data, the Spe cialist notices that the magnitude of the input fea tures vary greatly. The Specialist does not want variables wit h a larger magnitude to dominate the model. What should the Specialist do to prepare the data f or model training?", "options": [ "A. Apply quantile binning to group the data into cat egorical bins to keep any relationships in the data by", "B. Apply the Cartesian product transformation to cre ate new combinations of fields that are independent of the", "C. Apply normalization to ensure each field will hav e a mean of 0 and a variance of 1 to remove any sig nificant", "D. Apply the orthogonal sparse bigram (OSB) transfor mation to apply a fixed-size sliding window to gene rate" ], "correct": "C. Apply normalization to ensure each field will hav e a mean of 0 and a variance of 1 to remove any sig nificant", "explanation": "Explanation/Reference:", "references": "https://docs.aws.amazon.com/machine-lear ning/latest/dg/data-transformations-reference.html" }, { "question": "A Machine Learning Specialist must build out a proc ess to query a dataset on Amazon S3 using Amazon Athena. The dataset contains more than 800,000 reco rds stored as plaintext CSV files. Each record cont ains 200 columns and is approximately 1.5 MB in size. Mo st queries will span 5 to 10 columns only. How should the Machine Learning Specialist transfor m the dataset to minimize query runtime?", "options": [ "A. Convert the records to Apache Parquet format.", "B. Convert the records to JSON format.", "C. Convert the records to GZIP CSV format. D. Convert the records to XML format." ], "correct": "A. Convert the records to Apache Parquet format.", "explanation": "Using compressions will reduce the amount of data s canned by Amazon Athena, and also reduce your S3 bucket storage. It's a Win-Win for your AWS bill. S upported formats: GZIP, LZO, SNAPPY (Parquet) and Z LIB.", "references": "https://www.cloudforecast.io/blog/using- parquet-on-athena-to-save-money-on-aws/" }, { "question": "A Machine Learning Specialist is developing a daily ETL workflow containing multiple ETL jobs. The wor kflow consists of the following processes: \u00b7 Start the workflow as soon as data is uploaded to Amazon S3. \u00b7 When all the datasets are available i n Amazon S3, start an ETL job to join the uploaded da tasets with multiple terabyte-sized datasets alread y stored in Amazon S3. \u00b7 Store the results of joining datasets in Amazon S 3. 96CE4376707A97CE80D4B1916F054522 \u00b7 If one of the jobs fails, send a notification to the Administrator. Which configuration will meet these requirements?", "options": [ "A. Use AWS Lambda to trigger an AWS Step Functions w orkflow to wait for dataset uploads to complete in", "B. Develop the ETL workflow using AWS Lambda to star t an Amazon SageMaker notebook instance. Use a", "C. Develop the ETL workflow using AWS Batch to trigg er the start of ETL jobs when data is uploaded to", "D. Use AWS Lambda to chain other Lambda functions to read and join the datasets in Amazon S3 as soon as" ], "correct": "A. Use AWS Lambda to trigger an AWS Step Functions w orkflow to wait for dataset uploads to complete in", "explanation": "Explanation/Reference:", "references": "https://aws.amazon.com/step-functions/us e-cases/" }, { "question": "An agency collects census information within a coun try to determine healthcare and social program need s by province and city. The census form collects respons es for approximately 500 questions from each citize n. Which combination of algorithms would provide the a ppropriate insights? (Choose two.)", "options": [ "A. The factorization machines (FM) algorithm", "B. The Latent Dirichlet Allocation (LDA) algorithm", "C. The principal component analysis (PCA) algorithm", "D. The k-means algorithm" ], "correct": "", "explanation": "The PCA and K-means algorithms are useful in collec tion of data using census form.", "references": "" }, { "question": "A large consumer goods manufacturer has the followi ng products on sale: \u00b7 34 different toothpaste variants \u00b7 48 different toothbrush variants \u00b7 43 different mouthwash variants The entire sales history of all these products is a vailable in Amazon S3. Currently, the company is us ing custom-built autoregressive integrated moving avera ge (ARIMA) models to forecast demand for these products. The company wants to predict the demand f or a new product that will soon be launched. Which solution should a Machine Learning Specialist apply? 96CE4376707A97CE80D4B1916F054522", "options": [ "A. Train a custom ARIMA model to forecast demand for the new product.", "B. Train an Amazon SageMaker DeepAR algorithm to for ecast demand for the new product.", "C. Train an Amazon SageMaker k-means clustering algo rithm to forecast demand for the new product.", "D. Train a custom XGBoost model to forecast demand f or the new product.", "A. Define security group(s) to allow all HTTP inboun d/outbound traffic and assign those security group( s) to the", "B. onfigure the Amazon SageMaker notebook instance t o have access to the VPC. Grant permission in the", "C. Assign an IAM role to the Amazon SageMaker notebo ok with S3 read access to the dataset. Grant", "D. Assign the same KMS key used to encrypt data in A mazon S3 to the Amazon SageMaker notebook" ], "correct": "D. Assign the same KMS key used to encrypt data in A mazon S3 to the Amazon SageMaker notebook", "explanation": "Explanation/Reference:", "references": "https://docs.aws.amazon.com/sagemaker/la test/dg/encryption-at-rest.html" }, { "question": "A Data Scientist needs to migrate an existing on-pr emises ETL process to the cloud. The current proces s runs at regular time intervals and uses PySpark to combi ne and format multiple large data sources into a si ngle consolidated output for downstream processing. The Data Scientist has been given the following req uirements to the cloud solution: Combine multiple data sources. Reuse existing PySpark logic. Run the solution on the existing schedule. Minimize the number of servers that will need to be managed. Which architecture should the Data Scientist use to build this solution?", "options": [ "A. Write the raw data to Amazon S3. Schedule an AWS Lambda function to submit a Spark step to a", "B. Write the raw data to Amazon S3. Create an AWS Gl ue ETL job to perform the ETL processing against th e", "C. Write the raw data to Amazon S3. Schedule an AWS Lambda function to run on the existing schedule and", "D. Use Amazon Kinesis Data Analytics to stream the i nput data and perform real-time SQL queries against the" ], "correct": "D. Use Amazon Kinesis Data Analytics to stream the i nput data and perform real-time SQL queries against the", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Data Scientist is building a model to predict cus tomer churn using a dataset of 100 continuous numer ical features. The Marketing team has not provided any i nsight about which features are relevant for churn prediction. The Marketing team wants to interpret t he model and see the direct impact of relevant feat ures on the model outcome. While training a logistic regres sion model, the Data Scientist observes that there is a wide gap between the training and validation set accurac y. Which methods can the Data Scientist use to improve the model performance and satisfy the Marketing te am's needs? (Choose two.)", "options": [ "A. Add L1 regularization to the classifier", "B. Add features to the dataset", "C. Perform recursive feature elimination", "D. Perform t-distributed stochastic neighbor embeddi ng (t-SNE)" ], "correct": "", "explanation": "Explanation/Reference:", "references": "" }, { "question": "An aircraft engine manufacturing company is measuri ng 200 performance metrics in a time-series. Engine ers want to detect critical manufacturing defects in ne ar-real time during testing. All of the data needs to be stored for offline analysis. What approach would be the MOST effective to perfor m near-real time defect detection?", "options": [ "A. Use AWS IoT Analytics for ingestion, storage, and further analysis. Use Jupyter notebooks from withi n AWS", "B. Use Amazon S3 for ingestion, storage, and further analysis. Use an Amazon EMR cluster to carry out", "C. Use Amazon S3 for ingestion, storage, and further analysis. Use the Amazon SageMaker Random Cut Forest (RCF) algorithm to determine anomalies.", "D. Use Amazon Kinesis Data Firehose for ingestion an d Amazon Kinesis Data Analytics Random Cut Forest" ], "correct": "B. Use Amazon S3 for ingestion, storage, and further analysis. Use an Amazon EMR cluster to carry out", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning team runs its own training algor ithm on Amazon SageMaker. The training algorithm requires external assets. The team needs to submit both its own algorithm code and algorithm-specific parameters to Amazon SageMaker. What combination of services should the team use to build a custom algorithm in Amazon SageMaker? (Choose two.)", "options": [ "A. AWS Secrets Manager", "B. AWS CodeStar", "C. Amazon ECR", "D. Amazon ECS" ], "correct": "", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist wants to determine th e appropriate SageMakerVariantInvocationsPerInstanc e setting for an endpoint automatic scaling configura tion. The Specialist has performed a load test on a single instance and determined that peak requests per seco nd (RPS) without service degradation is about 20 RP S. As this is the first deployment, the Specialist intend s to set the invocation safety factor to 0.5. Based on the stated parameters and given that the i nvocations per instance setting is measured on a pe r- minute basis, what should the Specialist set as the SageMakerVariantInvocationsPerInstance setting?", "options": [ "A. 10", "B. 30", "C. 600", "D. 2,400" ], "correct": "C. 600", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A company uses a long short-term memory (LSTM) mode l to evaluate the risk factors of a particular ener gy 96CE4376707A97CE80D4B1916F054522 sector. The model reviews multi-page text documents to analyze each sentence of the text and categoriz e it as either a potential risk or no risk. The model is no t performing well, even though the Data Scientist h as experimented with many different network structures and tuned the corresponding hyperparameters. Which approach will provide the MAXIMUM performance boost?", "options": [ "A. Initialize the words by term frequency-inverse do cument frequency (TF-IDF) vectors pretrained on a l arge", "B. Use gated recurrent units (GRUs) instead of LSTM and run the training process until the validation l oss", "C. Reduce the learning rate and run the training pro cess until the training loss stops decreasing.", "D. Initialize the words by word2vec embeddings pretr ained on a large collection of news articles relate d to the" ], "correct": "C. Reduce the learning rate and run the training pro cess until the training loss stops decreasing.", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist needs to move and tra nsform data in preparation for training. Some of th e data needs to be processed in near-real time, and other data can be moved hourly. There are existing Amazon EMR MapReduce jobs to clean and feature engineering to perform on the data. Which of the following services can feed data to th e MapReduce jobs? (Choose two.)", "options": [ "A. AWS DMS", "B. Amazon Kinesis", "C. AWS Data Pipeline", "D. Amazon Athena", "A. Build the Docker image with the inference code. T ag the Docker image with the registry hostname and", "B. Serialize the trained model so the format is comp ressed for deployment. Tag the Docker image with th e", "C. Serialize the trained model so the format is comp ressed for deployment. Build the image and upload i t to", "D. Build the Docker image with the inference code. C onfigure Docker Hub and upload the image to Amazon" ], "correct": "D. Build the Docker image with the inference code. C onfigure Docker Hub and upload the image to Amazon", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A trucking company is collecting live image data fr om its fleet of trucks across the globe. The data i s growing rapidly and approximately 100 GB of new data is gen erated every day. The company wants to explore mach ine learning uses cases while ensuring the data is only accessible to specific IAM users. Which storage option provides the most processing f lexibility and will allow access control with IAM?", "options": [ "A. Use a database, such as Amazon DynamoDB, to store the images, and set the IAM policies to restrict", "B. Use an Amazon S3-backed data lake to store the ra w images, and set up the permissions using bucket", "C. Setup up Amazon EMR with Hadoop Distributed File System (HDFS) to store the files, and restrict acce ss", "D. Configure Amazon EFS with IAM policies to make th e data available to Amazon EC2 instances owned by" ], "correct": "C. Setup up Amazon EMR with Hadoop Distributed File System (HDFS) to store the files, and restrict acce ss", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A credit card company wants to build a credit scori ng model to help predict whether a new credit card applicant will default on a credit card payment. The company has collected data from a large number of sources w ith thousands of raw attributes. Early experiments to t rain a classification model revealed that many attr ibutes are highly correlated, the large number of features slo ws down the training speed significantly, and that there are some overfitting issues. The Data Scientist on this project would like to sp eed up the model training time without losing a lot of information from the original dataset. Which feature engineering technique should the Data Scientist use to meet the objectives?", "options": [ "A. Run self-correlation on all features and remove h ighly correlated features", "B. Normalize all numerical values to be between 0 an d 1", "C. Use an autoencoder or principal component analysi s (PCA) to replace original features with new featu res", "D. Cluster raw data using k-means and use sample dat a from each cluster to build a new dataset" ], "correct": "B. Normalize all numerical values to be between 0 an d 1", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Data Scientist is training a multilayer perceptio n (MLP) on a dataset with multiple classes. The tar get class of interest is unique compared to the other classes wi thin the dataset, but it does not achieve and accep table 96CE4376707A97CE80D4B1916F054522 recall metric. The Data Scientist has already tried varying the number and size of the MLP's hidden la yers, which has not significantly improved the results. A solution to improve recall must be implemented as quickly as possible. Which techniques should be used to meet these requi rements?", "options": [ "A. Gather more data using Amazon Mechanical Turk and then retrain", "B. Train an anomaly detection model instead of an ML P", "C. Train an XGBoost model instead of an MLP", "D. Add class weights to the MLP's loss function and then retrain" ], "correct": "C. Train an XGBoost model instead of an MLP", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist works for a credit ca rd processing company and needs to predict which transactions may be fraudulent in near-real time. S pecifically, the Specialist must train a model that returns the probability that a given transaction may fraudulent . How should the Specialist frame this business probl em?", "options": [ "A. Streaming classification B. Binary classification", "C. Multi-category classification", "D. Regression classification" ], "correct": "C. Multi-category classification", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A real estate company wants to create a machine lea rning model for predicting housing prices based on a historical dataset. The dataset contains 32 feature s. Which model will meet the business requirement?", "options": [ "A. Logistic regression", "B. Linear regression", "C. K-means", "D. Principal component analysis (PCA)" ], "correct": "B. Linear regression", "explanation": "Explanation/Reference:", "references": "" }, { "question": "96CE4376707A97CE80D4B1916F054522 A Machine Learning Specialist is applying a linear least squares regression model to a dataset with 1, 000 records and 50 features. Prior to training, the ML Specialist notices that two features are perfectly linearly dependent. Why could this be an issue for the linear least squ ares regression model?", "options": [ "A. It could cause the backpropagation algorithm to f ail during training", "B. It could create a singular matrix during optimiza tion, which fails to define a unique solution", "C. It could modify the loss function during optimiza tion, causing it to fail during training", "D. It could introduce non-linear dependencies within the data, which could invalidate the linear assump tions of" ], "correct": "C. It could modify the loss function during optimiza tion, causing it to fail during training", "explanation": "Explanation Explanation/Reference:", "references": "" }, { "question": "Given the following confusion matrix for a movie cl assification model, what is the true class frequenc y for Romance and the predicted class frequency for Adven ture?", "options": [ "A. The true class frequency for Romance is 77.56% an d the predicted class frequency for Adventure is", "B. The true class frequency for Romance is 57.92% an d the predicted class frequency for Adventure is", "C. The true class frequency for Romance is 0.78 and the predicted class frequency for Adventure is (0.4 7-", "D. The true class frequency for Romance is 77.56% * 0.78 and the predicted class frequency for Adventur e is" ], "correct": "B. The true class frequency for Romance is 57.92% an d the predicted class frequency for Adventure is", "explanation": "Explanation/Reference:", "references": "" }, { "question": "implements the algorithm in a Docker container supp orted by Amazon SageMaker. How should the Specialist package the Docker contai ner so that Amazon SageMaker can launch the trainin g correctly?", "options": [ "A. Modify the bash_profile file in the container and add a bash command to start the training program", "B. Use CMD config in the Dockerfile to add the train ing program as a CMD of the image", "C. Configure the training program as an ENTRYPOINT n amed train", "D. Copy the training program to directory /opt/ml/tr ain" ], "correct": "B. Use CMD config in the Dockerfile to add the train ing program as a CMD of the image", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Data Scientist needs to analyze employment data. The dataset contains approximately 10 million observations on people across 10 different features . During the preliminary analysis, the Data Scienti st notices that income and age distributions are not normal. W hile income levels shows a right skew as expected, with fewer individuals having a higher income, the age d istribution also show a right skew, with fewer olde r individuals participating in the workforce. Which feature transformations can the Data Scientis t apply to fix the incorrectly skewed data? (Choose two.)", "options": [ "A. Cross-validation", "B. Numerical value binning", "C. High-degree polynomial transformation", "D. Logarithmic transformation" ], "correct": "", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A web-based company wants to improve its conversion rate on its landing page. Using a large historical dataset of customer visits, the company has repeatedly trai ned a multi-class deep learning network algorithm o n Amazon SageMaker. However, there is an overfitting problem: training data shows 90% accuracy in 96CE4376707A97CE80D4B1916F054522 predictions, while test data shows 70% accuracy onl y. The company needs to boost the generalization of it s model before deploying it into production to maxi mize conversions of visits to purchases. Which action is recommended to provide the HIGHEST accuracy model for the company's test and validatio n data?", "options": [ "A. Increase the randomization of training data in th e mini-batches used in training", "B. Allocate a higher proportion of the overall data to the training dataset", "C. Apply L1 or L2 regularization and dropouts to the training", "D. Reduce the number of layers and units (or neurons ) from the deep learning network" ], "correct": "D. Reduce the number of layers and units (or neurons ) from the deep learning network", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist is given a structured dataset on the shopping habits of a company's cust omer base. The dataset contains thousands of columns of data and hundreds of numerical columns for each customer. The Specialist wants to identify whether there are natural groupings for these columns acros s all customers and visualize the results as quickly as p ossible. What approach should the Specialist take to accompl ish these tasks?", "options": [ "A. Embed the numerical features using the t-distribu ted stochastic neighbor embedding (t-SNE) algorithm and", "B. Run k-means using the Euclidean distance measure for different values of k and create an elbow plot.", "C. Embed the numerical features using the t-distribu ted stochastic neighbor embedding (t-SNE) algorithm and", "D. Run k-means using the Euclidean distance measure for different values of k and create box plots for each" ], "correct": "B. Run k-means using the Euclidean distance measure for different values of k and create an elbow plot.", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist is planning to create a long-running Amazon EMR cluster. The EMR cluster will have 1 master node, 10 core nodes, and 20 task node s. To save on costs, the Specialist will use Spot Instances in the EMR cluster. Which nodes should the Specialist launch on Spot In stances?", "options": [ "A. Master node", "B. Any of the core nodes", "C. Any of the task nodes", "D. Both core and task nodes Correct Answer: A" ], "correct": "", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A manufacturer of car engines collects data from ca rs as they are being driven. The data collected inc ludes timestamp, engine temperature, rotations per minute (RPM), and other sensor readings. The company want s to predict when an engine is going to have a problem, so it can notify drivers in advance to get engine maintenance. The engine data is loaded into a data lake for training. Which is the MOST suitable predictive model that ca n be deployed into production?", "options": [ "A. Add labels over time to indicate which engine fau lts occur at what time in the future to turn this i nto a", "B. This data requires an unsupervised learning algor ithm. Use Amazon SageMaker k-means to cluster the", "C. Add labels over time to indicate which engine fau lts occur at what time in the future to turn this i nto a", "D. This data is already formulated as a time series. Use Amazon SageMaker seq2seq to model the time" ], "correct": "B. This data requires an unsupervised learning algor ithm. Use Amazon SageMaker k-means to cluster the", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A company wants to predict the sale prices of house s based on available historical sales data. The tar get variable in the company's dataset is the sale price . The features include parameters such as the lot s ize, living area measurements, non-living area measurements, nu mber of bedrooms, number of bathrooms, year built, and postal code. The company wants to use multi-var iable linear regression to predict house sale price s. Which step should a machine learning specialist tak e to remove features that are irrelevant for the an alysis and reduce the model's complexity?", "options": [ "A. Plot a histogram of the features and compute thei r standard deviation. Remove features with high var iance.", "B. Plot a histogram of the features and compute thei r standard deviation. Remove features with low vari ance.", "C. Build a heatmap showing the correlation of the da taset against itself. Remove features with low mutu al", "D. Run a correlation check of all features against t he target variable. Remove features with low target variable" ], "correct": "", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a 96CE4376707A97CE80D4B1916F054522 machine learning specialist will build a binary cla ssifier based on two features: age of account, deno ted by x, and transaction month, denoted by y. The class dist ributions are illustrated in the provided figure. T he positive class is portrayed in red, while the negative class is portrayed in black. Which model would have the HIGHEST accuracy?", "options": [ "A. Linear support vector machine (SVM)", "B. Decision tree C. Support vector machine (SVM) with a radial basis fu nction kernel", "D. Single perceptron with a Tanh activation function" ], "correct": "", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A health care company is planning to use neural net works to classify their X-ray images into normal an d abnormal classes. The labeled data is divided into a training set of 1,000 images and a test set of 20 0 images. The initial training of a neural network model with 50 hidden layers yielded 99% accuracy on the train ing set, but only 55% accuracy on the test set. 96CE4376707A97CE80D4B1916F054522 What changes should the Specialist consider to solv e this issue? (Choose three.)", "options": [ "A. Choose a higher number of layers", "B. Choose a lower number of layers", "C. Choose a smaller learning rate", "D. Enable dropout" ], "correct": "", "explanation": "Explanation/Reference:", "references": "" }, { "question": "This graph shows the training and validation loss a gainst the epochs for a neural network. The network being trained is as follows: Two dense layers, one output neuron 100 neurons in each layer 100 epochs Random initialization of weights Which technique can be used to improve model perfor mance in terms of accuracy in the validation set?", "options": [ "A. Early stopping", "B. Random initialization of weights with appropriate seed", "C. Increasing the number of epochs", "D. Adding another layer with the 100 neurons" ], "correct": "C. Increasing the number of epochs", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist is attempting to buil d a linear regression model. Given the displayed residual plot only, what is the MOST likely problem with the model?", "options": [ "A. Linear regression is inappropriate. The residuals do not have constant variance.", "B. Linear regression is inappropriate. The underlyin g data has outliers.", "C. Linear regression is appropriate. The residuals h ave a zero mean.", "D. Linear regression is appropriate. The residuals h ave constant variance." ], "correct": "D. Linear regression is appropriate. The residuals h ave constant variance.", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A large company has developed a BI application that generates reports and dashboards using data collec ted from various operational metrics. The company wants to provide executives with an enhanced experience so they can use natural language to get data from the reports. The company wants the executives to be abl e ask questions using written and spoken interfaces. 96CE4376707A97CE80D4B1916F054522 Which combination of services can be used to build this conversational interface? (Choose three.)", "options": [ "A. Alexa for Business", "B. Amazon Connect", "C. Amazon Lex", "D. Amazon Polly E. Amazon Comprehend" ], "correct": "", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A machine learning specialist works for a fruit pro cessing company and needs to build a system that categorizes apples into three types. The specialist has collected a dataset that contains 150 images f or each type of apple and applied transfer learning on a ne ural network that was pretrained on ImageNet with t his dataset. The company requires at least 85% accuracy to make use of the model. After an exhaustive grid search, the optimal hyperp arameters produced the following: 68% accuracy on the training set 67% accuracy on the validation set What can the machine learning specialist do to impr ove the system's accuracy?", "options": [ "A. Upload the model to an Amazon SageMaker notebook instance and use the Amazon SageMaker HPO", "B. Add more data to the training set and retrain the model using transfer learning to reduce the bias.", "C. Use a neural network model with more layers that are pretrained on ImageNet and apply transfer learn ing to", "D. Train a new model using the current neural networ k architecture." ], "correct": "B. Add more data to the training set and retrain the model using transfer learning to reduce the bias.", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A company uses camera images of the tops of items d isplayed on store shelves to determine which items were removed and which ones still remain. After several hours of data labeling, the company has a total of 1,000 hand-labeled images covering 10 distinct items. The training results were poor. Which machine learning approach fulfills the compan y's long-term needs?", "options": [ "A. Convert the images to grayscale and retrain the m odel", "B. Reduce the number of distinct items from 10 to 2, b uild the model, and iterate C. Attach different colored labels to each item, take the images again, and build the model", "D. Augment training data for each item using image v ariants like inversions and translations, build the model," ], "correct": "A. Convert the images to grayscale and retrain the m odel", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Data Scientist is developing a binary classifier to predict whether a patient has a particular disea se on a series of test results. The Data Scientist has data on 400 patients randomly selected from the populat ion. The disease is seen in 3% of the population. Which cross-validation strategy should the Data Sci entist adopt?", "options": [ "A. A k-fold cross-validation strategy with k=5", "B. A stratified k-fold cross-validation strategy wit h k=5", "C. A k-fold cross-validation strategy with k=5 and 3 repeats", "D. An 80/20 stratified split between training and va lidation" ], "correct": "B. A stratified k-fold cross-validation strategy wit h k=5", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A technology startup is using complex deep neural n etworks and GPU compute to recommend the company's products to its existing customers based upon each customer's habits and interactions. The solution cu rrently pulls each dataset from an Amazon S3 bucket before loading the data into a TensorFlow model pulled fro m the company's Git repository that runs locally. This jo b then runs for several hours while continually out putting its progress to the same S3 bucket. The job can be paus ed, restarted, and continued at any time in the eve nt of a failure, and is run from a central queue. Senior managers are concerned about the complexity of the solution's resource management and the costs involved in repeating the process regularly. They a sk for the workload to be automated so it runs once a week, starting Monday and completing by the close of busi ness Friday. Which architecture should be used to scale the solu tion at the lowest cost?", "options": [ "A. Implement the solution using AWS Deep Learning Co ntainers and run the container as a job using AWS", "B. Implement the solution using a low-cost GPU-compa tible Amazon EC2 instance and use the AWS Instance", "C. Implement the solution using AWS Deep Learning Co ntainers, run the workload using AWS Fargate" ], "correct": "C. Implement the solution using AWS Deep Learning Co ntainers, run the workload using AWS Fargate", "explanation": "Explanation/Reference: 96CE4376707A97CE80D4B1916F054522", "references": "" }, { "question": "A Machine Learning Specialist prepared the followin g graph displaying the results of k-means for k = [ 1..10]: Considering the graph, what is a reasonable selecti on for the optimal choice of k?", "options": [ "A. 1", "B. 4", "C. 7", "D. 10" ], "correct": "C. 7", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A media company with a very large archive of unlabe led images, text, audio, and video footage wishes t o index its assets to allow rapid identification of relevan t content by the Research team. The company wants t o use machine learning to accelerate the efforts of its i n-house researchers who have limited machine learni ng expertise. Which is the FASTEST route to index the assets?", "options": [ "A. Use Amazon Rekognition, Amazon Comprehend, and Am azon Transcribe to tag data into distinct", "B. Create a set of Amazon Mechanical Turk Human Inte lligence Tasks to label all footage.", "C. Use Amazon Transcribe to convert speech to text. Use the Amazon SageMaker Neural Topic Model (NTM)", "D. Use the AWS Deep Learning AMI and Amazon EC2 GPU instances to create custom models for audio" ], "correct": "A. Use Amazon Rekognition, Amazon Comprehend, and Am azon Transcribe to tag data into distinct", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist is working for an onl ine retailer that wants to run analytics on every c ustomer visit, processed through a machine learning pipelin e. The data needs to be ingested by Amazon Kinesis Data Streams at up to 100 transactions per second, and t he JSON data blob is 100 KB in size. What is the MINIMUM number of shards in Kinesis Dat a Streams the Specialist should use to successfully ingest this data?", "options": [ "A. 1 shards", "B. 10 shards", "C. 100 shards", "D. 1,000 shards" ], "correct": "B. 10 shards", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist is deciding between b uilding a naive Bayesian model or a full Bayesian n etwork for a classification problem. The Specialist comput es the Pearson correlation coefficients between eac h feature and finds that their absolute values range between 0.1 to 0.95. Which model describes the underlying data in this s ituation?", "options": [ "A. A naive Bayesian model, since the features are al l conditionally independent.", "B. A full Bayesian network, since the features are a ll conditionally independent.", "C. A naive Bayesian model, since some of the feature s are statistically dependent.", "D. A full Bayesian network, since some of the featur es are statistically dependent." ], "correct": "C. A naive Bayesian model, since some of the feature s are statistically dependent.", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Data Scientist is building a linear regression mo del and will use resulting p-values to evaluate the statistical significance of each coefficient. Upon inspection o f the dataset, the Data Scientist discovers that mo st of the features are normally distributed. The plot of one feature in the dataset is shown in the graphic. 96CE4376707A97CE80D4B1916F054522 What transformation should the Data Scientist apply to satisfy the statistical assumptions of the line ar regression model?", "options": [ "A. Exponential transformation", "B. Logarithmic transformation C. Polynomial transformation", "D. Sinusoidal transformation" ], "correct": "A. Exponential transformation", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist is assigned to a Frau d Detection team and must tune an XGBoost model, wh ich is working appropriately for test data. However, wi th unknown data, it is not working as expected. The existing parameters are provided as follows. 96CE4376707A97CE80D4B1916F054522 Which parameter tuning guidelines should the Specia list follow to avoid overfitting?", "options": [ "A. Increase the max_depth parameter value.", "B. Lower the max_depth parameter value.", "C. Update the objective to binary:logistic.", "D. Lower the min_child_weight parameter value." ], "correct": "B. Lower the max_depth parameter value.", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A data scientist is developing a pipeline to ingest streaming web traffic data. The data scientist nee ds to implement a process to identify unusual web traffic patterns as part of the pipeline. The patterns wil l be used downstream for alerting and incident response. The data scientist has access to unlabeled historic dat a to use, if needed. The solution needs to do the following: Calculate an anomaly score for each web traffic ent ry. Adapt unusual event identification to changing web patterns over time. Which approach should the data scientist implement to meet these requirements?", "options": [ "A. Use historic web traffic data to train an anomaly detection model using the Amazon SageMaker Random", "B. Use historic web traffic data to train an anomaly detection model using the Amazon SageMaker built-i n", "C. Collect the streaming data using Amazon Kinesis D ata Firehose. Map the delivery stream as an input", "D. Collect the streaming data using Amazon Kinesis D ata Firehose. Map the delivery stream as an input" ], "correct": "A. Use historic web traffic data to train an anomaly detection model using the Amazon SageMaker Random", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Data Scientist received a set of insurance record s, each consisting of a record ID, the final outcom e among 200 categories, and the date of the final outcome. Some partial information on claim contents is also provided, but only for a few of the 200 categories. For each outcome category, there are hundreds of records dis tributed over the past 3 years. The Data Scientist wants to predict how many claims to expect in each category from month to month, a few months in advance. 96CE4376707A97CE80D4B1916F054522 What type of machine learning model should be used?", "options": [ "A. Classification month-to-month using supervised le arning of the 200 categories based on claim content s.", "B. Reinforcement learning using claim IDs and timest amps where the agent will identify how many claims in", "C. Forecasting using claim IDs and timestamps to ide ntify how many claims in each category to expect fr om", "D. Classification with supervised learning of the ca tegories for which partial information on claim con tents is" ], "correct": "", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A company that promotes healthy sleep patterns by p roviding cloud-connected devices currently hosts a sleep tracking application on AWS. The application collec ts device usage information from device users. The company's Data Science team is building a machine l earning model to predict if and when a user will st op utilizing the company's devices. Predictions from t his model are used by a downstream application that determines the best approach for contacting users. The Data Science team is building multiple versions of the machine learning model to evaluate each ver sion against the company's business goals. To measure lo ng-term effectiveness, the team wants to run multip le versions of the model in parallel for long periods of time, with the ability to control the portion of inferences served by the models. Which solution satisfies these requirements with MI NIMAL effort?", "options": [ "A. Build and host multiple models in Amazon SageMake r. Create multiple Amazon SageMaker endpoints, one", "B. Build and host multiple models in Amazon SageMake r. Create an Amazon SageMaker endpoint", "C. Build and host multiple models in Amazon SageMake r Neo to take into account different types of medic al", "D. Build and host multiple models in Amazon SageMake r. Create a single endpoint that accesses multiple" ], "correct": "D. Build and host multiple models in Amazon SageMake r. Create a single endpoint that accesses multiple", "explanation": "Explanation/Reference:", "references": "" }, { "question": "An agricultural company is interested in using mach ine learning to detect specific types of weeds in a 100-acre grassland field. Currently, the company uses tracto r-mounted cameras to capture multiple images of the field as 10 \u00d7 10 grids. The company also has a large trai ning dataset that consists of annotated images of p opular weed classes like broadleaf and non-broadleaf docks . The company wants to build a weed detection model t hat will detect specific types of weeds and the loc ation of each type within the field. Once the model is ready , it will be hosted on Amazon SageMaker endpoints. The 96CE4376707A97CE80D4B1916F054522 model will perform real-time inferencing using the images captured by the cameras. Which approach should a Machine Learning Specialist take to obtain accurate predictions? A. Prepare the images in RecordIO format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an image classif ication algorithm to categorize images into various weed classes.", "options": [ "B. Prepare the images in Apache Parquet format and u pload them to Amazon S3. Use Amazon SageMaker to", "C. Prepare the images in RecordIO format and upload them to Amazon S3. Use Amazon SageMaker to train,", "D. Prepare the images in Apache Parquet format and u pload them to Amazon S3. Use Amazon SageMaker to" ], "correct": "C. Prepare the images in RecordIO format and upload them to Amazon S3. Use Amazon SageMaker to train,", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A manufacturer is operating a large number of facto ries with a complex supply chain relationship where unexpected downtime of a machine can cause producti on to stop at several factories. A data scientist w ants to analyze sensor data from the factories to identify equipment in need of preemptive maintenance and the n dispatch a service team to prevent unplanned downti me. The sensor readings from a single machine can include up to 200 data points including temperature s, voltages, vibrations, RPMs, and pressure reading s. To collect this sensor data, the manufacturer deplo yed Wi-Fi and LANs across the factories. Even thoug h many factory locations do not have reliable or high-spee d internet connectivity, the manufacturer would lik e to maintain near-real-time inference capabilities. Which deployment architecture for the model will ad dress these business requirements?", "options": [ "A. Deploy the model in Amazon SageMaker. Run sensor data through this model to predict which machines", "B. Deploy the model on AWS IoT Greengrass in each fa ctory. Run sensor data through this model to infer", "C. Deploy the model to an Amazon SageMaker batch tra nsformation job. Generate inferences in a daily bat ch", "D. Deploy the model in Amazon SageMaker and use an I oT rule to write data to an Amazon DynamoDB table." ], "correct": "A. Deploy the model in Amazon SageMaker. Run sensor data through this model to predict which machines", "explanation": "Explanation/Reference:", "references": "" }, { "question": "A Machine Learning Specialist is designing a scalab le data storage solution for Amazon SageMaker. Ther e is an existing TensorFlow-based model implemented as a train.py script that relies on static training dat a that is currently stored as TFRecords. Which method of providing training data to Amazon S ageMaker would meet the business requirements with the LEAST development overhead? 96CE4376707A97CE80D4B1916F054522", "options": [ "A. Use Amazon SageMaker script mode and use train.py unchanged. Point the Amazon SageMaker training", "B. Use Amazon SageMaker script mode and use train.py unchanged. Put the TFRecord data into an Amazon", "C. Rewrite the train.py script to add a section that converts TFRecords to protobuf and ingests the pro tobuf", "D. Prepare the data in the format accepted by Amazon SageMaker. Use AWS Glue or AWS Lambda to" ], "correct": "D. Prepare the data in the format accepted by Amazon SageMaker. Use AWS Glue or AWS Lambda to", "explanation": "Explanation/Reference: 96CE4376707A97CE80D4B1916F054522", "references": "" } ]